The parsing of undifferentiated tone sequences into groups of qualitatively distinct elements is one of the earliest rhythmic phenomena to have been investigated experimentally (Bolton, 1894). The present study aimed to replicate and extend these findings through online experimentation using a spontaneous grouping paradigm with forced-choice response (from 1 to 12 tones per group). Two types of isochronous sequences were used: equitone sequences, which varied only with respect to signal rate (200, 550, or 950 ms interonset intervals), and accented sequences, in which accents were added every two or three tones to test the effect of induced grouping (duple vs. triple) and accent type (intensity, duration, or pitch). In equitone sequences, participants’ grouping percepts (N = 4,194) were asymmetrical and tempo-dependent, with “no grouping” and groups of four being most frequently reported. In accented sequences, slower rate, induced triple grouping, and intensity accents correlated with increases in group length. Furthermore, the probability of observing a mixed metric type—that is, grouping percepts divisible by both two and three (6 and 12)—was found to be highest in faster sequences with induced triple grouping. These findings suggest that lower-level triple grouping gives rise to binary grouping percepts at higher metrical levels.
In everyday life, our ears are continuously stimulated by a wide variety of sounds that come from multiple sources and serve different purposes. Rather than perceiving these sounds as a chaotic mass, our mind is able to organize them in such a way that meaningful signals, such as music or speech, are differentiated from noise. One important aspect that contributes to the disentanglement of sounds is the way they are grouped in time, their “rhythm.” As summarized by Handel (1989), subjective rhythms result from the tendency to group undifferentiated sequences of tones into subjective groups (see Figure 1). Early studies of subjective rhythmization have found that equitone isochronous sequences are typically perceived to form groups from 2 to 8 elements (with groups of 4, 2, and 3, in that order of preference, being most commonly reported), leading to the perception of the first element of each group as louder and the time interval between the last element of one group and the first element of the next as longer (Bolton, 1894; Harrell, 1937; Woodrow, 1909). These percepts have also been found to match listeners’ experience of isochronous sequences in which periodic accents are produced through physical differences in duration (Povel, 1981; Povel & Ockerman, 1981; Vos, 1977) as well as in intensity and pitch (Bolton, 1894; Thomassen, 1982). In Figure 1, similar grouping percepts are shown to result from series of onsets that are equally spaced and undifferentiated (a), or in which duple or triple grouping is induced by physical differences in intensity (b), onset-to-offset duration (c), and pitch (e). In (d), duple grouping is induced through differences in offset-to-onset intervals, with relatively larger intervals between the last element of a group of two tones and the first element of the next group resulting in the perception of the last element of the group as accented, which Hasty (1997) interprets as a shift in the order of elements within groups, with the first element understood as an anacrusis to an accented downbeat. Given that spontaneous grouping percepts have been linked to a perceived differentiation between accented (strong) and unaccented (weak) events in the absence of such physical differences, it has been proposed that this phenomenon might be better described as “subjective metricization” (London, 2012, p. 13) or “spontaneous higher order resonance” (Large, 2008, p. 215). In a context where a given grouping percept persists, then, one may posit that it is likely to result in the perception of nested pulses; that is, to the perception of metric hierarchy, with the stimulus tones corresponding to subdivisions at more than one pulse level.
Early studies of subjective rhythmization typically involved guided verbal responses from participants, raising issues of demand characteristics and response bias. In his report on the first systematic study to investigate 30 listeners’ grouping percepts of undifferentiated isochronous sequences with interonset intervals (IOIs) ranging from 115 ms to 2,304 ms, Bolton (1894) comments that “generally, however, it required some kind of a suggestion to direct the attention of the subjects to the grouping of the sounds” (p. 185), such as asking a participant who compared the stimuli to the ticks of a clock “if there was the same difference of intensity or quality in the sounds as was apparent in the clock ticks.” It is not clear whether this participant’s comparison of the experimental stimuli to the “ticks of a clock” resulted from a spontaneous binary grouping percept, but the experimenter’s follow-up question is nonetheless problematic. Verbal suggestions appear to have also been made in a more indirect manner, but with similarly questionable effects:
The operator frequently directed the attention of the subjects to respiration, or asked them to feel the pulse. Most of the subjects incline to the view that respiration accommodated itself to the form of grouping that was found most natural with the rate to which they were listening. Inhalation and exhalation each lasted during the time of a 4-group. In this way a kind of higher grouping was accomplished, for the clicks heard during inspiration were more intense. When the rate was slow both inspiration and expiration were accommodated to the time of one click. (Bolton, 1894, p. 224)
The findings from these studies have also often been reported without taking into account whether grouping percepts occurred spontaneously or as a result of experimental procedure (e.g., Fraisse, 1956; Handel, 1989; MacDougall, 1903; Repp, 2007). Thus, in an experiment that asked participants to listen to an undifferentiated sequence of clicks and adjust the rate to match their preferred rate, Harrell (1937) found that less than half of the 27 participants reported spontaneous grouping percepts. Similarly, Temperley (1963) reports that in a preliminary tapping task with isochronous series at varying rates (IOIs from 200 ms to 2,200 ms), only four of the 23 participants spontaneously perceived groups or imposed accents, while all but one participant did so when it was suggested. These reports, which involved observations from both perceptual and performance tasks, suggest that spontaneous grouping might be not as “spontaneous” as has been generally assumed.
More recent studies have addressed some of these limitations. Brochard, Abecasis, Potter, Ragot, and Drake (2003) measured differences in event-related brain potentials (ERPs) to occasional deviant tones (4 dB softer) in odd versus even positions in equitone isochronous sequences. The experimenters found that differences between position 9 as compared to 8, and 11 as compared to 10, suggested a binary percept, providing direct evidence for subjective rhythmization. In a follow-up study, Abecasis, Brochard, Granot, and Drake (2005) induced binary and ternary percepts by making every second or third tone longer (onset-to-offset interval) and found similar differences in the positions corresponding to strong beats in a metrical pattern (positions 9 and 11 for the binary condition and position 10 for the ternary condition), with the pattern of differences suggesting a preference for binary structure. Nonetheless, differences between ERP responses to successive strong beat positions (e.g., 9 vs. 11 in the binary condition), either of which could correspond to the first beat in a four-beat pattern, were not analyzed. In another study, Schaefer, Vlek, and Desain (2011) compared ERP responses to accented and unaccented events in 2-beat, 3-beat, and 4-beat patterns in two conditions, i.e., externally presented (induced) followed by internally generated (self-imposed). The experimenters found significant differences between accented and unaccented beats in both conditions, including between the first unaccented (beat 2) and last unaccented (beats 3 and 4) beats in the 3-beat and 4-beat patterns. Although these findings are consistent with the neural resonance model of pulse and meter (Large & Snyder, 2009), the experimental “metric context” was limited to the relationship of beats within a single pulse level (“metric position”). Furthermore, differences between beats 1 and 3 in the 4-beat pattern and differences between successive accented first beats, which would be suggestive of a spontaneously perceived lower or higher pulse level, were not analyzed. Thus, the relationship between spontaneous grouping and the perception of meter (understood as a hierarchy of nested beats) has yet to be examined systematically.
Most recently, Bååth (2015) replicated the Bolton (1894) study with the aim of testing two theoretical models that may account for spontaneous rhythmization, i.e., preferred tempo (Temperley, 1963) and neural resonance (Large, 2008). Bååth’s study featured three experimental tasks, including a spontaneous rhythmization task with equitone isochronous sequences with IOIs ranging from 150 ms to 2,000 ms, with the slowest rate having been included to detect participants who may have misunderstood the instructions. Prior to the spontaneous rhythmization task, participants listened to a sequence with IOIs of 600 ms and were informed that the clicks did not differ in any way. Participants listened to the same sequence a second time and were instructed to report whether they nonetheless perceived groups of tones, with available responses ranging from 1 (no grouping) to 8 clicks per group; all participants reported having experienced a grouping of clicks. Results from the experimental task, which consisted of four blocks of eight trials each (one for each sequence rate), showed that groups of four, two, and eight clicks were most frequently perceived, accounting for 29.6%, 29.5%, and 10.4% of trials, respectively, with peak rates ranging from 150 ms to 600 ms. Nonetheless, a relatively large proportion of “no grouping” responses (23.9%) was also reported, with more than half of the trials at the slowest rate of 2,000 ms resulting in no perceived grouping, while most of the remaining “no grouping” responses were reported for sequences with IOIs ranging from 600 ms to 1,500 ms (with a probability of about 10% to over 50% within rate, respectively). Five of the 30 participants reported spontaneous grouping percepts for each of the four sequences with IOIs of 2,000 ms, but these participants' data were not included in the analyses. Based on the analysis of the relationship between perceptual and motor tasks, Bååth concluded that the pattern of observed spontaneous grouping percepts was consistent with a neural resonance explanation for spontaneous rhythmization.
The goal of the present study was to address some of the limitations of the earlier studies, more specifically, the limited range of possible grouping responses (i.e., the reported number of tones per group) and limitations in study design with regard to the exploration of the metric implications of the reported groupings; that is, the potential for the reported groups to be heard as nested beats within a larger metric framework. The first limitation was addressed by extending the range of possible grouping responses from 8 to 12 elements per group, and the second was addressed by using differentiated isochronous sequences in which a physical accent was added every two or three tones to induce lower-level grouping. Although the present study did not investigate the theoretical underpinnings of spontaneous rhythmization, the hypothesized presence of implied metric percepts is consistent with the neural resonance explanation, which was supported by the findings of Bååth (2015). Thus, the specific goals of the present study were to: (1) investigate the variability of spontaneously perceived groups both in terms of cardinality, that is, the number of elements per group or grouping response (GR), and in terms of duration or group length (GL); (2) characterize the implied metric hierarchy or metric type (MT) resulting from the relationship between the perceived number of elements per group and induced grouping (i.e., the physical accent applied every two or three tones); and (3) determine the relationship between the physical attributes of the sequences (rate, induced grouping, and accent type) and spontaneous grouping percepts.
One hundred and forty participants (women = 69) took part in the study and 97 participants completed the entire study. Taking into account all the participants that took part in the study, participants’ age ranged from 17 to 73, with a median of 27 (M = 30.70, SD = 13.08). Participants ranked their overall level of music training on a scale of 1 (none) to 5 (professional), and the median level was 3 (M = 3.24; SD = 1.31). The majority of the participants were either native of, or resided in, the United States (55% and 61.2%, respectively); most of the participants (66.4%) identified English as their native language, although a large proportion (59.3%) also reported being fluent with more than one language. Most participants reported playing an instrument, with piano/keyboard being the most common (41.4%); on the other hand, less than half of the participants had specialized music training (41.4% and 40.7% had received composition and voice training, respectively). Finally, only a small percentage of participants (5%) reported having absolute pitch.
Materials and Procedure
Each trial consisted of an isochronous sequence of tones with a base frequency of 833 Hz (G♯5) and the default rise and decay of 1 ms, with a base onset-to-offset interval of 50 ms. There were three variables used: rate (200, 550, and 950 ms interonset intervals) corresponding to a tempo range of 63 bpm to 300 bpm, induced grouping (duple vs. triple), and accent type (duration, intensity, and pitch). There were two types of isochronous tone sequences used: three equitone sequences that varied only with respect to rate, and 18 accented sequences that varied with respect to rate, induced grouping, and accent type. Rates were selected to provide a range of tempi within the limits of pulse perception (for a summary, see London, 2002, pp. 535–539), with 200 ms representing the lower limit for maximal pulse salience (Parncutt, 1994; Warren, 1993) and 950 ms falling well under the 1.5–2 s upper limit of pulse perception (e.g., Fraisse, 1982; Woodrow, 1932), while still allowing for a higher-level pulse in the induced triple grouping condition to fall under the upper limit of 3.4 s identified by Woodrow as “the vanishing point of the capacity for synchronization” (as cited in London, 2002, p. 536). For the accented isochronous sequences, low-level grouping was induced by applying a periodic physical cue; that is, an accent, every two or three tones (duple vs. triple induced grouping). Accented tones were always presented first and were either twice longer (duration accent with onset-to-offset interval of 100 ms instead of 50 ms), significantly louder (intensity accent with a difference of 4.7 dB), or one semitone lower (pitch accent effected by frequency contrast) than unaccented tones. The pitch accent difference of one semitone was selected to prevent stream segregation, i.e., the separation of a tone sequence into two perceptual groups based on pitch distance (e.g., Bregman, 1990; Szalárdy et al., 2014). Pure tones were used to allow for a more direct comparison with previous studies. Although timbre has been shown to play a role in rhythm perception (e.g., Schutz & Vaisberg, 2014), this assumption is not tested here as it was determined that differences of rate, induced grouping, and accent type provided the necessary conditions to address the experimental questions. The addition of further conditions would also have extended the time required for participants to complete the study beyond what can be reasonably expected for an online study. Finally, to avoid a direct relationship between presentation rate and sequence duration, and to minimize habituation, the number of tones per sequence was varied across sequences (18 to 72 tones per sequence), with sequences’ total duration ranging from 8.25 to 24.75 s (M = 14.85, SD = 4.93); sounds files for each of the 21 experimental stimuli can be accessed at https://osf.io/asu84/.
Thirty-nine trials were presented across three experimental blocks with the presentation order varied randomly within each block. To test the emergence of spontaneous grouping percepts in the absence of physical cues, block 1 presented three trials that varied only by rate. This block was tested first in order to minimize a carry-over effect or grouping response bias resulting from differences in the physical characteristics of the sequences (i.e., the presence of periodic accents every two or three tones), which could have induced participants to select the grouping response “1” (no grouping) when presented with equitone sequences after having heard accented sequences. The influence of physical accent cues on spontaneous grouping percepts and implied metric types were tested in blocks 2 and 3. Each of these two blocks presented the same 18 trials combining the three variables of rate, induced grouping, and accent type (3 × 2 × 3) in a different random order; the repetition of the accented block of trials also allowed for testing whether task repetition would give rise to learning or habituation.
Spontaneous grouping response data were collected using an online survey on the MARL platform. Invitations were distributed in the teaching and research community via e-mail, list-serv, and social media (convenience sampling). There was no compensation offered to take part in the study and participants were informed that they could stop at any time without negative consequences. Before the experiment, participants were asked to complete a questionnaire about their background, including gender, age, nationality, language, general education as well as music training. Participants were also required to conduct a sound test to adjust the volume of their electronic device at a comfortable level. Participants were instructed to do the entire study in one sitting, in a quiet room or wearing noise cancelling earphones, and focusing only on doing the experiment. Because the study was conducted online, it is not possible to determine the specific environment in which participants completed the experimental tasks or what type of listening device (e.g., headphones, earphones, or computer speakers) was used by each participant, nor is it possible to account for equipment malfunction or erratic behavior. However, participants did have the opportunity to provide feedback at the end of the experiment, and no issues related to equipment were reported.
In each trial, participants heard the tone sequence only once, after which they were asked to answer the question “How many tones per group did you hear?” and to select a number from 1 to 12 (forced-choice response). Before the first trial of each block, participants were given the following instructions:
It is possible that you will hear different levels of grouping or no grouping at all. If you hear different levels of grouping, you should select the answer that corresponds to the group with the larger number of elements. For example:
If you hear the sounds grouping together in such a way that you could count them as “1-2-3-4-1-2-3-4-1…” your answer should be “4”;
If you hear the sounds grouping together in such a way that you could count them as “1-and-2-and-3-and-1-and-2-and-3…” your answer should be “6”, not “2”.
If you do not hear the sounds grouping together at all, you should select “1”.
For undifferentiated sequences (block 1), participants were presented with each of the three rate conditions only once. In total, there were 415 responses recorded, from 140 unique participants, with a maximally even total number of trials per rate condition. For differentiated sequences (blocks 2 and 3), participants were presented with each condition resulting from the combination of the three factors of rate, induced grouping and accent type twice (once in each block). There were 3,779 responses recorded (2,003 for block 2 and 1,776 for block 3), from 127 unique participants, with a maximally even number of trials per induced grouping condition across blocks 2 and 3 (block 2 = 1,003 duple and 1,000 triple trials; block 3 = 886 duple and 890 triple trials) and an average of 1,260 trials per rate and accent type condition (SD = 6.43 and 2.08, respectively, with a minimum of 1,255 responses per condition across rate and accent). The total number of grouping responses recorded over the three blocks of trials was 4,194.
Measures and Analysis
Grouping responses (GR) were used for data visualization and to identify trends in grouping percepts within and across the three blocks of trials. Data visualization was conducted for anomaly detection. Given that no anomaly was detected and that trials were randomized within blocks, incomplete data sets were not discarded and all recorded responses were used for analysis, including those from the 43 participants who did not complete the entire study. Descriptive statistics (frequency and percentage distribution) of GRs within and across the three experimental blocks are presented in the Results section. Because block 1 presented undifferentiated sequences that varied only by rate, and blocks 2 and 3 presented differentiated sequences that varied by rate as well as induced grouping and accent type, the statistical significance of observed differences in the distribution of GRs in block 1 were tested separately using chi-squared tests, which are appropriate for categorical data.
To better account for the influence of rate and accent manipulations on participants’ spontaneous grouping percepts in differentiated sequences, two measures were derived from participants’ grouping responses. Group length (GL) was calculated by multiplying participants’ grouping responses by the rate variable (e.g., GR 4 × 550 ms = 2,200 ms). While the experimental design did not allow for the direct testing of participants’ metric percepts, it was assumed that such percepts could be inferred from the spontaneous grouping responses to differentiated sequences in which the perception of low-level groups (duple vs. triple) was induced by applying a periodic accent. Given that metric hierarchy is conceived as a temporal framework of nested pulse levels, each of which is comprised of groups of two or three evenly-spaced timepoints (e.g., Lerdahl & Jackendoff, 1983), a participant’s spontaneous grouping percept of six tones per group (GR 6) in response to an equitone sequence, for example, does not provide enough information to posit a specific metric percept. On the other hand, the same response to a differentiated isochronous sequence with a duple accentual pattern suggests a ternary metric percept; that is, three higher-level beats, each of which is subdivided into two evenly spaced lower-level beats, as can be indicated by the time signatures 3/8, 3/4, and 3/2. Thus, each grouping response was assigned to one of four metric types (MT), based on whether it was a multiple of 2, 3, both 2 and 3, or neither 2 or 3, i.e., pure duple (GRs 2, 4, 8, 10), pure triple (GRs 3 and 9), mixed (GRs 6 and 12), and other (GRs 1, 5, 7, and 11). It is important to note that this categorization rests on the assumption that the addition of periodic accents every two or three tones did result in the perception of low-level duple or triple grouping.
Since the two derived measures of GL and MT represent different types of data (continuous and categorical, respectively), the analysis required different statistical models. Observed differences in the continuous GL data from blocks 2 and 3 were tested using a linear mixed-effects model with participant treated as a random effect factor using the R (https://www.r-project.org/) package “lme4” (Bates, Maechler, Bolker, & Walker, 2015). Since participant-specific effects on spontaneous grouping responses (from which the GL data were derived) should not be dismissed, it was determined that treating participant-specific effects as a random effect factor would afford a more accurate estimation of the fixed effects from the independent variables of rate (200, 550, or 950 ms), induced grouping (duple vs. triple), accent type (duration, intensity, or pitch), and block (2 vs. 3). To test the contribution of each factor in isolation as well as each combination of factors up to the full model, estimates of model fit were calculated using the Akaike information criterion (AIC) via a maximum likelihood approach.
For the categorical MT data, two complementary statistical models were considered, a multinomial logistic regression model and a Bayesian multilevel model, both of which are appropriate when analyzing response data that correspond to discrete categories. The goal was to estimate the effects of the same four independent variables on the probability of observing a particular type of metric hierarchy as inferred from the relationship between participants’ spontaneous grouping responses and the low-level groups induced by periodic accents every two or three tones. Due to limitations related to the integration of random effects in a multinomial logistic regression using the existing R packages, and to account for between participants variability when estimating the fixed effects, it was determined that a Bayesian multilevel model with participant treated as group-level factor would be appropriate. Both models estimated the changes in log probability between pure duple and other MTs based on changes from a baseline level for each explanatory variable. For the Bayesian multilevel model, a categorical distribution for the response with a logit link function was used. Normal priors with a mean 0 and standard deviation 8 were chosen for the fixed effects, so as to provide a centered and reasonably wide prior, given the context of the data. A random intercept was included to account for each participant. In an effort to provide a more thorough picture of the relationships underlying the data, the results from both analyses are reported; the multinomial logistic regression analysis was performed using the R package “nnet” (Venables & Ripley, 2002) and the Bayesian multilevel analysis was performed using the R package “brms” (Bürkner, 2017, 2018).
Finally, there were a considerable number of participant characteristics that were recorded, but to include all of them would have reduced the effectiveness of the statistical models to estimate the relationship of the responses to the independent variables. Instead, a series of exploratory post hoc analyses were conducted to identify potential avenues of inquiry for future studies on the effects of participants’ characteristics on grouping responses and implied metric percepts, including native language, music training, and gender. The relationship between participants’ GRs and induced grouping in accented sequences was also further examined to verify the assumption that physical accents every two or three tones would result in grouping percepts consistent with these accentual patterns. This analysis revealed a small portion of dissonant GRs. Since these represented only a small portion of responses (5.5%) distributed among a large number of unique participants, and to avoid tailoring the data to preconceived ideas, these responses were not discarded. Instead, a series of post hoc analyses were conducted to identify potential relationships between dissonant grouping percepts and factors pertaining to both stimuli and participants' characteristics. Descriptive statistics (percentage distribution) for observed differences across a number of factors (rate, accent type, and gender) are presented in the Results section.
As shown in Figure 2, GR 3 and GR 2 were the most frequent spontaneous grouping percepts overall (N = 4,194 responses), each accounting for about a third of responses (32.1% and 28.6%); GRs 5, 7, 9, 10, and 11 were relatively rare, each representing less than 1% of all responses. As shown in Table 1, when considering only equitone isochronous sequences (block 1), GR 1 and GR 4 were the preferred responses, with GR 2 and GR 3 being much less frequent. On the other hand, when duple and triple grouping were induced through duration, intensity, or pitch accents every two or three tones (blocks 2 and 3), GR 3 and GR 2 were the most frequent responses (34.9% and 30.5% of responses across these two blocks), followed by GRs 4, 6, 8, and 12, in that order of preference across blocks (3.3-10.8% within block). These results suggest that in the absence of physical cues, spontaneous grouping is limited, with participants showing a strong preference for groups of four tones. On the other hand, when duple and triple grouping is induced through periodic accents, the resulting accentual structure predicts a large proportion of, but not all, grouping percepts. The small increases in the proportion of GR 6 and GR 12 from block 2 to block 3 also suggest that task repetition may also have played a role in participants’ perception of higher cardinality groups.
Spontaneous grouping percepts in equitone isochronous sequences (block 1) also varied by rate, X2(20, N = 415) = 72.75, p < .0001. As shown in Figure 3, GR 1 was most frequently reported at the slowest rate of 950 ms (63 beats per minute or bpm), representing close to half of the “no grouping” responses (49.7%, as compared with 26.8% and 23.5% for sequences with a rate of 550 ms and 200 ms, respectively). On the other hand, GR 4 was most frequent at the fastest rate of 200 ms or 300 bpm (45.7% within response group, as compared with 31.5% and 22.89% for sequences with a rate of 550 ms and 950 ms). While GR 1 and GR 4 were the most common responses at the moderate rate of 550 ms (109 bpm), GR 2 and GR 3 were also most frequently reported at the rate of 550 ms, while GR 8, which represented the fifth most frequent response, was most frequently reported at the fastest rate of 200 ms.
Table 2 presents the summary statistics for GL across and within the three blocks of trials, including GLs resulting from GR 1, which corresponded to the response for “no grouping.” The addition of accents to induce low-level duple and triple grouping in blocks 2 and 3 resulted in a noticeable increase in average GL, even beyond what one might expect to result from grouping percepts limited to the induced duple and triple grouping (GR 2 and GR 3) given an equal number of sequences at each of the three rates. That is, the average GL for an equal number of induced duple and triple grouping percepts at each of the three sequence rates of 200, 550, and 950 ms is only 1,417 ms ([200 × (2 + 3) + 550 × (2 + 3) + 950 × (2 + 3)] / 6 = 1,417). The increase in average GL across the three blocks of trials as well as the increase in the average GL from block 2 to block 3 are consistent with previously observed increases in the proportion of higher cardinality grouping responses across the three blocks of trials. GL variance was also observed to decrease from block 1 to block 2, but increase from block 2 to block 3. As shown in the violin plot (Figure 4), the slowest rate resulted in higher and more variable GL (M = 2,943; SD = 1,672), while induced triple grouping gave rise to a noticeably wider spread distribution around the median across rates (Mdn = 1,650; IQR = 1,650) as compared to induced duple grouping (Mdn = 1,600; IQR = 800).
A likelihood ratio test found that block (block 2 vs. block 3) was a relatively significant factor, X2(1, N = 3,779) = 11.27, p < .001, providing further justification for the inclusion of block as a fixed effect factor in the analysis of the GL data. To account for the relative influence of the different variables, GL data were fitted using a linear mixed-effects model, with participant treated as a random effect factor (see Table 3). When compared with the baseline rate level of 200 ms, changes in rate were associated with the largest increases in GL, followed by the change from induced duple to triple grouping; by comparison, changes in accent type and block resulted in relatively small differences.
To evaluate the effects of each factor in isolation as well as each combination of factors on GL data, estimates of model fit were calculated using the Akaike information criterion (AIC) via a maximum likelihood approach. Every combination of induced grouping, rate, and accent type was fit with participant treated as a random effect factor; block (2 vs. 3) was excluded except in the full model. Table 4 presents the AIC values for each model from lowest to highest, along with the corresponding degree of freedom (df). When comparing multiple models fit by maximum likelihood to the same data, the smaller the AIC value, the better the fit. The results indicate that when considered in isolation, rate appears to be the most important factor, followed by induced grouping and accent type, and that the combination of rate and induced grouping (over accent type) provides a better fit. However, the full model (i.e., the model that includes all the factors) provides the best fit to the data, as measured by the AIC.
The previously observed influence of rate on GRs in block 1 also manifested as a significant relationship between rate and MT, X2(6, N = 415) = 32.04, p < .001. As shown in Figure 5, pure duple (i.e., GRs 2, 4, 8, and 10) metric percepts in equitone sequences were more frequent at the faster rates of 200 ms and 550 ms while “no grouping” (GR 1) was the preferred response at the slowest rate of 950 ms. Pure triple (GRs 3 and 9) and mixed metric percepts (GRs 6 and 12) were much less frequent, with larger proportions of pure triple and mixed MTs observed at the faster rates of 550 ms and 200 ms, respectively. Finally, odd metric percepts (i.e., GRs 5, 7, and 11) were relatively rare, accounting for less than 2% of the overall responses in block 1.
When grouping was induced by physical accentual cues (blocks 2 and 3), MT was observed to vary based on rate as well as induced grouping (duple vs. triple), with “no grouping” (GR 1) and odd grouping percepts (GRs 5, 7, and 11) being relatively rare (see Table 5). While the proportion of pure duple MT was relatively stable across equitone and accented sequences, the presence of periodic accents was associated with an expected increase in pure triple MT as well as an increase in mixed MT. As shown in Figure 6, induced duple grouping most frequently gave rise to pure duple metric percepts across rates (accounting for 88.4% of MT percepts within duple grouping condition). On the other hand, mixed MT was perceived most frequently in sequences in which triple rather than duple grouping was induced (accounting for 82.0% of mixed MT percepts across rates and induced grouping conditions). Across induced grouping conditions, mixed MT percepts were also most frequently reported at the fastest rate (59.7%, as compared to 23.0% and 17.4% for IOIs of 550 ms and 950 ms, respectively).
For the subsequent analyses of spontaneous metric percepts in accented sequences (blocks 2 and 3), “no grouping” and odd MT data were combined into a single metric category (“other” MT). Tables 6 and 7 present the parameter estimates for MT across blocks 2 and 3 based on a multinomial logistic regression model with no random effects and a Bayesian multilevel model with participant treated as group-level factor, respectively. The multinomial logistic regression model (see Table 6) estimates the changes in log probability between pure duple MT and the three other metric types based on changes from a baseline level for each of the explanatory variables (rate, induced grouping, and accent type). For example, the change in induced grouping from duple to triple is associated with coefficient estimates of 5.76 (pure triple MT), 4.11 (mixed MT), and 2.40 (other MT), each of which correspond to a t value that is significant at the p < .001 level. The relative size of the coefficient estimates suggests that the probability of a change from the baseline pure duple MT to each of the three other metric types is highest when the induced grouping changes from duple to triple as compared with other changes in rate, accent type, and block. Furthermore, while the probability of observing a change from pure duple MT to another metric type based on a change in induced grouping from duple to triple was strongest for pure triple MT, it was also relatively high for mixed MT. Smaller significant effects can also be observed based on changes in rate, accent type, and block. Slower rates were associated with a decrease in the probability of observing a change from pure duple to mixed MT, while task repetition (block 3) was associated with a small increase in parameter estimates in favor of mixed MT. Finally, coefficient estimates based on accent type were very small, with the increase in the probability of observing a change in metric percept from pure duple MT to other MT being only significant when associated with a change from duration to intensity accents.
The results from the Bayesian multilevel model (see Table 7) were very similar, with changes in induced grouping from duple to triple being strongly predictive of observed changes from pure duple MT to each of the other three metric types; faster rate and task repetition were also associated with increases in the probability of observing mixed MT. One small difference between the two analyses relates to the previously observed effect associated with accent type. Changes in accent type from duration accents to either pitch or intensity accents did not seem to have a noticeable effect, as their 95% credible intervals contain zero. Since the previous analysis did not account for participant-specific effects, it might be the case that the previously observed effect of accent type might have been due to participant-specific tendencies. Nonetheless, when considering the overall results of these two analyses, it would appear that participant-specific effects did not play a major role in the observed changes in implied metric types.
A series of post hoc exploratory analyses were conducted to examine the effects of participants’ characteristics on spontaneous grouping responses (GRs) and implied metric percepts (MTs). Most observed differences were relatively small, but a few trends were observed. The results suggest potential effects related to native language (larger proportion of GR 1 in equitone sequences for participants with a native language other than English, i.e., 44.6% as compared to 36.9% across groups), composition training (larger proportion of mixed MT percepts in accented sequences for participants with composition training, i.e., 19.8% as compared with 13.3% across groups), and instrumental performance (lower proportion of GR 4 in equitone sequences for participants with no instrumental experience, i.e., 12.2% as compared with 30.6% across groups).
Several unexpected differences were observed based on whether participants self-identified as men or women. Without considering any other variable, the relationship between gender and GR was found to be significant across blocks, X2(11, N = 4,194) = 75.46, p < .001. Overall, participants who self-identified as men accounted for a larger proportion of GR 1 (58.4%) and GR 12 (66.3%) across blocks, as well as within blocks 2 and 3 combined (61.2% and 67.5%, respectively). Participants who self-identified as women were associated with a larger proportion of GR 4 across blocks (58.8%), as well as within blocks 2 and 3 combined (61.5%), X2(11, N = 3,779) = 71.20, p < .001.
Dissonant Grouping Percepts
To verify the assumption that periodic accents every two or three tones would result in the perception of low-level duple and triple grouping, the relationship between participants’ grouping percepts and induced grouping in accented sequences was further examined (blocks 2 and 3, N = 3,779 responses from 127 unique participants). Post hoc analyses revealed a small proportion (5.5%) of dissonant grouping responses distributed among a large number of unique participants. Dissonant GRs included mismatched grouping percepts (e.g., GR 3 in the induced duple grouping condition) and odd grouping percepts (GRs 5, 7, and 11), which are dissonant in both induced grouping conditions and accounted for only 12.9% of all dissonant GRs. Dissonant GRs occurred most frequently in the induced triple grouping condition (69.9%, distributed among 51 unique participants), with mismatched GR 4 accounting for almost half of the dissonant GRs (48.0%). Dissonant GRs in the induced duple grouping condition were distributed among 30 unique participants, with mismatched GR 3 accounting for most of the dissonant GRs (81.0%). As shown in Table 8, dissonant grouping percepts were reported more frequently by women than men participants, at the fastest rate of 200 ms, and with intensity accents. Women also accounted for the large majority of odd grouping percepts (81.5%). While odd GRs could be easily dismissed as simple errors (they represented only 0.7% of all GRs within blocks 2 and 3 combined), mismatched GRs are more difficult to explain. Mismatched grouping percepts could be associated with a reduced audibility of physical cues (possibly due to technical or environmental conditions related to online data collection), response errors (e.g., responding based on the higher-level grouping of accented tones instead of the number of tones per group or carelessness), or a higher tolerance for syncopated accents in preferred grouping percepts. Nonetheless, the relatively small overall proportion of either types of dissonant grouping responses provided support for the use of induced grouping for the investigation of spontaneous metric percepts (or spontaneous “metricization”).
The purpose of the present study was to investigate the variability of spontaneously perceived groups in both equitone and accented isochronous sequences that fell within the temporal window previously identified as giving rise to spontaneous rhythmization, and to explore the metric implications of these spontaneous grouping percepts. The present study found that when presented with equitone isochronous sequences that varied only with respect to rate, listeners reported a wide range of grouping percepts, with “no grouping” and groups of four tones (rather than two or three) being the most frequent responses (61.5% combined overall), with more than half of the trials for which no grouping percepts were reported (55.1%) being associated with the slowest rate of 950 ms (63 bpm), and most of the quadruple grouping percepts (45.7%) being associated with the fastest rate of 200 ms (300 bpm). The relatively large proportion of trials for which no grouping percepts were reported is consistent with earlier reports that direct or indirect suggestion was often necessary for listeners to hear rhythmic groups (Bolton, 1894; Harrell, 1937; Temperley, 1963), a finding that has been largely ignored by later reports. On the other hand, the observed influence of rate on spontaneous grouping percepts is consistent with a number of studies on spontaneous synchronization (Repp, 2005; Repp & Su, 2013). For example, in an experiment with 50 graduate and undergraduate students, Duke (1989) found that while a large proportion of the participants’ taps to equitone sequences in a tempo range of 40 bpm to 200 bpm (IOIs of 1,500 ms to 300 ms) was consistent with stimulus tones being perceived as subdivisions, the number of subdivisions as well as spontaneous grouping percepts gradually decreased with slower rates. Thus, 90% of the sequences were perceived as being made up of subdivision tones at 200 bpm (300 ms), 22% at 100 bpm (600 ms), and only 10% at 60 bpm (1,000 ms); no spontaneous grouping percepts were observed at the slowest rate of 40 bpm (1,500 ms). The current findings, however, contrast with earlier studies with respect to the slower limit for spontaneous rhythmization. For example, Bååth (2015) reports a peak rate of 2,000 ms for trials that resulted in no grouping percepts (23.9% overall), while the probability of observing no grouping percepts at 900 ms (the rate closest to the slowest rate in the current study) was only about 15% (as compared with 55.1% in the current study).
The preference for quadruple (rather than duple) spontaneous grouping percepts observed in the current study also contrasts with the findings of Duke (1989), who reported that most of the participants’ taps were in a duple relation with the stimulus tones (with only 5.4% and 7.3% of tones corresponding to triple and quadruple subdivisions of the taps, respectively). On the other hand, a quadruple grouping preference is consistent with some of the early reports (e.g., Bolton, 1894; Harrell, 1937), a finding that also appears to have been largely ignored by later reports, with the exception of Bååth (2015). In some cases, the contrast in observed grouping preferences could be related to the nature of the task (perceptual grouping rather than tapping responses) or to participants’ characteristics (e.g., all of Duke’s participants were involved in a college-level music program). Nonetheless, these differences in spontaneous grouping preferences as well as the observed differences in the slower limit for spontaneous grouping call for a closer examination of the results of more recent studies that report subjective rhythmization in equitone sequences with occasional deviant tones (e.g., Brochard et al., 2003), including preference for binary percepts without a consideration of metric hierarchy (e.g., Abecasis et al., 2005; Schaefer et al., 2011).
As expected, when grouping was induced by adding an accent every two or three tones, listeners’ percepts correlated with the physical attributes of the sequences, with groups of three and two tones (in that order) being most frequently reported. Spontaneous grouping in accented sequences was also influenced by rate, with an increase in larger grouping percepts and group length at the fastest rate, and an increased variance in group length at the slowest rate. Most importantly, the examination of the implied metrical structure revealed that listeners had a marked preference for perceiving sequences with an induced low-level triple grouping as forming higher-level binary groups (i.e., GR 6 = 2 × 3 instead of 3 × 2, and GR 12 = 4 × 3 instead of 6 × 2), with non-binary higher-level groups (i.e., GR 9 = 3 × 3 and GR 10 = 5 × 2) being rare. Finally, odd grouping percepts were also relatively rare, as predicted by previous findings that tapping every 5th or 7th tone is more difficult possibly due to the higher cognitive demands associated with repeated counting of larger numbers that cannot be subdivided evenly (Repp, 2007). Overall, the findings are consistent with an observed preference for binary percepts in a variety of task procedures (e.g., Bergeson & Trehub, 2006; Drake, 1993; Parncutt, 1994), although very little attention has been given to the metric implications of these preferences.
The significance of rate and accentual structure as the main physical attributes contributing to listeners’ grouping percepts (with respect to both group length and metric type) was confirmed by mixed-effects models with and without group-level effects based on within-participants’ variability. The increase in larger grouping percepts in the repeated block also suggests the presence of habituation or a learning effect, by which lower-level accents are assimilated by higher-level grouping percepts. This process is consistent with the perception of metric hierarchy, and findings that listeners with more musical experience have access to more beat levels (e.g., Drake, Penel, & Bigand, 2000; Geiser, Sandmann, Jäncke, & Meyer, 2010) and tend to prefer a slower tactus (e.g., Drake, Jones, & Baruch, 2000). On the other hand, and in contrast to earlier studies that suggest that pitch and/or duration accents play a determining role in meter perception (Ellis & Jones, 2009; Hannon, Snyder, Eerola, & Krumhansl, 2004), the effects of accent type were small and limited to intensity accents, which were associated with an increase in the proportion of reported percepts associated with dissonant groups (i.e., odd and mismatched grouping responses). These findings may have been related to the magnitude of change used to create accents (e.g., Ellis & Jones, 2009, used two to five semitones differences, while the current study used only one semitone difference), a limitation that could be addressed in future experiments. On the other hand, there is also some evidence that the effect of different types of accents on beat perception may differ based on participants’ musical expertise (Bouwer, Burgoyne, Odijk, Honing, & Grahn, 2018; Poudrier, 2017). In the current study, post hoc observations of grouping preferences based on native language and music training provide further evidence that basic auditory processes supporting pulse perception may be shaped by enculturation (Hannon, Soley, & Ullal, 2012; Iversen, Patel, & Ohgushi, 2008). Thus, future studies could explore the relative contribution of acoustic and participant-specific factors on the effect of different types of accents on spontaneous metricization.
Finally, the post hoc analysis of spontaneous grouping percepts that are dissonant with the low-level induction of duple and triple grouping by means of periodic accents every two or three tones revealed some interesting avenues for further investigation. Of the possible explanations offered for this finding, the possibility of a higher tolerance for syncopated accents in preferred grouping percepts is the most intriguing. Because a large proportion of the dissonant grouping responses were observed in both induced duple and triple grouping conditions (e.g., GR 3 in sequences with accents every two tones and GR 4 in sequences with accents every three tones), and that a relatively large proportion of dissonant grouping responses were associated with participants who self-identified as women, it would seem that individual preferences as well as sex differences might play a role. Apart from studies of rhythmic abilities in children (e.g., Pollatou, Karadimou, & Gerodimos, 2005; Schleuter & Schleuter, 1985), there has been a relatively small number of studies that have addressed sex and gender differences in rhythm perception and production, as opposed to single-tone and interstimulus interval estimation in nonmusical contexts or the influence of background music on duration estimation (e.g., Koglbauer, 2015). Nonetheless, there is some evidence of sex differences in benefits associated with the use of synchronous music during exercise (e.g., Karageorghis, Priest, Williams, Hirani, Lannon, & Bates, 2010) and preferred dance tempo (Dahl & Huron, 2007; Dahl, Huron, Brod, & Altenmüller, 2014). Results from studies on preferred dance tempo are particularly suggestive. In a first study that asked participants to adjust the tempo of a drum machine while dancing alone, Dahl and Huron (2007) found that height and leg length, both of which were highly correlated with sex, were the best predictors of tempo variance across participants, with shorter measurements being correlated with faster tempos. In a second study that aimed to replicate and further investigate the influence of sex and leg length by matching participants in terms of height, Dahl and colleagues (2014) found that age was the best predictor of preferred tempo, followed by height; sex was not found to be a significant factor. These results provide further evidence for an embodied model of rhythm perception (e.g., Phillips-Silver & Trainor, 2007; Toiviainen, Luck, & Thompson, 2010). They also suggest a possible explanation for the observed preference of participants who self-identified as men for larger grouping percepts (i.e., GR 12), and that of participants who self-identified as women for smaller grouping percepts (i.e., GR 4), even in the context of syncopated accentuation. The latter finding is also consistent with an explanation of dissonant grouping percepts resulting from a higher tolerance for syncopated accents in preferred grouping percepts (such as GR 4 in the context of induced triple grouping).
Despite having been one of the earliest rhythm-related phenomena to be investigated experimentally, subjective rhythmization has yet to be more fully understood. The current study shows that in isochronous tone sequences, preference for binary grouping percepts is influenced by rate and accentual structure. In the absence of acoustic grouping cues, spontaneous grouping most often fails to emerge, and when it does, listeners show a preference for quadruple rather than duple grouping, with “no grouping” being more frequently associated with slower rates and quadruple grouping with faster rates. However, when stimulus characteristics suggest a lower-level triple grouping, listeners’ binary preference manifests itself in the form of grouping percepts consistent with compound meters that exhibit a higher-level binary as opposed to ternary structure (as represented by 6/8 and 12/8 as compared with 9/8 time signatures).
Although the experimental design of the current study strongly supported an unambiguous perception of duple and triple grouping by the addition of accents every two or three tones (with each sequence beginning with an accented tone), it was not possible to determine the specific location of the accented tones within listeners’ perceived groups, an important aspect of spontaneous metricization given that the association of accent with group beginning would seem to be influenced by cultural factors (e.g., Iversen et al., 2008; Rothstein, 2008). Furthermore, the post hoc findings of grouping preferences based on gender, including the larger proportion of dissonant grouping percepts associated with participants who identified themselves as women, and the larger proportion of “no grouping” (GR 1) associated with men participants, are without a theoretical basis and warrant further investigation. Future studies could use a design that affords the investigation of the perceived position of the accent within groups as well as differences based on participants’ characteristics, including specialized music training, cultural background, age, and sex.
Thus, the experimental study of the phenomenon of spontaneous rhythmization still has the potential to reveal interesting aspects of temporal processing, including the role of individual preferences in the hierachical organization of auditory signals. Individual differences in spontaneous grouping percepts could be fruitfully explored using experimental procedures that combine perceptual and motor tasks with measures of brain activity (e.g., Koelsh, Grossmann, Gunter, Hahne, Schöger, & Friederici, 2003; Koelsch, Maess, Grossmann, & Friederici, 2003) using a wider range of accentual patterns and taking into account differences in physical attributes (e.g., height and leg length) for a wider range of participants. Such methods may contribute to a richer understanding of the variegated psychological experience of sound within distinct moving bodies.
This research is supported in part by funding from the Social Science and Humanities Research Council of Canada.
The author thanks Mary Farbood (New York University) for the use of the MARL online platform; Stefanie Acevedo, Damian Blättler, and Sherlock Campbell (Yale University) for technical assistance with an earlier pilot study; and Jiayi Bao, Creagh Briercliffe, and Hailey Wu (University of British Columbia) for technical assistance and statistics consulting for the current report.
I have no known conflict of interest to disclose.