The speech-to-song illusion is a perceptual transformation in which a spoken phrase initially heard as speech begins to sound like song across repetitions. In two experiments, we tested whether phrase-specific learning and memory processes engaged by repetition contribute to the illusion. In Experiment 1, participants heard 16 phrases across two conditions. In both conditions, participants heard eight repetitions of each phrase and rated their experience after each repetition using a 10-point scale from “sounds like speech” to “sounds like song.” The conditions differed in whether the repetitions were heard consecutively or interleaved such that participants were exposed to other phrases between each repetition. The illusion was strongest when exposures to phrases happened consecutively, but phrases were still rated as more song-like after interleaved exposures. In Experiment 2, participants heard eight consecutive repetitions of each of eight phrases. Seven days later, participants were exposed to eight consecutive repetitions of the eight phrases heard previously as well as eight novel phrases. The illusion was preserved across a delay of one week: familiar phrases were rated as more song-like in session two than novel phrases. The results provide evidence for the role of rapid phrase-specific learning and long-term memory in the speech-to-song illusion.
The speech-to-song illusion is a perceptual transformation in which spoken phrases can sound increasingly like song as they are repeated (Deutsch et al., 2011). The transformation is usually measured by participant reports that a phrase sounds more song-like on the final repetition compared to the first. While repetition is commonly deployed in music composition (Margulis, 2014), the cognitive mechanisms by which repetition contributes to the perceptual experience of music remain unclear.
The transformation from speech to song influences participants’ abilities to recite a phrase and shapes their perceptual expectations upon hearing the phrase again. Deutsch et al. (2011) asked participants to repeat back phrases and found that phrases for which the illusion was experienced were sung rather than spoken as speech, with the pitches distorted to conform to the melodic expectations of Western tonal music. Once a phrase is perceived as song, participants are also less likely to notice changes in pitch when the phrase is heard again, so long as the changes conform to familiar musical scale structures (Vanden Bosch der Nederlanden et al., 2015). Similarly, participants are better at detecting temporal irregularities when transformed phrases are heard again compared to phrases that continue to be perceived as speech (Graber et al., 2017). Consistent with behavioral work, Tierney et al. (2013) found that auditory-motor brain regions previously associated with music perception increase blood-oxygen-level-dependent signal during repetitions of transformed speech, compared to untransformed speech. Overall, the findings provide converging evidence that the illusion entails updating perceptual expectations across repetitions based on prior knowledge for Western tonal and rhythmic structures.
Critically, when a phrase is transposed or the syllables are scrambled across repetitions, the illusion doesn’t occur (Deutsch et al., 2011). These results suggest that the illusion and the underlying changes in perceptual expectations driven by repetition are tailored to a particular phrase, and do not merely reflect a general tendency to perceive music when varying short segments of speech are repeated out of context. In the present study, we were interested in determining whether phrase-specific knowledge (e.g., about the specific sounds and the transitions between them) learned across repetitions of spoken phrases contributes to the illusion.
Early work on the illusion hypothesized that increases in perceived musicality are the result of the interaction between cognitive processes differentially engaged by speech and music perception. Deutsch et al. (2011) proposed that repetition frees cognitive resources devoted to pitch processing that are inhibited during speech perception, resulting in the emergent salience of perceived pitches. More recently, Castro et al. (2018) suggested that the tension between pitch and speech processing can be explained by the dynamics between the lexical and syllable nodes in a connectionist model of language (Vitevitch et al., 2021). Neither of these hypotheses can explain why the illusion is found for non-speech stimuli like tones (Margulis & Simchy-Gross, 2016; Tierney et al., 2018a) and environmental sounds (Rowland et al., 2019; Simchy-Gross & Margulis, 2018), given that such stimuli don’t engage lexical processing. Moreover, hypotheses that appeal to the competing dynamics between cognitive processes most often study the effects of repetition over short periods of time (seconds) and do not provide a clear framework for examining the role of learning and memory processes in the illusion, for example, by testing whether stimulus-specific knowledge is learned across repetitions and consolidated in long-term memory.
Several other hypotheses have been proposed, including two variations of a template matching process, according to which perceptual stimuli are continuously compared and matched to existing music templates of common tonal and rhythmic structures found in Western music (Rowland et al., 2019; Tierney et al., 2018a). Presumably, once a match is found, perceptual expectations are updated and a stimulus is perceived as more song-like. According to Tierney et al. (2018a), the cognitive mechanisms by which repetition contributes to the illusion include two components: short-term memory for the storage of a phrase’s melodic structure, and a mechanism for comparing the structure held in short-term memory to music templates. We suggest that the latter mechanism engages working memory processes (Naveh-Benjamin & Cowan, 2023; van Ede & Nobre, 2023) to maintain and manipulate the melodic structure across repetitions until a match is made. Alternatively, Rowland et al. (2019) suggest that the interaction between existing music templates and attentional processes gives rise to the illusion by biasing perception towards musical structure. These hypotheses can explain why the illusion occurs for non-speech stimuli but would not be able to explain how the illusion could persist despite delays and interference from other stimuli between repetitions.
Studies have found the illusion can occur even when the final repetition comes at the end of the study following exposure to other stimuli and a short delay period (Graber et al., 2017; Margulis & Simchy-Gross, 2016). If too many other stimuli are heard between repetitions, or if enough time has passed since the last repetition, phrase-specific information cannot be maintained in short-term or working memory, because both processes are limited in the duration and quantity of information that can remain active (Brem et al., 2013; Cowan et al., 2012; Miller, 1956). Based on the hypothesis put forth by Tierney et al. (2018a), the illusion shouldn’t be experienced when intervening stimuli are heard between repetitions of the same stimulus. Soehlke et al. (2022), however, demonstrated the opposite effect: interleaved repetitions of different spoken phrases produced the illusion. Even if some manner of template matching is biasing attention (Rowland et al., 2019), evidence that the illusion is experienced despite interleaved presentations suggest that the results of the template matching process are temporarily stored and redeployed after attention has been focused on intervening stimuli, implicating a role for learning processes. Although Soehlke et al. (2022) provided preliminary evidence that learning and long-term memory contribute to the illusion, further work is needed to test phrase-specific learning, whether interleaved and consecutive stimulus repetitions produce an illusion of similar strength, and how long the illusion lasts.
In the present study, we examined the role of phrase-specific learning and long-term memory in the illusion by designing two experiments that required phrase-specific knowledge to be encoded and retained across delays for the illusion to occur. As a result, changes in ratings of perceived musicality, measured within-participants at the level of individual phrases, served as a measure of phrase-specific memory in experiment conditions that precluded sustained contributions from attention and working memory processes. Previous work shows that repetition drives learning of the note and chord sequences particular to a piece of music (Hébert & Peretz, 1997; Janata & Grafton, 2003; Kubit & Janata, 2022a, 2022b) and that the resulting veridical representations in memory interact with schematic knowledge about Western music to shape listeners’ expectations (Bharucha, 1987; Tillmann & Bigand, 2010; Vuust et al., 2022). Repetition is also likely to drive the learning of the structure particular to a spoken phrase. We hypothesize the phrase-specific knowledge plays an import role in perceiving musicality in speech. Such learning may be differentially engaged by music perception compared to speech perception because the higher frequency of repetition characteristic of music stimuli (Margulis, 2014) affords more opportunities to learn the time-varying structure. In Experiment 1, we tested the hypothesis that phrase-specific knowledge learned across repetitions influences perception such that a phrase sounds more musical.
Anecdotally, people report that the illusion can persist across long temporal delays, but this hasn’t yet been tested in an experimental setting. Evidence that the illusion persists across delays greater than a day would further implicate learning and memory processes and provide evidence that undercuts previous hypotheses that only consider attention, short-term memory, and working memory: such processes cannot influence perception across time periods spanning multiple days unless learning takes place and the results are stored in long-term memory. In Experiment 2, we tested the hypothesis that phrase-specific knowledge learned across repetition is consolidated in long-term memory and increases perceived musicality upon hearing a phrase after a seven-day delay period.
Experiment 1
In Experiment 1, we extend previous work by Soehlke et al. (2022) by examining, within participants at the level of individual spoken phrases, the effects of interleaved compared to blocked (consecutive) repetitions on the illusion. Interleaved repetitions, during which different phrases are heard between repetitions of a phrase, prevent the same phrase-specific content from remaining the focus of attention and being maintained in working memory across repetitions (see hypotheses described in Rowland et al., 2019 and Tierney et al., 2018a). Thus, results showing the illusion is reliably experienced across interleaved repetitions would provide evidence for the role of phrase-specific learning processes that result in the rapid encoding and retrieval of knowledge despite delays and interference between exposures. We hypothesized that interleaved repetitions would still produce the illusion, but that the illusion would be stronger after blocked repetitions. Blocked repetition may lead to a stronger illusion by engaging the attention and working memory processes previously suggested to underlie the illusion (Rowland et al., 2019; Tierney et al., 2018a) that are limited in contribution during interleaved repetition. Alternatively, blocked repetition may lead to a stronger illusion because it is more conducive to learning, as interleaved and blocked presentation schedules are known to differentially influence learning rates (Brunmair & Richter, 2019; Schorn & Knowlton, 2021; Shea & Morgan, 1979). In the case of the speech-to-song illusion, if phrase-specific learning contributes to the transformation, then the difference in learning rates between blocked and interleaved presentations would produce illusions of different strength.
Method
Participants
All data collection procedures used in the study were approved by the Princeton University Institutional Review Board. We estimated the required number of participants for Experiment 1 based on the results of a previous experiment in which participants heard repetitions of spoken phrases and provided musicality ratings after every repetition (Experiment 1 in Tierney et al., 2021). Using the partial eta squared of the F-test for the repetition variable as the effect size, we determined that 22 participants would suffice for power of .80. We quadrupled the expected number of participants in Experiment 1 to counterbalance condition order and to help account for extra noise inherent to the online study environment.
Eighty-two Princeton University undergraduate students (47 females, 19–23 years; mean age = 21 years) participated in Experiment 1 after providing informed consent. Participants reported neither neurological nor hearing impairments and declared English to be their primary language. Fifty-two participants reported having more than 1 year of formal musical instrument training (“2” – “10 or more years”; mean training = “4 – 5 years”). Participants were compensated with research credits for completing each of the two days of the experiment.
Materials
Equipment
Participants were tested online using their own desktop or laptop computer in a location of their own choosing. Responses were made using a mouse and keyboard. Participants were instructed to wear headphones and to find a quiet and comfortable place to complete the study. The experiment was hosted by Pavlovia (https://pavlovia.org) and controlled by jsPsych (de Leeuw, 2015).
Speech Stimuli
Phrase stimuli were chosen from a stimulus set used in previous studies of the speech-to-song illusion (Graber et al., 2017; Tierney et al., 2013, 2018a, 2018b, 2021). We selected the 16 phrases from the set of 24 “illusion” stimuli that had, on average, the greatest increase in musicality ratings from the 1st to the 8th repetition in Experiment 1 from Tierney et al. (2021). The mean length of stimuli was 6.3 syllables (minimum = 4, maximum = 8) and 1.3 seconds (minimum = 0.84, maximum =1.80). Speakers featured in the phrase stimuli were three different males who were native speakers of American or British English.
Procedure
Participants heard 16 phrases across two conditions in a single session. In both conditions, participants heard eight presentations of each phrase and after each repetition clicked buttons labeled 1 through 10 to indicate the extent to which a phrase was perceived as music, where “1” indicated completely speech-like and “10” indicated completely song-like. A trial comprised a single phrase presentation and musicality rating, and eight trials formed a block. Participants pressed a button to start each block. Within a block, participants were given two seconds to respond on each trial, after which time the experiment automatically went on to the next trial. The two conditions differed in whether repetitions of a phrase were heard consecutively within a block (blocked condition) or interleaved (interleaved condition) such that participants heard other phrases between each repetition (Figure 1). Phrase order was randomized within each interleaved condition block, which was always comprised of a single presentation of eight different phrases. As a result, each phrase was heard eight times either within a single block (blocked condition) or interleaved across eight blocks (interleaved condition). The assignment of phrases to conditions and order of stimulus presentation were randomized across participants, and the starting condition was counterbalanced across participants. Participants were instructed to take a short break (no more than five minutes) after completing the first condition.
After blocks 3, 6, 11, and 14, participants encountered catch trials during which they heard the sentence, “Don’t rate this speech, instead choose response” followed by a spoken number (“two,” “four,” “five,” or “eight”). Participants were expected to click the button on the musicality scale that corresponded to the spoken number. Half of the catch trials were spoken by a female voice and the rest by a male voice. The purpose of these trials was to identify participants who were listening attentively to the phrases. Fourteen participants incorrectly answered more than one catch trial and were excluded from analyses. Ten participants were also excluded for missing multiple responses in a single block more than once. On average, each participant missed 1.39 trials (SD = 1.53) and 1.46 trials (SD = 1.55) in the blocked and interleaved conditions, respectively.
At the end of the session, participants filled out the Goldsmiths Musical Sophistication Index (GMSI) (Müllensiefen et al., 2014) to measure individual differences in music aptitude (see Supplementary Materials accompanying this paper online at online.ucpress.edu/mp; Experiment 1: Individual differences in illusion strength and Table S1). Participants also indicated the extent to which they agreed with the statement, “I paid attention throughout the experiment” using a 5-point scale (e.g., “Strongly Disagree,” “Disagree,” “Neither Agree nor Disagree,” “Agree,” “Strongly Agree”). Two participants responded that they didn’t agree with the statement and were excluded from analyses.
Analyses
Mixed models were estimated in R (https://www.R-project.org/) using the lmer(), glmer(), and bootMer() functions from the lme4 package (Bates et al., 2014). Descriptions of all analyses are provided in Supplementary Materials Table S2. For each experiment, separate linear mixed models (LMMs) were used to estimate the effect of repetition number and condition on musicality ratings. LMMs were estimated using maximum likelihood based on Laplace approximation. We modeled participant as a random intercept for every model to account for variance resulting from repeated measurements.
For every LMM we performed 2,000 parametric bootstraps of the model using the bootMer() function in lme4. Both fixed and random effects were estimated for every bootstrapped sample. The fixed effect coefficients were extracted from a bootstrapped model to create a sampling distribution for each coefficient. The median value from the sampling distribution served as the estimated fixed effect coefficient and values from the 2.5 and 97.5 percentiles served as the lower and upper bounds for the 95% confidence interval (CI). For all models we considered a coefficient to be significant if the 95% CI did not include zero. Prior to model evaluation, the data were preprocessed so that model coefficients served as effect sizes that can be interpreted as the estimated change in the dependent variable for a standardized increase in the predictor variable.
To aid comprehension, for each bootstrap iteration we also extracted the estimated marginal means for a model across the full range of observed values by using the predict() function in the multcomp package (Hothorn et al., 2008). This allowed us to create a sampling distribution for each marginal mean. Error bars representing the standard error of the mean of the distribution are included in all figures to convey the variability in our samples, while the CIs reported in text provide information on whether an effect was statistically significant. When a hypothesis warranted further comparisons we used the estimated marginal means at each iteration to calculate the desired contrasts. Because the contrasts were based on the estimated marginal means, the resulting contrast coefficients served as effect sizes that describe the difference between levels of a predictor in the units of the dependent variable. The contrast coefficients were extracted from each bootstrap iteration and the median value from the sampling distribution served as the estimated contrast coefficient. Values from the 2.5 and 97.5 percentiles served as the lower and upper bounds for the 95% confidence interval (CI). For all models we considered a contrast to be significant if the 95% CI did not include zero.
Results and Discussion
We predicted that repetition would increase the perceived musicality of phrases heard in both conditions. We used a linear mixed model to estimate ratings for a phrase as a function of condition and repetition (see Supplementary Materials Table S2 for descriptions of analyses). Results show participants perceived phrases as more musical on the final repetition compared to the first during both conditions (Figure 2A). For each additional repetition, the musicality rating for a phrase increased by 0.287 (95% CI [0.261, 0.315], p < .001) in the blocked condition and 0.123 (95% CI [0.096, 0.151], p < .001) in the interleaved condition. Phrases heard in the blocked condition demonstrated a greater rate of increase in musicality than phrases heard in the interleaved condition (difference = 0.165, 95% CI [0.127, 0.204], p < .001).
Post hoc contrasts suggest the effects of blocked and interleaved conditions resulted in similar musicality ratings during the first two repetitions. Across repetitions three through eight, musicality ratings were greater in the blocked, compared to the interleaved condition. By the last repetition, phrases heard in the blocked condition were, on average, perceived as 1.05 rating scale values more song-like than the interleaved condition phrases (Table 1, Figure 2A). The different rates of change across repetitions are also clearly visible in the unmodeled data using participants’ average ratings (Figure 2B). Overall, the pattern of results demonstrate that phrase-specific learning contributes to the illusion across interleaved repetitions, but that the illusion is strongest after consecutive repetitions. The design of Experiment 2 directly tested phrase-specific learning during blocked presentations by probing long-term memory for phrases.
Contrast . | Coefficient . | 95% CI . | p . |
---|---|---|---|
rep. 1 BLK - INT | −0.109 | [−0.271, 0.048] | .189 |
rep. 2 BLK - INT | 0.055 | [−0.078, 0.186] | .412 |
rep. 3 BLK - INT | 0.219* | [0.116, 0.326] | < .001 |
rep. 4 BLK - INT | 0.383* | [0.295, 0.475] | < .001 |
rep. 5 BLK - INT | 0.549* | [0.457, 0.640] | < .001 |
rep. 6 BLK - INT | 0.714* | [0.606, 0.819] | < .001 |
rep. 7 BLK - INT | 0.878* | [0.747, 1.007] | < .001 |
rep. 8 BLK - INT | 1.043* | [0.884, 1.201] | < .001 |
Contrast . | Coefficient . | 95% CI . | p . |
---|---|---|---|
rep. 1 BLK - INT | −0.109 | [−0.271, 0.048] | .189 |
rep. 2 BLK - INT | 0.055 | [−0.078, 0.186] | .412 |
rep. 3 BLK - INT | 0.219* | [0.116, 0.326] | < .001 |
rep. 4 BLK - INT | 0.383* | [0.295, 0.475] | < .001 |
rep. 5 BLK - INT | 0.549* | [0.457, 0.640] | < .001 |
rep. 6 BLK - INT | 0.714* | [0.606, 0.819] | < .001 |
rep. 7 BLK - INT | 0.878* | [0.747, 1.007] | < .001 |
rep. 8 BLK - INT | 1.043* | [0.884, 1.201] | < .001 |
Note: Values are bootstrapped contrast coefficients and 95% CIs. Contrasts from all models reflect the difference between estimated marginal means. Asterisks denote significant effects. Blocked condition (BLK); Interleaved condition (INT); rep (repetition).
Experiment 2
In Experiment 2, we examine whether the illusion persists, within participants at the level of individual spoken phrases, across longer delays. Re-exposure to a phrase days after the illusion was experienced prevents the same phrase-specific content from remaining the focus of attention and being maintained in working memory across the delay (see hypotheses described in Rowland et al., 2019, and Tierney et al., 2018a). Thus, results showing that phrases are still perceived as more song-like after several days would provide strong evidence for the role of phrase-specific long-term memory in the illusion. Based on the results of Experiment 1, we hypothesized that phrase-specific knowledge initially learned across blocked repetitions would be consolidated in long-term memory, resulting in previously heard phrases sounding more song-like at the start of a second session seven days later, compared to novel phrases heard for the first time.
Method
Participants
Fifty-two Princeton University undergraduate students (32 females, 19–24 years; mean age = 21 years) participated in Experiment 2 after providing informed consent. Participants reported neither neurological nor hearing impairments and declared English to be their primary language. Thirty participants reported having more than 1 year of formal musical instrument training (“2” – “10 or more years”; mean training = “6 – 9 years”). Participants were compensated with research credits for completing each of the two days of the experiment.
Materials
The same apparatus was used as in Experiment 1. Phrase stimuli were the same as those used in Experiment 1.
Procedure
Participants heard 16 phrases across three conditions and two sessions. During the first session, all participants heard eight consecutive presentations of each of eight phrases (day 1 condition). Seven days later, participants again heard eight consecutive presentations of the eight familiar phrases first heard during session one (day 2 familiar condition) as well as eight novel phrases (day 2 novel condition). The structure of trials and blocks was the same structure used in Experiment 1. In all three conditions, repetitions of a phrase were heard consecutively within a block, as in the blocked condition from Experiment 1. The assignment of phrases to conditions and order of stimulus presentation were randomized across participants. Participants were instructed to take a short break (no more than five minutes) after completing the first eight blocks during the second session.
After blocks 3 and 6 in session one and blocks 3, 6, 11, and 14 in session two, participants encountered catch trials during which they heard the sentence, “Don’t rate this speech, instead choose response” followed by a spoken number (“two,” “four,” “five,” or “eight”). As in Experiment 1, participants were expected to click the button on the musicality scale that corresponded to the spoken number. Half of the catch trials on each day were spoken by a female voice and the rest by a male voice. Eight participants incorrectly answered more than one catch trial on a given day and were excluded from analyses. Three participants were also excluded for missing multiple responses in a single block more than once. On average, each participant missed 1.54 trials (SD = 1.53) on day 1, 0.88 trials (SD = 1.46) during the day 2 familiar condition, and 0.58 trials (SD = 0.98) during the day 2 novel condition.
At the end of the second session, participants filled out the GMSI (see Supplementary Materials, Experiment 2: Individual differences in illusion strength and Table S3). Participants also indicated the extent to which they agreed with the statement, “I paid attention throughout the experiment” using a 5-point scale (e.g., “Strongly Disagree,” “Disagree,” “Neither Agree nor Disagree,” “Agree,” “Strongly Agree”). A single participant responded that they didn’t agree with the statement and was excluded from analyses.
Results and Discussion
We first examined whether repetition increased the perceived musicality of phrases heard in all three conditions. We used a linear mixed model to estimate ratings for a phrase as a function of condition and repetition (see Supplementary Material Table S2 for descriptions of analyses). Results show participants perceived phrases as more musical on the final repetition compared to the first in all conditions (Figure 3A). For each additional repetition, the musicality rating for a phrase increased by 0.301 (95% CI [0.268, 0.337], p < .001) in the day 1 condition, and by 0.244 (95% CI [0.209, 0.278], p < .001), and 0.312 (95% CI [0.279, 0.345], p < .001) in the day 2 familiar and novel conditions, respectively. Phrases heard in the day 2 novel and day 1 conditions demonstrated similar rates of increase in musicality (difference = 0.011, 95% CI [0.037, 0.058], p = .647), while day 2 familiar phrases increased at a slightly slower rate than day 2 novel phrases (difference = -0.067, 95% CI [-0.115, -0.023], p = .004) and day 1 phrases (difference = -0.057, 95% CI [-0.104, -0.011], p = .018). Post hoc analysis suggests the lower rate of change in perceived musicality in the day 2 familiar condition is the result of phrases starting off as more song-like on the first presentation.
We predicted that phrase-specific representations encoded in long-term memory during the first session would increase the perceived musicality when the same phrase was heard again after the one-week delay period. Post hoc contrasts show that during the second session, perceived musicality was greater for the first repetitions of familiar phrases compared to the first presentations of novel phrases (difference = 0.745; Table 2 and Figure 3A). The first repetitions of phrases in the day 2 familiar condition were also perceived as more song-like compared to the first time the phrases were heard in the day 1 condition (difference = 0.896). Though we found evidence for an effect of long-term memory for phrases, perceived musicality still, on average, decreased by 1.214 rating scale values (95% CI [1.019, 1.408], p < .001) between the last repetition of a phrase in session one and the first repetition in session two. Day 2 familiar phrases heard for the second time in session two were also perceived as more song-like on the final repetition compared to day 2 novel phrases as well as in comparison to the final repetition of the same phrases at the end of session 1 (difference = 0.274 and difference = 0.500, respectively; Table 2). The different rates of change across repetitions are also clearly visible in the unmodeled data using participants’ average ratings (Figure 3B). Overall, despite a slightly smaller rate of change across repetitions, day 2 familiar phrases started off and ended up being perceived as more song-like than phrases heard for the first time. The pattern of results demonstrates phrase-specific long-term memory contributes to the illusion.
Contrasts . | Coefficients . | 95% CI . | p . |
---|---|---|---|
rep. 1 D2 Familiar - D1 | 0.896* | [0.704, 1.097] | < .001 |
rep. 1 D2 Familiar - D2 Novel | 0.745* | [0.543, 0.927] | < .001 |
rep. 1 D2 Novel - D1 Familiar | 0.157 | [−0.038, 0.349] | .124 |
rep. 8 D2 Familiar - D1 | 0.500* | [0.298, 0.690] | < .001 |
rep. 8 D2 Familiar - D2 Novel | 0.274* | [0.067, 0.455] | .006 |
rep. 8 D2 Novel - D1 Familiar | 0.226* | [0.038, 0.420] | .022 |
Contrasts . | Coefficients . | 95% CI . | p . |
---|---|---|---|
rep. 1 D2 Familiar - D1 | 0.896* | [0.704, 1.097] | < .001 |
rep. 1 D2 Familiar - D2 Novel | 0.745* | [0.543, 0.927] | < .001 |
rep. 1 D2 Novel - D1 Familiar | 0.157 | [−0.038, 0.349] | .124 |
rep. 8 D2 Familiar - D1 | 0.500* | [0.298, 0.690] | < .001 |
rep. 8 D2 Familiar - D2 Novel | 0.274* | [0.067, 0.455] | .006 |
rep. 8 D2 Novel - D1 Familiar | 0.226* | [0.038, 0.420] | .022 |
Note: Values are bootstrapped contrast coefficients and 95% CIs. Contrasts from all models reflect the difference between estimated marginal means. Asterisks denote significant effects. Day 1 condition (D1); Day 2 Familiar condition (D2 Familiar); Novel condition (D2 Novel); rep (repetition).
Unexpectedly, even though we didn’t find a difference in ratings between the first phrase presentations in the day 1 and day 2 novel conditions, day 2 novel phrases were rated as more song-like on the final repetition (difference = 0.226; Table 2). Though the difference was small, the finding suggests that task-related knowledge not specific to a stimulus but still conducive to the illusion can be learned and transferred to novel phrases, much like the transfer of skills in various learning paradigms (Kóbor et al., 2020; Mosha & Robertson, 2016; Schorn & Knowlton, 2021). Though we used phrases with different words spoken in distinct voices, the transfer effect may reflect pitch and or rhythmic similarities between phrases that weren’t directly manipulated in the current study. Importantly, the current study was not designed to test for a transfer effect and the result could reflect changes in participant behavior across sessions unrelated to learning, e.g., an upward drift in participants’ musicality ratings. Future work measuring the illusion across multiple days is required to establish the transfer effect.
General Discussion
The present study provides evidence for the role of rapid phrase-specific learning and long-term memory in the speech-to-song illusion. Experiments 1 and 2 required phrase-specific knowledge to be encoded and retained across delays for the illusion to occur. As a result, the magnitude of the change in perceived musicality across repetitions also served as a measure of phrase-specific memory. In Experiment 1, learning that took place across interleaved repetitions of different phrases led to increases in perceived musicality. The information encoded was sufficient to produce a more song-like perception even though the interleaved presentations prevented a phrase from being maintained by attentional and working memory processes. In Experiment 2, phrase-specific knowledge learned during the first session led to increases in perceived musicality at the start of the second session one week later. Even though participants only heard eight repetitions, the knowledge contributing to the illusion was consolidated into long-term memory and biased subsequent perception towards a more song-like experience. In both experiments, repetition provided the opportunity for phrase-specific learning to take place which was sufficient to produce the speech-to-song illusion.
Overall, participants in the present study experienced levels of musicality in the speech excerpts comparable to that experienced in previous studies using the same speech stimuli. Tierney et al. (2018b) reported average musicality rating values for the final repetitions in Experiments 1–3 of approximately 5 (using a 1–10 response scale) and a similar value was reported in Tierney et al. (2018a). The average musicality rating values for the final repetition in the present study correspond to a value slightly more song-like than the median scale value: 5.20 (Experiment 1 blocked condition), 4.16 (Experiment 1 interleaved condition), 5.16 (Experiment 2 day 1 condition), 5.66 (Experiment 2 day 2 familiar condition), 5.39 (Experiment 2 day 2 novel condition). Note that to facilitate comparison between previous studies that used a 1-10 response scale, the median values reported in this paragraph were shifted from the 0-9 response scale we used to analyze and report the results. Participants in the present study also experienced the illusion to an extent comparable to that experienced in previous studies using the same speech stimuli. The average subject-level difference between the last and first presentation in the present set of experiments was approximately 2 rating scale values: 2.01 (Exp. 1 blocked condition), 0.86 (Exp. 1 interleaved condition), 2.11 (Exp. 2 day 1 condition), 1.71 (Exp. 2 day 2 familiar condition), 2.18 (Exp. 2 day 2 novel condition). The same value, reported by Tierney et al. (2021) was approximately 1.9.
Once a phrase transforms, participants make clear predictions about the syllable sequence and are differentially sensitive to pitch and timing deviations when it recurs (Graber et al., 2017; Vanden Bosch der Nederlanden et al., 2015). We suggest that repetition transforms speech into song because the illusion requires learning how a phrase unfolds over time, much like how repetition drives learning of the tonal and temporal (rhythmic) sequence of a particular piece of music (Kubit & Janata, 2022a, 2022b). Previous work examined the illusion after removing temporal structure and found the illusion strength to be unaffected by the introduction of random jitter between syllables in a phrase (Falk et al., 2014; Graber et al., 2017; Tierney et al., 2018b) and between repetitions (Margulis et al., 2015). While beat structure and meter are undoubtedly important features in music, humans also learn to represent higher-order structure in temporal sequences (Dehaene et al., 2015; Janata & Grafton, 2003). For example, ordinal knowledge about a syllable sequence can be learned even if meter is lacking and more complex structures like melodic contour are independent of timing (Dowling, 1978; Hébert & Peretz, 1997). We hypothesize that repetition drives the learning of higher-order sequence structure in phrases and that the cognitive mechanisms that support sequence learning in the illusion are likely not unique to music listening (Janata & Grafton, 2003; Zatorre et al., 2007) or speech perception (Christiansen & Chater, 2008; Conway & Pisoni, 2008).
In Experiment 1, phrases heard in the interleaved condition were perceived as more song-like by the final repetition than they had been on the initial presentation; however, the final repetition of the blocked condition phrases was perceived as more song-like than the final repetition of the interleaved condition phrases (Table 1, Figure 2A). One explanation of these results is that phrase learning is facilitated by blocked repetition. Although the blocked condition precludes a direct measure of phrase-specific learning within a single experiment session free from the influence of attention and working memory processes, we found in Experiment 2 that familiar phrases that had been presented in session one were perceived as more song-like on the first repetition within session two, seven days later. The endurance of the perceptual transformation from speech to song suggests that phrase-specific learning influenced the illusion during blocked conditions in both experiments. Research on visuo-motor sequence learning provides evidence that blocked presentations lead to superior short-term retention but poor long-term memory of the sequences, compared to interleaved presentations (Schorn & Knowlton, 2021). During interleaved presentations, interference arising from other stimuli inhibits performance during learning, but the same interference eventually helps produce a more stable memory trace (Robertson et al., 2004; Shea & Morgan, 1979). The pattern of short-term learning observed in such studies resembles the pattern of musicality ratings between conditions in Experiment 1 and suggests that the perceptual illusion experienced for interleaved phrases would be better preserved over time. An important question for future work is whether additional interleaved repetitions can further strengthen the illusion such that the magnitude of the effect is comparable to that experienced after blocked repetitions. Finding that additional interleaved repetitions produce an illusion comparable to blocked repetitions would provide evidence that differences in learning rates lead to differences in illusion strength between conditions. Additionally, while the results of Experiment 1 demonstrated a reliable increase in perceived musicality in the interleaved condition, future work is needed to clarify whether some threshold exists at which an increase in perceived musicality is experienced as the illusion.
Participants tend to experience greater changes in perceived musicality across repetitions when phrases conform to melodic features typically found in Western music (Tierney et al., 2018b). Previous work has interpreted such results as support for the hypothesis that the illusion results from warping phrases to fit internalized templates of common musical patterns (Rowland et al., 2019; Tierney et al., 2018b). However, given the present results showing that the illusion entails the learning of phrase-specific knowledge, findings on stimulus-related differences can be explained in the context of their influence on learning. We hypothesize that differences in stimuli such as within- and between-syllable pitch slopes (Tierney et al., 2018b), rhythmic stability (Falk et al., 2014), and stimulus length (Rowland et al., 2019) influence the extent to which the structure of a temporal sequence can be learned. For example, shorter phrases and phrases composed of syllables that have flat rather than steep pitch contours produce a stronger illusion and may be easier to learn when heard repeatedly. Indeed, the mnemonic benefit of structure has been well documented in learning paradigms using visuo-motor sequences (Howard et al., 2004; Kóbor et al., 2020; Nissen & Bullemer, 1987) and music (Bharucha & Krumhansl, 1983; Dowling, 1991; Lévêque et al., 2022). One way of distinguishing between sequence learning and mechanisms that entail music-specific template matching (Rowland et al., 2019; Tierney et al., 2018a) is to test whether structures that benefit learning but are atypical music patterns still strengthen the speech-to-song illusion.
Music listening is an active process, during which predictions about what’s next are continuously updated according to prior schematic knowledge of typical music structure as well as representations of veridical sequences in memory (Bharucha, 1987; Tillmann & Bigand, 2010; Vuust et al., 2022). The present study provides evidence that stimulus-specific learning shapes listeners’ expectations across repetitions of spoken phrases and suggests that the influence of such expectations on perception underlies the transformation from speech to song. While the illusion manifests as an explicit awareness of a change in perceived musicality, the underlying changes in perceptual expectations may not require awareness of the learned knowledge or any overt effort by the listener. Sequence learning has been shown to improve task performance even when participants aren’t explicitly aware of the sequences (Robertson, 2007) much like how implicit memory for music reflects prior exposure even when participants don’t explicitly recognize the music (Halpern & Müllensiefen, 2008; Thorpe et al., 2019). Thus, perceiving a stimulus as music may not require active engagement, but simply the opportunity to learn sequence structures found in music—made possible by the repetition inherent to much of it (Margulis, 2014).
Author Note
The authors would like to thank Madeline Kushan for her help with setting up the online experiment, and Mara Breen and Aniruddh Patel for helpful early conversations about this work.
All audio files, data, and analysis code are available through the Open Science Framework at https://osf.io/7dxwu/ (https://doi.org/10.17605/OSF.IO/7DXWU).
Preliminary findings from this study were presented at the Society for Music Perception and Cognition 2022 Meeting in Portland, Oregon. All audio files, data, and analysis code are available through the Open Science Framework (https://osf.io/7dxwu/).