In a recent article, Bonin, Trainor, Belyk, and Andrews (2016) proposed a novel way in which basic processes of auditory perception may influence affective responses to music. According to their source dilemma hypothesis (SDH), the relative fluency of a particular aspect of musical processing—the parsing of the music into distinct audio streams—is hedonically marked: Efficient stream segregation elicits pleasant affective experience whereas inefficient segregation results in unpleasant affective experience, thereby contributing to (dis)preference for a musical stimulus. Bonin et al. (2016) conducted two experiments, the results of which were ostensibly consistent with the SDH. However, their research designs introduced major confounds that undermined the ability of these initial studies to offer unequivocal evidence for their hypothesis. To address this, we conducted a large-scale (N = 311) constructive replication of Bonin et al. (2016; Experiment 2), significantly modifying the design to rectify these methodological shortfalls and thereby better assess the validity of the SDH. Results successfully replicated those of Bonin et al. (2016), although they indicated that source dilemma effects on music preference may be more modest than their original findings would suggest. Unresolved issues and directions for future investigation of the SDH are discussed.
According to the processing fluency theory of aesthetic pleasure (Reber, Schwarz, & Winkielman, 2004), when an object is more quickly and/or accurately processed, this “ … elicit[s] positive affect because [the experience of ease in perceptual processing] is associated with progress toward successful recognition of the stimulus … or the availability of appropriate knowledge structures to interpret the stimulus … ” (p. 366). In line with this notion, research has confirmed that a number of features of artworks that heighten processing fluency, including figure-ground contrast, symmetry, and visual clarity, promote aesthetic preference (Reber et al., 2004), at least in initial stages of stimulus evaluation (Graf & Landwehr, 2015).
In accordance with research showing a formal relationship between aesthetic responses to visual and audio features (e.g., Liew, Lindborg, Rodrigues, & Styles, 2018), Bonin, Trainor, Belyk, and Andrews (2016) have recently put forth a processing fluency-based model that is specifically relevant to understanding the aesthetic appeal of music. According to their source dilemma hypothesis (SDH), the relative fluency of a particular aspect of musical processing—the parsing of the music into distinct audio streams—is hedonically marked: Efficient stream segregation elicits pleasant affective experience whereas inefficient segregation results in unpleasant affective experience, thereby contributing to (dis)preference for a musical stimulus. To elaborate, following Bregman (1990; Wright & Bregman, 1987), Bonin et al. (2016) propose that a primary task of the auditory system is to conduct a spectrotemporal analysis of the complex sound wave that constitutes the summation of waves that reach the ear from multiple sources in the surrounding environment. This analysis entails identifying which frequency components should be grouped together and thereby mentally represented as emanating from the same physical source, for instance, from the same musical instrument during a live performance.
As Bonin et al. (2016) review, the auditory system uses several cues in segregating audio streams: frequency components that share onset/offset times, exhibit parallel changes in pitch contour, and show similar patterns of interaural timing (cuing similar spatial location) are more likely to be perceived as originating from a single source. Likewise, when multiple sounds are heard over time and these share a similar timbre (e.g., that of a piano versus trumpet) these tend to be grouped together as part of a distinct auditory stream (as they possess a similar spectrotemporal profile of frequency components; McLachlan, 2016). Finally, frequency components that are related harmonically, as integer multiples of a fundamental frequency, tend to be perceived as a single sound source with a distinct pitch, whereas inharmonically related components tend to be perceived as representing distinct sound sources.
Bonin et al. (2016) note that although these various auditory source cues are often congruent with one another, at times they provide conflicting interpretations as to the nature of the auditory scene. Such source dilemmas may arise in the case of music, for instance, when two notes a diatonic semitone apart are sounded simultaneously—here, the shared onset time suggests a single source whereas the interval of a minor second formed by the dyad engenders an inharmonic complex spectrum, suggesting the presence of multiple sound sources. As alluded to above, according to the SDH, ambiguities of this sort impair auditory scene analysis (ASA), eliciting negative affect and leading the music to sound more unpleasant than it would have given congruent source cues.
Bonin et al. (2016) conducted two lab experiments to test this hypothesis. In the first study, participants were administered a series of two alternative forced choice (2AFC) trials. Here, they had to indicate which of two versions of a simple tonal melody they preferred as well as which of the two they found more pleasant. On each trial, participants had to choose between a standard version of the melody played in a piano timbre and a version that was altered in one of three ways. Specifically, in one altered version, approximately 20% of the notes were shifted up or down by a semitone such that they were out of key. In the second version, these same notes were panned 90° such that they were presented to either the left or right ear only as opposed to both ears (i.e., dichotically as opposed to diotically). Finally, in the third version, these notes were played in either a trumpet or xylophone timbre instead of a piano timbre. Bonin et al. (2016) proposed that each of these variations would promote uncertainty in ASA because they were interspersed with harmonic, spatial, or timbral cues that suggested the possibility of multiple sound sources despite an overall musical context consistent with the presence of a single auditory stream. As such, Bonin and his colleagues predicted that individuals would show a clear preference for the standard, unaltered melodies, a prediction that was strongly borne out.
Unfortunately, as Bonin et al. (2016) themselves note, the results of this initial experiment were equivocal in terms of their support for the SDH as they confounded source cue ambiguity with sheer unpredictability. Restated, each melodic variant (harmonic, spatial, or timbral) not only created confusion regarding the number of auditory streams present, but also included stimuli that were relatively unexpected. According to both classic (Meyer, 1956) and contemporary (e.g., Huron, 2006) theories of musical emotion, unexpected musical events (i.e., incorrect musical predictions) at least initially engender negative affect, diminishing liking for the music.
To rectify this issue and provide more compelling support for the SDH, Bonin et al. (2016) conducted a second experiment in which they again asked participants to make a series of choices between a standard and one of three alternative versions of each of several melodies. In this case, each standard melody consisted entirely of parallel dyads of a minor ninth (i.e., intervals of 13 diatonic semitones), played using a piano timbre. Each of these dyads created an inharmonic complex spectrum, which has been found to lead tone combinations to sound relatively dissonant (e.g., Bowling, Purves, & Gill, 2017; Prete, Fabri, Foschi, Brancucci, & Tommasi, 2016). The first type of altered version of this stimulus involved changing the timbre of one of the musical voices in each dyad to that of a trumpet on half of the trials and to that of a xylophone on the remaining half of the trials. The second type of altered version left the timbre of the standard stimuli intact but panned one of the musical voices by 90° such that the upper and lower voices were presented to different ears. Finally, a third altered version represented a combination of the latter two, simultaneously modifying the timbre and spatial location of the voices.
Bonin et al. (2016) posited that the standard version of each melody incorporated several cues (including shared spatial location and shared timbre), suggesting the presence of a single auditory source; however, the inharmonicity inherent in each dyad strongly implied the presence of multiple sources. Theoretically speaking, by modifying the stimuli to place one of the musical voices in a different timbre and/or in a different perceived spatial location, it would help disambiguate this source dilemma, diminishing negative affect and thereby fostering a preference for the altered stimulus. Although Bonin et al. (2016) did not report the raw proportions of choices of the altered versus standard melodic stimuli, they did report a tally of the number of study participants who predominantly chose each version across trials. This showed that an overwhelming majority of participants expressed a preference for the altered versions relative to the corresponding standard versions. Participants also overwhelmingly agreed that the altered versions were more pleasant than the standard versions. In contrast to Bonin et al.'s (2016) first experiment, this pattern of results reflects a tendency to choose a more, rather than less complex and unpredictable stimulus, thereby arguing against an expectancy-based alternative account and providing stronger evidence for the SDH.
However, upon consideration, Bonin et al.'s (2016) second experiment also suffered from limitations that undermine its ability to unequivocally support their hypothesis. Most prominently, the experiment did not include a control condition in which musical stimuli were constructed using harmonic as opposed to inharmonic dyads. According to the SDH, if the stimuli were harmonic, the available source cues would unambiguously point to the presence of a single auditory stream. In this case, changing the timbre of one of the voices in the musical stimuli should serve to enhance rather than diminish the source dilemma, leading individuals to show less, as opposed to more, preference for the altered (mixed-timbre) relative to standard (shared-timbre) stimuli. The absence of a “harmonic” control group leaves Bonin et al.'s (2016) results vulnerable to at least two non-ASA-based alternative explanations for participants’ preference for mixed-timbre stimuli: First, according to recent findings by Broze, Paul, Allen, and Guarna (2014), polyphonic music with higher voice multiplicity (i.e., a greater number of perceived voices) is rated by listeners as sounding happier. They propose that these findings may reflect a tendency for multiple musical voices to symbolically represent a (positive) social interaction involving two or more people. Although Broze et al.'s (2014) results were based on differences in the number of same-timbre voices perceived in Baroque fugues, it seems likely that the perception of multiple timbres may have a similar symbolic effect as that of multiple musical lines. If so, the preference for mixed-timbre stimuli reported by Bonin et al. (2016) might have reflected greater liking for musical stimuli with a higher number of perceived voices, irrespective of whether the parsing out of these voices was easy or difficult.
Second, it is possible that participants preferred the mixed-timbre option simply because they liked one or more of the altered timbres (trumpet or xylophone) better than piano. This possibility gains credence from work by Huron and his colleagues (Huron, Anderson, & Shanahan, 2014; Schutz, Huron, Keeton, & Loewer, 2008) suggesting that the trumpet and the xylophone are less capable of expressing sadness than the piano and that the xylophone is a particularly “happy” instrument for which composers rarely write music in aminor key. As such, participants in Bonin et al.'s (2016) study may have judged the mixed-timbre option as more pleasant because the instrumental timbres used to produce it tended to express more positive emotion.
Another concern regarding Bonin et al.'s (2016; Experiment 2) design pertains to the choice of a xylophone timbre to create half of the mixed-timbre versions of the musical stimuli they employed. To reiterate, these stimuli were made up of sequential dyads of aminor ninth, which sound very dissonant when played on the piano. However, as discussed by Sethares (1993, 2005), percussion instruments such as xylophones—which are made up of a series of beams that are free at both ends—are unusual in that they produce overtones that are inharmonically related. The use of such inharmonic instrumental timbres can profoundly alter the subjective quality of musical intervals, for instance, making an octave sound relatively dissonant or an interval roughly approximating an equal-tempered minor ninth sound relatively consonant by altering the pattern of interference between frequency components (Sethares, 1993; Figure 8, p. 1223). This raises the prospect that participants in Bonin et al.'s study may have opted for the mixed-timbre musical stimuli on average largely because the half of trials that included a xylophone sounded far more consonant than those that included a piano alone. Although Bonin et al. (2016) also included mixed-timbre trials using a trumpet rather than a xylophone timbre, preference/pleasantness judgments for these stimuli were not reported separately, making it impossible to assess whether their effect was largely driven by the trials including a xylophone. However, even if participants’ judgments did not reliably differ based on the specific “altered” timbre they heard, given that trials using a xylophone timbre were always intermixed with trials using a trumpet timbre, the consonance of the piano-xylophone dyads may have fostered a mixed-timbre response bias, leading participants to express greater preference for the piano-trumpet dyads than they would have otherwise. In short, without a more comprehensive accounting for the effect of timbre on music preference, it is possible that Bonin et al.'s (2016) effects were an artifact of the particular instrumental voices that they used to create the mixed-timbre versions of their musical stimuli.
The present study represented a large-scale constructive replication of Bonin et al. (2016; Experiment 2), meant to address these methodological shortfalls and thereby better assess the validity of the SDH. To this end, we: 1) added an experimental condition in which the dyads comprising the musical stimuli were separated by an octave rather than a minor ninth; and 2) manipulated the particular timbre used in the altered stimuli between- rather than within-subjects, such that some participants were presented with piano-trumpet dyads and others with piano-xylophone dyads. If the predictions of the SDH are supported, participants should show greater preference for mixed-timbre musical stimuli when these are based on inharmonic minor ninth dyads and should do so irrespective of the specific “altered” timbre employed (trumpet vs. xylophone). However, participants should show less preference for mixed-timbre musical stimuli when these are based on parallel octaves relative to when they are based on minor ninths—here, the harmonicity of the constituent dyads should strongly cue the presence of a single auditory source and the use of two timbres (e.g., piano and trumpet) should exacerbate rather than mitigate ambiguity in ASA.
Participants were 325 undergraduate students at the University at Albany who completed the study for partial course credit in an introductory psychology course. They were recruited for an approximately 45 min long study involving “listening to pairs of short melodies and choosing which one you prefer.” Due to technical issues during data collection 14 of these participants were unable to complete the study in its entirety. Therefore, the final sample consisted of 311 participants (202 female; age: M = 18.83, SD = 1.69). Thirty-one percent reported having at least some formal training in music theory and 67% reported at least some formal training on a musical instrument. Seventy-six participants were randomly assigned to the Low Harmonicity/Trumpet Timbre group, 80 to the High Harmonicity/Trumpet Timbre Group, 76 to the Low Harmonicity/Xylophone Timbre group, and 79 to the High Harmonicity/Xylophone Timbre group. There were no significant differences in age, gender, or music training between experimental groups as determined by a series of 2 (Harmonicity: low vs. high) X 2 (Altered Timbre: trumpet vs. xylophone) ANOVAs on each of these demographic variables (p's > .12 for all main effects and interactions). In addition, to assess whether they moderated the effects of Harmonicity and Timbre on music preference, each of the demographic variables were separately included as predictors in each of the regression analyses to be reported below. Neither age, nor gender, nor music training moderated the main or interactive effects to be reported below (all p's > .11) and, as such, none of these variables will be discussed further.1
The musical stimuli presented to participants were adapted from those developed by Bonin et al. (2016), which included 12 distinct polyphonic melodies, one for each of the 12 diatonic major keys. Melodies were each set at a constant tempo ranging from 72 to 106 BPM and varied in length from 11-17 s (see Bonin et al., 2016, for the musical scores, which include tempo markings). For the purposes of the present experiment, several variants of the stimuli were created: First, as in Bonin et al.'s (2016) study, one melodic variant was entirely made up of inharmonic parallel minor 9th dyads and used a piano timbre for both the upper and lower voices. Two mixed-timbre variants of these inharmonic stimuli were then created by changing the timbre of each upper voice from piano to either trumpet or xylophone (producing an additional 24 stimuli, i.e., 12 piano-trumpet and 12 piano-xylophone variants). Next, consonant versions of each of these 36 stimuli were generated by modifying the upper voices such that they formed intervals of an octave, as opposed to a minor 9th, with each corresponding lower voice (see Figure 1). The final set of 72 melodic stimuli was transcribed into Sibelius First music notation software (v. 2018.7; Finn & Finn, 2018). They were then exported as MIDI files, which were played through Sibelius’ virtual instrument sample banks and recorded as WAV files, which were quasi-controlled for loudness by equating RMS amplitudes in Praat (v. 6.0.43; Boersma &Weenik, 2018). All stimuli were presented diotically via headphones (Koss UR-20).
Upon arrival at the lab, participants were seated at visually isolated computer workstations and instructed to put on a pair of experimenter-provided headphones. Participants were told that they would be listening to pairs of melodies and that, following each pair, they would be asked which melody they preferred as well as which one they found to be more unpleasant. To familiarize them with the task, they then completed a practice trial. The trial began with a 2 s display (“Get ready … “) alerting participants to attend to the stimuli that followed. They were then sequentially presented with two variants of the well-known children's song “Twinkle, Twinkle, Little Star.” In both variants, the melody was doubled at the interval of an octave. However, in the first variant, the upper and lower voices appeared in the same (piano) timbre, whereas in the second variant, the upper voice appeared in a xylophone timbre. There was a 2 s period of silence between variants. Afterward, participants were asked to complete a pair of two-alternative forced choice (2AFC) tasks. Specifically, they were asked, “Which melody did you prefer?” followed by, “Which melody was more unpleasant?” Following the procedure of Bonin et al. (2016), the two tasks were always administered in this fixed order. Responses were tendered by indicating “Melody 1” or “Melody 2” on screen using the computer mouse.
On each of the subsequent experimental trials, participants were presented with the same alerting screen, after which they were sequentially presented with two variants of one of the melodic stimuli developed by Bonin et al. (2016; Experiment 2). In one of the variants, the upper and lower voices appeared in the same (piano) timbre, whereas in the other variant the upper voice appeared in a different timbre. Here, this altered timbre was always either that of a trumpet or a xylophone, depending on the between-subjects condition to which participants had been randomly assigned. In addition, participants randomly assigned to the “low harmonicity” condition heard only pairs of dissonant melodic variants, whereas those assigned to the “high harmonicity” condition heard only pairs of consonant variants. The order in which the same- and mixed-timbre variants of a given melodic stimulus appeared on a given trial was counterbalanced between participants. (To clarify, whereas the same-timbre version of a particular melody appeared first for some participants, it appeared second for others). At the end of each trial, participants were asked to complete the same 2AFC items as in the practice trial. Trials were administered in two blocks. In each block, 12 distinct melody pairs were presented in a different random order. As such, participants completed two identical trials for each of 12 melodic variants for a total of 24 experimental trials.
Following the experimental procedure, basic demographic information including age and gender was collected and participants were asked how many years of formal training they had received in music theory as well as how many years of formal training they had received on a musical instrument. Finally, they were fully debriefed and released.
Table 1 presents the proportion of trials on which participants expressed a preference for the mixed-timbre melodic variant as well as the proportion of trials on which they indicated that the mixed-timbre variant was more pleasant (less unpleasant), with results indexed by condition. To review, the SDH predicts that participants should show greater preference for the mixed-timbre musical stimuli (and, accordingly, rate them as more pleasant) when these are relatively low in harmonicity.
|.||Low Harmonicity .||.||High Harmonicity .|
|Altered Timbre .||M .||SD .||.||M .||SD .|
|.||Low Harmonicity .||.||High Harmonicity .|
|Altered Timbre .||M .||SD .||.||M .||SD .|
Note: Participants were originally asked, “Which melody did you prefer?” followed by, “Which melody was more unpleasant?” For ease of interpretation, descriptive statistics for the latter item were reverse coded to reflect proportion of choice of mixed-timbre option as more pleasant.
To test this, the data for the dichotomous 2AFC preference and unpleasantness trials were separately modeled using a set of 2 (Harmonicity: low vs. high) X 2 (Altered Timbre: trumpet vs. xylophone) mixed-model multiple logistic regression analyses. Beginning with the analyses of preference judgments, consistent with the predictions of the SDH, these revealed a main effect of Harmonicity, Wald χ2(1, N = 311) = 165.26, p < .0001, suggesting that participants were more likely to express a preference for the mixed-timbre melodic variants when they were made up of inharmonic (M = 61.27%) as opposed to harmonic dyads (M = 30.21%). There was also a main effect of Altered Timbre, suggesting that participants were more likely to express preference for a mixed-timbre variant if the upper voice was played using a xylophone (M = 67.85%) as opposed to a trumpet timbre (M = 23.08%), Wald χ2(1, N = 311) = 294.10, p < .0001. These effects were qualified by a significant Harmonicity X Altered Timbre interaction, Wald χ2(1, N = 311) = 45.74, p < .0001, suggesting that the preference for inharmonic versus harmonic mixed-timbre melodic variants differed based on the particular altered timbre used to produce the stimuli: As indicated by the raw proportions in Table 1, when the altered timbre was a xylophone sound, participants in the “low harmonicity” condition showed an almost uniform preference (M = 92.16%) for the mixed-timbre melodic variant; however, when the altered timbre was a trumpet sound, participants in this condition no longer showed an absolute preference for the mixed-timbre variant, but instead, only a relative preference (M = 30.37%) compared to those in the “high harmonicity” condition (M = 16.15%). According to a supplementary planned comparison, this simple effect within the “trumpet” condition was nonetheless highly reliable, Wald χ2 (1, N = 156) = 16.90, p < .0001, in line with the predictions of the SDH. (All remaining simple effects were also reliable, including that of Harmonicity within the Xylophone Timbre group, Wald χ2 [1, N = 155] = 183.02, p < .0001, that of Altered Timbre within the Low Harmonicity group, Wald χ2 [1, N = 152] = 239.13, p < .0001, and that of Altered Timbre within the High Harmonicity group, Wald χ2 [1, N = 159] = 59.91, p < .0001; see Table 1 for all descriptive statistics).
Proceeding to the analyses of 2AFC unpleasantness judgments, these were first reverse coded to reflect choice of the more pleasant option for the sake of interpretability. As shown in Table 1 (bottom panel), the pattern of pleasantness judgments closely mirrored that of preference judgments and, indeed, there was a near-perfect correlation, r(309) = .98, between the proportion of total choices of mixed-timbre options for the 2AFC preference and pleasantness tasks across the 24 experimental trials. Accordingly, the results of mixed-model multiple logistic regression analyses revealed the exact same pattern of effects, including main effects of Harmonicity, Wald χ2(1, N = 311) = 165.24, p < .0001 and Altered Timbre, Wald χ2(1, N = 311) = 310.26, p < .0001, qualified by a Harmonicity X Altered Timbre interaction, Wald χ2(1, N = 311) = 38.65, p < .0001.
In this experiment, we reinvestigated Bonin et al.'s (2016) source dilemma hypothesis, the notion that the fluency of auditory stream segregation experienced when listening to music is positively associated with music preference. To this end, we conducted a large-scale constructive replication of one of their studies (Bonin et al., 2016; Experiment 2), the results of which were ostensibly consistent with the SDH, yet which also suffered from methodological limitations that ultimately called its support for the hypothesis into doubt. Specifically, we asked participants to indicate their preferences for same- versus mixed-timbre versions of several polyphonic melodies. For some participants, the mixed-timbre versions were produced with a piano timbre in the lower voice and a trumpet timbre in the upper voice, whereas for others, the mixed-timbre versions were produced with a xylophone rather than a trumpet timbre in the upper voice. In addition, for some participants, all melodies were made up of parallel minor ninths (“low harmonicity” condition),whereas for others, they were made up of parallel octaves (“high harmonicity” condition).
According to Bonin et al. (2016), the shared onset time of the musical voices comprising such stimuli suggests the presence of a single auditory source; however, when the stimuli are low in harmonicity, it suggests the presence of multiple sound sources, creating a “source dilemma.” As such, their model predicts greater preference for the mixed-timbre musical stimuli in the “low harmonicity” condition as the use of different timbres in the upper and lower voices—indicating the presence of two distinct sound sources—helps resolve this perceptual dilemma, diminishing the negative affect with which it is associated.
Our results supported this prediction: Participants who heard pairs of melodies made up of inharmonic dyads, relative to those who heard pairs made up of harmonic dyads, were more likely to express preference for mixed-timbre variants of these musical stimuli. Notably, this effect was significantly moderated by timbre—when a xylophone sound was used to alter the timbre of the upper melodic voices, participants in the “low harmonicity” condition expressed a much stronger preference for the mixed-timbre variants than when a trumpet sound was used. This is consistent with our speculation that Bonin et al.'s (2016) original effects, which were obtained with the inclusion of a xylophone timbre in each block of trials, may have at least partially reflected a general preference for this timbre (e.g., due to its “happy” expressive tone) and/or a tendency for such timbres to produce a “smoother” complex spectrum when used to generate dyads of a minor ninth. However, by showing that mixed-timbre dyads featuring a trumpet rather than a xylophone sound were also relatively preferable when the dyads were low versus high in harmonicity, the present study rules out the possibility that the effects found by Bonin et al. (2016) were solely an artifact of their timbre manipulation. In addition, by newly appending a high harmonicity control group and showing that participants were more inclined to choose the same-timbre, as opposed to mixed-timbre variants in this condition (again, consistent with the predictions of the SDH), the present study also rules out the possibility that Bonin et al.'s (2016) participants merely preferred melodic stimuli with a greater number of musical voices.
To be clear, the present findings do not exactly replicate those of Bonin et al. (2016), as they fail to show an absolute preference for inharmonic mixed-timbre melodies across conditions—indeed, when a trumpet instead of a xylophone timbre was used, participants were strongly inclined to prefer the same-timbre inharmonic variant on average. However, with its addition of an experimental manipulation of stimulus harmonicity, the present study was able to reveal a robust relative preference for inharmonic versus harmonic mixed-timbre melodies. This effect remained reliable irrespective of which of two distinct altered timbres was employed, offering converging evidence for the overarching predictions of the SDH. A fuller understanding of the precise manner in which timbre variations influence the generation and resolution of source dilemmas will require additional experimentation. The results of future studies will likely prompt further refinement of the model.
Although our study does provide renewed support for the SDH, it does share an important limitation of the initial research of Bonin et al. (2016) in that it offers no direct evidence that the use of different timbres for each of the musical voices in a low-harmonicity polyphonic melody reduces the fluency of perceptual processing of this melody. In future research, one way to test this would be to build on Bonin et al.'s (2016) assumption that “source dilemmas are cognitively effortful” (p. 180) and as such should impair performance on an unrelated attentionally-demanding secondary task. Pertinent to this proposal, Bonin and Smilek (2016) recently reported the results of a study showing that presentation of task-irrelevant contrapuntal background music interferes more with performance on a focal short-term memory task when the music is inharmonic as opposed to harmonic. Here, the background musical stimuli were presented using a single (piano) timbre. However, if one of the voices in the inharmonic stimuli were altered in timbre as in the present study, the SDH would predict diminished interference and thereby improved performance on the focal task due to the resolution of the source dilemma. Likewise, the background presentation of mixed-timbre variants of the harmonic musical stimuli should create rather than resolve a source dilemma, increasing cognitive interference and impairing performance relative to when a single timbre is employed. To our knowledge, these predictions await empirical scrutiny.
In addition, although the present study did rule out a number of alternative explanations for Bonin et al.'s (2016) initial findings, we believe that there may be at least one other competing account: As mentioned above, Broze et al. (2014) have proposed that polyphonic music with a greater number of musical voices may be perceived as sounding happier and less lonely as it symbolically represents a social setting. While speculative, it is possible that such effects differ depending on the conventionality of the music: When exposed to relatively unconventional musical stimuli, such as the inharmonic melodies used in the present experiment, participants presumably have difficulty processing this music using standard harmonic and/or melodic schemas. As a result, they may be more likely to attend to secondary cues such as voice multiplicity when judging the music's emotional character. Assuming that as in the case of multiple musical lines, the use of multiple timbres symbolizes a larger social group, this might prompt listeners to interpret inharmonic music as expressing more happiness/less loneliness, and thereby to find it relatively preferable. To evaluate this alternative hypothesis, it will be necessary to test whether individuals indeed perceive mixed-timbre musical stimuli as more likely to connote a social setting (i.e., being in a group versus alone) and whether such stimuli actually express greater happiness and less loneliness, particularly when they are also harmonically unconventional.
Beyond this, there are at least two ways in which the SDH might be refined to more comprehensively contribute to understanding music preference. First, as the results of the present study suggest, the model needs to more fully account for how harmonicity and timbre shape the (in)coherence of a musical percept and its associated affective consequences. As alluded to earlier, Bonin et al. (2016) assume that certain, traditionally “consonant” musical dyads (e.g., the octave) tend to produce complex spectra with harmonically related frequency components. This leads the components to fuse in auditory perception, thereby implying the presence of a single sound source. Likewise, traditionally “dissonant” dyads (e.g., the minor 2nd), tend to produce complex spectra with harmonically related frequency components, reducing fusion and implying the presence of multiple sound sources. However, as discussed by Sethares (2005), the fusion of components in the complex sound wave constituting a dyad depends on the harmonicity of the individual tones used to produce it. When the tones making up a dyad contain partials that are harmonically related, their components will not necessarily fuse at traditionally “consonant” intervals such as the octave—indeed, such intervals may even sound particularly “rough” (Sethares, 1993). In addition, even when the timbres used to produce the individual tones in a dyad (or chord) are each relatively harmonic, but differ in the amplitude and/or timing of particular partials, those with greater spectrotemporal similarity (e.g., a violin and viola vs. a violin and flute) are presumably more likely to lend themselves to a unified perceptual representation. As such, in future iterations of the SDH, it would be worthwhile to incorporate a systematic means of estimating the coherence of the auditory percept elicited by the combination of tones with various spectrotemporal features. This would enable the model to make more precise predictions regarding how variables critical to ASA—including harmonicity, timbre, and spatial location—collectively impact processing fluency and thereby music preference.
A second general limitation of the SDH is its fairly exclusive focus on bottom-up processing in shaping aesthetic responses to musical stimuli. As mentioned earlier, a considerable amount of evidence supports the notion that when stimuli are processed automatically, experiences of fluency are associated with enhanced aesthetic preference (Reber et al., 2004). However, Graf and Landwehr (2015) have recently proposed that when individuals have a high need for cognitive enrichment—a motivation to “ … revise and adapt their knowledge structures to the stimulus … ” (p. 402)—this can spark controlled (top-down) processing aimed at reducing feelings of disfluency when stimuli are initially difficult to process. Progress at disfluency reduction is posited to foster pleasant feelings of interest or “flow” (Csikszentmihalyi, 1990), whereas a lack of progress gives rise to unpleasant feelings of confusion. This model implies that for individuals whose personalities are characterized by higher levels of openness to experience (Goldberg, 1993), or for those who are at least momentarily inclined to effortfully scrutinize a musical stimulus, the initial discomfort caused by a source dilemma may heighten curiosity, ultimately leading to positive aesthetic responses, provided that deliberate efforts at processing the music are fruitful. Speculatively, controlled processing effects of this sort may help account for cases in which traditionally dissonant stimuli are associated with positive emotional ratings (e.g., Blood, Zatorre, Bermudez, & Evans, 1999) or even preferred over traditionally consonant stimuli (e.g., Lahdelma & Eerola, 2016). In the present study, no attempt was made to manipulate processing style (automatic vs. controlled) nor to measure individual differences in the need for cognitive enrichment. However, the SDH may offer a useful point of departure for empirically exploring how bottom-up versus perceiver-driven processes uniquely contribute to aesthetic preferences, in the auditory domain and otherwise.
In sum, the results of the present study constructively replicate those of Bonin et al. (2016), offering converging evidence for their recently proposed source dilemma hypothesis. Clearly, additional research is needed to provide more direct support for the assumptions underlying the SDH and to explore the full implications of the model. However, if the hypothesis is ultimately borne out empirically, it will uniquely contribute to understanding how fundamental processes of auditory perception shape preferences for and emotional responses to music.