Music expertise has been shown to enhance emotion recognition from speech prosody. Yet, it is currently unclear whether music training enhances the recognition of emotions through other communicative modalities such as vision and whether it enhances the feeling of such emotions. Musicians and nonmusicians were presented with visual, auditory, and audiovisual clips consisting of the biological motion and speech prosody of two agents interacting. Participants judged as quickly as possible whether the expressed emotion was happiness or anger, and subsequently indicated whether they also felt the emotion they had perceived. Measures of accuracy and reaction time were collected from the emotion recognition judgements, while yes/no responses were collected as indication of felt emotions. Musicians were more accurate than nonmusicians at recognizing emotion in the auditory-only condition, but not in the visual-only or audiovisual conditions. Although music training enhanced recognition of emotion through sound, it did not affect the felt emotion. These findings indicate that emotional processing in music and language may use overlapping but also divergent resources, or that some aspects of emotional processing are less responsive to music training than others. Hence music training may be an effective rehabilitative device for interpreting others’ emotion through speech.
Playing a musical instrument requires the development and application of numerous specialized skills, such as the rapid processing of auditory information, the translation of written notation into coordinated movements, and precise rhythmic timing. These skills develop through substantial training and practice; a complex and motivating multisensory experience that requires the integration of auditory and visual sensory information with motor responses (Edward & Carole, 1990; Lee & Noppeney, 2014; Zatorre, Chen, & Penhune, 2007). Therefore, the study of musicians provides an ideal model to explore experience-dependent changes in cognition, perception and neuroplasticity.
A central component of music training involves the production and perception of complex, expressive patterns of sound (Gardiner, 2008), and it is therefore not surprising that musicians show perceptual differences compared to nonmusicians in the auditory domain. In terms of perceptual ability, musicians have been shown to be more capable of segmenting speech from background noise (Parbery-Clark, Strait, Hittner, & Kraus, 2013; however, see Boebinger et al., 2015, and Madsen, Marschall, Dau, & Oxenham, 2019, for a lack of difference between musicians and nonmusicians), and show enhanced perception of auditory features such as pitch (Kishon-Rabin, Amir, Vexler, & Zaltz, 2001; Moreno et al., 2009), timbre (Chartrand & Belin, 2006), and intensity (Hausen, Torppa, Salmela, Vainio, & Särkämö, 2013). It is plausible that these enhanced perceptual abilities extend to the perception of emotions, which, in both music and speech, are expressed through variations in acoustic features such as intonation, pitch, and intensity (Juslin & Laukka, 2003). However, the effect of music expertise on emotion processing has been sparsely studied and research has predominantly focused on emotions expressed through excerpts of musical performances.
Within the music domain, research has indicated that musicians and nonmusicians do indeed perceive emotions differently. For example, length of music training is positively correlated with the accuracy with which emotions are perceived when hearing musical excerpts (i.e., when the emotion perceived by participants matches the emotion intended in the music excerpt; Lima & Castro, 2011a; Livingstone, Muhlberger, Brown, & Thompson, 2010), and musicians and nonmusicians differ in perception of expressiveness and emotion when observing solo musical performances of a drummer (Di Mauro, Toffalini, Grassi, & Petrini, 2018). It is also possible that these differences extend to nonmusical stimuli, as Juslin and Laukka (2003) have indicated that acoustic cues used to express emotions are similar in both music and speech. For example, anger in music is expressed through fast tempo, high sound level and high-frequency energy, and rising pitch. Similarly, anger in speech is expressed through fast speech rate, high voice intensity and high-frequency energy, and rising pitch (Juslin & Laukka, 2003). Thus, if common mechanisms are used for processing emotions in both domains, it is plausible that the advantage shown by musicians for processing emotion in music extends to the processing of emotion in speech.
Only a few studies have examined whether music training affects emotion perception from auditory sources beyond the music domain, although some evidence can be drawn from research on linguistic prosody; a nonverbal component of speech characterized by variations in pitch, loudness, rhythm, and timbre that can signify the emotion of an utterance (Wu & Liang, 2011). For example, Thompson, Schellenberg, and Husain (2004) showed with three experiments that not only adult trained musicians performed better than nonmusicians when recognizing different emotions from prosody, but that even 6-year-old children trained with musical instruments for one year were better able to recognize emotions from prosody compared to children of the same age that took drama classes for one year. Similarly, Good et al. (2017) showed that children (aged between 6 and 15 years) who used cochlear implants and undertook six months of piano lessons showed enhanced emotional speech prosody perception compared to children with cochlear implants who undertook painting lessons for six months. Finally, Lima and Castro (2011b) assessed the effect of musical experience on recognition of emotions from speech prosody, and showed that musicians perceived emotion more accurately than participants with no music training; a difference that was found across all six basic emotions (anger, disgust, fear, happiness, sadness, and surprise; Ekman, 1992). However, the stimuli consisted of single-speaker sentences, whereas studies have indicated that emotion recognition is more accurate when stimuli represent multiple people interacting with the same emotion (Cauldwell, 2000; Clarke, Bradshaw, Field, Hampson, & Rose, 2005). Furthermore, the findings of Lima and Castro (2011b) contrast with an earlier study by Trimmer and Cuddy (2008), which found no association between music training and recognition of emotional speech prosody. However, as Trimmer and Cuddy (2008) had carried out a correlational study, there was no cut-off specified for musicianship and thus no indication of the number of participants who had music training. Since recognition of emotions is a fundamental social function (Klaus, Rainer, & Harald, 2001), and therefore likely to still be efficient in nonmusicians (Hauser & McDermott, 2003), it is plausible that an extensive period of music training is needed to detect differences in emotion perception (Lima & Castro, 2011b).
Although sound is a dominant feature of music, auditory information usually coincides with other elements of music practice or performance, such as translating musical notation into motor activity (Herholz & Zatorre, 2012). Furthermore, visual expressive skills are highly important to musicians (Lindström, Juslin, Bresin, & Williamon, 2003) who must learn to use body movement to visually express intention and emotion (Dahl & Friberg, 2007), and must also learn to visually interpret the intentions of other musicians in order to communicate during live performances (Gardiner, 2008). Nevertheless, few studies have explored the effects of music training on visual and multisensory perception of emotion. Vines, Krumhansl, Wanderley, Dalca, and Levitin (2011) found that when musicians were presented with the sound and video of a solo clarinet performance, the visual information augmented the emotional content perceived through the sound, compared to when the sound was presented alone. This effect did not extend to nonmusicians in separate studies by Petrini, McAleer, and Pollick (2010) and Petrini, Crabbe, Sheridan, and Pollick (2011), who found that sound dominated the visual signal in the perception of affective expression of excerpts of drumming and saxophone performances. Additionally, Lima et al. (2016) recently showed that individuals with congenital amusia (a lifelong impairment in music perception and production) show a reduced ability to recognize emotion not only from prosody and other emotional sounds (e.g., crying) but also from silent facial expressions, indicating that impairment in music processing affects emotion recognition beyond the auditory domain. Taken together, these findings suggest that musicians may place greater weight on emotional visual information compared to nonmusicians, at least when perceiving emotions from musical performances. This suggestion has been recently supported by Di Mauro et al. (2018), who found that musicians placed greater weight on visual information when perceiving emotion from drumming clips. However, the effect of music expertise on visual and multisensory perception of emotion has only been tested using musical stimuli, where musicians have more expressive knowledge, and so it remains unclear as to whether this effect extends to visual and multisensory information in nonmusical domains, where both musicians and nonmusicians can be considered to have a similar level of expertise.
Finally, it remains unclear whether music training affects the feeling of emotion. Indeed, it has been postulated that the perception of emotion, whether exhibited by another person or within a piece of music, may directly induce the same emotion in the perceiver (Hatfield, Cacioppo, & Rapson, 1993; Juslin, Harmat, & Eerola, 2014). For example, Neumann and Strack (2000) have shown that listeners often feel the emotions portrayed in the vocal expressions of other people, and similarly, Juslin et al. (2014) were able to induce sadness in listeners by playing music featuring a voice-like cello timbre performing a song with slow tempo, legato articulation, and a low volume; acoustic features consistent with the vocal expression of sadness (Juslin & Laukka, 2003). In both cases, the induction of emotion requires the detection of sound patterns, and therefore if musicians do show enhanced recognition of emotion, it is possible they would be more likely to feel such emotions themselves. However, the mechanisms through which emotions are induced extend beyond the mere perception of emotion and include other factors such as subjective appraisal of the stimulus (Roseman & Smith, 2001; Scherer, 1999). Consequently, emotions felt do not always coincide with emotions perceived (Juslin & Laukka, 2004; Scherer, 1999), and it is therefore unclear whether any perceptual benefit of music training would affect emotion induction.
The aim of the current study was to investigate whether music expertise affects the perception and feeling of emotions expressed by others’ social interactions. We used a recently created and validated set of audiovisual clips depicting two people interacting (Piwek, Petrini, & Pollick, 2016) with angry or happy emotions, and presented the clips as auditory-only, visual-only, and audiovisual combined, to test 1) whether the previously found benefit for the combined clips in nonmusicians (Piwek, Pollick, & Petrini, 2015) extends to musicians; 2) whether musicians’ enhanced ability to recognize emotions from musical gestures extends to nonmusical gestures, thus indicating whether musicians’ enhanced ability to recognize emotions is specific to speech prosody; 3) whether musicians’ enhanced ability to recognize emotions extends to their feeling of such emotions. Specifically, we tested for differences in how accurately and quickly musicians and nonmusicians recognized the emotion in each clip, and in how frequently musicians and nonmusicians felt the perceived emotions. This knowledge is necessary to further our understanding of how music training affects cognitive and emotional processes, of whether emotional processing in music and language share resources beyond those specific to sound (Lima & Castro, 2011b), and to indicate to what extent music can be used as an effective therapeutic device for individuals with emotion-processing disorders (e.g., autistic persons; Sharda et al., 2018).
In light of previous research indicating that musicians detect emotions more accurately in music and speech (Lima & Castro, 2011a; 2011b)—together with the acoustic similarities between emotion expression in music and speech (Juslin & Laukka, 2003)—we hypothesized that musicians would perceive emotions in the auditory-only stimuli faster and more accurately than nonmusicians. Furthermore, given the greater weight musicians place on visual information in emotion perception from music performance (Di Mauro et al., 2018; Vines et al., 2011), and their improved multisensory abilities in other cognitive domains (Lee & Noppeney, 2011, 2014; Petrini et al., 2009), we hypothesized that musicians would also be able to detect emotions faster and more accurately than nonmusicians in the visual-only and audiovisual conditions. Finally, because some mechanisms of emotion induction depend on recognition of affective cues (Juslin et al., 2014; Neumann & Strack, 2000), it was hypothesized that if musicians were more accurate at recognizing emotion, they would also feel the emotion they had perceived more frequently than nonmusicians. However, because the induction of emotion involves mechanisms that extend beyond the perception of emotion alone (Roseman & Smith, 2001; Scherer, 1999), this latter hypothesis remained exploratory.
The number of participants in each group was similar or higher than previous studies investigating the effect of music expertise on cognitive and perceptual abilities (e.g., Petrini, Holt, & Pollick, 2010; Petrini, Pollick, et al., 2011; Bhatara, Tirovolas, Duan, Levy, & Levitin, 2011; Lee & Noppeney, 2014; Lu, Paraskevopoulos, Herholz, Kuchenbuch, & Pantev, 2014; Lima & Castro, 2011b). As estimated in a previous study (Di Mauro et al., 2018) the effect size reported in previous research examining differences in emotion perception and audiovisual perception between musicians and nonmusicians is medium to very large (Cohen’s d ≈ 0.50–2.0; e.g., Bhatara et al., 2011; Lima & Castro, 2011a; Castro & Lima, 2014; Lee & Noppeney, 2014; Lu et al., 2014; Petrini, Holt, & Pollick, 2010). Hence, we expected at least a medium effect size, which required a total sample size of 20. We calculated this sample size by using G*Power 3.1 (Faul, Erdfelder, Lang, & Buchner, 2007) and running an a priori power analysis for a repeated measures ANOVA, within-between interaction and assuming a Cohen’s F effect size equal to 0.25 (medium effect size), a level of power of 0.80, 2 groups, 6 measurements (2 emotions x 3 modalities), and an alpha level of .05.
Forty participants were recruited through opportunity sampling. 20 participants (6 males) were nonmusicians, with ages ranging from 18–29 years (M = 21.70, SD = 2.56). Twenty participants (7 males) were musicians, with ages ranging from 21–28 years (M = 22.25, SD = 2.10). Nonmusicians had no music training other than basic music classes within the school curriculum, whereas musicians had at least 5 years of music training (ranging from 5–24 years; M = 10.15, SD = 5.86). Musicians were instrumentalists who played guitar (n = 10), piano (n = 10), violin (n = 2), trumpet (n = 2), saxophone (n = 2), clarinet (n = 1), and flute (n = 1). All participants were fluent English speakers, had normal hearing, and had normal (or corrected to normal) vision. The study received ethical approval from the Department of Psychology Research Ethics Committee at the University of Bath [17-262], and written informed consent was obtained from all participants.
Materials and Stimuli
The stimuli were selected from a set of audiovisual affective clips developed and validated by Piwek et al. (2016). The clips represented the biological motion of two people interacting in the form of point-light displays, which offer the benefit of removing contextual information such as clothing or body appearance (Johansson, 1973). The clips also included dialogue in the form of speech prosody of the two people (see details below) which, in addition to the point-light displays, conferred information about movement and some morphological characteristics of the speakers (e.g., body size as depicted by formant spacing; Pisanski et al., 2014).
This helps prevent emotional bias that may be associated with certain cues such as identity, and ensures that visual attention is primarily focussed on body movement and expressivity (Hill, Jinno, & Johnston, 2003). The dialogue in each clip was either a deliberation consisting of two affirmative sentences (Actor 1: “I want to meet with John”; Actor 2: “I will speak to him tomorrow”) or an inquiry consisting of a question and answer (Actor 1: “Where have you been?”; Actor 2: “I have just met with John”).
From the set of clips in Piwek et al. (2016) we selected eight angry and eight happy audiovisual displays that portrayed the emotion at medium intensity, and where the emotion had been identified previously with 75% accuracy (Piwek et al., 2016). These criteria were used to avoid any ceiling effects, while ensuring the correct emotion could still be identified above chance. These clips were then edited in Adobe Premiere Pro 2017 to produce auditory-only (where the video was replaced with a black background) and visual-only (where the audio track was muted) versions.
Multisensory facilitation is greater when the different senses have the same reliability (Ernst & Banks, 2002). As such, a high-cut filter attenuating sound above 280 Hz was applied to the auditory stimuli using Adobe Premiere Pro 2017, in order to maintain the vocal prosody and intonation while decreasing the auditory reliability to a level more similar to the visual stimuli (similarity in reliability between the auditory and visual information when using this filter was pre-assessed by running a small pilot study). The high-cut filter was chosen as a means of reducing the reliability of the auditory information while emulating real-life conditions and therefore maintaining ecological validity (Knoll, Uther, & Costall, 2009; Scherer, 2003). Finally, the average sound amplitude of each clip was normalized to -0.5dBFS using Adobe Premiere Pro 2017 to ensure consistency in volume. The clips were exported as MPEG-4 (mp4) files with a resolution of 800 by 600 pixels, a frame rate of 30fps, and Advanced Audio Coding (AAC) audio with a sample rate of 44.1kHz.
Thus, the final set of stimuli consisted of 48 clips (each lasting between 2500 and 3500ms), comprised of 2 emotions (happiness and anger) expressed by 8 different pairs of actors and presented as 3 stimulus types (visual-only, auditory-only, and audiovisual). The stimuli were presented using a 15” MacBook Pro laptop with retina display, and Beyerdynamic DT 880 Pro headphones.
We focused our investigation on anger and happiness. These two emotions were selected because they are both basic emotions recognized in all cultures (Ekman, 1992), and are easy for actors to accurately convey in various scenarios (Ma, Paterson, & Pollick, 2006; Pollick, Paterson, Bruderlin, & Sanford, 2001). Furthermore, anger and happiness are the most frequently reported emotions when people are asked to reflect upon their experienced emotions, and are also most commonly experienced as pure emotions, as opposed to combinations of multiple other emotions (Scherer & Tannenbaum, 1986). Finally, both share similar acoustic and visual properties such as high voice intensity and large movements (Dittrich, Troscianko, Lea, & Morgan, 1996; Juslin & Laukka, 2003), thus making the discrimination between these emotions a challenging task for both musicians and nonmusicians.
Participants were tested individually. On arrival, they were given information about the study, and asked to confirm their history of music training (to ensure they met the criteria for musician or nonmusician). After giving informed consent, participants were seated approximately 55 cm from the screen of the laptop, and wore the headphones with an intensity at the sound source of 60 dB.
Participants were told they would be watching clips of two people interacting, and that some clips would be composed by video and audio, while others only by video, or only by audio. Participants were instructed to make two consecutive responses for each clip. First, a forced-choice identification of the emotion expressed in the clip by pressing the “1” key on the laptop keyboard to indicate “happy,” or the “3” key to indicate “angry.” Participants were asked to respond as quickly as possible when giving this response. Following this initial response, participants were instructed to indicate whether they felt the emotion they had perceived by pressing either the “1” or “3” key on the laptop keyboard to indicate “Yes” or “No,” respectively. Participants were first presented with three randomly selected practice trials, containing one visual-only, one auditory-only, and one audiovisual clip. Following the practice trials, participants had the opportunity to ask any questions, and were then presented with the full stimuli set of 48 clips, which were presented in a random order. Each trial began with a black screen and text saying “Test Clip”, which lasted for 1 s. This was immediately followed by the clip. Reaction times were measured from onset of the clip until the first key was pressed. The next trial began immediately after the participant made their second response. MATLAB Release 2017b (MathWorks, Inc.) software with Psychophysics Toolbox extensions (Brainard, 1997; Kleiner et al., 2007; Pelli, 1997) was used to present the stimuli and collect the responses.
The average value for each dependent measure was calculated from the eight actor interaction clips per condition for each participant. Data from a single trial for one participant were removed from subsequent analysis because Matlab failed to display the clip, and so their 7 remaining trials for this condition were used to calculate the average value. The remaining 47 clips for this participant—and all 48 clips for every other participant—were presented successfully. Three separate mixed ANOVAs were used to analyze the three dependent measures. The assumptions of homogeneity of variance and sphericity were tested using Mauchly's test of sphericity and Levene's test for equality of variances. Both tests were nonsignificant (p > .05) for each ANOVA, indicating these two assumptions had been met. However, visual inspection of Q-Q plots indicated that the assumption of normally distributed residuals was violated for the ANOVA on proportion of emotions felt, and outliers were detected in the reaction time data. Data treatment in relation to these violations is discussed in the relevant sections below. Significant findings were followed up with pairwise comparisons, with Bonferroni correction applied to maintain control of family-wise type 1 error.
A response was considered correct when it matched the intended emotion of the interaction in the stimuli. The average proportion of correct responses were submitted to a mixed ANOVA, with Musical Experience (musician or nonmusician) as a between factor, and with Emotion (happy and angry) and Modality (visual-only, auditory-only, and audiovisual) as within factors. There was a significant main effect for Emotion, F(1, 38) = 9.18, p = .004, η2 = .195, indicating that participants identified a higher proportion of emotions correctly when judging happy (.75) compared to angry (.63) clips. Additionally, the main effect of Emotion was independent of Modality or Musical Experience, as there was no significant interaction between Modality and Emotion, F(2, 76) = 0.73, p = .49, or between Musical Experience and Emotion, F(1, 38) < 0.001, p = .986.
There was no significant main effect for Musical Experience, F(1, 38) = 0.81, p = .37, but there was a significant main effect for Modality, F(2, 76) = 15.10, p < .001, η2 = .284, and a significant interaction between Modality and Musical Experience, F(2, 76) = 3.46, p = .04, η2 = .083 (Figure 1). Pairwise comparisons with Bonferroni correction indicated that nonmusicians identified a higher proportion of emotions correctly in the audiovisual clips (.76) than in the auditory-only (.62; p < .001; 95% CI [0.062, 0.217]) and visual-only (.65; p = .01; 95% CI [0.026, 0.188]) clips, but no differences were found between the auditory-only and visual-only clips (p = 1.00). In contrast, musicians identified a higher proportion of emotions correctly in the audiovisual clips (.77) than in the visual-only clips (.63; p < .001; 95% CI [0.059, 0.221]), but no difference was found between the audiovisual and auditory-only clips (.72; p = .29), or the auditory-only and visual-only clips (p = .07). Furthermore, musicians identified a higher proportion of emotions correctly compared to nonmusicians in the auditory-only clips (p = .02; 95% CI [−0.175, −0.016]), whereas there were no significant differences between musicians and nonmusicians for the visual-only (p = .55) and audiovisual (p = .83) clips. No significant 3-way interaction between Emotion, Modality, and Musical Experience was found, F(2, 76) = 0.02, p = .98.
An additional analysis was conducted on the association between years of music training and recognition accuracy for all six different conditions (2 emotions: angry and happy x 3 modalities: auditory-only, visual-only, and audiovisual). Years of music training did not significantly predict the level of accuracy in the musicians’ group for any of the conditions, F(1, 18) ≤ 2.04, p ≥ .17.
Average reaction times were submitted to a second mixed ANOVA with the same factors as described above. Errors and outliers (reaction times exceeding the mean of each participant by 2 SD) were not included in the analysis. There was no significant main effect for Modality, F(2, 76) = 1.66, p = .20, or Musical Experience, F(1, 38) = 0.01, p = .95. Concerning Emotion, reaction times were shorter for happy clips than for angry clips, F(1, 38) = 8.90, p = .005, η2 = .190, but there was a significant interaction between Emotion and Musical Experience, F(1, 38) = 5.03, p = .03, η2 = .117 (Figure 2).
Pairwise comparisons with Bonferroni correction showed that for musicians, reaction times were significantly shorter when responding to the happy clips (1.207 ms) compared to the angry clips (1.386 ms; p = .001; 95% CI [−0.278, −0.081]). No significant differences were found between happy (1.296 ms) and angry clips (1.321 ms) for nonmusicians (p = .60), or between musicians and nonmusicians for either happy (p = .62) or angry (p = .73) clips.
No significant interaction between Modality and Musical Experience, F(2, 76) = 0.66, p = .52, Modality and Emotion, F(2, 76) = 2.63, p = .08, or Modality, Musical Experience, and Emotion, F(2, 76) = 2.05, p = .14, was found.
An additional analysis was conducted on the association between years of music training and reaction time for all six different conditions (2 emotions: angry and happy x 3 modalities: auditory-only, visual-only and audiovisual). Years of music training significantly predicted the reaction time in the musicians’ group for all conditions, F(1, 18) ≥ 5.99, p ≤ .03 (Figure 3).
Finally, an analysis was conducted to examine whether there was any association between accuracy of responses and reaction time for musicians and nonmusicians separately. We found only two significant associations for the nonmusician group, between the accuracy and reaction times for happy auditory-only (r = −.85, p < .001) and for happy audiovisual (r = −.51, p = .02) conditions. No significant associations of this type were found for musicians. The same analyses carried out on the average reaction time for only accurate responses returned the same results (see Appendix).
A third mixed ANOVA, with the same factors as above, was used to analyze the average proportion of responses where participants reported feeling the emotion they had perceived. Visual inspection of Q-Q plots indicated that residuals for each factor level were positively skewed, demonstrating that while most participants rarely reported feeling the emotion they had perceived (average proportion of emotions felt = .20), a small number of participants reported feeling emotion more frequently. The analysis was therefore repeated following square root transformation of raw data, and subsequent Q-Q plot inspection of transformed data indicated approximately normal residual distribution. However, analysis using raw and transformed data revealed equivalent findings, and therefore only the results using raw data have been reported.
There was no significant main effect for Emotion, F(1, 38) = 2.02, p = .16, or Musical Experience, F(1, 38) = 0.05, p = .83, but there was a significant main effect for Modality, F(2, 76) = 8.04, p = .001, η2 = .175, and a significant interaction between Modality and Emotion, F(2, 76) = 4.35, p = .02, η2 = .103 (Figure 4). Pairwise comparisons with Bonferroni correction indicated that, for angry clips, the proportion of trials where emotion was felt was lower in the auditory-only clips (.13) than in the visual-only (.19; p = .03; 95% CI [−0.127, −0.004]) and audiovisual clips (.25; p < .001; 95% CI [−0.190, −0.060]), but there was no significant difference between the visual-only and audiovisual clips (p = .11). For happy clips, the proportion of trials where emotion was felt was lower in the visual-only clips (.18) than in the audiovisual clips (.26; p = .04; 95% CI [−0.141, −0.003]), but there were no differences between the visual-only and auditory-only clips (.22; p = .89), or the auditory-only and audiovisual clips (p = .71). Finally, within the auditory-only clips, the proportion of trials where emotion was felt was significantly higher when the expressed emotion was happy (.22) than when it was angry (.12; p = .002; 95% CI [0.036, 0.151]).
No significant interaction between Modality and Musical Experience, F(2, 76) = 0.27, p = .76, Emotion and Musical Experience, F(1, 38) = 0.17 p = .69, or Modality, Musical Experience, and Emotion, F(2, 76) = 0.60, p = .55, was found.
An additional analysis was conducted on the association between years of music training and felt emotion for all six different conditions (2 emotions: angry and happy x 3 modalities: auditory-only, visual-only and audiovisual). Years of music training did not significantly predict the level of felt emotion in the musicians’ group for any of the conditions, F(1, 18) ≤ 1.62, p ≥ .22.
The same analyses carried out on the average proportion of felt emotions only for accurate responses returned the same results (see Appendix).
Since a null result does not imply the absence of a difference between musicians and nonmusicians in the level of felt emotions, we also calculated Bayes Factors (BF10) to evaluate the strength of evidence in favor of the null hypothesis relative to the alternate hypothesis. We used a one-way ANOVA approach with the default JZS prior (Jeffreys, 1961; Rouder, Speckman, Sun, Morey, & Iverson, 2009; Zellner & Siow, 1980) to get a Bayes Factor for each one of the six conditions (2 emotions x 3 modalities) when comparing musicians and nonmusicians. We obtained for all conditions BF10 values ranging from .124 to .151, which are considered as moderate positive evidence in favor of the null hypothesis (e.g., Schönbrodt & Wagenmakers, 2018).
The aim of this study was to assess whether music expertise affected the perception and feeling of emotion from others’ social interaction through sound alone (speech prosody) and also through other communicative modalities such as vision (biological motion) and sound and vision (speech prosody and biological motion) combined. First, we found that musicians were more accurate than nonmusicians at perceiving both happy and angry emotions within the auditory-only condition, but no differences in accuracy were found between musicians and nonmusicians for the visual-only or audiovisual conditions. Second, musicians, but not nonmusicians, took longer to respond to angry stimuli than to happy stimuli, but reaction times did not differ significantly between musicians and nonmusicians. Finally, music training did not affect the feeling of emotion, but we found that across all participants anger was felt significantly less often than happiness within the auditory-only condition.
Concerning emotion perception, our results are consistent with prior research by Lima and Castro (2011b), who found that musicians were more accurate than nonmusicians at perceiving emotions expressed in speech. Additionally, we extend their findings in two important ways. First, we found consistent results while using more complex speech prosody stimuli of two people interacting, thereby indicating that the effect of music expertise on emotion recognition through speech prosody extends to multiagent social interactions. Second, by comparing performance for auditory-only, visual-only, and audiovisual conditions, we were able to test whether perceptual differences between musicians and nonmusicians were specific to certain communication channels. Here, our results showed that music expertise had no effect on the accuracy of emotion perception when the clips were presented in the visual-only or audiovisual conditions. This indicates that the effect of music training on emotion perception of nonmusical stimuli is limited to the auditory domain, and likely occurs due to the enhanced processing of features such as pitch (Moreno et al., 2009), timbre (Chartrand & Belin, 2006), and intensity (Hausen et al., 2013) associated with extensive music training.
Moreover, we found that for nonmusicians, accuracy of emotion recognition improved when stimuli from both auditory and visual modalities were presented together in the audiovisual condition, compared to the auditory-only and visual-only conditions. This result replicates the findings of Piwek et al. (2015), where the same stimuli were used, in showing that nonmusicians can increase their emotion recognition accuracy through multisensory integration. Musicians, in contrast, did not show greater accuracy in the audiovisual compared to auditory-only conditions, indicating that accuracy of emotion recognition for musicians did not benefit significantly from inclusion of the visual information. Conceptually, the integration of multisensory emotional information relies on the reliability of the sensory modalities being used (Collignon et al., 2008; Piwek et al., 2015). Indeed, multisensory facilitation (e.g., the higher accuracy for audiovisual compared to auditory-only and visual-only displays) is greatest when sensory cues have similar levels of reliability (Ernst & Banks, 2002), whereas facilitation diminishes when the reliability of one cue dominates the other (Alais & Burr, 2004). As such, due to enhanced processing in the auditory domain, musicians received the most reliable information through the auditory modality (even when degraded through a high-cut filter), such that the addition of visual information in the audiovisual condition did not offer any additional multisensory benefit. This partly contrasts with the findings of Vines et al. (2011) and Di Mauro et al. (2018), who found that musicians placed more weight than nonmusicians on visual information when judging the expressed emotion of music clips. However, these contrasting results are arguably due to an effect of expertise and familiarity, such that musicians, who are visually familiar with music performances, are consequently more able to integrate visual information into their judgements when observing such performances.
The absence of a multisensory advantage shown by musicians in our study thus probably depends on the lack of a difference in expertise between musicians and nonmusicians when judging emotions through nonmusical gestures and movements. This is in agreement with previous findings by Lee and Noppeney (2011), who showed a multisensory advantage in musicians for a music clip but not for a speech clip. In fact, studies showing a multisensory advantage for musicians compared to nonmusicians (Lee & Noppeney, 2011, 2014; Petrini et al., 2009; Petrini, Holt, & Pollick, 2010; Petrini, Pollick, et al., 2011; Proverbio, Attardo, Cozzi, & Zani, 2015) have typically assessed musicians and nonmusicians in their multisensory ability for clips representing the movements and sound of a musician. This highlights that additional factors, such as motor experience and visual familiarity with the stimulus are likely to impact upon multisensory ability.
Across all participants, we found that happiness was identified more accurately than anger. This finding stands in contrast to numerous studies showing that participants identify angry expressions more accurately than happy expressions when listening to voices (Banse & Scherer, 1996), watching single-actor body movement (Pollick et al., 2001), and watching two people interacting (Clarke et al., 2005). Additionally, studies have also found that detection of anger elicits activation in brain regions associated with autonomic defensive behaviour in relation to threat (such as the amygdala and hypothalamus) and may therefore signify an evolutionary benefit of anger detection (Pichon, de Gelder, & Grezes, 2008). Conversely, other studies have found more comparable results to ours. For example, Belin, Fillion-Bilodeau, and Gosselin (2008) showed that nonverbal affect bursts of happiness were better recognized than anger, and Dittrich et al. (1996) found that happy displays of point-light dancers were identified more accurately than angry displays. Furthermore, using the same stimuli as used in our current study, Piwek et al. (2015) also found that happiness was identified more accurately than anger.
One explanation for the more accurate recognition of happiness over anger found in this study concerns our use of multiagent interactions. Indeed, research comparing single and multiagent stimuli has shown that participants express a bias toward perceiving emotions as angry when listening to single-speaker dialogues (Cauldwell, 2000), and demonstrate impaired recognition of happiness when viewing single-actor point-light displays (Clarke et al., 2005). This bias may be explained through consideration of the social expression of emotion. Namely, happiness is most commonly expressed through enthusiastic and dynamic social interaction involving multiple people (Shaver, Schwartz, Kirson, & O'Connor, 1987), to the extent that our perception of happiness may be impaired when the social context is removed (Clarke et al., 2005). In contrast, anger is less often observed in public, and physical and verbal expressions of anger can frequently occur in the absence of a second person (Shaver et al., 1987). Thus, whereas the recognition of happiness in single-agent stimuli may be impaired, the recognition of anger would not be affected. Because we used multiagent stimuli, it is plausible that the perceptual bias against perceiving happiness was reduced in comparison to previous studies using single-agent displays, thus accounting for the more accurate recognition of happiness in our results. As such, our findings add further evidence toward the importance of considering multiagent social context and first- vs. third-person perspective in research on emotion recognition.
Concerning reaction times, musicians, but not nonmusicians, gave slower responses when recognizing emotion in angry clips than in happy clips. However, reaction times did not differ significantly between musicians and nonmusicians overall. A plausible explanation for this finding concerns a more pronounced speed-accuracy trade-off exhibited by musicians. Specifically, musicians are trained to carefully analyze sound, and consequently exhibit longer processing times when responding to auditory stimuli that are ambiguous or difficult to interpret (Chartrand & Belin, 2006; Münzer, Berti, & Pechmann, 2002). Musicians must also learn to carefully interpret visual cues of emotion and intention when performing or rehearsing (Gardiner, 2008); for example, to recognize when another performer will begin a solo, or to ensure they play the final note of a song in synchrony with other instrumentalists (Palmer, 1997). As such, musicians may exhibit comparable speed-accuracy trade-offs when processing both auditory and visual stimuli. In our current study, recognition of anger was less accurate than recognition of happiness, therefore indicating that angry clips were overall harder to interpret. Additionally, numerous studies have consistently indicated that happiness is the most common emotion expressed through music (Juslin & Laukka, 2004; Lindström et al., 2003), and it is therefore plausible that musicians are less familiar with expressions of anger compared to happiness. In this case, the existence of a more pronounced speed-accuracy trade-off would explain why musicians in this current study spent more time processing the angry clips across all sensory conditions. However, we found that with the increase in years of music training there was an increase in reaction time for all the emotions and modalities, such that musicians with more training or experience took longer to recognize both happy and angry emotions in all sensory conditions (auditory-only, visual-only, and audiovisual). This speed accuracy trade-off effect did not result however in higher levels of accuracy or higher levels of felt emotion in more experienced musicians, therefore indicating that the slower answers of musicians for the angry clips cannot be fully explained by a speed-accuracy trade-off.
Regarding the proportion of trials where emotions were felt as well as perceived, no differences were found between musicians and nonmusicians, indicating that neither music training or accuracy of emotion recognition affected the felt emotion. We did, however, find that across all participants anger was felt significantly less often than happiness within the auditory-only condition. Here, it is possible that the expressions of anger in our study were perceived with lower intensity or clarity compared to the expressions of happiness and were consequently less likely to induce emotion. However, although this is consistent with our finding that angry expressions were recognized less accurately than happy expressions, it would not explain why the differences in felt emotion were only found for the auditory-only stimuli.
An alternative explanation concerns the importance of contextual factors in the induction of happiness and anger, such that auditory expressions of anger may induce different emotions to visual expressions of anger. Specifically, Deng and Hu (2018) and Seibt, Mühlberger, Likowski, and Weyers (2015) have shown that the induction of anger is dependent on appraisal of the situation. Indeed, visual detection of anger often signifies a direct threat in the proximate environment and may consequently induce anger as an adaptive response (Pichon et al., 2008). Conversely, perception of anger through sound alone can indicate the presence of a threat that cannot be seen, and may therefore be more likely to induce escape-related emotions such as fear (Danesh, 1977). This is consistent with studies showing that regions associated with our own feelings of anger, such as the amygdala (Phan, Wager, Taylor, & Liberzon, 2002), are more responsive to visual than auditory expressions of anger (Costafreda, Brammer, David, & Fu, 2008). In contrast, happiness is often felt automatically in response to both visual (Deng & Hu, 2018; Seibt et al., 2015) and auditory (Johnstone, van Reekum, Oakes, & Davidson, 2006) expressions of happiness, and is less dependent on appraisal of the situation (Deng & Hu, 2018; Seibt et al., 2015). In this context, our study provides further evidence for the importance of contextual factors in the induction of emotion, such that, while happiness may be felt in response to both auditory and visual stimuli, anger is predominantly felt in response to visual stimuli.
Finally, the lack of an effect of music expertise on felt emotion, in contrast to the found effect of music expertise on perceived emotion, supports the hypothesis that a speech-music interaction might occur at an early level of processing. This agrees with reports (Musacchia, Sams, Skoe, & Kraus, 2007; Strait, Kraus, Skoe & Ashley, 2009) of musical expertise affecting subcortical (brainstem) level processing of vocal signs of emotion. The difference between perceived and felt emotion also supports the previously postulated separation between these processes in music (Gabrielsson, 2001) and extends this differentiation to voice prosody and musical expertise. Finally, although studies comparing perceived and felt/evoked emotions and differences in brain responses between musicians and nonmusicians are rare, a recent study by Brattico et al. (2016) has shown a neural dissociation between perceiving sadness and happiness and evaluating pleasure-related processes from music, with musicians showing increased activation in areas related to proprioception and salience detection, such as the insula and the anterior cingulate cortex. Future studies could examine whether a neural dissociation between perceived and felt emotion is present also for vocal emotional signals and whether areas involved in processing sound are specifically modulated by musicianship in the brain network subtending emotion perception.
Many other participants’ characteristics and predispositions besides music practice and training (e.g., socio-educational background, general intellectual level or personality characteristics; Lima and Castro, 2011b) could predict or be associated with music training and thus be different in musicians and nonmusicians. Our participants were mostly university students and thus the sample was quite homogeneous, and we do replicate findings from studies that did control for some of the aforementioned characteristics (e.g., Lima and Castro, 2011b). However, we did not measure at what age musicians started their training, the frequency of their practice, control for cognitive abilities, or ask for socio-educational characteristics, and thus we cannot dismiss that these factors may be linked to music training and consequently to the effect of music training on emotion recognition (Herholz & Zatorre, 2012; Strait, O'Connell, Parbery-Clark & Kraus, 2014); Swaminathan & Schellenberg, 2018). Future studies should include these measures and use them as covariates (if not variables of interest) to control for their influence.
The second limitation concerns the two emotions used in our study. Happiness and anger were selected as they are both easy for actors to convey in social interactions, and both are high intensity emotions, thereby making discrimination between them more difficult (Dittrich et al., 1996; Juslin & Laukka, 2003). However, without multiple emotions to choose from, participants had to engage with a forced-choice scenario where biases in emotion misidentification may have become more problematic. Pertaining to our present study, aforementioned research has indicated that anger is misidentified as happiness more frequently than happiness is misidentified as anger (Dittrich et al., 1996; Lima & Castro, 2011b), and it is therefore possible that participants in this current study may have been biased towards interpreting stimuli as being happy. Additionally, we cannot separate the effects of the specific emotions used here and emotional valence effects, as we had one positive and one negative emotion. Thus, future research would benefit from including additional emotions (both positive and negative) to avoid these types of biases and disentangle the contribution of specific emotions from that of emotional valence.
Finally, we did not ask participants to rate the intensity of the emotion they perceived in each display. Although we found interesting results based on the measures we had used, inclusion of intensity ratings would have enabled more thorough interpretation of our findings. Most notably, this would have helped determine the factors that influence felt emotions. That is, despite anger and happiness being considered high intensity emotions, in the absence of intensity ratings, it remains unclear whether, for example, the lack of difference between musicians and nonmusicians in felt emotion was due to low emotional intensity portrayed by the chosen stimuli. Future studies should obtain a measure of emotional intensity to clarify this point.
In conclusion, this current study provides empirical evidence that, through improving auditory capabilities, music training enhances perception and recognition of emotion from others’ social interaction when delivered through sound, but not through vision. This indicates that the overlap between music and language processes is specific to sound, and does not generalize to other communicative channels. Furthermore, we show that music expertise enhances the ability to recognize emotions from others, but not the feeling of such emotions, indicating that the effect of music training is confined to perceptual processes of emotion. Although the literature on the effect of music training on emotion perception from stimuli other than music still remains in its infancy, the current findings have promising implications not only theoretically but also in terms of application. Indeed, if emotion recognition in music and speech share common auditory mechanisms that are affected by music training, music may present a useful rehabilitative device for assisting persons who have difficulty interpreting (rather than feeling) emotions from vocal expression; a symptom present in disorders such as autism (Korpilahti et al., 2007).
Average reaction times for correct answers (correct recognition of expressed emotions) were submitted to a mixed ANOVA, with Musical Experience (musician or nonmusician) as a between factor, and with Emotion (happy and angry) and Modality (visual-only, auditory-only, and audiovisual) as within factors. There was no significant main effect for Modality, F(2, 76) = 2.64, p = .08, or Musical Experience, F(1, 38) < 0.001, p = .98. There was no significant main effect for Emotion, F(1, 38) = 3.51, p = .07, but there was a significant interaction between Emotion and Musical Experience, F(1, 38) = 8.16, p = .007, η2 = .177 (Figure A1).
Pairwise comparisons with Bonferroni correction showed that for musicians, reaction times were significantly shorter when responding to the happy clips (1.206 ms) compared to the angry clips (1.554 ms; p = .002; 95% CI [-0.560, -0.138]). No significant differences were found between happy (1.412 ms) and angry clips (1.340 ms) for nonmusicians (p = .49), or between musicians and nonmusicians for either happy (p = .29) or angry (p = .35) clips. No significant interaction between Modality and Musical Experience, F(2, 76) = 0.54, p = .58, Modality and Emotion, F(2, 76) = 2.42, p = .10, or Modality, Musical Experience, and Emotion, F(2, 76) = 2.51, p = .09, was found.
An additional analysis was conducted on the association between years of music training and reaction time for all the six different conditions (2 emotions: angry and happy x 3 modalities: auditory-only, visual-only, and audiovisual). Years of music training significantly predicted the reaction time in the musicians’ group for all conditions, F(1, 18) ≥ 3.76, p ≤ .07 (this is the only value above .05; all others are ≤ .04). Finally, an analysis was conducted to examine whether there was any association between accuracy of responses and reaction time for musicians and nonmusicians separately. We found only two significant associations for the nonmusician group, between the accuracy and reaction times for happy auditory-only (r = -.74, p <.001) and for happy audiovisual (r = -.44, p =.05) clips. No significant associations of this type were found for musicians.
A mixed ANOVA, with the same factors as above, was used to analyze the average proportion of responses where participants felt the emotion they had correctly identified. There was no significant main effect for Emotion, F(1, 38) = 0.004, p = .95, or Musical Experience, F(1, 38) = 0.06, p = .81, but there was a significant main effect for Modality, F(2, 76) = 5.61, p = .005, η2 = .129, and a significant interaction between Modality and Emotion, F(2, 76) = 6.56, p = .002, η2 = .147 (Figure A2). Pairwise comparisons with Bonferroni correction indicated that, for angry clips, the proportion of trials where emotion was felt was lower in the auditory-only clips (.13) than in the visual-only (.24; p = .03; 95% CI [-0.212, -0.007]) and audiovisual clips (.30; p < .001; 95% CI [-0.253, -0.074]), but there was no significant difference between the visual-only and audiovisual clips (p = .24). For happy clips, the proportion of trials where emotion was felt was lower in the visual-only clips (.19) than in the audiovisual clips (.25) but did not reach significance (p = .21; 95% CI [-0.157, 0.022]). No differences were found between the visual-only and auditory-only clips (.24; p = .82), or the auditory-only and audiovisual clips (p = 1.000). Finally, within the auditory-only clips, the proportion of trials where emotion was felt was significantly higher when the expressed emotion was happy (.24) than when it was angry (.13; p = .03; 95% CI [0.011, 0.196]).
No significant interaction between Modality and Musical Experience, F(2, 76) = 0.77, p = .47, Emotion and Musical Experience, F(1, 38) = 0.44, p = .51, or Modality, Musical Experience, and Emotion, F(2, 76) = 0.95, p = .39, was found.
An additional analysis was conducted on the association between years of music training and felt emotion for all six different conditions (2 emotions: angry and happy x 3 modalities: auditory-only, visual-only and audiovisual). Years of music training did not significantly predict the level of felt emotion in the musicians’ group for any of the conditions, F(1, 18) ≤ 2.14, p ≥ .16.