This article seeks to unveil quantitative relations between the patterns of movement recurrence of a group of expert clarinetists and expressive sonic manipulations they employ during their performances. The main hypothesis is that the recurrent ancillary gestures of musicians are closely related to their sounded expressive intentions, and that the expressive content imposed by them according to the music structure is reflected in their movement patterns. To conduct this multimodal investigation of expressiveness in music, movement and audio analyses of several clarinet performances of excerpts in the classical repertoire are presented and discussed in conjunction. The results show strong correlations between the recurrence pattern of clarinetists’ ancillary movements and expressive manipulations of timing, timbre, and loudness associated with melodic phrasing and harmonic and dynamic transitions in the performed music excerpts.
Music interpretation by expert classical players is the culmination of a long process of learning and practice, during which the musicians acquire a great familiarity with the performed musical passages. This intricate process leads to the development of a personal interpretation concept, which can be expected to be quite consistent over different performances of the same piece, especially in the case of expert musicians performing a well-known excerpt of their repertoire. As shown by several musical content information studies, musicians make use of small and systematic deviations of timing, timbre, loudness, pitch, and articulation in order to convey their so-called expressive intentions (De Poli, Canazza, Drioli, Roda, & Vidolin, 2004; Gabrielsson, 1995; Gabrielsson & Juslin, 1996; Juslin, 1997, 2000). Particular and consistent manipulations of the acoustic material would rise as a consequence of the individuality of the performed expressive intentions. This expressive content is what provides the human and individual character to the performances and it is essential to the desired listener's and musicians’ experience.
From the corporal standpoint, the physical movements employed by the musicians during music performances are in many cases essential to the sound production of the instrument, but some of these movements also seem to be closely related to the sonic expressive content of the music being played. The movements of the performer that are not essential to sound generation usually have been designated as ancillary gestures (Wanderley, 2002). Davidson (1993, 1994) and Dahl and Friberg (2007) have shown that expressive information is indeed present in these movements, enhancing our comprehension of music expressiveness from a multimodal perspective. Davidson (1993) presented to observers performances of the same piece played by the same musician in three different manners, in three modes: video only, audio only, and audio and video together. Results clearly revealed the contribution of visual information in perceiving the expressive content of the performances. Dahl and Friberg (2007) also verified that specific emotional intentions such as happy, sad, angry, and fearful were properly communicated through visual-only presentation of performance stimuli.
The process of musical practice described above—aiming at a high level of expertise—is a rather long and repetitive one, requiring a lot of discipline and dedication. Professional classical musicians must be able to perform very consistently and they work very hard in this direction. During this long process they play a referential excerpt in their repertoire, like the ones analyzed in the current study, hundreds of times in order to play it consistently and also to develop their personal interpretation concept. It is our hypothesis that this long repetitive process of musical practice eventually leads to the emergence of patterns, both sonic and gestural. Since these sonic and gestural patterns are acquired concurrently it is reasonable to expect that they are somehow related. That is, if any of the players’ ancillary gestural pattern is related to their sounded expressive intentions and was also acquired and developed during this musical learning process, it is also expected to be recurrent over consecutive performances of familiar excerpts, and to be related to their sonic expressive content. For these reasons, we propose to quantitatively look for movement recurrence among consecutive performances, attempting to perform pattern recognition, in order to provide correlational evidence to support this hypothesis by linking the main gestural patterns to the sounded expressive content of the performances and to the score.
Studies by Wanderley, Vines, Middleton, McKay, and Hatch (2005), Caramiaux, Wanderley, & Bevilacqua (2012), and Desmet et al. (2012) have already investigated these gestural motifs, comparing motion segments to primitive shapes, aiming to define their musical significance in a performance. Demos, Chaffin, and Kant (2014) used recurrence quantification analysis (RQA) to identify recurrent motion segments within a performance and their links to local aspects of the music structure. The book edited by Godoy and Leman (2010) also presents a broad collection of studies exploring the relationship between sound and movement in music performances. Similar studies have also been conducted to evaluate the multimodal mechanisms involved in speech communication, such as hand gestures and facial expressions (Barbosa, Dechaine, Vatikiotis-Bateson, & Yehia, 2012), in dance performances, where periodic movements derive from rhythmic patterns (Camurri, Mazzarino, Ricchetti, Timmers, & Volpe, 2004; Naveda & Leman, 2009), and in the perception of music by the listeners (Vines, Krumhansl, Wanderley, & Levitin, 2006; Maes, 2016). This article aims at expanding this multimodal investigation of expressiveness in music by looking for close relationships between performers’ consistent physical gestures and their expressive intentions conveyed through sound, bringing additional arguments to support previous evidence that musical expressive intentions might be embodied through reliable patterns of gestures. This could provide means to define a more effective set of expressiveness parameters and models for musical synthesis, analysis, and teaching systems based on both sonic and visual information. Due to the constant development of new interfaces for music composition, performance, and learning, there is a growing interest in such multimodal models and methods.
In previous works we presented a procedure to extract physical gestures of clarinet players during performances of excerpts from masterpieces of the classical repertoire, based on movement segmentation and analysis of recurrence patterns (Teixeira, Loureiro, Wanderley, & Yehia, 2015). The movement consistency of several clarinetists in relation to the music was evaluated during consecutive performances of excerpts from clarinet pieces by Mozart and Brahms. In the current paper their ancillary gestural patterns will be related to expressive acoustic patterns in the performances and the underlying music structure. An audio analysis is conducted over the performances to investigate their sonic expressive content through the examination of timing, timbre, and loudness manipulations by the players, aiming to relate them to these patterns of movement recurrence. Our main hypothesis is that the ancillary gestures employed recurrently by expert musicians are closely related to their sounded expressive intentions, and that the larger expressive acoustic deviations imposed by the players according to the music structure are thus reflected in their patterns of gestural recurrence.
In order to quantitatively evaluate our hypothesis, the results of movement and audio analyses of several clarinet performances of an excerpt from a clarinet sonata by Brahms are correlated in this study. These results are also compared to preliminary data based on performances of a shorter excerpt from a clarinet piece by Mozart, presented in previous works and expanded here in order to broaden the general discussion on our main findings. From the sonic standpoint, note duration manipulation is the most noticeable expressive resource employed by musicians, as demonstrated by some of the most relevant studies on musical performance (De Poli et al., 2004; Gabrielsson, 1995; Gabrielsson & Juslin, 1996). It is also the most directly analyzable, since it is less affected by room, microphone, or instrument characteristics. For these reasons our audio analysis methodology was developed to evaluate timing manipulation. Spectral centroid and RMS energy are then analyzed using a similar procedure in order to expand the results to expressive manipulations of timbre and loudness.
This study is focused on an experiment conducted with ten expert classical clarinet players. The ten clarinetists were professional performers with at least ten years of formal classical training and who were active members of professional symphonic orchestras for a considerable amount of time. This instrument was chosen due to the necessity to analyze specifically the musician's ancillary gestures. On the clarinet, body movements directly related to sound production are usually of small amplitude and are restricted to the player's fingers, jaw, lips, tongue, and chest. This allows the clarinetists to employ a variety of expressive movements using their legs, arms, torsos, and heads, which are almost independent and easily distinguishable from the sound generating movements.
MATERIALS AND PROCEDURE
The musicians performed an excerpt extracted from the first movement of the Sonata for Clarinet and Piano in F minor, Op. 120, No. 1 by Johannes Brahms (see Figure 1). The excerpt was a part of the musicians’ concert repertoire and was thus very familiar to them. Each clarinetist was simply instructed to play this excerpt three times freely, standing up and without any accompaniment. No further instructions or restrictions in performance style or motion were given to the clarinetists and they were not aware of the purposes of the experiment. Figure 2 illustrates the experimental setup. The body movements of the musicians were tracked during the performances at a rate of 100 frames per second, using high-end 3D motion capture devices, the NDI Optotrak Certus and the NDI Optotrak 3020. These devices consist of a tracker, built with three infra-red cameras positioned along one axis, tracking the spatial position of active infra-red LED markers inside a tridimensional measurement volume, together with synchronous audio recording. Ten markers were used, placed on the player's knees, waist, shoulders, head, and clarinet. The audio signals were recorded at a sampling rate of 44.1 kHz, with 16 bits per sample, using an AKG C1000 condenser microphone positioned one meter away from the clarinet.
The Optotrak system includes an integrated analog data acquisition unit that allows the recording of any kind of analog voltage signal in perfect synchrony with the spatial signals from the movement markers. Unfortunately, this analog acquisition unit is designed using a general purpose analog to digital converter with a resolution of 12 bits per sample, which is unsuitable for high-quality audio applications. The high levels of noise generated would make the task of audio feature extraction required in this study virtually impossible. To solve this problem, while retaining perfect synchrony between the audio and movement signals, the microphone signal is pre-amplified and split using an audio recording console and then fed simultaneously to the Optotrak analog data acquisition unit and to a high-end digital audio recording interface. A clapperboard is used to generate a start-time reference sound at the beginning of each recorded performance for audio synchronization purposes. The low-quality audio recorded synchronously by the Optotrak system is used only as a timing reference, by detecting the onset of the clapperboard sound and defining it as the start-time of the performance in the Optotrak output data. The same clapperboard sound onset is then detected in the high-quality audio signal recorded asynchronously by the digital audio interface and is also used to define the start-time of the performance in the actual audio signal to be analyzed. This way, due to the highly impulsive nature of the clapperboard sound, an excellent level of synchronization is achieved between the audio and the movement signals and a high-quality audio signal is available to be properly analyzed, as required.
Figure 3 shows a diagram summarizing the data analysis procedure applied to each of the 30 clarinet performances acquired in this experiment. The movement analysis method is presented in greater detail in Teixeira et al. (2015).
Movement representation and segmentation
The movement analysis algorithm is based on the clarinet bell motion, that has been the object of other studies (Caramiaux et al., 2012; Wanderley, 2002; Wanderley et al., 2005) and is believed to be an important indicator of expressive movements made by the musician. According to what was described in the introduction, the clarinetists are free to employ ample gestures with the clarinet bell without interfering with the sound generating movements and most of them seem to use it as a focal point while executing ancillary gestures. The movement of the clarinet bell is taken relative to a static reference (the center of the Optotrak tracker); therefore it incorporates any motion performed by the musicians with their feet, knees, torso, neck, and arms, and can thus be seen as a general indicator of their movements. Optical flow techniques were already used to define a general motion indicator (Barbosa et al., 2012), but our method also allows the precise analysis of a specific point in space (in this case the clarinet bell, well known for its expressive character), including its 3D trajectory, in order to define recurrent gestures and many associated gestural features (Teixeira et al., 2015).
The evolution of the tridimensional movement of the clarinet bell is analyzed in conjunction with the acoustic data using a scalar representation of motion, its tangential velocity—defined as the linear speed along the trajectory—toward its tangent. It is estimated in this case using the Euclidian distance between the positions of a marker placed on the clarinet bell in two subsequent frames and the time between frames. This unidimensional parameter captures a large amount of information from the musician's movements and can also be used to perform their segmentation, defining movement segments between subsequent local minima in the tangential velocity curve, where the motion direction or character is most likely to suffer a sudden change (Teixeira et al., 2015). These local minima can be easily obtained after the application of an appropriate smoothing filter to the curves, by detecting its points of zero derivative, possibly below a given velocity threshold if higher selectivity is desired. Figure 4 illustrates this step of the movement analysis.
To search for patterns of recurrence in the movements of the musicians, the tangential velocity signals are then fed to an instantaneous correlation algorithm. The complete description and mathematical formulation of this algorithm can be found in Barbosa et al. (2012). It calculates the correlation coefficient between a pair of signals for each instant in time and also for different time offsets between them, generating a two-dimensional correlation map between the two signals. At each sample step, the algorithm calculates the correlation as a bi-directional function of both past and future samples, with the weighting of values decaying exponentially further away from the correlation point in both directions. There is no explicit window size defined, but the effective size of the moving window can be adjusted through a single constant that controls the rate of decay used by the bi-directional exponential weighting function. In other words, the effective window size is related to the aperture of a bi-directional decaying exponential function centered around the current sample, where the maximum weighting value is employed.
In this study correlation maps are used to unveil overall recurrences in a group of motion signals, instead of measuring the correlation between two signals. In order to do that, a correlation map is calculated for each of the possible signal pairs in the group, and then all the resulting maps are added together. Since the focus is on strict reoccurrence, and thus on the positive correlations, each map has its negative correlation values truncated to zero before the addition, so that the resulting recurrence map better reflects the overall levels of direct correlation in proportion to the pairwise indexes between signals. Negative values correspond to inverse correlations, which do not characterize proper reoccurrence. It is worth noting that even if correlations are not linear functions, we are not strictly summing correlations, but adding positive correlation maps, an heuristic used to provide a graphical indication, or a map, of the likelihood of movement recurrence.
To ensure a proper and musically accurate temporal alignment between the signals, in accordance with the structure of the excerpt, the velocity signals are first time-warped (Senin, 2008), using the note onsets of the melody as reference points in the timing model to generate a relative musical time in bars (instead of the absolute time in seconds). For each musician, correlation maps were then calculated for the three possible pairs of clarinet bell tangential velocity signals. These correlation maps had their negative values truncated to zero and were then summed and normalized to one, generating a resulting map that provides a recurrence descriptor for that musician's clarinet bell movement over his/her three performances. In order to highlight regions of interest, of high recurrence, an empirically determined threshold is then applied to this map, removing values below 0.75. This threshold level is adjusted manually in order to improve the readability of the map and to highlight the most prominent regions of recurrence, also based on the general understanding that signals with correlation indexes above 70% are considered to be strongly correlated. Figure 5 illustrates the computation of this recurrence map for one of the musicians in detail. Its top plot displays the three time-warped clarinet bell tangential velocity curves for this musician during a short music passage. Its bottom plot shows the corresponding map, relative to his three performances, with dark areas indicating high movement recurrence. It is possible to see the regions where he employs recurrent movements in the performances, where all the velocity curves at the top plot are highly correlated and dark areas occur in the corresponding region of the map. In this figure time is shown in bars (relative musical time due to time warping), with the numbers indicating the beginning of each bar.
Recurrent physical gestures
Regions of interest, consisting of the movement segments (defined between subsequent minima in the velocity curve) that are contained inside the high recurrence areas, are then defined in each performance based on this map. Inside each of these regions recurrent sequences of gestures can be unveiled by grouping its constituent movement segments consecutively, in order to form larger primitive gestural paths according to a visual inspection of the basic geometry of the tridimensional trajectory of the clarinet bell. This is done using the criteria of planarity and circularity described in greater detail in Teixeira et al. (2015) and based on theories by McNeill (2007). To sum up: changes in direction or rotational axis of these paths coincide with a velocity minima; the paths tend to be elliptical, with starting and end points close to each other and a velocity minima in between, since the physical balance of the player limits the reach of a body movement in a given direction, leading to a subsequent return movement toward or around an equilibrium position in the opposite direction, as suggested by McNeil (2007) through the concept of Growth Point; the paths tend to stay in the same plane or in the same rotational axis, due to the principle of Conservation of Angular Momentum. Based on these assumptions, the whole trajectory of the clarinet bell inside a given region of interest is plotted by the algorithm on a 3D graph, with the points of velocity minima clearly indicated along the path by markers. After that, the points of velocity minima that correspond to actual breakpoints between primitive gestures satisfying the conditions described above are selected as gesture onsets through visual inspection of the graph, while the rest of the points of velocity minima are discarded. Finally, with the appropriate gesture onsets defined for that region of interest, the algorithm plots the individual trajectories of each gesture sequentially, in an array of 2D graphs of equal scale for each performance, illustrating the final sequence of gestures.
An example of a sequence of recurrent gestures obtained with the use of this procedure is given in Figure 6. Each row corresponds to a performance and each column corresponds to a recurrent gesture. These gestures can then be subjected to a detailed parametric and statistical analysis in order to extract many gestural features related to their geometry, prominence, duration, variability, and dimensionality, allowing comparisons between various performers, excerpts, and experimental conditions. In Teixeira et al. (2015) we have already shown that most of these recurrent gestures indeed exhibit highly planar round-trip trajectories, with invariable musical location and duration, but variable prominence, depending on the degree of musical freedom allowed in the experiments. This is a good indication of their relation to the music structure and to the sonic expressive content strived by the performer. Typical gestures in this case exhibited durations ranging from 0.5 to 4 seconds, with a mean value of around 2 seconds, total distances covered ranging from 5 to 75 centimeters, with a mean value of around 40 centimeters and mean velocities ranging from 3 to 35 centimeters per second, with a mean value of around 20 centimeters per second. It is important to say that even though velocity time warping is used to compute the recurrence map and locate the regions of interest in the score, the movement segments and the acoustic parameters are defined on the unwarped data, in order to preserve the individual expressive aspects of each performance throughout the analysis.
In the current study, aiming to relate the occurrence of these recurrent ancillary gestures to the expressive timing manipulations employed by the players, normalized bar durations were initially analyzed along the performances. As mentioned in the introduction, timing manipulation is the most notable and directly measurable expressive parameter employed by the musicians. The first step of this analysis is to calculate the interonset intervals (IOIs) based on the note onsets extracted from the audio signals using the Expan tool (Campolina, Loureiro, & Mota, 2009). After that, in each performance, the duration of every bar is obtained by summing the IOI values of its constituent notes. These bar durations are then normalized in relation to the mean value of bar duration in that performance in order to get a relative idea of bar duration manipulation by the player and to compare it between various performances, regardless of the tempo employed in each one.
In order to obtain a general measure of the manipulation of bar duration for all players, the standard deviation of the normalized duration of each bar in the excerpt over the 30 performances is then computed. This is done based on the assumption that the bars with weaker expressive content would lead in general to less duration manipulation by the group of players, resulting in normalized bar duration values closer to unity and thus to smaller values of its standard deviation, while bars with stronger expressive content would lead to a more substantial and varying duration manipulation overall in the group, resulting in scattered values of normalized duration and thus to larger values of its standard deviation. This is according to our initial hypothesis that the musicians make use of deviations in such parameters to convey their expressive intentions (De Poli et al., 2004; Gabrielsson, 1995; Gabrielsson & Juslin, 1996). Here it is worth noting that expressiveness is a complex phenomenon that is hard to quantify, involving the notion of contrast between various musical passages, among many other factors. Therefore, a high standard deviation value in this case does not guarantee that a bar has a stronger expressive content. Nevertheless, it provides a good indication, based on the fact that expressiveness is related to individual aspects of a performance (as discussed in the introduction), and individualities should translate into differences, or variability, in a global data analysis of this sort, involving a large group of musicians.
A similar procedure is then applied to analyze the spectral centroid and RMS energy data extracted from the audio signals in order to evaluate the expressive timbre and loudness manipulations, respectively. First, a mean value of spectral centroid and a mean value of RMS energy is computed for every bar in each performance of the excerpt. After that, these values are normalized in relation to the mean value of that parameter in that particular performance. Finally, in order to obtain a general measure of manipulation for all players, the standard deviation of these normalized parameter values is computed for each bar, over the 30 performances.
The movement analysis procedure described in the previous section was conducted for all clarinetists in the current study. In order to analyze the general pattern of movement recurrence for this group of players, a global recurrence map was computed for the excerpt by summing the individual thresholded maps of the ten musicians. The resulting map was again normalized to one and a new empirically determined threshold of 0.4 was applied in this global case. This threshold level was also adjusted manually to highlight the main regions of global recurrence. The lower value of 0.4 is due to the fact that these global recurrences are not as prominent as the individual ones, since the movement patterns are very consistent for each musician but can vary considerably between different players. It is worth noting that in the absence of well-established guidelines for this kind of analysis, the empirical setting of the threshold levels is subjective but somewhat inevitable. Nevertheless, it is done simply to improve the readability of the maps and to unveil the regions of higher recurrence in the performances. This global motion recurrence map is presented in the top plot of Figure 7. It reveals that the main regions of movement recurrence for these musicians occur mostly around four common regions of the excerpt, highlighted on the map and also on the musical score in Figure 1, where the most recurrent clarinet gestures are executed, indicating their strong relation to the music structure.
An audio analysis was conducted over these performances to investigate their sonic expressive content through the examination of timing, timbre, and loudness manipulations by the players, aiming to relate them to these patterns of movement recurrence. The three bottom plots of Figure 7 present the standard deviation values, given in percentage points, for the normalized durations, normalized mean spectral centroids, and normalized mean RMS energies in each bar of the excerpt over the 30 performances in the analysis. This figure shows that the four most prominent peaks in the curve representing the evolution of standard deviation of normalized bar durations occur at bars 8, 12, 16, and 24. A very similar pattern can be observed in the curve for the standard deviation of normalized spectral centroids, with local peaks occurring at bars 8, 12, 16, and 23. Those bars are directly related to the four main regions of movement recurrence in the excerpt, highlighted in the global motion recurrence map. A related but somewhat different pattern is observed in the curve representing the standard deviation of normalized RMS energies. In this plot a peak still occurs at bar 8, but bars 12, 16, and 23 are marked by local minima in the curve instead, and peaks also occur at bars 13 and 19. These results will be discussed in detail in the next section, relative to melodic and harmonic developments of the excerpt and to specific technical and acoustical characteristics of the clarinet.
The general movement recurrence patterns observed in the top plot of Figure 7 suggest that the four highlighted regions of the excerpt—where physical gestures were highly recurrent—correspond to notable musical targets such as melodic phrase endings and harmonic and dynamic transitions, where interpreters usually employed most of their intentional acoustic deviations. This is a strong indication of a musical significance in their ancillary gestures. The bottom plots of Figure 7 confirm that the bars in the excerpt that led the players to the most significant timing and timbre manipulations are all directly related to these main regions of movement recurrence in the music, according to the global motion recurrence map. This corroborates the hypothesis that the stronger sounded expressive intentions present in these regions of the excerpt are reflected in the physical gestures of the clarinetists.
Figure 8 presents the full score of the excerpt for better examining the music structure. The first highlighted region is a clear preparation to the phrase ending that takes place in the second highlighted region, emphasized at its end (bar 8) by the wide upward legato leap on the clarinet melody, a minor tenth crossing into the altissimo register, a passage that imposes high technical difficulty. It is also worth noting that this passage builds an extra harmonic tension, due to the Neapolitan sixth subdominant chord in this bar, an harmonic aspect that some of the participants might be aware of due to their level of expertise. The last three highlighted regions are strongly related to the corresponding melodic phrase endings that take place inside each of them. These phrases conclude at relevant harmonic transitions: into the dominant at bars 12 and 16, evoking a strong feeling of punctuation and harmonic tension, reinforced by the dotted rhythm, and into the tonic at bar 25 in a sudden forte, coming from a diminuendo started at bar 21, evoking a strong feeling of harmonic resolution. The long rests after each target note in bars 12, 16, and 25 also provide breathing points for the clarinetists, emphasizing these phrase endings. This musical analysis of the excerpt is in accordance with the general impression of the players involved in this study.
The analysis of loudness manipulation presented in Figure 7 also reflects the phrase structure just described for the excerpt, but according to a somewhat different pattern. In this case the phrase endings in the last three highlighted regions—at bars 12, 16, and 23—are marked by local minima in the curve of standard deviation of normalized RMS energies, while its peaks tend to occur along the music phrases. This might be explained by the dynamic markings just before these bars. The crescendo into forte before bar 12 induces the musicians to play at the loudest possible volume on the A4 (concert pitch) at the beginning of bar 12. This is the region where the clarinet presents the weakest acoustic power and consequently the most limited dynamic range. This could explain the homogeneity of intensity observed across the performances at this bar. The same loudness homogeneity seems to occur at bar 23, which is preceded by a three bar long diminuendo marking, leading to a soft intensity at is the B❘4 that reduces the capability of dynamic variation by the players. Peaks of loudness manipulation occur after the phrase endings, where the dominant is asserted and reasserted (also marked with forte in the case of bar 13), in the cantabile register of the clarinet, where it presents a very high dynamic range. This fact will be further investigated, but nevertheless it also points to a clear relation between the expressive acoustic patterns, the music structure, and the occurrence of recurrent clarinet gestures. The only exception to this pattern is in the first region (bar 8), where all curves in Figure 7 show a peak of standard deviation along the phrase and not only the loudness manipulation curve. There is no phrase ending at this point, but there is a notable harmonic progression that prepares the dominant four bars later and also a wide legato leap (a minor tenth) crossing into the highest register of the clarinet. This might explain its atypical characteristics, leading also to timing and timbre manipulations midway through this phrase, even if they are less prominent than the corresponding loudness manipulations at this bar, which are expected considering the pattern just described.
The four regions in the excerpt where the most recurrent movements occurred for this group of clarinetists are highly related to key musical moments in the performances, with apparent strong sonic expressive content. The results show that the musicians employ significant expressive timing and timbre manipulations at the three phrase endings (the last three of these regions), accompanied by highly recurrent punctuation ancillary gestures. On the other hand, the most prominent loudness manipulations occur along the phrases (mainly right after the phrase endings) and they are accompanied by recurrent physical gestures in the case of the first of these regions (bar 8), due to notable harmonic and melodic transitions. This indicates that the ancillary physical gestures employed by musicians during performances are closely related to their sounded expressive intentions towards the musical realization. Some of these gestures could also have a relation to physical constraints such as breathing, and therefore analysis of additional musical excerpts without long rests can prove more effectively the relation of these movements to sounded expressive intentions. Even so, the individual recurrence maps, presented in Figure 9, show that movements occur around these rests but not always precisely during them, indicating that they could not be the sole cause of all the ample gestures observed. To sum up, the points of major movement recurrence appear to be dictated by the music structure, since at key musical targets almost all musicians employ highly recurrent gestures, while elsewhere their gestures are not so well-defined. The audio analysis also corroborates this, indicating that these alleged musical targets indeed coincide with higher expressive deviations in acoustic features. Therefore, according to our initial assumptions, a gestural recurrence measure could be used to put in evidence regions of higher expressive interest in the music.
In previous works we presented related results for a preliminary experiment based on a shorter excerpt extracted from the first movement of the Quintet for Clarinet and Strings in A Major, K. 581 by W. A. Mozart, notated in Figure 10 (Teixeira, Loureiro, Yehia, &Wanderley, 2013). This experiment followed the same basic experimental setup and data acquisition procedure described in detail in the Method Section. The only difference is that in this case eight clarinet players were recorded and each musician performed the excerpt six times, providing a total of 48 performances to be analyzed. The same data analysis procedure described in the Method Section was also applied to this data set, but due to limitations imposed by the audio recording process, only the expressive timing manipulations could be properly analyzed in this case. The results of this preliminary experiment were expanded and adapted to be in conformity to the ones presented in the previous section, and will be briefly discussed here, aiming to the broaden the validity of our main findings. They are presented in Figure 11.
The top plot of Figure 11 shows the global motion recurrence map for the eight players analyzed in the preliminary experiment. It indicates that the main region of movement recurrence for this excerpt from Mozart's quintet, highlighted on the map, occurs at its final part. Once again the results of the timing manipulation analysis show that the larger expressive manipulations are employed by the clarinetists where the recurrent gestures are observed, in this case at the final bars of the excerpt. This excerpt consists of a short theme containing a single musical phrase, somehow limiting the conclusions about its results. Even so, these results corroborate our previous findings, since the performers employed recurrent punctuation gestures at the phrase ending, related to notable expressive timing manipulations at the same musical location. This phrase ending also occurs at a major harmonic progression, a perfect cadence, highlighting it further as a musical target for the players. Therefore we have additional evidence, based on a different excerpt, that the ancillary gestures of clarinetists carry a musical significance, related to the music structure and its sonic expressive content.
This article aimed at expanding the scope of multimodal investigation of expressiveness in music, presenting a method to quantitatively analyze performers’ physical gestures and their sounded expressive intentions, seeking to corroborate previous evidence that musical expressive intentions might be embodied through reliable patterns of gestures. The main hypothesis was that the ancillary gestures employed recurrently by expert musicians are closely related to their sounded expressive intentions, and that the larger expressive acoustic deviations imposed by them according to the music structure are reflected in their gestural patterns. This is due to the theory that the sonic expressive outcome strived by the performers would emerge and get consolidated during a long process of musical practice, which would also lead to these associated gestural patterns. The recurrent ancillary gestures could therefore serve to help the musicians in achieving their goals and conveying them to the audience.
The results of a motion analysis show that the studied musicians execute highly recurrent physical gestures at four specific regions of an excerpt of a Brahms’ clarinet sonata, where their movements seem to have a closer relation to expressiveness. An audio analysis was also used in this study to investigate the expressive content of these performances from the sonic standpoint. The results of this audio analysis show that the main timing and timbre manipulations are also employed by the clarinetists at these same regions in Brahms’ excerpt. These results also show that the loudness manipulations exhibit a somewhat different pattern, but one that still reflects the melodic phrase structure of this excerpt and the dynamic markings indicated by the composer. According to this analysis, the clarinetists employ expressive manipulations of timing and timbre mainly at the end of the music phrases, inside notable harmonic progressions. On the other hand, expressive manipulations of loudness are employed by them mainly along the phrases, in accordance to the dynamic markings on the score and to the acoustic peculiarities of the clarinet itself. During the development of crescendos and diminuendos toward the phrase endings, the players have a diminishing capability to impose personal expressive dynamic manipulations, on top of the dynamic profile already indicated on the score. Also, the specific note that is being played on the clarinet has a strong impact on the attainable dynamic range by the musician. These findings were also corroborated by the results of a preliminary experiment based on a shorter excerpt from a Mozart's quintet for clarinet and strings. In this case the performers employ expressive timing manipulations mainly at the end of Mozart's theme, precisely where the recurrent physical gestures are observed. Again, this part of the excerpt contains a phrase ending, related to a major harmonic progression, a perfect cadence, leading to these multimodal expressive patterns.
The findings discussed in this paper constitute solid evidence to support the hypothesis of a musical significance in the ancillary gestures of musicians, closely related to their sounded expressive intentions and important for their desired musical outcome. Up to this point a clear correlation was shown between the recurrence pattern of clarinet bell movements and the main expressive acoustic manipulations employed by the players during the performances, associated with melodic phrasing and harmonic and dynamic transitions in the excerpts. This indicates that the music structure has a direct influence on the occurrence of these physical gestures, even if other constraints could prove to be a factor. This corroborates the theory that these gestural patterns are acquired amid a long process of musical practice, during which the sounded expressive intentions of the performers are also developed in a related fashion. A gestural recurrence measure could therefore be used to highlight musical regions of higher expressive interest in a performance.
Studies by Wanderley et al. (2005), Desmet et al. (2012), and Caramiaux et al. (2012) have unveiled local relations between gestures and melody in a performance, but this is the first time that full-scale musical analyses of complex excerpts, including information about their sonic expressive content, are quantitatively coupled with a general and objective gestural pattern, consistent over a large group of players. It is also worth noting that this study focuses on movement recurrence between different performances, aiming to investigate how musical expressive intentions could be embodied during musical practice. It is also possible, and equally important, to analyze motion recurrence within each performance in order to better understand the nature of performer physical gestures and their local relation to musical structure, as proposed by Demos et al. (2014), who used recurrence quantification analysis (RQA) to evaluate the movements of musicians, with related theoretical and empirical goals.
In the future this research will be extended to additional excerpts, instruments, musicians, and acoustic and gestural features. This can lead to broader conclusions on ancillary movements of musicians and to a better parametric characterization of the expressive content of the performances, unveiling additional relations between its many descriptors. Another interesting perspective is to analyze the consistency of expressive acoustic deviations within musicians. That is, to look for systematic sonic signatures particular to each performer in order to further support our theory that multimodal expressive patterns emerge and get consolidated during the process of musical learning and practice, based also on more detailed individualized data analyses. Ultimately we wish to incorporate the methods presented here to musical synthesis, recognition, analysis, and teaching systems. By understanding the relation between expressive gestural and acoustic features in a performance we can define a more effective set of expressiveness parameters and models for musical synthesis systems and digital musical instruments, based on sonic and visual information. In pedagogical tools, through the comparison with referential performances, teachers can evaluate expressive aspects of students’ executions and their evolution over time. The methods and results presented here can also prove invaluable to new theoretical studies in the fields of musicology, signal processing, computer intelligence, embodied interaction, human cognition, and physiology, among others.