In order to be heard over the low-frequency energy of a loud orchestra, opera singers adjust their vocal tracts to increase high-frequency energy around 3,000 Hz (known as a “singer's formant”). In rap music, rhymes often coincide with the beat and thus may be masked by loud, low-frequency percussion events. How do emcees (i.e., rappers) avoid masking of on-beat rhymes? If emcees exploit formant structure, this may be reflected in the distribution of on- and off-beat vowels. To test this prediction, we used a sample of words from the MCFlow rap lyric corpus (Condit-Schultz, 2016). Frequency of occurrence of on- and off-beat words was compared. Each word contained one of eight vowel nuclei; population estimates of each vowel's first and second formant (F1 and F2) frequencies were obtained from an existing source. A bias was observed: vowels with higher F2, which are less likely to be masked by percussion, were favored for on-beat words. Words with lower F2 vowels, which may be masked, were more likely to deviate from the beat. Bias was most evident among rhyming words but persisted for nonrhyming words. These findings imply that emcees use formant structure to implicitly or explicitly target the intelligibility of salient lyric events.
A basic skill in music performance is the ability to make yourself heard. For example, Sundberg's (2001) seminal research on the “singer's formant” showed that professional singers adjust their vocal tracts during vowel production to boost high-frequency energy (~3,000 Hz) so that they can be heard over the predominantly low-frequency energy of a loud orchestra. Making yourself heard can be especially important for the intelligibility of lyrics, which are often misheard (Collister & Huron, 2008; Condit-Schultz & Huron, 2015; Johnson, Huron, & Collister, 2012; see kissthisguy.net for a user-contributed archive). Condit-Schultz and Huron (2015) reported below-average intelligibility of rap lyrics among 12 music genres. This may be due in part to performative aspects of rapping. For example, rate of delivery is on average 4.5 syllables per second (Condit-Schultz, 2016), which is faster than other music genres (Ding et al., 2017), and may increase the likelihood of unintelligibility. The current study considered another potential barrier to the intelligibility of lyrics in hiphop music—masking by percussion—and used corpus analysis to investigate whether emcees adapt their performances accordingly.
Lyrics are relatively important in hip-hop music. A recent online study showed that rap lyrics have a greater number of unique words than lyrics from 24 other popular music genres (http://lab.musixmatch.com/largest_vocabulary/). It is typically the lyrics in hip-hop music that are scrutinized by fans (see, e.g., the user-annotated website Genius [formerly Rap Genius]) and researchers (Alim, 2003; Bradley, 2009; Condit-Schultz, 2016; Hirjee & Brown, 2010; Krims, 2001). In fact, explicitly bragging about one's lyrical prowess—particularly the creative and extensive use of rhyme—is a common trope in rap lyrics. A corpus analysis from Hirjee and Brown (2010) showed that, relative to rock lyrics, rap lyrics feature more multisyllable rhymes, longer rhyme chains (reuse of a given rhyme sound), and greater rhyme density (proportion of rhyming syllables). These findings show that rhymes are salient lyric events in rap music for both performers and listeners. Thus, it is reasonable to expect that emcees should prioritize rhyme intelligibility.
About half of rhymed syllables in hip-hop music occur on the beat (Condit-Schultz, 2016). Percussion events also typically occur on the beat. In duple meter, which is nearly universal in hip-hop music, kick and snare drum hits usually occur on strong and weak beats, respectively. In hip-hop music, studio production with prominent on-beat percussion is called “boom bap” (onomatopoeia for the sounds of the kick drum and snare, respectively). Kick and snare drum hits are typically mixed louder in studio recordings than other percussion events (e.g., high-hat events) and have most of their spectral energy in the low-frequency range (< 1,000 Hz; FitzGerald, Coyle, & Lawlor, 2002). Thus, in hip-hop music, on-beat percussion is prominent and overlaps spectrally with male and female speaking voices, raising the possibility of the masking of lyrics (see Figure 2).
Anecdotal evidence supports masking as an issue in rap studio recording. For example, rap producer Matthew Weiss (2011, online resource) writes “Hip-Hop is all about the relationship between the vocals and the drums … Finding a way to make both the vocals and the snare prominent without stepping on each other will make the rest of the mix fall nicely into place.” Producers use mixing effects such as compression and equalization to accomplish this balance.
Beyond production values, do emcees themselves do anything to avoid masking of on-beat lyrics? One solution may be to simply increase vocal intensity. For example, part doubling is a common practice in rap, e.g., through multitrack recording in the studio or by a “hype man” in live performance. Doubling often occurs for on-beat rhymes, as shown in Figure 1.
To our knowledge, no study has investigated whether emcees, like opera singers, deal with masking by exploiting formant structure. It seems unlikely that emcees would capitalize on the singer's formant, since vowels in singing are usually sustained and vowels in rap are usually transient. Moreover, Sundberg and Romedahl (2009) showed that the singer's formant had little effect on the intelligibility of sung text. However, a “speaker's formant” (a similar but shallower peak in high-frequency energy than the singer's formant; Cleveland, Sundberg, & Stone, 1999) has been observed in individuals whose professions demand high speech intelligibility, such as teachers and actors (Velsik Bele, 2006). The speaker's formant has also been found in both sung and spoken vowels of country music singers; a classically trained singer, on the other hand, showed a much more prominent peak for sung vowels but not for spoken ones (Cleveland et al., 1999). The authors suggest this is due to the fact that singing is closer to speaking for country singers. It is possible to empirically test whether emcees also develop a speaker's formant in order to be heard over the beat (e.g., by analyzing the formant structure of the same vowel sounds off- and on-beat). More high-frequency energy in the latter would be expected. However, this would be difficult to measure from commercial rap recordings, since isolated vocal tracks are generally not made publicly available (an exception is Jay Z's Black Album [see Figure 2], which was released in full and a cappella versions).
Here we consider another possibility: that emcees capitalize on differences in formant structure between vowels. The acoustic effect of vocal tract resonances is to boost certain frequency bands called formants. For English vowels, the height of formant 2 (F2) varies between ~1,000 and 2,300 Hz, depending on the horizontal position of the tongue. F2 is higher for front vowels like /i/ in ‘beet’ and /I/ in ‘bit’ and lower for back vowels like /u/ in ‘boot’ and /a/ in ‘bought’ (tongue near the back of the mouth; see Table 1 for a summary). The height of F1 varies between ~300 and 800 Hz depending on the vertical position of the tongue. It is well established that F1 and F2 (and the frequency difference between them) can predict speech intelligibility (e.g., Bradlow, Toretta, & Pisoni, 1996; Ferguson & Kewley-Port, 2002; Gahl, 2015; Wright, 2004), including in noisy environments (Kim & Davis, 2014; Summers et al., 1988).
|Rhyming .||Nonrhyming .|
|Vowel (example word) .||F1(Hz) .||F2(Hz) .||p(on) .||p(16th dev.) .||p(8th dev.) .||p(on) .||p(16th dev.) .||p(8th dev.) .|
|n = 1707||n = 1130||n = 729||n = 3447||n = 3114||n = 3609|
|Rhyming .||Nonrhyming .|
|Vowel (example word) .||F1(Hz) .||F2(Hz) .||p(on) .||p(16th dev.) .||p(8th dev.) .||p(on) .||p(16th dev.) .||p(8th dev.) .|
|n = 1707||n = 1130||n = 729||n = 3447||n = 3114||n = 3609|
Note: Columns 1-3 show the 8 monopthong vowel nuclei used in the study, and their first (F1) and second formant (F2) frequency estimates, respectively, for American males as reported in Hillenbrand et al. (1995). The remaining columns show the proportion of rhyming (columns 3-6) and nonrhyming (columns 7-9) words as a function of metric position. Each of columns 4-9 may not sum to 1 due to rounding.
The MCFlow rap lyric corpus was used to test the following hypothesis: if emcees are sensitive to the issue of masking, then the metric placement of different vowels should vary systematically. Specifically, emcees should favor vowels with high F2 on beat, since they would not be masked by percussion (see the top row of Figure 2). On the other hand, vowels with low F2 should occur on beat less frequently, since they may be at least partially masked (see the bottom row of Figure 2). It was also expected that F1 should not predict metric placement, since F1 should be entirely masked by on-beat percussion for all English vowels (see the right column of Figure 2). Finally, we investigated whether such a strategy is targeted to salient lyric events by comparing the metrical placement of vowels in rhyming and nonrhyming words. If the strategy is targeted, then a bias should be evident for rhymes only.
Rap lyrics in the Musical Corpus of Flow (MCFlow) corpus (Condit-Schultz, 2016) were analyzed. The corpus contains transcriptions of 124 hip-hop songs that were on the Billboard Hot 100 between 1980 and 2014. Each of these transcriptions was done aurally by the author of the corpus. MCFlow contains only the through-composed rap verses.
Several features are transcribed in two formats: Humdrum kern notation and as an R object. The relevant transcriptions for this study were:
- IPA: an International Phonetic alphabet (IPA) transcription of each syllable in each verse of each song.
- Rhyme: identifies each rhyming syllable. Condit-Schultz (2016) acknowledges that the identification of rhymes is subjective and describes a number of complications that arise. In general, his approach to identifying rhymes is a liberal one insofar as the occurrence of the first rhyme in a rhyme chain, which the listener may or may not yet perceive to be a rhyme, is marked as a rhyme in the transcription. See Condit-Schulz (2016) for further details on the procedure used to identify rhymes.
- Beat: the relative time of onset of each syllable quantized to a 16th-note metric grid. All songs in the corpus are in duplemeter and are notated in 4/4 time.
A subset of the 62,466 syllables in the MCFlow corpus was analyzed (n = 13,736).
First, only monosyllabic words (both rhyming and nonrhyming) were analyzed. As in other musical traditions, there is a preference in hip-hop music for the stressed syllables in multisyllable words to fall on the beat (Condit-Schultz, 2016). This raises the possibility of conflicting goals for multisyllable words between stress/meter congruence and the avoidance of masking. For example, consider the two-syllable word “moving.” Prioritizing stress/meter congruence predicts that the stressed syllable “mov-” is more likely to fall on the beat. In fact, some evidence suggests that lyrics are more intelligible when there is a congruence between prosodic and metric stress (Johnson et al., 2014). On the other hand, prioritizing the avoidance of masking predicts that the syllable “ing” (containing a vowel with high F2) is more likely to fall on the beat. The current hypothesis makes no predictions about whether stress or masking might take precedence in such situations. Therefore, multisyllable words were excluded.
Second, only words containing one of the eight monopthong vowel nuclei in Table 1 were included. Table 1 also reproduces previously published population estimates of F1 and F2 of American male speakers (Hillenbrand, Getty, Clark, & Wheeler, 1995, p. 3103), which were used in the analysis below. Although estimates were also available for female speakers, the sample of songs by female emcees (4 out of 124 songs) was too small to be analyzed separately. Thus, since male and female F2 values were strongly correlated (F1: r = .98; F2: r = 1), male values were applied to data from both genders. One other monophthong from the corpus— ǝ (schwa)—was excluded. F1 and F2 estimates were not available for this vowel in Hillenbrand et al. (1995) or other consulted sources, perhaps due to the fact that F1 and F2 variability for ǝ is particularly high and context-dependent (Flemming, 2007).
Third, words were included if the vowel nucleus was minimally preceded and followed by a consonant. There were two reasons for this. First, the population F1 and F2 estimates from Hillenbrand et al. (1995) were measured from consonant-vowel-consonant utterances. Second, this criterion filtered out a number of “noncontent” words (Collister & Huron, 2008), such as “the,” “I,” and “a,” which were expected to be overrepresented among nonrhyming words and underrepresented among rhyming words.
Finally, only words whose absolute deviation from the beat was zero (positions 1, 5, 9, and 13 of the 16th note metric grid), one 16th (positions 2, 4, 6, 8, 10, 12, 14, and 16), or two 16ths/one 8th (positions 3, 7, 11, and 15) were analyzed. Rare instances in the corpus where onsets did not align with a 16th-note grid (e.g., triplet eighth notes) were excluded.
Table 1 shows the proportion of words containing each of the 8 vowels that were on-beat, 16th note deviations, or 8th note deviations. Figure 3 shows preliminary support for our hypothesis that the higher the F2 of a vowel, the more frequently rhymes containing that vowel should occur in the on-beat position. The y-axis values show the ratio of on-beat to 8th note deviations for each vowel (columns 4 and 6, respectively, in Table 1), where a ratio greater than 1 indicates bias towards the beat. These values were positively correlated with the log-transformed F2 frequencies shown on the x-axis, r(6) = .79, p = .02.
Next, multilevel logistic regression was used to model the effects of vowel formant structure on metric position while holding variation between songs in the corpus constant. The glmer function from R's lme4 library was used for analysis (Bates et al., 2015). For the dichotomous outcome variable Metric Position, words that occurred in positions 1, 5, 9, and 13 of the 16th-note metric grid were classified as on-beat; words in all other positions in the grid were classified as off-beat. To assess the effects of formant structure, estimated F1 and F2 values from Table 1 were log-transformed and entered as fixed effects. To assess our prediction that formant structure effects should be limited to rhymes, the categorical variable Rhyme (rhyming vs. nonrhyming) was entered as a fixed effect interaction term. Song was entered as a random effect with F1 and F2 as random slopes (i.e., the effects of F1 and F2 on metric position were allowed to vary between songs).
Results were generally consistent with our predictions. There was a significant fixed effect of F2 on Metric Position, Z = 2.38, p = .02, SE = .14. The odds ratio (OR) was 1.39, 95% CI (1.06, 1.81), indicating that for every 1 Hz increase in the log frequency of a vowel's F2, there was 1.39 times more chance of the word containing that vowel being on beat. (F2 values had a range of .85 log Hz.) On the other hand, the fixed effect of F1 on metric position was not significant, Z = 1.53, p =.13, SE = .10. This was expected since all F1 values, regardless of vowel, should be masked by on-beat percussion. Contrary to our expectations, the interactive effect of Rhyme did not reach significance for F1 or F2, ps > .62. However, separate regression analyses on rhyming and nonrhyming words did show more bias for the former, OR = 1.90, 95% CI (1.11, 3.27), than the latter, OR = 1.37, 95% CI (1.03, 1.83). Taken together, these findings suggest that emcees implicitly or explicitly adjust the metric position of (especially rhyming) words depending on the risk of their being masked by on-beat percussion. However, the current correlational data cannot confirm this causal interpretation.
Because the effect of F2 on metric position was modest, we considered two potential sources of statistical noise in post hoc analyses. First, our hypothesis assumed that bass- and snare-drum events occur on the beat. But there need not be a percussion event on the beat, nor should percussion be restricted to occurring on the beat. In the histogram in Figure 4, we have transcribed from the original recordings the bass and snare drum hits that occur in the first verse measure of a random subset of half of the songs (n = 62) in the MCFlow corpus. The figure shows fewer percussion events on beat 3 than the other beats, and off-beat percussion events are fairly common between beats 2 and 4. Therefore, an additional regression analysis was done to determine whether the effect of formant structure on metric position varies throughout a 4/4 bar. F1 and F2 values were again entered as fixed effects. The outcome variable and random effects were identical to above. In addition, the categorical variable Beat Number (1 vs. 2 vs. 3 vs. 4) was entered as a fixed effect interaction term, where level 1 included positions 1 to 4 in the 16th note grid, level 2 included positions 5 to 8, level 3 included positions 9 to 12, and level 4 included positions 13 to 16. Fixed effects of formant structure were nearly identical to previous: F2: Z = 2.19, p = .03, SE = .15; F1: Z = 1.65, p = .1 (ns), SE = .13. However, neither F1 nor F2 interactions with Beat Number reached significance, all ps > .13. While these findings did not support variation across the measure as a source of statistical noise, this might still be demonstrated with a more fine-grained analysis that takes into account variation in percussion rhythms on a song-by-song basis.
A second potential source of statistical noise concerns the amplitude envelope of on-beat percussion events (or perhaps even the envelopes of other instruments, sampled sounds, etc.). The median tempo of songs in the corpus is 96 BPM or 156 ms between adjacent positions in the 16th note grid. Thus, it seems reasonable that the sustain/decay of on-beat musical events could mask offbeat words that immediately follow. If emcees are sensitive to the issue of masking of words immediately following the beat, then there should be a bias towards higher F2 among these words relative to those that anticipate the beat. To test this, we compared the effect of formant structure on words that followed the beat by one-16th (i.e., positions 2, 6, 10, and 14 in the 16th note grid; n = 2716), with those that anticipated the beat by one-16th (i.e., positions 16, 4, 8, and 12; n = 1528). A regression analysis was done with Metric Position (anticipating vs. following) as outcome variable, log-transformed F1 and F2 values as fixed effects, and the same random effects as above. The findings supported our prediction. There was a significant fixed effect of F2 on Metric Position, Z = 3.15, p = .002, SE = .16. The odds ratio (OR) was 1.64, 95% CI (1.21, 2.24), i.e., for every 1 Hz increase in the log frequency of a vowel's F2, there was 1.64 times more chance of the word containing that vowel following the beat than anticipating it. The fixed effect of F1 was not significant, Z = 1.06, p = .29.
The current findings suggest that emcees are sensitive to the issue of on-beat masking and may take steps to make their lyrics intelligible over the beat. Specifically, these findings suggest that emcees exploit the formant structure of vowels. Vowels with higher F2, which are less likely to be masked by low-frequency percussion, are favored for on-beat words. Vowels with lower F2, which may be masked by percussion, are more likely to deviate from the beat. These effects were not found for F1, which was expected since F1 should be entirely masked by the beat. These effects were more pronounced for rhyming words but persisted for nonrhyming words. Thus, contrary to our expectations, the observed bias was not limited to the intelligibility of what are presumably the most salient lyric events (rhymes).
It is important to note that the effect size for the relationship between vowel and metric placement of words was modest. One source of statistical noise was identified: although words that anticipated and followed the beat by one 16th were both classified as off-beat in our regression model, the latter were biased towards high F2 vowels relative to the former. We propose that this pattern reflects emcees’ sensitivity to the sustain/decay of an on-beat musical event (e.g., snare hit), which may extend into the first 16th that follows and potentially mask lyrics. A more fine-grained, song-by-song acoustic analysis is needed to further support this interpretation.
There could be a number of other potential sources of statistical noise. For example, the MCFlow corpus quantizes microtiming deviations in syllable onsets to the nearest sixteenth note. Thus, some words may have been identified as on-beat in the corpus that actually anticipated or followed the beat. The current findings suggest that microtiming deviations should be more common among words with lower F2 nuclei to avoid masking by on-beat percussion, and that these words should tend to anticipate rather than follow the beat.
Another possible source of statistical noise that could be explored in future research is whether bias towards high F2 vowels in on-beat rhymes varies historically. With advances in audio signal processing, such as tools for the equalization and compression of frequency bands in an audio signal, and as these tools became more readily available (e.g., as software plugins), masking issues could increasingly get sorted out in production and thus have less of an impact on an emcee's musical choices. In this case, bias should show a gradual decline over time. Another possibility is that bias should decline with the stylistic change towards greater rhyme density (proportion of rhyming syllables) that occurred between ~1992–2001 (Condit-Schultz, 2016). Having more rhyming syllables might constrain the number of available metric positions to the extent that avoidance of masking could not easily be accommodated.
A causal relationship between meter and vowel cannot be established with the current data but could be verified experimentally. For example, in a rapping task, beat salience in an instrumental backing track could be manipulated over headphones (e.g., low intensity vs. high intensity). Emcees could perform the same verse under low and high intensity conditions. Based on the current findings, formant structure of vowels should vary between intensity conditions. Similar effects have been reported in the speech production literature. For example, formant structure has been shown to vary with intensity of background noise (a feature of the Lombard effect; Summers et al., 1998), and for spoken vs. shouted speech (Rostolland, 1982).
The current findings may contribute to the question of the functional significance of expressive timing deviations, and, more specifically, the extensive use of syncopation in Western popular music. Syncopation is typically interpreted in terms of its perceptual effects on the listener. For example, Temperley (1999) proposed that certain extended uses of syncopation, such as the line “It's been a long cold lon-ely win-ter” from The Beatles’ “Here Comes the Sun,” in which all syllables anticipate the beat, are perceived as surface structure transformations that support the listener's mental representation of deep structure (i.e., the actual meter). Others have shown effects of syncopation on listeners’ positive affect and wanting to move (i.e., perception of groove; Witek, Clarke, Wallentin, Kringlebach, & Vuust, 2015). A testable alternative hypothesis is that syncopation may, at least in some cases, be driven not by intended perceptual effects but by a singer/emcee's goal of making themselves heard over a beat-driven accompaniment. For example, the hypothesis predicts that extent of syncopation should be correlated with the intensity of on-beat accompaniment, and that more easily masked (e.g., low F2) vowels should be overrepresented among syncopated lyrics. Our findings are consistent with the latter prediction (see Figure 3), but don't distinguish between off-beat words that are part of a syncopated rhythm (as in the Beatles example above) and words that occur incidentally in an off-beat position (as is likely the case, e.g., with the word “tell” in measure 1 of Figure 1).
The current findings are also broadly consistent with the vocal constraints hypothesis (Russo & Cuddy, 1999; Tierney, Russo, & Patel, 2011), which proposes that musical choices are implicitly or explicitly guided by the limitations of the voice. The hypothesis has been shown to predict similarities and differences in melodic features between human and bird song (Tierney et al., 2011), and to generate novel predictions for statistical trends in vocal and instrumental music (Ammirante & Russo, 2015). While previous studies have focused on fundamental frequency in a melodic context, the current findings suggest that compositional choice may also be guided by the spectral properties of the voice.