Music psychology has a long history, but the question of whether brief music excerpts are representative of whole songs has been largely unaddressed. Here, we explore whether preference and familiarity ratings in response to excerpts are predictive of these ratings in response to whole songs. We asked 643 participants to judge 3,120 excerpts of varying durations taken from different sections of 260 songs from a broad range of genres and time periods in terms of preference and familiarity. We found that within the range of durations commonly used in music research, responses to excerpts are strongly predictive of whole song affect and cognition, with only minor effects of duration and location within the song. We concluded that preference and familiarity ratings in response to brief music excerpts are representative of the responses to whole songs. Even the shortest excerpt duration that is commonly used in research yields preference and familiarity ratings that are close to those for whole songs, suggesting that listeners are able to rapidly and reliably ascertain recognition as well as preference and familiarity ratings of whole songs.
Music can evoke affective responses, which has been studied empirically. Some of the work on the elicitation of emotions with music employs whole songs (Grewe et al., 2007; Grewe et al., 2009), whereas other studies utilize brief excerpts (De Vries, 1991; Kreutz et al., 2008; Krumhansl, 1997). A related line of research pertains to the question of what determines music preference more generally. Again, some use whole songs (Schäfer & Sedlmeier, 2010) whereas others use excerpts to address this question (Rentfrow et al., 2011). There are good reasons for this diversity of approaches: the average song in popular music is about 3–4 minutes long (Allain, 2014). As there are limits on the attention and time of research participants, using whole songs in an experiment curbs the number of different pieces that can be used in any given study; indeed, such studies typically use less than 10 distinct pieces. This is a concern because music is inherently complex—varying in many dimensions—so picking only a few pieces runs the risk of undersampling the underlying stimulus space of music (Lundin, 1953; Prince, 2011). In contrast, using brief excerpts as the stimulus material enables experimenters to employ many different music selections in the same study, which in turn allows for a more thorough coverage of the underlying stimulus space. However, this approach runs the risk that these excerpts are not representative of the whole songs from which they were extracted. In other words, in any given study, music researchers face an inherent tradeoff between better coverage of the stimulus space of music or better coverage of individual songs.
Whether the results gained from these different approaches can inform each other depends on the answers to two questions:
How representative are the excerpts of the songs that from which they were sampled?
How much do these results depend on the specific way in which the excerpts were sampled?
To our knowledge, the answer to the first question has not been studied empirically. Whereas some studies have investigated how quickly participants can make reliable aesthetic judgments about excerpts (Belfi et al., 2018), none have explicitly compared the rating of excerpts to whole songs. In other words, the answer to this question remains open. Meanwhile, the field simply relies on the assumption that excerpts are in fact representative of songs. This might be the case, but there are concerns about both internal and external validity. For instance, it is conceivable that the long-term temporal structure of a song matters in a way that can—in principle—not be captured by brief excerpts. In terms of external validity, most people listen to whole songs, not excerpts, when listening to music. Therefore, results from the study of “music” that relies solely on excerpts could be misleading if this assumption does not hold.
In terms of the second question, there is a wide range of stimulus durations commonly used in music research that utilizes excerpts (also referred to as “selections,” “snippets,” or “clips” in some of these publications). This duration range spans from around 5 seconds (Krumhansl, 2017; Krumhansl & Zupnick, 2013), to about 9 seconds (Peretz et al., 1998), to a “very short” 14 seconds (Mehr at al., 2018), to 15 seconds (Belfi et al., 2016, Rentfrow et al., 2011), to 30 seconds (McCown et al., 1997), and up to 2 minutes (Zentner et al., 2008), but it is unclear how this choice affects aesthetic judgments. Sometimes, even shorter (e.g., Belfi et al., 2018; Mace et al., 2012) and longer pieces are used (e.g., Vuoskoski et al., 2012; Warrenburg, 2020). In addition, sometimes this excerpt may be taken from the chorus (Krumhansl, 2017; Krumhansl & Zupnick, 2013), sometimes from a “highly recognizable part of the song” (Belfi et al., 2016), or sometimes from “melodic lines” of a song (Peretz et al., 1998). At other times, the clips are simply whatever is provided by iTunes (Barrett et al., 2010). This might matter, as the sonic properties of a song could be different in different parts of a song. For example, the sonic properties of the introduction might be different from those of the chorus, which might in turn be different from those of the verse. Often, it is not specified at all how the excerpts were sampled from the song (Kreutz et al., 2008; Rentfrow et al., 2011; Vuouskoski et al., 2012), only that excerpts—with otherwise unknown properties—were used, rendering it unclear what research participants actually listened to. This is a concern in terms of the potential replicability of this research, which is an increasingly important matter (Camerer et al., 2018; Open Science Collaboration 2015).
In this study, we aim to address these two related questions in order to quantify how representative excerpts are of their song of origin and whether affective and cognitive judgments are affected by duration or location of the excerpts.
Method
Participants
Our sample consisted of New York University undergraduate students as well as residents of the greater New York City area (age range = 17 to 87 years, mean = 21.3 years, median = 20 years). Our participants (n = 643) completed the study in a single, in-person two-hour session. All participants either received course credit or were compensated with $20. Participants were recruited using the “NYU SONA subject pool” and through flyers, email lists, and advertisements in classes. We used the data from all 638 participants (> 99%) that finished the study for the analysis presented here.
Materials
Song Selection
We selected 260 songs in total, with the goal of achieving a selection that was representative of music styles commonly heard by U.S. listeners, including a wide range of genres, time periods, and popularity. Thus, we included 152 pop songs from the Billboard popular music charts (2 randomly chosen “number-one songs” for each year from 1940 through 2015), 52 diverse songs that were judged by a panel of experts to be obscure (Rentfrow et al., 2011), as well as 56 “iconic” songs from eight broad music genres (classical, country, electronic, jazz, popular, rap/hip-hop, rock, and R&B/soul). For the iconic songs, a lab-wide panel composed of six researchers with varying music backgrounds determined seven sub-genres for each of the eight broad genres (such as “Renaissance” and “Baroque” for classical music; and “Bebop” and “Fusion” for Jazz). For each of these 56 sub-genres, the panel voted on one song to best represent the given sub-genre.
Excerpt Selection and Creation
For each of the 260 songs, we created 12 excerpts (“clips”) with durations of precisely 5s, 10s, and 15s. The clips were “nested” such that clips of all durations had the same starting point; for example, if a clip started at 1:10 minutes into the song, the 15s clip played from 1:10 minutes to 1:25 minutes, the 10s clip from 1:10 minutes to 1:20 minutes, and the 5s clip from 1:10 minutes to 1:15 minutes. We created the clips with a custom-built MATLAB program. We linearly faded the last 10% of all clips to silence (the last 1.5s, 1.0s, and 0.5s for the 15s, 10s, and 5s clips, respectively).
To create these clips, we indexed the structure of all songs according to when the intro (I), “outro” (O), chorus (C), and “verse” (V) sections of the song began, with the “verse” sections including verses, solos, bridges, and any other portions of the song that did not fit into the other categories. Doing so, we derived a standard song structure: beginning with I, then V and C, with the V-C combo repeated various (n) times, until O (the ending section of the song). This algorithm can be represented as I + (V*C)n + O. For “standard” songs that followed this format, another custom-built MATLAB program randomized which I/O, C, or V portion to use for the clip and when in that portion the clip should start. We used these target starting points to create the clips but manually ensured that the clips did not begin in the middle of a music phrase.
For example, if the output of the program indicated to start the chorus clips at 0:23 minutes but this cut off a phrase, we might start the clips at 0:24.1 minutes to avoid cutting off the start of the music phrase, such as when the lyrics begin or at the downbeat of the first measure. We balanced the actual starting points, ensuring that there were an even number of actual starting points before and after the target starting points given by the MATLAB program, such that the actual starting point deviations from the suggested starting point for each song were normally distributed with a mean of zero. The actual clip starting points were all within a few seconds in either direction of the target starting points. If any of the chorus or verse sections overlapped, we created the first section using the given starting point and then re-drew the starting times for the other sections by running the sampling program again. If the intro or outro was less than 15 seconds, we used the first or last 15 seconds of the song. At the end of this process, none of the clips drawn from I/O, C or V had any overlap at all.
Some songs did not clearly follow this standard structure (the remaining “non-standard” songs). For instance, some songs did not have a chorus or clearly delineated sections at all. For these songs, we used the temporal distribution of the clips from the songs had clearly delineated sections, in order to create a statistical distribution of starting points that matched that of the songs with a standard structure. In other words, we found sections of these songs that corresponded to the time point in which the chorus or verse typically started, if there were a chorus or verse.
For each song, 9 of the 12 clips were created in the systematic way detailed above (3 sections per clip duration, crossed with 3 clip durations). An additional 3 clips per song were chosen subjectively in the following manner. The members of the selection panel independently listened to all of the songs and voted on which 15s section of each song they considered to be the most representative or characteristic of the entire song. The panel discussed their chosen sections until a consensus was reached among all members. We then created the 15s, 10s, and 5s clips for these representative sections using the same nesting technique described earlier (see Figure 1).
To summarize, we created 12 clips for each of the 260 songs. For standard songs, we picked clips from a chorus, verse, intro, or outro section, as well as a subjectively chosen section, with a duration of 15s, 10s, and 5s for each. For songs with a non-standard structure, we picked clips from sections in the song that statistically corresponded to the typical starting points of chorus, verse, and intro/outro of standard songs.
Auditory “Palate Cleanser”
As each participant listened to numerous songs and clips (a total of 192) throughout the experiment, we created simple behavioral tasks that the participants completed between each music presentation to avoid carryover effects from one music selection to the next. This “palate cleansing” was effective. We performed an analysis correlating the music ratings from all 192 trials with themselves, offset by one trial. This cross-correlation analysis revealed that this correlation between all ratings and those offset by one trial was statistically indistinguishable from zero for almost all (94%) of the participants. It was only significant for the rest of the participants, and those effects were likely driven either by mood effects (someone being in a temporary up- or down-state), or someone “clicking through” for a while, during the task.
Procedure
Throughout the experiment, participants listened to music (full songs and clips), which they rated in terms of liking and familiarity. They also were asked to indicate whether they recognized each music selection. Each music presentation was followed by a “palate cleanser,” a task requiring a behavioral response that was designed to distract participants from what they had just heard. All music was presented via Audio-Technica ATH-M20x Professional Monitor Headphones using custom-built MATLAB (2016b) software.
The songs used in each session were randomly drawn from all 260 songs, with the following constraints: each participant listened to 6 songs from the Billboard, 3 from the Rentfrow corpus, and 3 “iconic” songs, for a total of 12 songs per participant and experimental session. Participants were also presented with the 144 clips (12 per each song) drawn from these 12 songs as well as 36 clips that were randomly drawn from the other 248 songs in order to enrich the experience of participants, as piloting suggested that listening to only 12 songs over and over again would be too monotonous. The order of these 192 music selections was effectively randomized (we constrained the order of the music selections such that a clip from a song cannot immediately precede or follow the song from which it came) and alternated with 191 behavioral “palate cleansers” for a total of 383 trials per participant.
To operationalize affective (preference) and cognitive (familiarity/recognition) judgments respectively, participants were asked in each music trial to rate how much they like the current song or clip on a 7-point Likert scale (“Hate it,” “Strongly dislike it,” ”Slightly dislike it,” “Indifferent,” “Slightly like it,” “Strongly like it,” “Love it”) as well as rate their familiarity on a 5-point Likert scale, responding to the question, “How often have you heard this before?” (“Never,” “Once,” “More than once,” “Multiple times,” “Too many to count”). We used these scales because piloting indicated that these response options constitute “natural” reference points. Participants were also asked to indicate whether they recognized the song by binary choice (“Yes” or “No”). Importantly, our participants only see the qualitative labels, to which we assign numbers from 1 to 7 (for ratings) and 1 to 5 (for familiarity) in the analysis.
When instructing the participants, we emphasized that they should answer all questions pertaining to the specific clip or song they just heard, not other portions of the song, or whole genres. All procedures were approved by the New York University Institutional Review Board, the University Committee on Activities Involving Human Subjects (UCAIHS).
Data Analysis
To analyze the data, we use MATLAB throughout. As we performed several statistical comparisons in the form of significance tests, we adopted a conservative significance level alpha of .005 (Benjamin et al., 2018), in order to avoid false positive results.
Before performing these tests, we needed to do a manipulation check to establish whether our participants used the scale described above properly. For instance, it was conceivable that there were floor effects, ceiling effects, response biases, as well as the possibility that we used songs that were psychometrically not representative of the full range of music.
To accomplish this, we plotted a histogram of the number of responses as a function of the average preference rating for both clips and songs. If there were substantial deviations from normal, this could pose statistical issues as well as raise the question of whether our participants were engaged in the task or had other biases.
However, our results showed that the overall preference ratings of our chosen music were well-calibrated and that participants used the rating scale properly (see Figure 2). The mean song preference rating was 4.23, with a standard deviation of 0.95, while the mean clip preference rating was 4.13, with a standard deviation of 0.87. Both distributions did not significantly deviate from a normal distribution, as tested by a Shapiro-Wilk test: SW = 0.9991, p = .99 for clips, and SW = 0.9958, p = .082 for songs. Thus, our preference ratings distributions are not statistically distinguishable from normal distributions.
Results
The main question we attempt to answer in this study was whether preference and familiarity ratings in response to brief music excerpts were representative of the whole songs from which they were sampled.
To address this question, we correlated the average clip preference rating of participants with their song preference rating. The median Spearman rank correlation was .834 (see Figure 3 for the distribution).
As only 12 points went into any given correlation (per participant), we were concerned whether such a median could likely be obtained by chance. Thus, we randomly shuffled the ratings data 500,000 times. Doing so, the highest median correlation we obtained was .0724, implying an exact p value of 0. We therefore concluded that this median correlation was both statistically significant and substantial. Put differently, clip preference ratings did seem to predict song preference ratings well.
However, there was an unaddressed confound in this analysis. We simply correlated the song preference ratings with the clip preference ratings, regardless of where in the presentation sequence the song appeared. In other words, what might appear as prediction—predicting the song preference rating from the clip preference rating—might in actuality be post-diction. This was not a trivial point, as the task (from the perspective of the participant) was quite different psychologically. In the first case, there were large parts of the song that they had not heard yet, if they just heard a brief excerpt. In the second case, they simply had to realize that the clip was part of a song they already heard. Thus, we further refined this analysis by whether the clips occurred before or after the song (see Figure 4).
Our intuition as to the different nature of the responses appeared to be correct. The median Spearman correlation for clips presented after the song was significantly higher than that between clips presented before the song, as assessed by a KS-test (Δ = 0.091, D = 0.279, p = 2.68e-22).
Taken together, the correlation between clip and song ratings was high, indicative of high intra-song reliability of appraisal. But did this depend on the length of the clip? Intuitively, it must—from first principles, the longer an organism integrates information, the more reliable a judgment will be (Anderson, 1962). However, it is possible that humans are highly efficient integrators of this information. In other words, accuracy of judgments might have already saturated at the beginning of the range of excerpt durations commonly used in music psychology research (around 5 s). We explore this question in Figure 5.
Median Spearman correlations between the clip and song preference ratings were .74, .76, and .78 for clips of 5s, 10s, and 15s duration, respectively, if the clip was presented before the song. Median Spearman correlations between the clip and song preference ratings were .84, .86, and .86 for clips of 5s, 10s, and 15s duration, respectively, if the clip was presented after the song.
We perform KS-tests to assess whether these differences are statistically significant (see Table 1).
Comparison of clip durations relative to song . | Delta . | D . | p . | Significant? . |
---|---|---|---|---|
5s Before vs. 10s Before | 0.011 | 0.062 | .171 | ns |
5s Before vs. 15s Before | 0.034 | 0.088 | .0143 | ns |
10s Before vs. 15s Before | 0.023 | 0.061 | .176 | ns |
5s After vs. 10s After | 0.014 | 0.050 | .409 | ns |
5s After vs. 15s After | 0.014 | 0.050 | .389 | ns |
10s After vs. 15s After | 5.5e-6 | 0.039 | .720 | ns |
5s Before vs. 5s After | 0.099 | 0.22 | 5.60e-14 | *** |
10s Before vs. 10s After | 0.101 | 0.26 | 1.82e-19 | *** |
15s Before vs. 15s After | 0.079 | 0.23 | 1.03e-14 | *** |
Comparison of clip durations relative to song . | Delta . | D . | p . | Significant? . |
---|---|---|---|---|
5s Before vs. 10s Before | 0.011 | 0.062 | .171 | ns |
5s Before vs. 15s Before | 0.034 | 0.088 | .0143 | ns |
10s Before vs. 15s Before | 0.023 | 0.061 | .176 | ns |
5s After vs. 10s After | 0.014 | 0.050 | .409 | ns |
5s After vs. 15s After | 0.014 | 0.050 | .389 | ns |
10s After vs. 15s After | 5.5e-6 | 0.039 | .720 | ns |
5s Before vs. 5s After | 0.099 | 0.22 | 5.60e-14 | *** |
10s Before vs. 10s After | 0.101 | 0.26 | 1.82e-19 | *** |
15s Before vs. 15s After | 0.079 | 0.23 | 1.03e-14 | *** |
To summarize this pattern of results, all differences between clip durations—whether they be before or after the song—were not significant. In contrast, all differences between clips presented before and after the song (for a given clip duration) were statistically significant. We concluded that clip duration was not a significant factor when predicting song preference rating from clip preference rating. However, just like in the analysis where we did not disaggregate by duration, clip ratings from clips presented after the song were more strongly correlated with the song preference ratings.
Whereas it appears that there was no significant difference between clip durations that are commonly used in music, one might wonder if it matters from which part of the song the clip is sampled. For instance, it is conceivable that there could be a considerable difference in the correlation between clips and song for clips taken from the chorus versus clips taken from other sections of the song. We explored this question in Figure 6.
Median Spearman correlations between the clip and song preference ratings were .72, .76, .75, and .76 for clips from Intro/Outro, Chorus, Verse, and a subjectively maximally representative portion of the song, respectively, if the clip was presented before the song. Median Spearman correlations between the clip and song preference ratings were .83, .85, .84, and .85 for clips from Intro/Outro, Chorus, Verse, and a subjectively maximally representative portion of the song, respectively, if the clip was presented after the song. We performed KS-tests to assess whether these differences are statistically significant and present the results in Table 2.
Comparison of clip durations relative to song . | Delta . | D . | p . | Significant? . |
---|---|---|---|---|
I/O before vs. Chorus before | 0.0363 | 0.0754 | .0510 | ns |
I/O before vs. Verse before | 0.0302 | 0.0628 | .1565 | ns |
I/O before vs. Representative before | 0.0432 | 0.1120 | 6.06e-04 | * |
Chorus before vs. Verse before | 0.0060 | 0.0440 | .5605 | ns |
Chorus before vs. Representative before | 0.0069 | 0.0508 | .3755 | ns |
Verse before vs. Representative before | 0.0130 | 0.0838 | .0216 | ns |
I/O after vs. Chorus after | 0.0225 | 0.0750 | .0530 | ns |
I/O after vs. Verse after | 0.0160 | 0.0539 | .3041 | ns |
I/O after vs. Representative after | 0.0283 | 0.0982 | .0039 | * |
Chorus after vs. Verse after | 0.0065 | 0.0485 | .4346 | ns |
Chorus after vs. Representative after | 0.0058 | 0.0469 | .4779 | ns |
Verse after vs. Representative after | 0.0124 | 0.0627 | .1571 | ns |
I/O before vs. I/O after | 0.1055 | 0.2229 | 2.28e-14 | *** |
Chorus before vs. Chorus after | 0.0917 | 0.2137 | 3.19e-13 | *** |
Verse before vs. Verse after | 0.0912 | 0.2253 | 1.12e-14 | *** |
Representative before vs. Representative after | 0.0906 | 0.2023 | 6.66e-12 | *** |
Comparison of clip durations relative to song . | Delta . | D . | p . | Significant? . |
---|---|---|---|---|
I/O before vs. Chorus before | 0.0363 | 0.0754 | .0510 | ns |
I/O before vs. Verse before | 0.0302 | 0.0628 | .1565 | ns |
I/O before vs. Representative before | 0.0432 | 0.1120 | 6.06e-04 | * |
Chorus before vs. Verse before | 0.0060 | 0.0440 | .5605 | ns |
Chorus before vs. Representative before | 0.0069 | 0.0508 | .3755 | ns |
Verse before vs. Representative before | 0.0130 | 0.0838 | .0216 | ns |
I/O after vs. Chorus after | 0.0225 | 0.0750 | .0530 | ns |
I/O after vs. Verse after | 0.0160 | 0.0539 | .3041 | ns |
I/O after vs. Representative after | 0.0283 | 0.0982 | .0039 | * |
Chorus after vs. Verse after | 0.0065 | 0.0485 | .4346 | ns |
Chorus after vs. Representative after | 0.0058 | 0.0469 | .4779 | ns |
Verse after vs. Representative after | 0.0124 | 0.0627 | .1571 | ns |
I/O before vs. I/O after | 0.1055 | 0.2229 | 2.28e-14 | *** |
Chorus before vs. Chorus after | 0.0917 | 0.2137 | 3.19e-13 | *** |
Verse before vs. Verse after | 0.0912 | 0.2253 | 1.12e-14 | *** |
Representative before vs. Representative after | 0.0906 | 0.2023 | 6.66e-12 | *** |
Again, a consistent pattern of results emerges. Broadly speaking, it did not matter from which section in the song the clips were sampled from. The only exception to this was the comparison between clips from the Intro or Outro compared to subjectively most representative clips, with the latter being slightly stronger correlated with the song preference ratings. This is plausible, as the Intro or Outro can be quite drastically different from the rest of the song, in terms of acoustic properties.
Again, there were strong and statistically significant differences for clips presented before vs. after the song, with clips presented after the song being consistently more correlated with the song preference ratings, regardless of the song section from which the clip came.
Consistent with the idea that when participants encountered a particular song, any subsequent clip can then serve as a cue to evoke the memory of the feelings that the whole song elicits, it is also possible that participants have encountered the whole song before the study. In that case, a clip could act as a retrieval cue to the memory of the whole song, whenever that was heard (Spivack et al., 2019). In other words, the effects laid out above might well be mediated by recognition. In fact, it is possible that what drives preference ratings are not acoustic properties of the song per se, but the entire context and associations that were encoded when first encountering the song prior to the experiment. If this were the case, we would predict that such an effect would boost the correlation between clips and song preference ratings, as these memories could be used to rate the clip, consistent with the previously encountered song. Thus, this correlation should be higher for clips where participants recognize the song compared to those where they do not.
There are indications that are consistent with this model. For instance, we recorded the time it took for participants to click on the “next trial” button. Looking at the reaction times for 5s clips is fair, as participants could already report their preference ratings while the clip is still playing, but not continue to the next trial until the clip had finished playing. Looking at the reaction times for these clips, the median reaction time for unrecognized clips is 7.16s, whereas it is 6.76s for recognized clips. This difference was statistically significant, as evaluated with a Mann-Whitney U test (ranksum = 360980, p = 9.38e-7). Thus, it took longer to rate an unrecognized song than a recognized one, perhaps due to the fact that the memory serves as shortcut, which is remarkable, as it takes some time to click the “I recognize the clip” checkbox whereas it takes no time to simply leave it unclicked (the default). Thus, everything else being equal, we would expect the unrecognized clip trials to be completed faster.
Moreover, there is information in the clip recognition data (see Figure 7).
As one can see, song recognition can be predicted from the number of clips that are recognized and is well described by a logistic function (ß0 = 4.12, ß1 = 0.64). However, this analysis integrated information over all participants and clips and was challenging to do on a per-participant level, as recognition rates for quite a few of our songs were close to zero. This is not surprising, given that we took a substantial portion of our sample from the Rentfrow et al. (2011) corpus, which was deliberately sampled to consist of obscure music. Overall, only 20% of the clips were reported to be recognized by our participants. Therefore, instead of calculating the average clip-to-song correlation, we calculated the mean absolute deviation in this analysis (see Figure 8).
We attempted to tease apart the effects of recognition and whether clips were presented before or after the song. The mean absolute deviation (MAD) was calculated as the mean absolute difference between song and clip ratings. As one can see, the MAD was lowest for recognized clips presented after the song. The prediction accuracy of clips presented before the song but that were recognized was roughly equal to that of unrecognized clips presented after the song. Unrecognized clips that were presented before the song were least predictive of the song preference rating, but still far more predictive than one would expect from random chance. We concluded that both recognition and presentation order seem to play about equally strong roles in increasing song predictability from clips, with short-term and long-term memory serving as plausible mechanisms underlying this effect.
It is somewhat surprising that neither the duration of the excerpt (within a three-fold duration window of commonly used excerpt durations from 5 to 15 seconds) nor the song section that the excerpt came from appreciably affected the predictability of song preference ratings from the preference rating of excerpts (with the exception of the contrast between intro/outro and representative ratings). With this in mind, note that we presented the excerpts in random order throughout the experiment. Given this design, it is possible that there was cross-contamination between the excerpts; for instance, if a longer excerpt was played first, that might affect how a later, but shorter one was interpreted. It is also possible that if a more representative segment was played first, the intro would remind the listener of that, which would artificially level the difference in predictability. This is a valid concern, and one of the key reasons Belfi et al. (2018) adopted a block design. However, such a design is also not without inherent concerns, such as order effects or temporal autocorrelations. Thus, we addressed this possibility by analyzing the predictability of song ratings based on the very first excerpt (and only if that excerpt was played before the song), as a function of duration and song section. As we can only use the very first instance of the excerpts, we computed the root mean-squared error (RMSE) across all participants and songs and then bootstrap confidence intervals. To mimic the significance level used in the rest of this publication and to guard against multiple comparison concerns, we computed 99% confidence intervals. As is evident in Figure 9, the concerns about cross-contamination between excerpts were not borne out empirically, as these results closely mirror those from previous analyses presented above.
After answering the question of whether preference ratings of songs can be predicted from clip preference ratings, we turned to the question of whether song familiarity could be predicted from clip familiarity. To answer this question, we employed a similar but somewhat modified analysis, as each presentation of music—both clips and songs—changed the reported familiarity of subsequent music presentations of the same kind (see Figure 10).
Figure 10 shows two things: First, familiarity significantly increased as a function of the number of presentations of the same music. Second, the type of music mattered; in other words, listening to the song had a much larger impact on the reported familiarity than a clip. This is plausible, as the average song was 21 times longer than the average clip. As both observations impact any possible relationship between clips and songs, we restricted our analysis to clips that were immediately preceding a song. We believe this was a fair comparison, as their average familiarity was comparable.
Thus, to answer the question of whether song familiarity can be predicted from clip familiarity, we calculated the median Spearman correlation between songs and the corresponding clips that immediately preceded them as .771 (see Figure 11 for the distribution).
Following the logic of the preference ratings analysis, we considered whether there was a difference in median correlation as a function of clip duration. However, as discussed above, only the immediately preceding clip was valid as a predictor of song familiarity. This meant that each correlation was only based on a few points, which made correlations of 1, 0 and -1 much more likely, as evident in Figure 12. Thus, we paired the correlation figures with corresponding mean absolute deviation numbers and distributions.
Consistent with the findings from the preference ratings, the relationship between clip and song familiarity was strong, but there was no significant difference between different clip durations (see Table 3).
Comparison of MAD (clip vs. song familiarity) . | Delta . | D . | p . | Significant? . |
---|---|---|---|---|
5s vs. 10 s | 0.0266 | 0.0544 | .3085 | ns |
5s vs. 15s | 0.0236 | 0.0425 | .6207 | ns |
10s vs. 15s | 0.0031 | 0.0246 | .9898 | ns |
Intro/Outro vs. Chorus | 0.1097 | 0.1022 | .0032 | * |
Intro/Outro vs. Verse | 0.0988 | 0.0891 | .0169 | ns |
Intro/Outro vs. Representative | 0.1270 | 0.1287 | 8.586e-5 | ** |
Chorus vs. Verse | 0.0109 | 0.0283 | .9645 | ns |
Chorus vs. Representative | 0.0173 | 0.0728 | .0727 | ns |
Verse vs. Representative | 0.0282 | 0.0676 | .1228 | ns |
Comparison of MAD (clip vs. song familiarity) . | Delta . | D . | p . | Significant? . |
---|---|---|---|---|
5s vs. 10 s | 0.0266 | 0.0544 | .3085 | ns |
5s vs. 15s | 0.0236 | 0.0425 | .6207 | ns |
10s vs. 15s | 0.0031 | 0.0246 | .9898 | ns |
Intro/Outro vs. Chorus | 0.1097 | 0.1022 | .0032 | * |
Intro/Outro vs. Verse | 0.0988 | 0.0891 | .0169 | ns |
Intro/Outro vs. Representative | 0.1270 | 0.1287 | 8.586e-5 | ** |
Chorus vs. Verse | 0.0109 | 0.0283 | .9645 | ns |
Chorus vs. Representative | 0.0173 | 0.0728 | .0727 | ns |
Verse vs. Representative | 0.0282 | 0.0676 | .1228 | ns |
To conclude our analysis, we also considered the relationship between clip and song familiarity as a function of song section from which the clip was sampled. We show these distributions in Figure 13. As the same considerations regarding correlation apply, we also present the mean absolute deviation (MAD).
Like the duration findings, these results are consistent with the corresponding results from the preference ratings; see Table 3 for the results of the hypothesis tests.
Discussion
In this study, we explored whether the psychological responses to excerpts were representative of the songs from which they were sampled. We found that this was the case, broadly speaking: Both the preference and familiarity ratings of a song could be well predicted from excerpt preference and familiarity ratings. This pattern of results is remarkably consistent. Both song preference and familiarity ratings are well predicted from clips, regardless of their duration, and regardless of which section in the song they were sampled from, with the exception of a significant difference between songs from the Intro or Outro vs. those from the subjectively most representative part of the song.
One strength of this research is that it was conducted with high statistical power, as we used about an order of magnitude more participants in this study than most existing experimental studies on music perception and cognition. The only exceptions are music studies that rely on survey research or participants on Amazon mTurk, which come with their own host of inherent problems (Buchanan & Scofield, 2017). In the age of questionable replicability, power is a considerable concern (Open Science Collaboration, 2015; Wallisch, 2015).
A limitation of this work is that we were unable to determine the concordance between clip and song ratings as a function of clip duration, as the agreement was already high at the shortest duration we used (5 seconds). This is perhaps not surprising given research on “thin slices” in music perception: participants are able to recognize snippets of songs that are as short as 300 ms with a 25% success rate (Krumhansl, 2010). In general, the temporal fidelity of the human auditory system is exquisite; people are able to distinguish the human voice from other sources with exposure durations as short as 2 ms (Suied et al., 2013).
Thus, the shortest duration at which song rating and recognition can be perfectly predicted from clip responses will lie somewhere below 5 seconds, perhaps considerably so. However, the question of which clip duration predictions about song judgments dip from perfect (given the limits of reliability imposed by judgments about the song ratings) is perhaps mostly of academic interest. An exposure duration of 5 seconds already allows the experimenter to rapidly present excerpts from many songs and it might take participants a while to make judgments about what they are hearing. For instance, presenting a rapid stream of clips with a duration of 1 second might still yield reliable judgments about the songs, but might be too exhausting for participants if this stream is too long. Nevertheless, determining the exact slope of this “rise of the temporal kernel” could be of interest and thus constitute an area of future research.
Another potential limitation of this research consists in the fact that we used a fully randomized design when determining the order of clips and songs. Such an approach has many advantages, as it eliminates concerns regarding temporal autocorrelations and secular response biases. However, one downside of such a design is that if a shorter clip is presented after a longer one, participants could conceivable remember—and respond—to what they heard before, not the currently presented short clip. If this effect is strong, it might limit the interpretability of the conclusion that duration did not matter. Similar concerns apply to the conclusion that the song section of origin of an excerpt did not matter. However, this would require prodigious feats of memory on part of the participants to keep track of all of this information across presentations from different song sections throughout the study. Indeed, our analysis presented in Figure 9 shows that this concern is unfounded empirically; even if one only considers first exposures of song excerpts, where no such contamination between excerpts is possible, we find the same results as in the other analyses.
To summarize, we believe this research has implications for the study of music perception and cognition as a whole. First, this research shows that excerpts that are as short as 5 seconds are already sufficiently representative in terms of preference and familiarity ratings of the song as a whole. This means that studies on the psychology of music can potentially present many more stimuli than used in previous work without compromising how well a song is represented by the clip. Second, there are also theoretical implications for the psychology of music. For instance, the sonic properties of a song are presumably somewhat different between the chorus, the verse, and the intro. Remarkably, this does not seem to matter much. Judgments about how well someone likes a given song seem to be largely independent of the location in the song from which it was sampled.
Of course, simply recognizing the song could be enough to determine how much one likes it. This is plausible, given that the accurate recognition of genre is very fast (Gjerdingen & Perrott, 2008; Mace et al., 2012), and people presumably know how much they like a given genre. Our results suggest that while song recognition did play a role, it did not account for all of the effects we observed. However, one caveat is that we might not have achieved a fair apples-to-apples comparison, as so many clips were unrecognized. As this study was not designed to address this question specifically, we recommend doing so in future research, with a more balanced sample of music in terms of popular recognition.
The strong correlation between clip and song preference ratings, independent of where in the song the clip was sampled and independent of duration, raises the question of what actually determines one’s preference of a given song. What exactly underlies this—perhaps invariant sonic properties that are present throughout the entire song or judgments of genres as a whole—should be explored in future research. It also addresses one of the central questions of Gestalt psychology, namely whether “the sum is other than the sum of its parts” (Koffka, 1935). At least in terms of some psychological qualities of popular music as measured by preference and familiarity ratings, that does not seem to be the case, as responses to the parts seem, overall, to be a good proxy for responses to the whole. Indeed, this research suggests that popular music is close to auditory textures (Ellis et al., 2011; McDermott & Simoncelli, 2011), contrasting with popular movies, which are narrative- and plot-development driven (Wallisch & Whritner, 2017).
Finally, this research could have practical implications. Music industry platforms like iTunes, Amazon, and Pandora already provide prospective buyers with a 30-second excerpt of the song, presumably chosen by expert judgment. Instead, these samples could simply include a 5-second clip that was randomly picked from a song segment other than the Intro or Outro.
Author Note
We would like to thank Lucy Cranmer, Warren Ersly, and Ted Coons for helpful comments on a prior version of this manuscript. We would also like to thank the Dean’s Undergraduate Research Fund (DURF) at New York University for financial support of this project and Andy Hilford for providing the space for our lab to run participants.