Musical timbre is often described using terms from non-auditory senses, mainly vision and touch; but it is not clear whether crossmodality in timbre semantics reflects multisensory processing or simply linguistic convention. If multisensory processing is involved in timbre perception, the mechanism governing the interaction remains unknown. To investigate whether timbres commonly perceived as “bright-dark” facilitate or interfere with visual perception (darkness-brightness), we designed two speeded classification experiments. Participants were presented consecutive images of slightly varying (or the same) brightness along with task-irrelevant auditory primes (“bright” or “dark” tones) and asked to quickly identify whether the second image was brighter/darker than the first. Incongruent prime-stimulus combinations produced significantly more response errors compared to congruent combinations but choice reaction time was unaffected. Furthermore, responses in a deceptive identical-image condition indicated subtle semantically congruent response bias. Additionally, in Experiment 2 (which also incorporated a spatial texture task), measures of reaction time (RT) and accuracy were used to construct speed-accuracy tradeoff functions (SATFs) in order to critically compare two hypothesized mechanisms for timbre-based crossmodal interactions, sensory response change vs. shift in response criterion. Results of the SATF analysis are largely consistent with the response criterion hypothesis, although without conclusively ruling out sensory change.
In his seminal 1844 Treatise on Instrumentation, Hector Berlioz describes the sound of the low-register clarinet as “coldly threatening” and infused with “dark…rage.” Descriptions like these are hardly poetic outliers in the lexicon of musical timbre; they are the absolute convention. Timbre is thoroughly multimodal: we make sense of sound by way of comparison to other sensory experiences. The crossmodal nature of timbre semantics has been documented in numerous languages, cultural and music-stylistic contexts, and time periods (for review, see Wallmark & Kendall, 2018). But although Berlioz’s descriptions appear intuitive to many of us, on closer consideration, the cognitive basis for describing musical sounds in tactile and visual terms remains somewhat mysterious. How can timbre, a property of auditory perception, be “cold” or “warm,” “dark” or “bright”? Are these “just words,” lacking any direct claim on multisensory processing, or do they represent structural connections or “weak synesthesia” between sensory channels that have historically solidified into semantic conventions? If so, then what mechanism links non-auditory experiences to auditory ones?
Crossmodal correspondences refer to perceptual interactions across seemingly independent sensory domains (Marks, 1978, 2004; Spence, 2011). When people are asked to quickly identify a visual stimulus, for example, response times are generally shorter when the visual stimulus is accompanied by an irrelevant sound, and facilitated further (evidenced by greater speed and accuracy) when the visual and auditory stimuli are conceptually or semantically related (e.g., a “high” pitch accompanying classification of a spatially “high” visual stimulus). Interactions of this sort have been extensively documented between pitch height/loudness and visual elevation (e.g., Ben-Artzi & Marks, 1995a; Bernstein & Edelstein, 1971), brightness (Klapetek, Ngo, & Spence, 2012; Marks, 1987; Marks, Hammeal, Bornstein, & Smith, 1987), and physical size (Eitan, Schupak, Gotler, & Marks, 2014; Gallace & Spence, 2006). However, to the best of our knowledge, this paradigm has not yet been used to test for crossmodal correspondences in the perception of timbre. Wallmark (2019a) explored automatic associations, or “semantic crosstalk,” between common adjectival labels for timbre (e.g., “BRIGHT” and “DARK”) and corresponding sounds, reporting weak interactions; however, that study examined only the linguistic correlates of crossmodal interaction, not the interrelationship of different sensory or decisional systems themselves. Moreover, the mechanisms governing crossmodal interactions involving timbre remain unknown.
In this article, we report the results of two experiments that examine the extent to which task-irrelevant crossmodally congruent or incongruent tones affect responses in a visual choice task. The first experiment asks whether and how the timbre of natural-instrument and synthesized sound (“dark” vs. “bright”) affects the identification of visual stimuli varying in brightness. The second experiment, which incorporates a spatial texture identification task (“smooth” vs. “rough”) using simple waveforms as auditory stimuli, and a response deadline, critically compares two competing explanations for these hypothesized interactions by examining the speed-accuracy tradeoff functions that characterize the results.
Timbre and Crossmodal Perception
To date, most evidence for crossmodal processes in timbre cognition is linguistic (Saitis, Weinzierl, von Kriegstein, Ystad, & Cuskley, 2020). Well documented is the widespread application to timbre of terms from the visual and tactile modalities in particular (for reviews, see Saitis & Weinzierl, 2019; Winter, 2019). In an interlanguage comparative study of English and Greek speakers, for instance, Zacharakis, Pastiadis, and Reiss (2014, 2015) found that timbre is commonly conceptualized in terms of luminance (“bright”), texture (“smooth”), and mass (“dense”). Similar findings have appeared in crossmodal studies in music perception (Eitan & Rothschild, 2010), written discourse of orchestration (Kendall & Carterette, 1993a; Wallmark, 2019b), social tagging in online music streaming services (Ferrer & Eerola, 2011), and interviews with professional musicians (Reymore & Huron, 2020) and studio engineers (Porcello, 2004). However, the presence of semantic crossmodal relations in the written and spoken discourse of timbre provides only partial and circumstantial evidence for genuine crossmodal interactions; labeling a tone “bright” might indicate, although it does not necessarily indicate, any mechanistic link between the processing of auditory and visual brightness. The two domains may share a common lexical or semantic basis either with or without also having a sensory, perhaps amodal one (e.g., Melara & Marks, 1990).
The strongest evidence for crossmodal correspondences in timbre perception comes from studies that examined facilitation and interference of timbre in visual and texture response tasks without explicit semantic mediation. Using an implicit association paradigm, Adeli, Rouat, and Molotchninoff (2014) reported that participants associated the sound of a sine wave with rounded, smooth shapes and a square wave with jagged, rough shapes. This particular audio-visual association also appears in reverse: Parise and Pavani (2011) asked participants to vocalize the vowel /a/ in response to jagged compared to rounded shapes and found greater strength in the high-frequency partials produced to jagged shapes. Consistent with this evidence, Giannakis (2006) examined the acoustical determinants of audio-tactile interactions and found that three dimensions reliably mapped onto visual texture perception: spectral centroid to contrast, inharmonicity to coarseness/granularity, and sensory dissonance to pattern periodicity. The same basic set of acoustical features reliably predicted visuo-tactile intensity in a semantic task (Wallmark, 2019a). Aside from visual texture, moreover, manipulations of spectral centroid have been shown to modulate tactile roughness perception: tones with stronger high-frequency energy increased judgments of the roughness of touched surfaces (Guest, Catmur, Lloyd, & Spence, 2002). It has long been known, too, that pitch height—comparable in the frequency domain to spectral centroid in the timbral domain—interacts with judgments of visual brightness: higher sounds are associated with bright and lower sounds with dim lights, colors, and shapes (Klapetek et al., 2012; Marks, 1987; Marks et al., 1987; Martino & Marks, 2000). Taken together, these findings suggest crossmodal assimilation (the integration or addition of information across sensory channels) in a manner consistent with conventional sound symbolic associations (Sidhu & Pexman, 2018).
Another way to experimentally account for the influence of language in such crossmodal interactions is to explore differences in associative patterns between populations at varying stages of cognitive development (Bonetti & Costa, 2019; Maurer, Gibson, & Spector, 2012) and between populations with differing degrees of exposure to musical discourse and training (McAdams, Douglas, & Vempala, 2017; Siedenburg & McAdams, 2018). Recent evidence suggests that crossmodal associations involving timbre develop fairly early in childhood. In a study of audio-visual and audio-tactile associations among 3- to 5-year-olds, Wallmark and Allen (2020) reported that most preliterate children associated a sine wave with the sensation of fine-grit sandpaper and a sawtooth wave with coarse-grit sandpaper. Associations of sounds to visual brightness, on the other hand, showed a strong effect of age. Adults exhibited audio-tactile associations consistent with those of children and audio-visual associations consistent with those of older but not younger preschoolers (sine wave: dark/smooth, sawtooth: bright/rough), with no effects of music training. In sum, findings to date suggest that timbre processing may affect sensory responses in other domains and/or bias decisional processes in a way that is roughly consistent with conventional semantic associations.
Crossmodal Interactions and the Speed-Accuracy Tradeoff Function (SATF)
Multisensory interactions are widely studied using the paradigm of speeded classification, in which participants’ responses to a stimulus in one sensory domain are experimentally manipulated by introducing an informationally irrelevant stimulus in another domain (for review, see Algom & Fitousi, 2016). Because correct performance in the task is by definition unrelated to the secondary domain—e.g., auditory information is not relevant to visual responding—interactions of this kind are unexpected. Effects of the task-irrelevant stimulus in this paradigm are measured in participants’ reaction time (RT) and accuracy, and the two are often inversely correlated: response accuracy suffers when response speed increases (Ben-Artzi & Marks, 1995a; Nickerson, 1973; Posner, Nissen, & Klein, 1976). Speed-accuracy tradeoffs are ubiquitous, not only in human behavior but across species, reflecting the loss of information when time for signal-processing is limited (for review, see Heitz, 2014). The relation of accuracy to response time may thus provide valuable clues to the underlying mechanism at play when perceivers make decisions in one sensory domain while simultaneously processing an irrelevant stimulus in another.
Crossmodal interactions may be either facilitative or interfering. That is, an irrelevant stimulus in a secondary modality may increase speed and/or accuracy of responses made to stimuli in the primary modality (facilitation) or decrease speed and/or accuracy (interference). Moreover, two different kinds of mechanism may underlie these interactions. First, either facilitation or interference may arise because of a shift in sensory response: in this account, auditory information could summate with or subtract from visual information, modifying the visual stimulus’s effectiveness by increasing or decreasing a perceptible attribute. For example, Stein, London, Wilkinson, and Price (1996) reported that weak flashes of light were judged as brighter when accompanied by a burst of noise, and they postulated that sensory signals in the auditory pathway add to signals in the visual pathway, much as signals can add when one light is added to another (energy summation), thereby amplifying perceived brightness (see also, Odgaard, Arieh, & Marks, 2004). Alternatively, crossmodal facilitation or inhibition could largely or wholly reflect a shift in decisional criterion. By this account, the sound-induced increases in judgments of brightness reported by Stein et al. could have resulted instead because the auditory stimulus changed the response criteria used in judging brightness, biasing perceivers to report greater levels of brightness when the sound was present rather than absent.
Arieh and Marks (2008) compared the predictions of these two hypotheses by measuring speed-accuracy tradeoff functions (SATFs) in speeded responses to discriminate lights of different colors in the presence and absence of simultaneous pulses of sound. Whereas the hypothesis of energy summation predicts that crossmodal facilitation involves (a) increases in response speed together with (b) increases in accuracy (or increases in either one and no change in the other), measured by d’ of signal detection theory, the hypothesis of decisional shift predicts that increases in response speed are accompanied by decreases in d’: an increase in correct identification responses would be offset by a corresponding increase in false-positive responses. Consequently, the energy-summation hypothesis predicts that when a congruent sound accompanies the visual target, the SATF (plot of accuracy against response time) will lie above (i.e., will show greater accuracy at a given RT) the SATF obtained when an incongruent sound accompanies the visual target. Alternatively, the decisional-criterion hypothesis predicts that an irrelevant sound shifts only the location of the response criterion without affecting the rate that information accrues; therefore, a decrease in response time would be accompanied by a corresponding increase in errors. Arieh and Marks (2008) reported SATFs that overlapped when sounds did and did not accompany the visual targets, suggesting that the addition of task-irrelevant noise facilitated visual responses by shifting the decisional criterion without affecting the sensory response.
The Present Study
In two experiments, we investigated the crossmodal influence of auditory timbre on visual identification and, by implication, on visual discrimination. In Experiment 1, we used a speeded-response paradigm in which participants identified a shift in brightness between two consecutively presented target squares of subtly contrasting brightness levels (darker, brighter, or the same). Responses to visual targets were primed on different trials with a large number of heterogeneous natural-instrument and synthesized tones shown in a previous study (Wallmark, 2019a) to be considered very “dark” or very “bright.” (For the sake of clarity, “priming” in this context can be understood as the facilitating and/or interfering effects of timbre on the visual responses.) Our logic was simple: if “dark” or “bright” musical timbres are commonly conceptualized and so labeled due to symbolic (and possibly arbitrary) linguistic conventions—that is, if timbres are crossmodal “in name only”—then we should not expect auditory primes to affect response speed and/or accuracy in a visual brightness identification task. Conversely, as we hypothesized, if reaction speed and/or accuracy are systematically influenced by the degree of congruency between task-irrelevant sounds and brightness—e.g., a “bright” priming sound presented alongside a shift to a darker target square would slow responses and/or increase errors relative to a “dark” prime—then this interaction would indicate some degree of crossmodal interference in the response. Correspondingly, incorporating a modified version of the deception paradigm of Meier, Robinson, Crawford, and Ahlvers (2007) and Bhattacharya and Lindsen (2016), in which participants compare baseline brightness to a target brightness and provide forced-choice responses of darker or brighter when in fact the two stimuli are identical, we aimed to evaluate the potential assimilative biasing effect of auditory primes on brightness discrimination. That is, we wanted to know whether people perceive a second, identical gray square as darker or brighter above chance levels after being primed with “dark” or “bright” sounds.
Experiment 1, then, sought to determine whether ecologically valid musical timbres commonly labeled as “bright” or “dark” would interfere crossmodally with visual brightness and darkness judgments. Building on this hypothesis, in Experiment 2 we shifted aims to critically compare two hypothesized mechanisms governing audio-visual correspondences in timbre perception, shift in sensory response vs. decisional criterion, using a paradigm similar to that of Experiment 1. To create speed-accuracy tradeoff functions (SATFs), it was necessary to increase the proportion of response errors to a level somewhere between complete accuracy and chance performance; hence, Experiment 2 set a response deadline of 500 ms. Furthermore, to reduce the high degree of perceptual variability associated with a large number of auditory stimuli in a manner that corresponds more closely to earlier studies on SATFs (e.g., Arieh & Marks, 2008), auditory stimuli here consisted of simple waveforms. We also introduced a visual texture identification task using visual stimuli analogous to those of the brightness task (close-up photos of sandpaper patches of slightly contrasting grit sizes). In addition to asking whether sounds with “dark,” “bright,” “smooth,” and “rough” waveforms had an assimilative effect on responses to brightness and texture when matched or mismatched with visual stimuli and whether these timbre primes would bias response in a deception paradigm, we plotted and analyzed SATFs to compare the competing explanatory mechanisms (sensory change, shift in response criterion) of timbre-based crossmodal interactions.
Experiment 1: Effects of Musical Timbre on Visual Brightness Identification
Method
Participants
Participants were 140 undergraduate students (100 females; M age = 19.77, SD = 1.61). Relative musical background was assessed using Goldsmith Musical Sophistication Index perceptual and musical training subscales (Gold-MSI; Müllensiefen, Gingras, Musil, & Stewart, 2014), M Gold-MSI (perceptual + training) = 38.68, SD = 12.06, min = 14, max = 67. All participants reported normal hearing and normal or corrected-to-normal vision. Students received course credit in an introductory psychology class for participating. All experiments were approved by the Southern Methodist University (SMU) Institutional Review Board. Data were collected in conjunction with a larger preregistered music cognition experiment (https://osf.io/6juyn).
Stimuli
Visual stimuli: Three visual stimuli were created using graphic design software. The first, baseline gray stimulus consisted of a black square (640 x 640 pixels) set to 50% opacity (hex color code #7F7F7F). The other two were slightly darker (60% opacity; hex code #666666) and slightly brighter (40% opacity; #999999) versions of the baseline square. The transition screen between stimuli consisted of a white (#FFFFFF) square with a fixation cross at its center (Times New Roman font, 24-point size; see procedure below). Identification of the visual stimuli was validated in a separate control experiment (see Supplementary Materials accompanying this article at mp.ucpress.edu).
Auditory stimuli: The 32 auditory stimuli (timbre primes) consisted of 16 natural-instrument signals and 16 synthesized signals. Natural-instrument signals were drawn from the McGill University Master Samples (MUMS; Opolko & Wapnick, 1987), and synthesized signals were software instrument presets in Apple GarageBand. Synthesized signals were included to mitigate possible effects on responses of instrument identification and instrument-specific semantic associations (Golubock & Janata, 2013). All auditory primes were 1.5-s, D♯4 pitch, sampled at 44.1 kHz, and normalized to 65 dB SPL (A-weighted).
The parameters of the auditory stimuli were based on results of a perceptual study of timbral brightness (Wallmark, 2019a, Experiment 2a). In that study, 93 natural-instrument and synthesized signals were rated by participants on a 7-point bipolar semantic differential scale (very dark–very bright). The highest and lowest rated eight signals for each stimulus type (instrument and synthesized) were selected for the present study as exemplars of “dark” and “bright” timbres. These signals produced weak semantic crosstalk in the earlier study. Table 1 presents a complete list of signals used in the experiment.
Experiment 1 Stimuli
Natural . | Synthesized . | ||
---|---|---|---|
“dark” . | “bright” . | “dark” . | “bright” . |
archlute | alto shawm aasf | Antarctic sun | big pulse waves |
tenor baroque recorder | bass viol | deep sub bass | bright synth brass |
bassoon | cornetto | FM piano | chip tune lead |
tenor crumhorn | English horn | heavy sub bass | icy synth lead |
oboe | classical guitar | infinity pad | monster bass |
tenor sax (growl) | horn (mute) | soft square lead | paper-thin lead |
trombone (mute) | tenor viol | starlight vox | percussive square lead |
tuba | trumpet (harmon mute) | synth e-bass | soft saw lead |
Natural . | Synthesized . | ||
---|---|---|---|
“dark” . | “bright” . | “dark” . | “bright” . |
archlute | alto shawm aasf | Antarctic sun | big pulse waves |
tenor baroque recorder | bass viol | deep sub bass | bright synth brass |
bassoon | cornetto | FM piano | chip tune lead |
tenor crumhorn | English horn | heavy sub bass | icy synth lead |
oboe | classical guitar | infinity pad | monster bass |
tenor sax (growl) | horn (mute) | soft square lead | paper-thin lead |
trombone (mute) | tenor viol | starlight vox | percussive square lead |
tuba | trumpet (harmon mute) | synth e-bass | soft saw lead |
Natural-instrument signals taken from MUMS library; synthesized signals derived from GarageBand software instrument library. See Wallmark (2019a).
Procedure
The experiment was conducted in a quiet room on iMac computers (27” display, 2560 x 1440 pixel resolution) running DirectRT software (Jarvis, 2016). Auditory stimuli were presented at a subjectively determined comfortable loudness though Bose SoundTrue headphones. Participants were first presented the baseline square. After 1 s, this square was replaced by a transition square with fixation cross, at which time participants heard a 1.5-s auditory prime. Following the prime, the fixation cross and transition square were replaced by a target square in one of three relations to the baseline square: (1) Brighter than baseline (Brighter-target), (2) Darker than baseline (Darker-target), or (3) Same as baseline (Same-target).
Similar to the procedure of Bhattacharya and Lindsen (2016) and Meier et al. (2007), the participants were instructed to respond as accurately and quickly as possible to the target square by pressing a key corresponding to whether they judged the target to be darker (left-arrow) or brighter (right-arrow) than the baseline, while disregarding the sounds. The participants were informed that the change in brightness from baseline to target was very small, but nevertheless detectable by most people; however, the Same-target stimulus was actually identical to the baseline. Thus, in 2/3 of trials the light level did shift (by ± 10% opacity), but in the remaining 1/3, participants were forced to respond darker or brighter even though the baseline and test squares were identical. The procedure is diagrammed in Figure 1.
Prior to beginning, participants received four practice trials in order to familiarize themselves with the procedure. Natural-instrument and synthesized stimulus blocks were presented separately in a counterbalanced order, each with its own practice. In the main task, each of the three visual targets (Darker, Brighter, Same) was presented in 16 randomly ordered trials: 8 trials with “dark” auditory primes and 8 with “bright” primes. Hence, each block consisted of 48 trials (96 total for the 2 blocks of stimuli). The experiment took approximately 15 minutes.
Results
Analyses were conducted using R Version 3.4.4 (R Core Team, 2004–2016; see Supplemental Materials for data and R scripts at mp.ucpress.edu). Following Whelan (2008), outlier thresholds of < 100 ms and > 2000 ms were applied to RT data, resulting in the exclusion of 549 responses (out of 12,315 total, or 4%). Data were also scanned for participants with total error rates worse than chance (> 50%), which resulted in eliminating the data of one participant. Table 2 summarizes the resulting measures of RT and accuracy, collapsed across participants to produce median correct RTs, interquartile ranges of correct RTs, and average error.
Experiment 1 Median Reaction Time (RT) of Correct Responses, Interquartile Range (IQR) of RTs, and Error Rate by Visual Target, Auditory Prime, and Crossmodal Congruency
Visual target . | Prime . | Congruency . | RT . | IQR . | Error . |
---|---|---|---|---|---|
Brighter | “bright” | congruent | 635 | 221 | 3% |
Darker | “dark” | congruent | 690 | 238 | 9% |
Brighter | “dark” | incongruent | 633 | 210 | 13% |
Darker | “bright” | incongruent | 675 | 212 | 15% |
Same | “bright” | — | 844 | 446 | — |
Same | “dark” | — | 825 | 439 | — |
Visual target . | Prime . | Congruency . | RT . | IQR . | Error . |
---|---|---|---|---|---|
Brighter | “bright” | congruent | 635 | 221 | 3% |
Darker | “dark” | congruent | 690 | 238 | 9% |
Brighter | “dark” | incongruent | 633 | 210 | 13% |
Darker | “bright” | incongruent | 675 | 212 | 15% |
Same | “bright” | — | 844 | 446 | — |
Same | “dark” | — | 825 | 439 | — |
Note: Congruency and error do not apply to the deceptive Same-target condition since the visual target remained unchanged and no objectively “correct” response was possible.
First, we wanted to know whether visual target brightness and auditory prime interacted with one another to affect RT and response error. Results from the Same-target condition were excluded from this first analysis because there were no objective congruency and accuracy dimensions in this condition. Since RTs exhibited a right-skewed distribution, they were log-transformed. We computed two linear mixed-effects models (LMMs) predicting log-RT and error from visual target (two levels: Brighter-target, Darker-target) and auditory prime (two levels: “bright” and “dark”). Cross-participant variability was modeled as random effects. Additionally, in order to statistically control for the potential influence of stimulus type (natural vs. synthesized) and musical sophistication (Gold-MSI), these variables were included as covariates in our models (see full analyses of the marginal effects of these covariates in the Supplementary Materials at mp.ucpress.edu). LMMs were created with the lme4 package (Bates, Mächler, Bolker, & Walker, 2015), and significance levels of main fixed effects and interactions were calculated using Type II Wald chi-square tests in the car package (Fox & Weisberg, 2010).
The reaction time LMM accounted for 29% of total variance in choice RT, R2 = .29, p < .00001. The two-way interaction between visual target and auditory prime was not statistically significant, Wald χ2(1) = 0.93, p = .34, indicating that crossmodal congruency did not influence choice RT (Figure 2, panel A). However, the analysis revealed main fixed effects of both visual target, Wald χ2(1) = 201.46, p < .00001, and prime, Wald χ2(1) = 14.85, p = .0001. Correct responses to Brighter visual targets were an average of 49 ms faster than responses to Darker targets (634 ms vs. 683 ms), and “bright” primes elicited 7 ms faster responses than “dark” primes (655 ms vs. 662 ms).
Experiment 1 results. (A) Nonsignificant interaction between visual target and auditory prime on choice reaction time (RT); (B) significant interaction between visual target and auditory prime on response error. Error bars = 95% CI.
Experiment 1 results. (A) Nonsignificant interaction between visual target and auditory prime on choice reaction time (RT); (B) significant interaction between visual target and auditory prime on response error. Error bars = 95% CI.
The logistic binomial LMM (LLMM) on response error accounted for half of total variance, R2 = .50, p < .00001. In contrast to the RT analysis, the error analysis revealed a statistically significant interaction between visual target and auditory prime, Wald χ2(1) = 214.04, p < .00001 (Figure 2, panel B). Crossmodally congruent responses were on average 8% more accurate than incongruent responses (6% vs. 14% error). Consistent with the RT analysis, moreover, main fixed effects were found for both visual target, Wald χ2(1) = 26.42, p < .00001, and prime, Wald χ2(1) = 6.65, p = .01; that is, Brighter targets elicited fewer errors than Darker (8% vs. 12%), and “bright” primes led to fewer errors than “dark” (9% vs. 11%).
Next, to explore whether and how “dark” and “bright” auditory primes biased responses in the deceptive Same-target condition, we analyzed the data in two ways. First, we globally determined whether the observed frequencies of the two responses (darker vs. brighter) differed significantly from the expected frequency (roughly 50% each, as established in the control experiment). Collapsed across participants, preliminary chi-square goodness of fit tests (α-level Bonferroni corrected for multiple comparisons; .05/2 = .025) revealed that participants responded to the Same-target condition 4% more frequently as brighter after priming with “bright” timbres, χ2(1) = 9.42, p = .002, Cohen’s w = .07. Same targets were also judged 2% more frequently as darker after priming with “dark” timbres, χ2(1) = 4.65, p = .03, Cohen’s w = .05, though this effect was not statistically significant after correction.
To correct for repeated measurements and statistically control for stimulus type and musical sophistication, as in the previous analyses, we next computed a LLMM on responses to visual targets in the Same-target condition using auditory prime as the predictor variable. No significant relationship between prime and response choice in the Same-target condition was observed, Wald χ2(1) = 1.98, p = .16. Hence, though biasing effects of auditory priming were evident when responses were collapsed, they did not survive the LLMM analysis.
In conclusion, the results of Experiment 1 indicated that incongruent pairings of timbre with visual brightness increased the proportion of response errors in an assimilative way, while hardly affecting reaction speed. Accordingly, the speed-accuracy tradeoff in these data was nonsignificant, r(126) = –.14, p = .12, 95% CI [–.30, .04]. Moreover, Brighter visual targets produced faster correct responses and lower error rates than did Darker targets, irrespective of prime; likewise, “bright” auditory primes produced faster and less error-prone responses than did “dark” primes. Finally, in the Same-target condition, auditory primes slightly increased the number of responses congruent with the prime (i.e., participants were more likely to respond to a Same-target square as brighter than darker immediately after hearing an irrelevant “bright” timbre); this effect, however, was not statistically significant after accounting for other sources of variability.
What the data do not easily answer, however, is the question: Do auditory primes modify the sensory response to the luminance of the target, that is, its underlying brightness, or do the primes mostly or wholly affect the decisional criterion by producing or modifying a response bias? Given the large number of heterogenous auditory stimuli and the lack of a response deadline (which kept error rates relatively low), this first experiment was not suited for SATF analysis. Consequently, to explore the psychological mechanisms underlying crossmodal interactions involving timbre, we designed Experiment 2 to try to answer this question using a procedure and stimuli more amenable to comparison of SATFs.
Experiment 2: Effects of Auditory Timbre on Visual Brightness and Spatial Texture: A Speed-Accuracy Analysis
Method
Participants
Ninety undergraduates were recruited to participate (54 females; M age = 19.73, SD = 2.14; M Gold-MSI (perceptual + musical training subscales) = 35.39, SD = 10.32, min = 19, max = 60. All participants reported normal hearing and normal or corrected-to-normal vision. None participated in Experiment 1. The students received course credit for their contribution. The protocol was approved by the SMU IRB. Data were collected in conjunction with another preregistered study.
Stimuli
Visual targets (brightness and texture): The three targets for the visual brightness condition were identical to those used in Experiment 1. Roughly analogous to these equally spaced brightness stimuli, three images for spatial texture were created from black-and-white-photographed close-ups of sandpaper of slightly differing grit sizes: baseline, medium roughness was 100-grit 3M wood sandpaper; low roughness was 150-grit; and high roughness was 50-grit. These visual texture stimuli were similar to those used by Ho, Landy, and Maloney (2006) in a comparable study of texture discrimination, and were validated prior to use in a separate control experiment (see Supplementary Materials at mp.ucpress.edu).
Auditory primes: Auditory stimuli in Experiment 2 were four waveforms. All were 1.5 s in duration, 311-Hz (D♯4), their loudness normalized to 65 dB SPL (A-weighted). A 50-ms linear amplitude ramp was applied at both ends of each signal. In the visual brightness task, the “bright” timbre prime was a square wave with spectral centroid of 3640 Hz. The “dark” timbre was created by low-pass filtering the original square-wave signal, thus lowering the spectral centroid to 1222 Hz. It is well established that the spectral center of gravity (i.e., the proportion of high-to-low-frequency energy in a signal) is the main perceptual determinant of timbral “brightness,” “sharpness,” or “nasality” (e.g., Beauchamp, 1982; Kendall & Carterette, 1993b; Saitis & Siedenburg, 2020; Schubert & Wolfe, 2006; von Bismarck, 1974).
Using identical parameters as the “dark-bright” timbral stimuli, in the visual texture task the “smooth” timbre consisted of a sine wave (spectral centroid = 311 Hz), and the “rough” timbre a sawtooth wave (spectral centroid = 6,330 Hz). The same signals were previously tested in adult musicians, nonmusicians, and preschool-aged children (Wallmark & Allen, 2020). Magnitude spectra for the four signals are plotted in Supplementary Materials Figure 1. Semantic associations for all Experiment 2 auditory stimuli were confirmed in a separate control experiment (see Supplementary Materials).
Procedure
The speeded identification procedure in Experiment 2 was similar to that of Experiment 1, but with a few crucial differences. Most importantly, in order to construct SATFs, we aimed to increase response errors to fall somewhere intermediate between complete accuracy and chance performance (˜25% error rate). An average error rate of 10%, as in Experiment 1, would be inadequate for SATF analysis because corresponding RTs fall too close to the asymptotic minimum in RT. In Experiment 1, error rates of roughly 25% occurred in responses between 400–500 ms; hence, to encourage very rapid responding, we used a deadline procedure in which participants had to respond on each trial within 500 ms. If no response was registered in the first 500 ms following the presentation of the target, the trial ended and the message, “Please respond faster!” appeared on the screen for 1.5 s. This interval was longer than the usual immediate transition between trials, adding to the overall length of the experiment. We anticipated that this extra wait time would incentivize fast responding.
In the auditory priming condition, participants were first shown a baseline gray or medium-texture square for 1 s. They were then presented a fixation cross for 1 s, and 500 ms after the onset of the cross, the timbre prime was presented. Thus, unlike Experiment 1, in which timbre primes appeared only during the fixation cross, Experiment 2 included a stimulus-onset asynchrony (SOA) of 500 ms. The SOA was introduced to offset the 500-ms deadline procedure because auditory information is processed more slowly than visual information in comparable experimental paradigms (hen & Spence, 2010; Donohue, Appelbaum, Park, Roberts, & Woldorff, 2013). After the SOA, participants were presented one of the three visual targets: Darker, Brighter, or Same target (brightness condition); or Smoother, Rougher, or Same target (texture condition). Participants were instructed to respond as quickly and accurately as possible to the target by pressing the arrow key corresponding to the target’s relationship to the baseline, while disregarding the sound. The procedure is outlined in Figure 3 (texture task only).
Brightness and texture blocks were presented in a counterbalanced order. Before each, participants were given a practice run consisting of four trials in order to familiarize them with the procedure. In the main task, each of the baseline-target pairs was presented 15 times in 3 possible relations to the timbre prime: congruent (e.g., Brighter target, “bright” sound), incongruent (Brighter target, “dark” sound), and Same (Same target, “dark” sound). Additionally, a sound-off control condition was included (e.g., Brighter target, no prime). Each combination was thus represented by 5 trials, and the order of trials was randomized within each block (45 trials per block, 90 trials total). The experiment took approximately 15 minutes.
Results
Only responses made before the 500-ms deadline were recorded for analysis, resulting in the elimination of 2,727 responses out of 13,230, or 21%. Additionally, data from 15 blocks were eliminated because performance was worse than chance (> 50% error), indicating confusion with response key assignment or random guessing.
Following these global trims of the data, the overall error rate was 18%, close to halfway between complete accuracy and chance, and thus, following the suggestion of Arieh and Marks (2008), well suited for SATF analysis. The texture task elicited lower overall error rates than did brightness (14% vs. 22%), suggesting that the perceptual differences in texture were more discriminable than those in brightness.
Mean correct RT, standard deviation, and error rate for all combinations of visual target and auditory prime are shown in Table 3. As in Experiment 1, we used a linear mixed model (LMM) approach to explore the effects of target and prime on RT and response error, with variability associated with repeated measurements and musical sophistication treated as random effects. Owing to the deadline, the distributions of RTs in this experiment were more normal than in Experiment 1; therefore, no log transformation was employed in this analysis. Because visual and auditory stimuli varied over the conditions, we computed separate models for visual brightness and visual texture.
Experiment 2 Mean Reaction Time (RT), Standard Deviation of RTs, and Error Rate in Each Congruency Condition
Visual target . | Prime . | Congruency . | M . | SD . | Error . |
---|---|---|---|---|---|
Brightness | |||||
Brighter | “bright” | congruent | 382 | 76 | 21% |
Darker | “dark” | congruent | 380 | 81 | 20% |
Brighter | “dark” | incongruent | 390 | 65 | 21% |
Darker | “bright” | incongruent | 378 | 86 | 26% |
Brighter | none | control | 389 | 69 | 17% |
Darker | none | control | 403 | 71 | 25% |
Same | none | control | 381 | 74 | — |
Same | “bright” | prime | 368 | 80 | — |
Same | “dark” | prime | 375 | 79 | — |
Texture | |||||
Rougher | “rough” | congruent | 375 | 64 | 14% |
Smoother | “smooth” | congruent | 391 | 65 | 13% |
Rougher | “smooth” | incongruent | 384 | 63 | 15% |
Smoother | “rough” | incongruent | 396 | 64 | 15% |
Rougher | none | control | 388 | 63 | 12% |
Smoother | none | control | 411 | 56 | 14% |
Same | none | control | 395 | 69 | — |
Same | “rough” | prime | 384 | 78 | — |
Same | “smooth” | prime | 382 | 78 | — |
Visual target . | Prime . | Congruency . | M . | SD . | Error . |
---|---|---|---|---|---|
Brightness | |||||
Brighter | “bright” | congruent | 382 | 76 | 21% |
Darker | “dark” | congruent | 380 | 81 | 20% |
Brighter | “dark” | incongruent | 390 | 65 | 21% |
Darker | “bright” | incongruent | 378 | 86 | 26% |
Brighter | none | control | 389 | 69 | 17% |
Darker | none | control | 403 | 71 | 25% |
Same | none | control | 381 | 74 | — |
Same | “bright” | prime | 368 | 80 | — |
Same | “dark” | prime | 375 | 79 | — |
Texture | |||||
Rougher | “rough” | congruent | 375 | 64 | 14% |
Smoother | “smooth” | congruent | 391 | 65 | 13% |
Rougher | “smooth” | incongruent | 384 | 63 | 15% |
Smoother | “rough” | incongruent | 396 | 64 | 15% |
Rougher | none | control | 388 | 63 | 12% |
Smoother | none | control | 411 | 56 | 14% |
Same | none | control | 395 | 69 | — |
Same | “rough” | prime | 384 | 78 | — |
Same | “smooth” | prime | 382 | 78 | — |
Note: Mean and standard deviation are reported here due to a more normal distribution of this deadline-constricted dataset compared to Experiment 1, which did not include a deadline and was thus significantly right-skewed. (Those descriptive statistics are reported in medians and IQR in Table 2.) Additionally, and for the same reason, the LMM analysis reported below does not use log transformation, while the parallel analysis in Experiment 1 does. Congruency and error do not apply to the deceptive Same-target condition since the visual target remained unchanged.
The brightness RT model explained 17% of total variance, R2 = .17, p < .00001. We found a main fixed effect of prime, Wald χ2(2) = 12.29, p = .002, and an interaction between visual target and prime, Wald χ2(2) = 7.54, p = .02. RTs were on average 13 ms longer in the sound-off control condition (396 ms vs. 383 ms). Post hoc comparisons (Bonferroni) revealed that “bright” primes elicited significantly faster correct responses (380 ms) than sound-off (396 ms), p = .002, but no other differences between primes were found.
The texture RT LMM accounted for 19% of variance, R2 = .19, p < .00001. Main fixed effects were observed for visual target, Wald χ2(1) = 47.60, p < .00001, and prime, Wald χ2(2) = 21.86, p < .00001. Rougher targets were correctly identified 17 ms faster than Smoother (382 ms vs. 399 ms), and the sound-off control generated slower responses by 13 ms compared to auditory prime conditions (400 ms vs. 387 ms). Similar to the brightness model, the analysis revealed a significant interaction, Wald χ2(2) = 7.33, p = .03: responses to incongruent pairs were slightly slower than congruent (by 7 ms), but this difference was not statistically significant in post hoc comparisons (see Supplementary Materials Figure 2).
The fit of the logistic LMM model on response error for the brightness trials was poor, R2 = .06, p < .00001. Rougher targets generated slightly lower error rates than Smoother (13.6% vs. 14%), Wald χ2(1) = 4.65, p = .03, but no other significant associations were found. Likewise, the texture LLMM (R2 = .07, p < .00001) produced null results.
In sum, these results indicate that in conditions of limited sensory information with a response deadline, response speed in the visual identification task was sensitive to the presence of a task-irrelevant sound, although not to whether the auditory timbre and the visual brightness/texture were congruently or incongruently related.
To next assess the effects of the four auditory primes on visual judgments in the Same-target condition, response distributions were examined using chi-squared goodness of fit tests (Bonferroni adjusted α; .05/2 = .025 per block). Here we wanted to know whether the primes significantly influenced visual responses in the deceptive condition in directions that were either assimilative (e.g., “bright” sounds increasing the identification of the Same target as brighter) or, perhaps, contrastive (e.g., “bright” sounds increasing the identification of the Same target as darker), given the expected probabilities determined with the sound-off Same-target control (see Supplementary Materials at mp.ucpress.edu).
In the brightness judgments, responses associated with the “bright” auditory prime did not differ from expected, χ2(1) = 0.31, p = .57, nor did responses associated with the “dark” prime, χ2(1) = 2.88, p = .09; similarly, in the texture judgments, “smooth” primes yielded responses like those with the sound-off control, χ2(1) = 0.11, p = .74. However, the “rough” prime yielded 7% more rougher responses than expected, χ2(1) = 11.01, p = .001.
To next explore whether the effect of the four primes differed from one another (as opposed to chance) while also accounting for variability across participants, we used LLMMs to model responses as the outcome given the auditory prime as predictor. Participants and musical sophistication were again modeled as random effects. The visual brightness LLMM showed no main fixed effects, ps > .05. However, the texture model revealed a significant main fixed effect of prime, Wald χ2(1) = 10.69, p = .001: consistent with the chi-squared test, then, results indicated that “rough” primes were significantly more likely than “smooth” primes to induce rougher responses. Together, these findings indicate that, with a 500-ms deadline for responding, the “rough” prime crossmodally shifted participants’ visual texture identifications in an assimilative way.
Finally, LMM analyses on RTs were carried out to determine whether either timbre prime differentially slowed or speeded the responses in the Same-target condition. Both models (brightness and texture) produced nonsignificant results, indicating that responses made within the 500-ms deadline did not differ statistically across priming conditions (M RT = 358 ms).
Speed-Accuracy Tradeoffs
The data from Experiment 2 revealed an overall speed-accuracy tradeoff, r(55) = –.62, p < .00001, 95% CI [–.76, –.43], leading us to investigate further how the speed-accuracy tradeoff functions (SATFs) might shed light on competing explanations for the crossmodal correspondences observed in Experiment 1. In particular, examining the relation between slope and intercept of the speed-accuracy relations in different stimulus conditions may help decide between sensory and decisional accounts of crossmodal facilitation (Arieh & Marks, 2008; Lappin & Disch, 1972). If adding an irrelevant stimulus in a different sensory modality affects only the response criterion (e.g., affects how much sensory information must accumulate before the response is made), then measures of RT and accuracy obtained with and without the irrelevant stimulus, or with different irrelevant stimuli, should fall along a single speed-accuracy function. A change in the speed-accuracy function induced by an irrelevant auditory stimulus, however, would indicate a sensory effect, be it enhancement or suppression. Thus, Experiment 2 asks, does the relation between RT and accuracy make it possible to determine whether the timbral quality of an irrelevant sound affects visual response versus response criterion?
Adapting the method of Arieh and Marks (2008) for this final analysis, we investigated the tradeoff between speed and accuracy of the responses in three different conditions per task: prime vs. no prime (sound-on vs. sound-off control), congruent vs. incongruent prime, and Same-target prime vs. all others. First, for each condition, we computed the median RT for each participant across repeated trials that met the 500-ms deadline. Next, we measured each participant’s accuracy using the sensitivity index d’ of signal detection theory (Green & Swets, 1966); we use d’ here because, compared to percent correct, d’ tends to be more linearly related to the sensory magnitudes that underlie discrimination. Calculations of d’ were corrected for extreme values (e.g., 0% false alarms or 100% hit rates) following the adjustment described by Hautus (1995). For each set of brightness and texture trials, we fit a linear mixed effects model regressing median RT on sensitivity (d’) with participants modeled as random effects, as shown in Figure 4 (texture version only; brightness plots, analysis details, and R code can be found in the Supplementary Materials at mp.ucpress.edu).
Sensitivity (d’) versus median reaction time (SATF) for texture identification. Note: Nomenclature for plot titles is visual target-auditory prime; e.g., Smoother-rough refers to incongruent trials in which a Smoother target is presented with a “rough” sound. Shaded areas: Standard error.
Sensitivity (d’) versus median reaction time (SATF) for texture identification. Note: Nomenclature for plot titles is visual target-auditory prime; e.g., Smoother-rough refers to incongruent trials in which a Smoother target is presented with a “rough” sound. Shaded areas: Standard error.
The focus of the present analysis is not on the slope and the intercept of each condition separately, but rather on how average slope and intercept covary across conditions. For both brightness and texture, then, we conducted three statistical tests asking: (1) whether the average slope and intercept for the six conditions containing the auditory prime differ from the corresponding values when there was no prime; (2) in the three conditions containing the prime, whether the average slope and intercept for the two congruent conditions (e.g., “bright” timbre paired with Brighter target) differ from the corresponding values in the two incongruent conditions (“bright” timbre paired with Darker target), and (3) whether the average slope and average intercept for the Same-target priming conditions differ from the corresponding values in the other conditions. Table 4 summarizes these results for d’ and RT (together with analogous results for response bias, as discussed below), as implemented in R with the lmerTest package (Kuznetsova, Brockhoff, & Christensen, 2017).
Estimates of the Differences in the Average Slope and Average Intercept for Sensitivity (d’) and Response Bias (c)
. | . | Sensitivity (d’) . | Bias (c) . | ||
---|---|---|---|---|---|
Task block . | Comparison . | Intercept . | Slope . | Intercept . | Slope . |
Brightness | (1) Sound-on vs. control (sound-off) | 0.73 (0.45) | 0.001 (0.001) | –1.71 (0.44)* | 0.003 (0.001)** |
(2) Congruent vs. incongruent | –0.32 (0.52) | 0.001 (0.001) | 0.05 (0.52) | 0 (0.001) | |
(3) Same-target vs. other conditions | –0.5 (0.44) | –0.007 (0.001)** | –1.93 (0.43)** | –0.002 (0.001) | |
Texture | (1) Sound-on vs. control (sound-off) | –0.99 (0.43)* | –0.001 (0.001) | 2.51 (0.43)** | –0.01 (0.001)** |
(2) Congruent vs. incongruent | 0.18 (0.53) | 0 (0.001) | 0.21 (0.53) | 0.001 (0.001) | |
(3) Same-target vs. other conditions | –0.93 (0.41)* | 0.008 (0.001)** | –0.93 (0.41)* | 0.001 (0.001) |
. | . | Sensitivity (d’) . | Bias (c) . | ||
---|---|---|---|---|---|
Task block . | Comparison . | Intercept . | Slope . | Intercept . | Slope . |
Brightness | (1) Sound-on vs. control (sound-off) | 0.73 (0.45) | 0.001 (0.001) | –1.71 (0.44)* | 0.003 (0.001)** |
(2) Congruent vs. incongruent | –0.32 (0.52) | 0.001 (0.001) | 0.05 (0.52) | 0 (0.001) | |
(3) Same-target vs. other conditions | –0.5 (0.44) | –0.007 (0.001)** | –1.93 (0.43)** | –0.002 (0.001) | |
Texture | (1) Sound-on vs. control (sound-off) | –0.99 (0.43)* | –0.001 (0.001) | 2.51 (0.43)** | –0.01 (0.001)** |
(2) Congruent vs. incongruent | 0.18 (0.53) | 0 (0.001) | 0.21 (0.53) | 0.001 (0.001) | |
(3) Same-target vs. other conditions | –0.93 (0.41)* | 0.008 (0.001)** | –0.93 (0.41)* | 0.001 (0.001) |
Note: * p < .05, ** p < .01. Standard errors of model estimates are shown in parentheses.
Results show that in the brightness task the average slope varied significantly in the Same-target conditions versus the other conditions, slope being lower in the priming conditions compared to all other conditions. Similarly, in the texture task, both average slope and average intercept varied significantly in the Same-target conditions versus the others. This result is not surprising, however, because there is no objectively “correct” response on Same-target trials, probably a source of considerable perceptual confusion. In the texture task, moreover, the average intercept in the sound-on trials was significantly lower than the sound-off control. All other intercepts and slopes between sound-on and sound-off control conditions, and between congruent and incongruent matches, were substantially similar for brightness and texture tasks. In sum, for brightness, these outcomes suggest that the presence of task-irrelevant sounds likely produced a lowering of the response criterion without changing the rate of information accrual, that is, without a change in sensory responses. For visual texture, however, the change in intercept between sound-on and sound-off suggests a shift in visual response in the presence of task-irrelevant tones (independent of crossmodal congruency).
Finally, as mentioned earlier, to supplement our analysis of SATFs, we also calculated a standard measure of response bias. Bias refers to a participant’s tendency to use one response option more than the other, given constant sensory information. To measure bias and thereby supplement the measures of sensitivity, we calculated values of c (criterion; see Macmillan & Creelman, 1990), which we could regress against corresponding values of median RT, just as we did d’ and RT. For any given data point (hits, H, and false positives, F), the value of d’ equals z(H) – z(F), while the value of c equals 0.5[z(H) + z(F)]. Thus, for example, increasing both H and F by one standard deviation (1 unit of z in each) leaves d’ unchanged, reflecting the constancy of sensitivity, but decreases c by 1 unit, reflecting the downward shift in response criterion, that is, the result of a change in response bias.
In principle, the way that average intercept and/or slope of the function relating c to RT varies across the experimental conditions may also help distinguish between decisional and sensory effects on performance. As shown in Table 4, across conditions, average intercept and average slope of the function relating c to RT varied significantly between sound-on trials and sound-off control trials. The Same-target condition had a significantly lower intercept than all others in both tasks. As with the relation of d’ to RT, that is, congruent and incongruent matches produced similar patterns of response bias.
Discussion
Two experiments asked whether the timbre of a musical note (here, an acoustic prime) affects the subsequent visual perception of, in the first experiment, brightness (dark-bright dimension) and, in the second experiment, both brightness and spatial texture (smooth-rough dimension), akin to the ways that other attributes of auditory stimuli, such as pitch and loudness of tones, affect responses to visual stimuli. Experiment 1 sought to provide evidence of auditory-visual (crossmodal) interactions using a conventional paradigm of speeded identification, whereas Experiment 2 used a variant of that procedure to make it possible to analyze the tradeoff between speed (RT) and accuracy (errors) in the responses. Our main questions were: Do the auditory primes affect responses to visual stimuli? And, if so, do the primes do this by modifying sensory responses (sensory hypothesis) or decisional processes (decisional hypothesis)?
To the best of our knowledge, our results provide the first tentative evidence that perceptual judgments in the visual domain may be systematically influenced by timbre processing in a manner consistent with crossmodal semantic associations. First, results of Experiment 1 suggest that when people are primed with timbres that are crossmodally incongruent with the direction of brightness shift between two gray squares—e.g., a “bright” musical sound presented immediately before seeing a darker gray target square—response accuracy in a visual choice task diminishes significantly (errors increase from 6% to 14%). This result is consistent with other findings implicating crossmodal assimilation or interference in timbre semantics using the same auditory stimuli, but with a much larger effect: whereas incongruent pairs of timbres and adjectives increased error by only 1–2% (Wallmark, 2019a), incongruent pairs of timbres and levels of visual brightness increased error by 8% (present study).
Interestingly, choice RT was not significantly affected by degree of congruency between timbres and visual brightness; in fact, responses to incongruent combinations were slightly (albeit not significantly) faster than responses to congruent combinations, which is surprising given that mismatches of this kind in other domains tend to increase both RT and errors. Relatedly, Wallmark (2019a) reported effects of crossmodal congruency on error rates but not RT in a speeded adjective identification task using the same auditory stimuli. Our data do not offer a plain explanation for this pattern: further research is clearly needed to investigate why semantic congruency is associated with accuracy but not processing speed. Additionally, in Experiment 1, Brighter visual targets and “bright” auditory primes produced significantly faster correct responses than Darker targets and “dark” primes, irrespective of crossmodal congruency. We theorize that faster RTs in response to brighter stimuli (in both modalities) were the result of differences in intensity/arousal—luminance for visual targets (Carreiro, Haddad, & Baldo, 2011) and spectral center of gravity for auditory primes (Caclin, Giard, Smith, & McAdams, 2007; Wallmark, 2019a)—though future studies using a broader gradient of luminance levels for the targets and, perhaps, fully controlled synthetic tones are required to confirm this intuition.
Consistent with the audio-visual literature, Experiment 2 found that presenting task-irrelevant tones (crossmodally congruent and incongruent) sped visual responding relative to a sound-off control (e.g., Bernstein & Edelstein, 1971; Nickerson, 1973). However, there were no effects of congruency on accuracy or RT. This discrepancy between results in the two experiments may be a result of the deadline procedure in Experiment 2. Note that Experiment 1 did not use a deadline, and most responses in Experiment 1 fell between 500 and 1000 ms: it is possible that this extra time was important in producing the crossmodal interaction.
In addition to examining the degree of crossmodal congruency between timbre and both visual brightness and spatial texture, our study investigated whether timbre would produce an assimilative bias in visual responses consistent with semantic connotations. To do this, we used a forced-choice deception paradigm in which participants were told all visual target stimuli differed from baseline even though some of the targets did not (Bhattacharya & Lindsen, 2016; Meier et al., 2007); we then compared frequency distributions of the primed responses. In Experiment 1, “bright” timbre primes led to 4% more frequent brighter responses and “dark” primes to 2% more frequent darker responses, although the nominal statistical significance of these differences did not survive linear mixed-effects models incorporating other sources of variability. More definitively, in Experiment 2 we found that “rough” timbre primes shifted responses to the identical (Same) target by 7% in the rougher direction compared to the expected probabilities; “bright,” “dark,” and “smooth” primes, however, did not bias responses. Taken together, these results indicate that in certain conditions timbre may interact with visual processing in an assimilative manner that could be predicted from conventional semantic associations. That is, musical timbres that people are likely to label “rough” may increase judgments of roughness of the visual percepts.
Taken together, the present results suggest that crossmodal associations in timbre perception may be processed automatically (for discussion, see Spence & Deroy, 2013). According to Moors and De Houwer (2006), crossmodal correspondences are automatic if they are goal-independent (i.e., not susceptible to cognitive control), nonconscious, and fast, attributes that characterize our data. Additionally, individual differences in musical sophistication did not significantly affect reaction speed or accuracy, which suggests that these associations are independent of exposure to musical discourse. We should not rule out, however, the possibility that relative consistency in everyday (i.e., nonspecialized) descriptors of sound timbre may influence post-perceptual responses to visual stimuli. In their semantic coding hypothesis, Martino and Marks (1999, p. 747) theorize that failures of selective attention (as exhibited here in congruency effects on error rates in Experiment 1) and, possibly, assimilative biasing effects on visual discrimination (effects of semantically congruent tones on identification of Same visual targets) “may reflect post-sensory (semantic) mechanisms created as a result of sensory and linguistic experience…congruency effects reflect the coding or recoding of perceptual information into a common, abstract semantic representation that captures the synesthetic relation between dimensions.”
It is also possible that these audio-visual interactions reflect emotional mediation, or the matching of valence and arousal dimensions across modalities (for review, see Spence, 2020). Affective dimensions of music can modify the evaluation of images (Logeswaran & Bhattacharya, 2009) and colors (Palmer, Schloss, Xu, & Prado-León, 2013); the expression of basic emotions (“happy” music, happy face)—or perhaps their valence, arousal, and potency dimensions (Osgood, Suci, & Tannenbaum, 1957)—thus appears, in certain instances, to mediate crossmodal correspondences (though for conflicting evidence see Marin, Gingras, & Bhattacharya, 2012; Wallmark & Allen, 2020; Whiteford, Schloss, Helwig, & Palmer, 2018). Note that complex musical stimuli are affectively richer than the isolated tones used here, so the applicability of these other findings to the present investigation may be somewhat limited. Weinreich and Gollwitzer (2016; Experiment 1) showed that brief pleasant/aversive sounds biased participants’ affective responses to unfamiliar Chinese ideographs, so perhaps the interactions documented in our study reflect a similar transfer of valence. However, the emotion mediation account is far from settled (Spence, 2020), and further research is needed to determine what role it might play in timbre perception and semantics.
Another important question for future research is the degree to which low-level crossmodal phenomena (here, effects of timbre on visual brightness/roughness) inform higher-level cognition of audio-visual stimuli (i.e., actual music and complex visual information). In this regard, the interaction of musical timbre processing and visual perception is perhaps most clear and ubiquitous in multimedia. In film, for instance, formal and semantic congruencies between soundtrack and diegetic images and action are focal to how audiences make sense of a narrative (e.g., Cohen, 2013; Iwamiya, 2013; Lipscomb & Kendall, 1994). Paradigms examining perceptual influences of timbre in film music could thus extend the current results in rich and ecologically relevant ways. A related question, moreover, concerns metaphorical projections of crossmodal concepts between domains (e.g., Marks et al., 1987; Winter, 2019). Consider: Would the “dark” deeds of a villain be perceived as darker if accompanied by a “dark” timbral palette, and perhaps less so if set to “bright” sounds? Would a protagonist’s “rough” day be facilitatively underscored by “rough” sounding instruments? Crucial to empirically probing these more speculative questions is a fuller understanding of the mechanism(s) responsible for producing these interactions, specifically, whether timbral qualities directly modify how we see (sensory hypothesis) or how we make decisions based on what we see (decisional hypothesis).
Explanatory Mechanisms
Experiment 2 aimed to determine whether the patterns of the responses to the visual stimuli, considering both accuracy and response time, were consistent with the sensory hypothesis or with the criterial hypothesis (see Arieh & Marks, 2008). The sensory hypothesis says that timbre can modify the underlying visual percepts, e.g., by contributing positively or negatively to overall brightness/roughness. For clarity of exposition, consider the four curves in Figure 5, panel A. The solid curves on the left and right show, respectively, theoretical distributions of perceptual responses to Darker and Brighter visual stimuli in a neutral, baseline condition (e.g., no sound). The scale along the x-axis represents the magnitude of the internal brightness response, and the solid vertical line represents the location of the response criterion along the brightness dimension. Under these conditions, we assume that the location of the criterion indicates no bias in the responses. Given the addition of an auditory prime, however—in this example, a “bright” tone—the sensory hypothesis states that the (“bright”) prime will add a fixed amount of brightness to the sensory effects of both the Brighter and the Darker visual target stimuli, thereby shifting both distributions by equal amounts to the right (as shown by dotted curves in panel A) and consequently increasing the proportions of brighter responses to both visual stimuli, but without changing sensitivity (d’). Note that we are also assuming here that the location of the criterion in panel A does not change, that is, that the prime affects perceptual representations, but not criterion.
A signal-detection model for how the timbre of a sound could affect visual identification by modifying sensory responses and/or decisional criteria. Panels A and B: In panel A, the distributions of sensory responses to the two visual stimuli are greater in the presence of a sound (dashed lines) than without the sound (solid lines), while the location of the single response criterion (vertical line) remains constant. In panel B, the distributions of sensory responses to the light remain fixed, while the addition of the sound affects only the location of the single response criterion. Panel C: In panel C, which models speeded identification, the distributions of responses to the two visual stimuli increasingly separate over time after onset of stimulation. As the separation increases, the two response criteria needed in a speeded-identification task can converge while maintaining high levels of accuracy.
A signal-detection model for how the timbre of a sound could affect visual identification by modifying sensory responses and/or decisional criteria. Panels A and B: In panel A, the distributions of sensory responses to the two visual stimuli are greater in the presence of a sound (dashed lines) than without the sound (solid lines), while the location of the single response criterion (vertical line) remains constant. In panel B, the distributions of sensory responses to the light remain fixed, while the addition of the sound affects only the location of the single response criterion. Panel C: In panel C, which models speeded identification, the distributions of responses to the two visual stimuli increasingly separate over time after onset of stimulation. As the separation increases, the two response criteria needed in a speeded-identification task can converge while maintaining high levels of accuracy.
Alternatively, panel B shows the predictions of a second hypothesis, namely, that auditory stimuli affect only the location of the criterion for responding. When the prime is “bright,” the criterion would move down, leading to a greater proportion of brighter responses. In fact, the overall result is predicted again, as with the sensory hypothesis, to be increases in the proportions of brighter responses to both visual stimuli, although now as a result of a shift in criterion. For a conceptually similar analysis in a different sensory system, pain, see Rollman (1977, Figure 2).
The graphical representations in panels A and B of Figure 5 do not indicate, however, how the sensory distributions may change over time after the onset of the visual stimulus, or how the locations of the response criterion/criteria may change over time up to the point that participants initiate their responses (the criteria may start to change following the presentation of the auditory prime). That sensory information itself can increase over time is suggested by several kinds of evidence, such as the presence of temporal integration, where performance improves with increasing stimulus duration, and, of course, by the speed-accuracy tradeoff function (SATF) itself, which shows that rapid responding is less accurate than slower responding (Ben-Artzi & Marks, 1995a; Nickerson, 1973; Posner et al., 1976). Consider a choice RT task in which participants make one of two responses depending on whether the stimulus is say, dark gray or light gray; in this case, the SATF reflects the accrual of more accurate information with greater delay in responding (or, conversely, greater proportions of errors when responses are made too quickly). Note that in unspeeded identification tasks (e.g., Experiment 1), only one response criterion is needed, as the response is determined on each trial by whether the sensory information falls above or below that criterion. In speeded identification tasks, especially those using a deadline (e.g., Experiment 2), however, the perceiver has, at each moment in time until responding, three possible options: make one identification response, make the other identification response, or wait until more information has accumulated.
To model these three options in speeded identification, it is plausible to assume that participants implicitly use two response criteria, as shown in the top, middle, and bottom graphs in panel C. These graphs show how, over time (top to bottom), the theoretical distributions of responses to Darker and Brighter targets become increasingly discriminable (overlap less and less). The two vertical lines represent criteria for responding darker or brighter: observations falling below the lower criterion evoke responses of darker and responses above the higher criterion evoke responses of brighter, while, importantly, responses between the two criteria lead the perceiver to wait longer—essentially, until the response lies either below the low or above the high criterion. As target stimuli grow more distinguishable over time (i.e., as the two distributions increasingly diverge), the proportions of correct identifications increase even though, as we assume in the present example, the criteria themselves are likely to converge.
How, then, might the presence of a “bright” or “dark” auditory prime modify the pattern of responses to the visual stimulus? One possibility is by making the visual stimuli more discriminable at any moment in time, for example, by increasing the brightness of the Brighter target and/or by decreasing the brightness (increasing the darkness) of the Darker target, thereby increasing the value of d’. The overall effect of these changes would be an increase in discriminability at a given RT, hence a change in the speed-accuracy relation. Another possibility, however, would be for the prime to increase (or decrease) by equal amounts the brightness of both the Brighter and Darker targets. In this case, despite the sensory changes, sensitivity (d’) would not change at a given RT. That is, given equal effects on the brightness of both visual targets, the speed-accuracy function would remain unchanged—much as the SATF would be unaffected if the auditory primes affected only decisional processes.
In sum, findings consistent with the sensory hypothesis (shift in SATF) would be inconsistent with the decisional, criterion-shift model. The results of Experiment 2 suggest that the presence of task-irrelevant sounds may shift sensitivity to visual texture, but not sensitivity to brightness. The contrasting results between the two conditions may reflect differences in the perceptual salience of the visual and auditory stimuli between brightness and texture tasks. Additionally, both intercepts and slopes of response bias c differed significantly between the sound-on and sound-off (control) conditions in both tasks (brightness and texture): bias decreased as a function of RT when no sound was present but not when auditory primes were present. Taken together, this indicates that tones may have affected both sensory response (to visual texture stimuli) and bias (to texture and brightness). Patterns of d’ and c also differed significantly in the Same-target condition compared to the other conditions, although this was expected given the perceptual ambiguity of this task.
SATFs with crossmodally congruent vs. incongruent visual-sound pairs did not differ significantly, suggesting that crossmodal correspondences involving timbre may be governed by a shift in decisional criterion. However, as sketched above, findings consistent with the criterion-shift model (all data points falling along a single SATF) could also be consistent with one version of the sensory hypothesis. This interpretation has been explored in analogous contexts of auditory classification (Ben-Artzi & Marks, 1995b) and verbal priming in flavor perception (Brewer et al., 2013). Therefore, although the results of Experiment 2 comparing crossmodal congruency support the decisional hypothesis, we cannot conclusively rule out the sensory hypothesis. This difficulty in interpretation is compounded by the noisiness of SATFs, which is the likely effect of perceptual ambiguity in the visual choice task. Future research using simpler visual identification tasks (e.g., black/white squares instead of subtly varying shades of gray) may help clarify the mechanisms governing crossmodal interactions involving timbre.
Conclusion
The present study sought to extend our understanding of timbre semantics, perception, and multisensory interactions through two experiments, both of which using variants of speeded identification. We investigated whether timbre primes would facilitate or interfere with visual responses to brightness and to spatial texture tasks when the timbre was semantically congruent or incongruent with the visual targets. Without a response deadline and using pre-validated natural instrument and synthesized tones as stimuli (Experiment 1), we found evidence for audio-visual interaction: incongruent audio-visual pairs produced 8% higher error rates than congruent pairs, suggesting that timbral “brightness” adds to visual brightness (assimilation). Although “bright” tones and Brighter visual targets produced significantly faster correct responses than “dark” tones and Darker targets, no effect of crossmodal congruency on choice reaction time was detected. Additionally, modest evidence was found that timbres increase response bias in a semantically congruent way when participants identify visual stimuli (e.g., when a “rough” sawtooth wave accompanies the second of two identical spatial textures, the “rough” sound increases the probability of judging the second texture as rougher). Together, these convergent results imply that the words we conventionally use to describe qualities of musical sound may partake in multisensory processing. Indeed, if the two systems (auditory and visual) were unrelated, or if crossmodal semantic conventions were merely symbolic, there would be little theoretical basis (aside, perhaps, from emotional mediation) for interactions of this kind. Perhaps using the words “bright” or “rough” to talk about a quality of musical timbre, then, is not just an arbitrary linguistic convention, but something a bit more “literal.” Perhaps we see brightness and spatial texture not just through eyes, but also (albeit faintly) through timbrally attuned ears.
Next, to explore the mechanism by which timbre affects visual perception, we compared speed-accuracy tradeoff functions (SATFs) to determine whether task-irrelevant timbres shift visual response or decisional criterion (e.g., see Arieh and Marks, 2008). The presence of auditory primes shifted sensory response (texture task) and response bias (brightness and texture tasks) compared to the sound-off control condition. However, results comparing crossmodal congruency were largely consistent with the decisional hypothesis; namely, that sounds shifted the location of the response criterion without changing sensory sensitivity (evident in similar SATFs between incongruent and congruent conditions), although at least one alternative account could accommodate our results under a sensory model (see also Brewer et al., 2013).
Our novel findings broaden understanding of the cognitive linguistics of music and the multisensory interactions that can accompany musical experience. Adjectives for timbre are often considered imprecise and uninformative; as Van Elferen (2017, p. 615) explains, “the fact that timbres are hollow, or warm, or bronze does not tell us very much at all beyond a subjective assessment.” Our results suggest otherwise: conventional semantic associations are generalizable beyond individual idiosyncratic differences, and in exhibiting interactions with other sensory domains, may loosely reflect a design feature of human perception and cognition (see Evans & Green, 2006). Musicologist Simon Frith (1998, p. 263) notes that “music is, in fact, an adjectival experience.” Our results lend empirical credence to this claim.
Author Note
We wish to thank Associate Editor Peter Pfordresher and three anonymous reviewers for their insights and guidance with the manuscript; the ACTOR Project, for facilitating numerous workshops and panels at which earlier versions of this project were presented; Benjamin Tabak for assistance with participant recruitment; and research assistants at the SMU MuSci Lab (Jay Appaji, Grace Kuang, Lea Hobbie, Jessie Henderson, and Camille Van Dorpe) for invaluable help with data gathering.