Aesthetic perception involves inputs from various senses, and the aesthetic perception of one sensory modality can be affected by information from another sensory modality. This phenomenon can be explained by crossmodal associations, with semantic and spatial correspondence playing important roles. However, the effects of both semantic and spatial congruency on aesthetic perception remained unclear. Here, a pre-registered experiment was conducted to investigate the influences of semantically and spatially congruent sounds on aesthetic perception of the pictures presented simultaneously and whether these influences were mediated by processing fluency. Participants were asked to evaluate the liking, aesthetic value, and processing fluency of pictures presented simultaneously with sounds under four conditions (spatial congruency/incongruency and semantic congruency/incongruency). The results indicated that the semantically congruent sounds increased the processing fluency, thereby leading to higher liking and aesthetic evaluation of pictures. However, a significant effect of spatial congruency was not observed. The null finding may suggest that the effect of spatial congruency was likely too small to be practically meaningful in aesthetic perception with unlimited response time, which might have caused participants to respond after the influence of spatial congruency had gradually faded. Another possible explanation is that, in comparison to spatial cues, the manipulation of semantic congruency better aligns with individuals’ learned expectations. Therefore, semantic cues that are more in line with expectations may be more easily observed to influence subjective evaluations.
Introduction
People live in a multisensory world. When perceiving the environment, events provide information to multiple sensory modalities simultaneously (Miyamoto et al., 2023; Sun & Sekuler, 2021; Wallace et al., 2020; Williams et al., 2022). Art is created, perceived, and appreciated by the human brain. Aesthetic perception also often involves multiple senses, and the aesthetic perception of one sense can be influenced by information from other senses, particularly the effects of auditory stimuli on visual stimuli (Albertazzi et al., 2015; Föcker et al., 2022; Lindborg & Friberg, 2015). One aspect of art, known as “visual music”, focuses on how visual arts can be influenced by music and how music and visual art can be combined (Brougher et al., 2005). Previous studies have demonstrated that impressionist music can make paintings appear less dynamic but more aesthetically pleasing (Limbert & Polzella, 1998). Additionally, Koning and Lier (2013) reported that jazz can increase people’s liking for abstract painting, while classical music can increase people’s liking for figurative painting.
One way to explain how different modalities influence each other’s aesthetic perception is understanding crossmodal associations, a widespread phenomenon that refers to the associations between stimuli from different modalities (Escobar et al., 2023; Iosifyan et al., 2022; Q. J. Wang & Spence, 2015). For example, Albertazzi et al. (2015) found that participants could form strong associations between the features of music and pictures regardless of their background or expertise. There are different mechanisms underlying crossmodal associations, such as semantic and spatial correspondence.
Semantic congruency, which refers to the coherence between visual and auditory stimuli pairs in terms of meaning, plays a crucial role in multisensory perception. The majority of studies have investigated the influence of semantic congruency by presenting matched or mismatched pictures and sounds (e.g., pictures and sounds of the same animal) (Hein et al., 2007; Roberts et al., 2024; Williams & Störmer, 2024). Previous studies have shown that the perception of visual information can be influenced by the semantic congruence of auditory information (Chen & Spence, 2010). This effect may be due to the fact that semantically congruent audiovisual stimuli are more likely to facilitate multisensory integration, whereas semantically incongruent audiovisual stimuli are less likely to achieve such integration (Q. Li et al., 2020).
In visual search tasks, visual targets accompanied by semantically congruent sounds have been shown to be more easily identified compared to when accompanied by semantically incongruent sounds (Iordanescu et al., 2008). Additionally, Kvasova et al. (2024) indicated that semantically congruent sounds effectively guide attention towards corresponding visual stimuli in real-world scenes, whereas incongruent stimuli sounds weaken this effect. The Unity Assumption refers to an observer’s assumption, or belief that two or more unisensory cues belong together (Jertberg et al., 2024; Spence, 2007). Chen and Spence (2017) suggested that some studies on semantic congruency can be explained by the Unity Assumption. They further proposed four factors that contribute to the Unity Assumption: experimenter instructions, redundant information, crossmodal correspondence and semantic congruency.
Redundant information is another important factor that influences multisensory perception. Stimuli from different senses occasionally convey information about identical characteristics, such as spatial and temporal features, leading to redundant information. Among these, spatial congruency, defined as the alignment of stimuli from different modalities in space, is considered a key mechanism underlying crossmodal associations. It has been demonstrated to enhance multisensory integration and influence the perception of stimuli (Dufour, 1999; Y. Li et al., 2021). A study by Fleming et al. (2020) explored the effect of audiovisual spatial congruency in the presence of competing stimuli. The results revealed that spatial congruency between auditory and visual stimuli improved multisensory integration, leading to faster reaction times and enhanced neural responses. This suggests that spatial congruency plays a key role in multisensory perception, particularly in complex environments.
However, few studies have simultaneously explored the effects of semantic and spatial congruency. It also remains unclear how semantic and spatial congruency between sounds and pictures influence aesthetic perception. Previous studies have demonstrated that crossmodal associations are related to the perceived value of stimuli (Koning & van Lier, 2013; Q. J. Wang et al., 2019). For example, Crisinel et al. (2012) showed that when the auditory pitch of background music matched the taste of food, participants’ evaluations of the food were significantly improved. This suggests that semantically and spatially congruent sounds may also form crossmodal associations with the simultaneously presented pictures, enhancing their aesthetic perception and evaluation.
The effects of congruency are likely related to processing fluency. Processing fluency is defined as the ease with which individuals process the perceptual features or meaning of a target object, where the ease of processing a stimulus’s surface features specifically is called perceptual fluency (Reber, 2012). Research has shown that when the shape of a food item matches the typeface of its name (i.e., congruent), consumers would have more favorable evaluations and higher purchase intentions (S. Li et al., 2020). This effect occurs because congruence between food shape and name typeface makes the food easier to process, thus promoting positive feelings toward the product. Furthermore, Mandler (2014) proposed that individuals prefer objects that can be easily processed. Therefore, the congruency of stimuli’s characteristics leads to higher evaluations and positive emotions by improving processing fluency (Littel & Orth, 2013; Lunardo & Livat, 2016).
In sum, the present study aimed to investigate the influences of semantically and spatially congruent sounds on the aesthetic perception of pictures and whether the influences were mediated by processing fluency. A pre-test was conducted to select semantically congruent sounds and pictures as the experimental materials. Then, participants were asked to rate the aesthetic evaluations (liking and aesthetic value) and processing fluency of pictures presented simultaneously with sounds under four conditions (spatial congruency/incongruency and semantic congruency/incongruency). As mentioned above, processing fluency is a subjective feeling and can influence people’s liking through the subjective sense of ease (Reber et al., 2004). Forster et al. (2013) believed that subjective feeling of fluency determines the liking of paintings rather than objective fluency. Given that the aim of the present study is to investigate the effects of semantic and spatial congruency on individuals’ subjective liking and aesthetic value, as well as to examine the mediating role of processing fluency, we chose to measure subjective fluency to align with previous research exploring the impact of processing fluency on subjective liking (Forster et al., 2015; Landwehr & Eckmann, 2020; Mayer & Landwehr, 2018). Based on the literature reviewed above, we proposed the following hypotheses:
H1: Semantically and spatially congruent sounds, compared to semantically and spatially incongruent sounds, will increase the aesthetic evaluations of the pictures presented simultaneously
H2: The effects of semantic and spatial congruency will be mediated by processing fluency.
Methods
Hypotheses, experimental design, procedure, method, and data analysis plans were preregistered at the Open Science Framework at https://osf.io/7w4uh. The study was approved by the Human Research Ethics Committee of the Department of Psychology of Soochow University and performed in accordance with the ethical standards laid down in the Declaration of Helsinki. The data and materials for this study are available at https://osf.io/89x5k/.
Participants
In the present study, participants were recruited from the student population at Soochow University in China. A total of 35 young adults (mean age = 20.83 ± 1.93 years, ranging from 18 to 27 years; 20 females) were recruited to take part in the formal experiment, and all reported normal or corrected-to-normal vision and hearing without color blindness. Each participant signed the informed consent form before the experiment started and was compensated with 17-20 Chinese Yuan for participation.
We used G*Power 3.1 software (Faul et al., 2007) to conduct a power analysis. We used a 2 × 2 within-subjects design where a sample size of 35 could detect the effects with the effect size f ≥ 0.30, statistical power = 0.80, and alpha = 0.05.
Materials
In the present study, animal art pictures generated by the artificial intelligence (AI) drawing tool Midjourney were selected as the experimental materials and cropped to 1536 × 1536 pixels. We chose AI-generated animal art pictures to better focus on our goal of exploring the effects of congruency on aesthetic perception, as simple animal pictures may not effectively capture aesthetic elements. Each picture used in the experiment was paired with two sounds: a semantically congruent sound (the sound of the animal in the picture) and a semantically incongruent sound (the sound of other animals). The semantically incongruent sound was randomly selected from the pool of sounds excluding the semantically congruent sound for the corresponding picture. These sounds were collected on YouTube and http://www.findsounds.com, and standardized to a sample rate of 44.1kHz by Adobe Audition 2020.
A pre-test was conducted to select the materials for the formal experiment. Participants were first asked to evaluate the pleasantness, familiarity, and liking of the individually presented sounds. Then they evaluated the congruency between the picture and the simultaneously presented sounds (including both the semantically congruent sound and the semantically incongruent sound). We finally selected 23 picture-sound pairs, with each pair consisting of one picture and two sounds (semantically congruent sound and semantically incongruent sound). These sounds differed significantly in semantic congruency, t(44) = 20.721, p <.001, and Cohen’s d = 6.11. Moreover, there were no significant differences between these sounds in pleasantness, familiarity or liking (pleasantness: t(44) = -0.123, p = 0.807; familiarity: t(44) = 0.412, p = 0.770; liking: t(44) = -0.017, p = 0.120).
The formal experiment was run using a computer with 27-inch, 144 Hz and 2560 × 1440 pixels resolution display. Stimulus presentation and data collection were controlled by Eprime 3.0. The participant was instructed to sit 80 cm away from the screen. When the image was presented in the center of screen, the visual angle of the participant was 22.69° × 31.85°; when the image was presented on left or right side of the screen, the visual angle of the participant was 20.12° × 29.71°. Auditory stimuli were presented via headphones (Logitech USB Headset H340).
Design and Procedure
A 2 (Spatial congruency: picture and sound are on the same side or on the different sides) × 2 (Semantic congruency: picture and sound are the same animal or different animals) within-subjects design was conducted. The experiment was conducted in a soundproof environment. The formal experiment included three phases: baseline task, learning task and evaluation task. There was a total of 23 trials in the baseline task, 69 trials in the learning task, and 92 trials in the evaluation task with all stimuli presented randomly. Before the evaluation task, participants completed 4 practice trials.
In the baseline task (see Fig. 1A), the pictures without accompanying sounds were presented in the center of screen. Participants were asked to rate the liking (1 = very much dislike; 7 = very much like), aesthetic value (1 = not at all; 7 = very much) and processing fluency of the pictures on a 7-point scale. The processing fluency was measured with the item (“Do you think the painting is easy to process?”) (1 = very difficult to process; 7 = very easy to process) (Graf et al., 2018; Lunardo & Livat, 2016).
In the learning task (see Fig. 1B), participants were asked to learn the association between pictures and semantically congruent sounds, learning each pair three times. The pictures were presented in the center of screen, accompanying by a semantically congruent sound through both the left and right channels of headphones. After the presentation of the pictures and the semantically congruent sounds, participants were asked to rate their learning effect (“whether you are familiar with the animal in the picture and their sound.”) (1 = very unfamiliar; 7 = very familiar).
In the evaluation task (see Fig. 1C), the picture was presented on either the left or right of the screen, accompanying by a semantically congruent or incongruent sound through the left or right channel of headphones. Participants were asked to rate the same item as in the baseline task of the pictures.
Results
First, we performed a paired-sample t-test on the familiarity of the learning tasks. The results revealed that the familiarity after the second learning (M = 6.02) was significantly higher than after the first learning (M = 5.54), t(34) = -6.150, p < 0.001, Cohen’s d = 1.04; and the familiarity after the third learning (M = 6.29) was also significantly higher than after the second learning, t(34) = -5.835, p < 0.001, Cohen’s d = 0.99. These results indicated that participants could increase their familiarity with pictures and sounds through the learning tasks.
The difference values between the evaluation task ratings and baseline task ratings were calculated as the dependent variable (evaluation task ratings minus baseline task ratings, see Table S1), and a 2 (Spatial congruency: spatial congruency and spatial incongruency) × 2 (Spatial congruency: semantic congruency and semantic incongruency) repeated measures analysis of variance (ANOVA) was conducted on the results. The results indicated that the main effect of Semantic congruency on liking was significant (see Fig. 2A), F(1,34) = 23.247, p < 0.001 and ηp2 = 0.406. Participants exhibited higher liking ratings under the condition of semantic congruency (M = 0.29) than under the condition of semantic incongruency (M = -0.1). The main effect of Spatial congruency (ηp2 = 0.015, 90% CI: [0, 0.134]) and the interaction term were not significant, both Fs < 1.9, ps > .05. We followed up on the null interaction to determine whether or not there was evidence in support of a null effect for spatial congruency by semantic congruency by calculating the Bayes factor (Berry et al., 2023; Schönbrodt & Wagenmakers, 2018). In the present study, Bayes factor was computed in R using the “BayesFactor” package (Morey & Rouder, 2022). The results supported the null hypothesis, suggesting that there were no differences in liking ratings between the spatial congruency and incongruency conditions under the semantic congruency condition, BN(0,1) = 0.25; The results also suggested that there were no differences in liking ratings between the spatial congruency and incongruency conditions under the semantic incongruency condition, BN(0,1) = 0.26.
The results indicated that the main effect of Semantic congruency on aesthetic value was significant (see Fig. 2B), F(1,34) = 19.483, p < 0.001, and ηp2 = 0.364. Participants exhibited higher aesthetic value ratings under the condition of semantic congruency (M = 0.28) than under the condition of semantic incongruency (M = -0.04). The main effect of Spatial congruency (ηp2 = 0.001, 90% CI: [0, 0.053]) and the interaction term were not significant, both Fs < 0.1, ps > .05. The results of Bayes factor indicating that there were no differences in aesthetic value between the spatial congruency and incongruency conditions under the semantic congruency condition, BN(0,1) = 0.25; The results also suggested that there were no differences in aesthetic value between the spatial congruency and incongruency conditions under the semantic incongruency condition, BN(0,1) = 0.25.
The results also indicated that the main effect of Semantic congruency on processing fluency was significant (see Fig. 2C), F (1,34) = 30.063, p < 0.001, and ηp2 = 0.469. Participants exhibited higher processing fluency ratings under the condition of semantic congruency (M = 0.54) than under the condition of semantic incongruency (M = -0.55). The main effect of Spatial congruency (ηp2 = 0.013, 90% CI: [0, 0.127]) and the interaction term were not significant, both Fs < 0.8, ps > .05. The results of Bayes factor indicating that there were no differences in processing fluency between the spatial congruency and incongruency conditions under the semantic congruency condition, BN(0,1) = 0.25; The results also suggested that there were no differences in processing fluency between the spatial congruency and incongruency conditions under the semantic incongruency condition, BN(0,1) = 0.25.
The present study further examined whether the effects of semantic congruency on liking and aesthetic value were mediated by processing fluency to test the H2. All tests were performed in SPSS Version 25 using path analysis-based mediation with the Hayes PROCESS macro (Model 4; Hayes, 2017). This model was chosen because many studies have demonstrated its reliability in measuring the mediating effects of variables (e.g. Sahmurova et al., 2022; Y.-X. Wang et al., 2019). Indirect effects were tested using bias-corrected bootstrapping with 5,000 bootstrap samples.
As shown in Fig. 3A, the mediation effect of processing fluency between semantic congruency and liking was tested. The results indicated that the direct effect of semantic congruency was not significant (Effect = -0.210, SE = 0.161, 95% CI: [-0.528, 0.108], including zero), but the indirect effect of semantic congruency was significant (Effect = -0.210, SE = 0.080, 95% CI: [-0.390, -0.074], not including zero). Semantic congruency had a significant effect on processing fluency (β = -1.110, SE = 0.203, 95% CI: [-1.511, -0.709]); processing fluency had a significant effect on liking (β = 0.189, SE = 0.061, 95% CI: [0.068, 0.310]). These findings suggested that the relationship between the semantic congruency and liking was completely mediated by the processing fluency.
Finally, as shown in Fig. 3B, the mediation effect of processing fluency between semantic congruency and aesthetic value was tested. The results indicated that the direct effect of semantic congruency was not significant (Effect = -0.085, SE = 0.138, 95% CI: [-0.358, 0.187], including zero), but the indirect effect of semantic congruency was significant (Effect = -0.234, SE = 0.079, 95% CI: [-0.414, -0.100], not including zero). Semantic congruency had a significant effect on processing fluency (β = -1.110, SE = 0.203, 95% CI: [-1.511, -0.709]); processing fluency had a significant effect on aesthetic value (β = 0.211, SE = 0.052, 95% CI: [0.107, 0.315]). These findings suggested that the relationship between the semantic congruency and aesthetic value was completely mediated by the processing fluency.
Discussion
The present study investigated the influences of semantically and spatially congruent sounds on the aesthetic perception of pictures and whether the influences were mediated by processing fluency. To the best of our knowledge, this is the first study which simultaneously explored the effects of semantic and spatial congruency from the perspective of aesthetic perception.
Several key findings emerged from the present study. First, as expected, we found a significant effect of semantic congruency on aesthetic perception. Specifically, compared to semantically incongruent sounds, semantically congruent sounds increased participants’ aesthetic evaluation of the pictures. Iosifyan et al. (2022) explored the effects of congruency between sounds and pictures from the perspective of embodied representations on aesthetic perception. The results indicated that embodied sounds increased the aesthetic evaluations of congruent pictures. We extended the findings to semantic congruency. Recently, Frame et al. (2023) reported that participants could independently evaluate the pleasure of painting while music was presented. However, they did not consider the influence of semantic congruency, which demonstrates the importance of semantic congruency for multisensory perception in aesthetic perception.
Second, the results of the mediation analysis demonstrated that processing fluency played a complete mediating role in the relationship between semantic congruency and aesthetic perception. Specifically, it indicated that individuals could process information about the pictures more fluently under the condition of semantic consistency, thereby leading to higher liking and aesthetic evaluation of pictures. This finding aligns with results from studies in other fields (Littel & Orth, 2013; Lunardo & Livat, 2016).
It is important to note that the present study focused on subjective fluency and aesthetic perception, based on the finding that the subjective experience of fluency influences subjective judgments to a greater extent than objective fluency (Forster et al., 2013). In contrast, objective fluency was usually measured by objective indicators like accuracy or reaction time (Graf et al., 2018; Reber et al., 2004). We recommend incorporating subjective processing fluency into the theoretical framework of the Unity Assumption when the focus of a study is on individuals’ subjective judgments. The results of the present study suggest that congruency does not directly influence subjective judgments. Instead, it exerts its effect through subjective processing fluency as a mediator. However, it should be used cautiously, as although subjective fluency has a better impact on subjective judgments and differs from objective fluency, the two are still somewhat related. For example, an early study found the similar impact of objective fluency on liking by manipulating presentation duration of simple shapes in a single exposure paradigm (Reber et al., 1998).
We believe that the familiarity may also modulate the effect of semantic congruency. To manipulate semantic congruency, we used common animal pictures generated by AI, which likely increased the familiarity of experimental materials. Hein et al. (2007) found that cortical audio-visual integration sites in a frontotemporal network are differently activated depending on semantic congruency and stimulus familiarity, indicating potential differences and similarities between semantic congruency and familiarity. However, relevant research still remains insufficient.
Furthermore, it should be noted that although participants learned each pair of pictures and semantically congruent sounds three times during the learning task, repeated exposure during the subsequent evaluation task may still lead to an increase in familiarity. Previous studies have shown that the effects of processing fluency and familiarity are difficult to disentangle. People can evaluate the perceived familiarity of the information based on the ease with which they can process the relevant input (Schwarz et al., 2007). Therefore, fluency can serve as a cue for familiarity, as previously seen materials are easier to process (Schwarz, 2004). Future research could attempt to separate the effects of familiarity. For instance, in a reinforcement learning study, the researchers separated the effects of visual novelty and value uncertainty by introducing new stimuli in each block (Cockburn et al., 2022). Such manipulation is also suitable for separating the effects of familiarity: by introducing new stimuli or directly manipulating the familiarity of stimuli in the blocks, the effect of familiarity can be separated, allowing for a pure assessment of the impact of semantic congruency and processing fluency on aesthetic perception.
One might argue that the results of the present study could be explained by an alternative theory, namely emotional response, which could replace processing fluency as the explanatory factor. In this view, congruent stimuli evoke more positive emotions, leading to higher aesthetic perception (Cheung et al., 2019; Lin et al., 2022; Mattila & Wirtz, 2001). In fact, processing fluency and emotional response are not independent. According to Reber et al. (1998), higher processing fluency leads to greater liking and more positive emotions. A possible direction for future research would be to integrate emotional response into the theoretical framework of the present study to explore whether processing fluency influences aesthetic perception by inducing more positive emotions.
Finally, the data did not support the hypothesis that spatially congruent sounds would increase the aesthetic perception of the pictures, and we did not observe a significant influence of spatial congruency on processing fluency. The null finding may be due to the fact that in the present study, participants had unlimited response time in the rating stages. Spence (2013) suggested that the spatial rules in multisensory perception appear to be a much more task-dependent phenomenon than is often realized. The effects of spatial congruency are more easily observed when the task requires a speeded overt orienting response (e.g., an eye movement). Since the present study focused on the subjective aesthetic perception, we did not restrict participants’ response time, which may have resulted in participants responding after the effect of spatial congruency had gradually diminished. Moreover, virtual reality (VR) has been shown to effectively augment spatial attention (Knobel et al., 2022). Future research could explore the effect of spatial congruency by using VR and conducting in the condition of limited response time.
Additionally, a possible alternative theoretical explanation is that the semantic information in the experiment is something individuals have learned over the long term (i.e., animals and their sounds), whereas the spatial information is not as explicit. For instance, we may hear the same animal’s sound from various directions in real life, rather than just simply from the left or right. Although the images were presented at the center of the screen and sounds were played from both the left and right channels of headphones in the learning phase, it is unclear whether this manipulation can influence the long-term learned expectations in the latter evaluation task. Therefore, we believe that participants likely had higher expectations of semantic congruency in the present study, which led to more easily observable effects. When cues are inconsistent with their expectations, this often leads to negative evaluations and responses, as individuals tend to prefer cues that match their expectations (Eklund & Helmefalk, 2022; Errajaa et al., 2018). When the cues align with expectations, the accompanying positive effects are likely to be transferred to the overall evaluation (Fiske, 2014; Krishna et al., 2010). Future research may benefit from using more immersive stereo sounds to manipulate spatial congruency, better aligning it with people’s learned expectations, or altering the materials of semantic congruency to further investigate the impact of spatial congruency and semantic congruency on aesthetic perception.
There are several limitations to the present study. First, the present study used AI-generated animal pictures as the experimental materials, which were selected to capture aesthetic elements and explore participants’ aesthetic perception. However, these pictures still differ from real images. Future research should use real art images to explore the effects of semantic and spatial congruency on aesthetic perception. Second, the present study did not consider the influences of complexity. Previous studies have demonstrated that incongruent presentations of sound and movement can lead to increased aesthetic ratings, probable due to the increased complexity (Howlin et al., 2020). It would be worthwhile to consider complexity in the future research.
Author Contributions
Zefei Chen: Formal analysis, Methodology, Writing- Original draft preparation.
Yujia Liu: Data curation, Writing- Original draft preparation.
Xinrui Wang: Writing - review & editing. Lei Wu: Conceptualization, Supervision. Jianping Huang: Conceptualization, Supervision, Writing - review & editing.
Competing Interests
The authors declare that they have no conflict of interest.
Funding
This research was supported by the National Natural Science Foundation of China (Grant No. 32471128), Jiangsu Province Higher Education Teaching Reform Research Project (Grant No.2023JSJG265), and Key Project of Jiangsu Province School Aesthetic Education Research Planning (Grant No. 20240190).
Data Accessibility Statement
All data and materials for this study are available at https://osf.io/89x5k/.