The capacity to experience inner speech has been hypothesized to impact cognitive functions like object recognition, abstract thought, and metacognition. However, little is known about how individual differences in the propensity to experience inner speech affect these functions, a relevant topic to address given the wide variation in this trait within the general population. Here, we contribute novel evidence addressing this gap and provide a new tool to measure these individual differences in German-speaking participants. In a first study, we validate the IRQ-G, a German version of the Internal Representations Questionnaire (Roebuck & Lupyan, 2020), offering scales that index the capacity to manipulate visual representations (Representational Manipulation), and the tendencies to experience thoughts in the form of visual and verbal information (Visual Imagery and Internal Verbalization, respectively). In a second study, we observe that participants with higher Internal Verbalization scores respond faster in a word-picture verification task where they match words to images but are more slowed by increases in semantic similarity between the items, relative to those with lower scores. We interpret this as a reflection of more categorical representations being cued in those with more language-reliant thinking during object recognition. In a third study, we present exploratory evidence based on a complex categorization task that more Internal Verbalization relates to slower determinations of response confidence when participants group images that rarely occur in the same context. We additionally offer exploratory evidence that condensed forms of inner speech hamper object recognition but aid abstract thought. Together, our studies highlight the complex and diverse effects of individual differences in inner speech tendencies on noncommunicative cognitive functions and broaden the methodological toolkit of researchers pursuing this line of inquiry in the German-speaking world.
Introduction
It is a very common human experience to think in a verbal code. Verbal thinking, also referred to as inner speech, internal verbalization, inner voice, among many other terms, can be defined as a subjective experience of language happening without audible articulation (Alderson-Day & Fernyhough, 2015). For example, we might think of the items we need to get at a supermarket and hear their names in our mind’s ear while perusing the aisles. Some of us, however, might experience only mental images of the items, without any inner voice; others might experience no imagery at all (Heavey & Hurlburt, 2008). Similarly, while some might covertly hear full grammatical sentences like “I need to get some eggs”, others might just hear “eggs” in a more condensed form (Alderson-Day et al., 2018; Fernyhough, 2004; Grandchamp et al., 2019). Thus, inner speech is at the same time a core aspect of human cognition and a variable, multifaceted phenomenon.
Irrespective of the conscious experience of inner speech, linguistic representations have been claimed to subserve various noncommunicative cognitive functions, ranging from low-level perception to higher-level abstract thinking and metacognition (for reviews, see, e.g.: Borghi & Fernyhough, 2023; Gleitman & Papafragou, 2005; Lupyan, 2016; Lupyan et al., 2020; Nedergaard, Wallentin, et al., 2023; Perszyk & Waxman, 2018). Typically, the effect of language on these aspects of cognition has been assessed either by experimentally manipulating the amount of verbal information available to participants (e.g., Baldo et al., 2005; Dils & Boroditsky, 2010; Gilbert et al., 2008; Lupyan, 2009; Lupyan & Thompson-Schill, 2012; Maier & Abdel Rahman, 2019; Tullett & Inzlicht, 2010; Winawer et al., 2007) or by comparing the performance of language-impaired participants to that of language-unimpaired controls (e.g., Baldo et al., 2010, 2015; Cohen-Dallal et al., 2022; Koemeda-Lutz et al., 1987; Langland-Hassan et al., 2017, 2021; Lupyan & Mirman, 2013). Much less attention has been devoted to the effects on these functions of naturally occurring individual differences in the use of inner speech. This is noteworthy because, beyond the anecdotal remarks made above, there is empirical evidence pointing to a large degree of variability in people’s propensity to internally verbalize: some individuals report experiencing inner speech almost all of the time, while others seem to lack this experience completely, with average prevalence estimates ranging from about a quarter of the time to around three-quarters of the time, depending on the method used (Heavey et al., 2019; Heavey & Hurlburt, 2008; Hurlburt et al., 2013, 2016).
One reason why individual differences in inner speech tendencies have been somewhat neglected in this area of research might be the paucity of easily deployable and reliable tools to measure them. On the one hand, researchers can use Descriptive Experience Sampling (Hurlburt & Akhter, 2006) , a well-established method that requires participants to wear a beeper as they go about their daily activities, jot down their inner experiences upon hearing randomly played beeps, and finally discuss these experiences with the experimenter in an expositional interview. However, this is a resource-intensive procedure, and it has not escaped theoretical criticisms (e.g., Hurlburt & Schwitzgebel, 2007). On the other hand, researchers can use self-report questionnaires where participants indicate the extent to which they agree with statements about their inner lives or how often they experience what is stated, e.g., “How frequently do you talk to yourself in your inner voice?”. These are much more efficient and easier to use, but only recently has a questionnaire been produced that can reliably measure general inner speech propensity, namely the Internal Representations Questionnaire (IRQ; Roebuck & Lupyan, 2020). Scales measuring somewhat similar constructs have been produced, but they either focus on specific manifestations of inner speech (the Varieties of Inner Speech Questionnaire, McCarthy-Jones & Fernyhough, 2011), conflate private (external) speech and internal speech (the Self-Talk Scale, Brinthaupt et al., 2009), have significant psychometric limitations (the Nevada Inner Experience Questionnaire, Heavey et al., 2019), or index only a subset of inner speech experiences, such as those related to oneself (the Inner Speech Scale, Siegrist, 1995) or those with a positive or negative valence (the Self-Talk Inventory, Calvete et al., 2005).
The IRQ is composed of four subscales, of which one measures the capacity to manipulate auditory and visual information (Representational Manipulation) and three measure people’s tendencies to experience different modes of thinking (Internal Verbalization, Orthographic Imagery, and Visual Imagery). The focus of the questionnaire, however, is the Internal Verbalization scale, which was designed to index the tendency to experience thoughts in the form of language and to use language to guide one’s thinking (Roebuck & Lupyan, 2020). As may be inferred from this description, the scale was developed based on the assumption that noncommunicative cognitive functions can be augmented by (inner) language. Fittingly, Roebuck and Lupyan (2020) used their newly developed tool to provide evidence for this idea from an individual-differences perspective, while also testing the scale’s predictive validity.
Specifically, Roebuck and Lupyan (2020) investigated whether Internal Verbalization scores affected performance on a word-picture verification task where participants saw a word cue followed by a picture target (or a picture cue followed by a word target) and indicated via keypress whether they matched. Some of the trials had matching stimuli (e.g., the word “snail” and the picture of a snail), and some had nonmatching stimuli, in which case the similarity between the cues and the targets varied continuously in semantic, phonological, and orthographic dimensions (e.g., SNAIL-WHALE: relatively high semantic similarity, high phonological similarity, and medium orthographic similarity, as indexed by human ratings). The authors hypothesized that stronger tendencies to experience inner speech would be linked to more robust and/or automatic activation of phonological representations from images, leading to faster response times when the cues and the targets matched and slower response times when the items were more phonologically related. They also hypothesized that more inner speech would lead to reduced semantic interference, that is, relatively faster response times when the items were more semantically related. This pattern would follow from the label-feedback hypothesis (Lupyan, 2012a), which posits that words can transiently modulate mental representations by selectively activating features that are most diagnostic of the categories they cue. This top-down effect of language on mental representations, in turn, would stem from the fact that verbal labels are invariant to the specific details of their referents, e.g., “drum” can refer to both a toy drum and an orchestral bass drum, thereby eliciting more categorical expectations to which subsequent stimuli can be matched (Edmiston & Lupyan, 2015; Lupyan, 2012b).
Surprisingly, results showed that participants with higher Internal Verbalization scores responded more slowly than those with lower scores when the cue was a picture and the target was a word, regardless of whether the items matched. In line with expectations, however, participants with higher Internal Verbalization scores were found to be slower than those with lower scores when the items were more phonologically similar, regardless of cue-target type. Moreover, in line with the authors’ predictions and the label-feedback hypothesis, participants who reported more inner speech were less slowed by semantic similarity, relative to those with less inner speech, although this was restricted to when the cue was a word and the target was a picture. Taken together, these findings suggest that individual differences in inner speech propensity influence how people recognize objects, but the complex pattern of results remains somewhat elusive in its interpretation and has yet to be reproduced in other samples. Furthermore, the IRQ has been scarcely validated for use outside the English-speaking world, posing an obstacle to researching this topic in other regions.
Additionally, while the IRQ has been validated for a lower-level recognition task, it is unclear whether it captures individual differences in inner speech that might impact higher-order cognition where controlled processes and conscious deliberation play a bigger role, as only limited evidence focusing on memory performance has been accumulated so far (Nedergaard & Lupyan, 2024; Schmidt & Kirwan, 2024). In this regard, it has been argued that inner speech serves as a “cognitive tool” for thinking and problem-solving, either directly through the retrieval, shaping, and/or maintenance of task-relevant representations (e.g., Baldo et al., 2005, 2010; Christie & Gentner, 2014; Cole et al., 2013; Emerson & Miyake, 2003; Kompa, 2021; Kray et al., 2008, 2009; Miyake et al., 2004; van ’t Wout & Jarrold, 2020; Vygotsky, 2012) or indirectly as a modulator of affect and/or motivation (e.g., Carlson et al., 2005; Gade & Paelecke, 2019; Nedergaard, Christensen, et al., 2023). Of these higher-order functions, abstract thinking has received special emphasis in the work of some researchers, who argue that language is particularly helpful when thinking about abstract concepts, i.e., those that cannot be grasped by the senses (e.g., Borghi et al., 2017, 2019; Dove, 2011, 2014, 2018). For example, Dove (2014) proposes that (inner) language extends our cognitive capacities by providing an independent amodal combinatorial code that can support conceptual processing when it goes beyond perceptual experience. Furthermore, based on complex categorization tasks, verbal labels have been argued to provide a way to link items in more abstract contexts. For example, in an early study involving individuals with aphasia, Koemeda-Lutz et al. (1987) submitted that participants’ difficulties in classifying images based on single features like color stemmed from the fact that this kind of categorization relies more on linguistic information: grouping red bricks and red cherries would be made possible through the verbal label “red”. An analogous account was offered in a more recent study to explain detrimental effects of language deficits on single-feature-based categorization (Lupyan & Mirman, 2013).
Over and above abstract thought, inner speech has been theorized to support other forms of higher-order cognition such as our capacity to reflect on our mental states, that is, to engage in metacognition. Some have attributed this effect to the linguistic structure of inner speech (e.g., Bermudez, 2003, 2018; Clark, 1998) while others have focused more on perceptual aspects of inner speech as the key to its metacognition-enhancing effects (e.g., Carruthers, 2011; Jackendoff, 2007, 2011). For example, Clark (1998) argues that we can hold our thoughts and cognitive profiles as objects for further scrutiny in inner speech due to its modality-neutral and context-invariant nature, which was determined by the communicative functions of public language. On the other hand, Carruthers (2011) defends that inner speech allows us to know our mental states because of its sensory character, just like the sensory input we receive from others allows us to interpret their mental states.
However, until recently, no empirical evidence had been provided for the idea that inner speech supports metacognition. One of the first to bridge this gap was Langland-Hassan and colleagues (2017). In this study, participants with aphasia who had inner speech deficits were found to be less accurate than controls when judging how confident they were in their response during a categorization task, specifically when the context of categorization was more abstract, i.e., when the answer was not based on perceptual features or thematic associations among the items but rather on taxonomic (“fruit”), functional (“energy source”) or affordance-based (“makes a noise”) categories. The results were interpreted as reflecting the importance of experiencing auditory verbal imagery to cue oneself that one has correctly identified the grouping criterion in the trial, e.g., “fruit” (Langland-Hassan et al., 2017).
Building on this early study, more recent work by Langland-Hassan and collaborators (2021) investigated the role of inner speech in both abstract thought and metacognition while also taking a methodological step forward by developing a measure called Trial Concreteness. The measure was based on human ratings and indexes how many perceptual features and thematic associations are available to group images in a specific trial, consistent with theories that see conceptual representations as being context-variant (e.g., Yee & Thompson-Schill, 2016). In a novel task manipulating Trial Concreteness, trials with high scores were those in which a target and a matching image were judged to occur together more often in the same setting and/or to be more visually similar to each other, relative to the target and the other choice images in the trial. The opposite was true for trials with low Trial Concreteness scores. After norming the task, the authors administered it to individuals with aphasia and controls, asking them to select the image from a four-image bottom array that would go “the best” with another image at the top. Upon making their choice, participants were then asked to rate how confident they were that they had the correct response, using a scale from 1 (“I am guessing”) to 4 (“I am very confident”).
The results showed that participants with aphasia were disproportionately slower than controls when Trial Concreteness was lower, corroborating previous suggestions that linguistic labels are more important when fewer salient properties are available for categorization (Koemeda-Lutz et al., 1987; Lupyan, 2009; Lupyan et al., 2012; Lupyan & Mirman, 2013). Furthermore, Langland-Hassan and colleagues (2021) found that individuals with aphasia were slower than controls when indicating how confident they were in their answers, finding additional correlations between participants’ language performance in the Western Aphasia Battery-Revised (Kertesz, 2006) and their confidence response times. These results were seen as reflecting the power of covert labels to serve as cues for successful task performance and were taken as consistent with theories that view inner speech as the main way one’s thoughts can become objects of awareness (e.g., Bermudez, 2003; Carruthers, 2011; Jackendoff, 1996). Thus, albeit limited in scope, evidence suggests that normal interindividual variation in inner speech use might also impact how people reflect on their confidence states and how they categorize images in more abstract contexts, but this hypothesis remains to be tested.
The Current Study
Using the tools reviewed above, this study aims to investigate the effects of individual differences in inner speech tendencies on object recognition in the context of matching pictures and words, as well as abstract thought and metacognition in the context of a complex categorization task. Specifically, to assess these functions, we use a version of the word-picture verification task employed by Roebuck and Lupyan (2020) where words and images with different types of relations are matched, and a version of the categorization task developed by Langland-Hassan and colleagues (2021) where images are grouped in contexts varying in abstract thinking demands and where metacognitive response confidence is indicated. Moreover, to be able to measure inner speech propensity in German-speaking samples, we translate and validate a German version of the IRQ (Roebuck & Lupyan, 2020), the IRQ-G. In pursuing these goals, we pay heed to the body of evidence suggesting that not everyone experiences inner speech (e.g., Hurlburt et al., 2013), we provide a new instrument for researchers to study this interpersonal variation in German speakers, and we expand previous research on the interplay between language and cognition.
Preregistered Research Questions and Hypotheses
Concerning the questionnaire validation (Study 1.1), we first probed whether the IRQ-G would show the same (four-) factor structure as the original IRQ, predicting that it would. Secondly, we asked whether each scale of the IRQ-G would be internally consistent and stable over time, expecting that this would be the case, in line with the English scales. Thirdly, we asked whether the Internal Verbalization scale would significantly correlate with other measures of inner speech as indexed by the Varieties of Inner Speech Questionnaire – Revised (VISQ-R; Alderson-Day et al., 2018), anticipating to find more important correlations with factors measuring dialogic and evaluative / critical inner speech, similar to the English original.
Regarding the word-picture verification task (Study 1.2), we first asked whether participants with higher Internal Verbalization scores would identify objects faster than those with lower scores, formulating no specific hypotheses due to the mismatch between Roebuck and Lupyan (2020)’s expectations and results. We also asked whether participants with higher Internal Verbalization scores would be less slowed by semantic similarity and more slowed by phonological similarity, relative to those with lower scores. We expected this would be the case if more inner speech relates to more robust or automatic activation of phonological representations and enhanced activation of category-diagnostic features associated with words and/or images.
Finally, regarding the complex categorization task (Study 1.3), we asked whether participants with higher Internal Verbalization scores would be faster and more accurate when categorizing pictures in more abstract contexts, relative to participants with lower scores. We predicted that this would be the case if inner speech provides a way to link items when fewer salient features are available to guide response (at least when it comes to response times, as no effects on accuracy were found by Langland-Hassan et al., 2021). Moreover, we asked whether higher Internal Verbalization scores would be linked to higher levels and/or faster indications of response confidence, relative to lower Internal Verbalization scores. We expected this would be the case if inner speech provides an important resource for metacognitive thinking (also at least when it comes to response times, as no effects were found for confidence levels themselves in Langland-Hassan et al., 2021).
Preregistered Exploratory Questions
Firstly, using the VISQ-R (Alderson-Day et al., 2018), we explored whether specific forms of inner speech would affect performance on any of the tasks. Secondly, using the IRQ-G factors other than Internal Verbalization, we explored the impact of other modes of thinking on the word-picture verification task. Lastly, we explored the effects of executive functions, namely inhibitory control, cognitive flexibility, and working memory, on the complex categorization task.
Non-preregistered Exploratory Question
Following a suggestion by Langland-Hassan and colleagues (2021), we further explored whether the effects of inner speech on abstract thought vary according to how abstractness is defined by separately analyzing the subcomponents of Trial Concreteness, namely visual similarity and common setting ratings.
Method
Transparency and Openness
This study was preregistered. The preregistration, sharable experiment files and stimuli, supplementary materials, and analysis scripts are available at the following Open Science Foundation (OSF) repository: https://osf.io/u4zck/.
Ethics Statement
This study was approved by the Ethics Committee of the University of Vienna (reference number 00864). All participants gave their written consent before providing any data, in accordance with the Declaration of Helsinki (Association, 2013).
Study 1.1 – IRQ-G Questionnaire Validation
Participants
The sample size for the questionnaire validation, specifically for the confirmatory factor analysis discussed below, was set at 360 participants based on power analyses and literature recommendations (Bentler & Chou, 1987). Initially, 377 participants filled out a German version of the IRQ as well as a German version of the VISQ-R in exchange for monetary compensation. We set out to exclude participants who failed two attention-check items that had been added to the IRQ (i.e., agreeing with “Kühe bellen” [“cows bark”] and disagreeing with “Elefanten sind größer als Ratten” [“elephants are larger than rats”]), but no participant met this criterion. In addition, we used the “TIME_RSI” metric from Sosci Survey (Leiner, 2024) to exclude participants who were too fast or too slow, using a cut-off value of 2.5. This led to the exclusion of three participants. Furthermore, participants who failed to answer all questions in the IRQ questionnaire were also excluded (13). In the end, 361 participants were included in the factor analyses. Some demographic data were missing for 13 of these participants, however, so the demographic information provided herein pertains to the remaining 348 individuals (267 females / 79 males / 2 other; meanage = 23.6, SDage = 3.7, rangeage = 18–35). Recruitment was carried out in the Viennese metropolitan area through the Vienna CogSciHub Study Participant Platform (https://spp.cogsci.univie.ac.at/). All but 11 participants were currently studying at tertiary educational institutions, hence their educational attainment was at least level three in the framework of the International Standard Classification of Education (UNESCO Institute for Statistics, 2012). All participants reported having German as their first language, with 69% speaking the “Austrian” variety, and 31% speaking a different German variety. Although all participants reported growing up monolingually, 90% of them indicated proficiency in at least one additional language by the age of 15, most commonly English. All participants reported having normal or corrected-to-normal vision, no acute or chronic psychiatric or neurological condition with alteration of consciousness, and no developmental reading disorder.
Procedure
The two inner speech questionnaires and a set of demographic questions were administered online via the SoSci Survey System of the University of Vienna (Leiner, 2024, available at https://sosci.univie.ac.at/). Participants were asked to use a laptop or a desktop computer and to be in a quiet environment to complete the surveys. The average time to complete the surveys was 11 minutes.
Translation
Two independent translators whose native language was German (AT, who spoke the Austrian variety and JW, who spoke the German variety) first translated the 36 items in the original IRQ from American English into German. Then, a professional translator whose native language is American English provided the back-translation of the same items, that is, he translated the new German translation of the items back into English. The result of this back-translation was then compared to the original IRQ, and any discrepancies were discussed among the translators and an Austrian-German-speaking linguist (CA). After that, the translated version of the items was critically reviewed and checked for cultural appropriateness by two native German speakers (one speaking the Austrian variety and one speaking the German variety) who were familiar with the topic. Finally, a small group of German-speaking pilot participants answered the translated questionnaire and provided feedback on their understanding of the questions and overall response experience.
Factorial Validity
To assess whether the nature and number of dimensions underlying the IRQ-G correspond to those in the original IRQ, a confirmatory factor analysis (CFA) was conducted based on the four-factor solution reported by Roebuck and Lupyan (2020), with 36 indicators (two of which are reverse-coded) loading on latent variables reflecting 1) the tendency to experience thoughts in the form of language (termed Internal Verbalization); 2) the tendency to see images in one’s mind’s eye (termed Visual Imagery); 3) the tendency to visualize written language (termed Orthographic Imagery); and 4) the capacity to mentally manipulate different types of visual and auditory representations (termed Representational Manipulation). The response to each item was given on a fully labelled 5-point Likert scale, from “stimme gar nicht zu” [“disagree completely”] to “stimme voll und ganz zu” [“agree completely”]. There were no missing values. The data were not normally distributed, so robust maximum likelihood was used for parameter estimation. As in Roebuck and Lupyan’s study, the variances of all latent variables were set to 1 (variance standardization method), which allowed the loadings of all manifest variables to be freely estimated, and all latent variables were allowed to correlate. Measurement errors were assumed to be uncorrelated, and each indicator was assumed to load on only one latent factor. The model was overidentified with 588 degrees of freedom. Model fit was considered adequate based on the following fit indices and cut-off values: robust Root Mean Square Error of Approximation (RMSEA) – close to 0.06 of smaller; robust Comparative Fit Index (CFI) – close to 0.95 or larger; and robust Standardized Root Mean Residual (SRMR) – close to 0.08 or smaller (Hu & Bentler, 1999).
Convergent Validity
To determine how well the Internal Verbalization factor of the IRQ-G correlates with instruments measuring similar constructs, participants answered the VISQ-R (Alderson-Day et al., 2018), a measure that has been frequently used to research the form and quality of inner speech and whose German version has been graciously provided by Miriam Gade. Spearman’s rho correlation coefficient was used to assess the relationship between Internal Verbalization and the five factors indexed by the VISQ-R, namely Dialogic, Condensed, Other People, Evaluative / Critical, and Positive / Regulatory inner speech (see below for more details).
Reliability
To determine the degree to which the items within each subscale of the IRQ-G are homogenous, that is, produce similar scores, we computed both the commonly reported Cronbach’s α as well as McDonald’s ω, as the latter does not have the often unreasonable assumption that every indicator of a factor is equally precise in reflecting the latent construct (Hayes & Coutts, 2020). In addition, to assess how consistent the scores of each subscale are over time (test-retest reliability), we computed the intraclass correlation coefficient for repeated IRQ-G measures obtained within approximately three months of each other, using a two-way mixed effects model based on a single measurement and absolute agreement.
Study 1.2 – Behavioral Tasks
Participants
Of the 361 participants who contributed to the questionnaire validation phase, those whose Internal Verbalization mean scores were above the 75th percentile or below the 25th percentile were invited to the next phase of the study, where they would perform behavioral tasks. The selection of participants with extreme scores follows Roebuck and Lupyan (2020). Sixty-nine participants completed the word-picture verification task, but two were excluded due to technical or experimenter errors (16 males, 51 females, meanage = 23.67, rangeage = 18-34, SDage = 3.82). Ninety-four participants completed the complex categorization task, but only 67 completed all executive function tasks (see “Executive Function Tasks” below for more information) and were therefore included in the models (15 males, 52 females, meanage = 23.72, rangeage = 18-34, SDage = 4.13). In these subsamples, the language profile was similar to the larger questionnaire sample, with all participants being native German speakers, mostly of the Austrian variety (70% of those who did the word-picture verification task; 71% of the participants who did the complex categorization task). Most participants also reported mastering at least one other language before the age of 15 (85% and 84% for the word-picture verification and the complex categorization task, respectively), most commonly English. Five of the 67 participants in each subsample reported being left-handed, with the remaining 62 being right-handed.
Word-Picture Verification Task
Materials
To construct the materials for this task, 20 target labels were paired with four cue labels, with “cue” meaning the stimulus participants saw first in a trial and “target” referring to the stimulus participants saw afterward, that is, the stimulus that was responded to. One cue was more phonologically (rhyme) related to the target, e.g., “Krone” [“crown”] and “Zitrone” [“lemon”]; one cue was more semantically related to the target, e.g., “Krone” and “Ring” [“ring”]; one cue was broadly unrelated to the target, e.g., “Krone” and “Fenster” [“window”]; and one cue was identical to the target, e.g., “Krone” and “Krone”. Each cue and target had four different picture exemplars (e.g., a plain golden crown, a plain silver crown, a crown with rubies, a crown with diamonds) and four different word exemplars, which were formed by varying font (Calibri or Times New Roman) and capitalization (uppercase or sentence case). The pictorial exemplars were obtained with Google Images searches and the THINGS database (https://things-initiative.org/). When necessary, the images were resized and had their backgrounds removed to ensure they all had a white background and were 800 x 600 pixels in dimension.
The stimuli were distributed across eight experimental lists, with each list displaying four different exemplars of each cue and target (two of the four words and two of the four pictures). Half the lists started with a block of trials in the word-picture condition (when the cue is a word and the target is a picture), and half, with a block of trials in the picture-word condition. Each cue exemplar appeared once per list. For cues that did not match the target, the cue concept appeared four times in total. When the cue was the same as the target, however, the concept was repeated more frequently: four times as a cue and 16 times as a target. These 16 target appearances were paired with each of the four types of cues – phonologically related, semantically related, unrelated, and identical – four times each. The trials where the cue and the target matched were never compared to the trials where they did not match, so the different number of concept repetitions in these conditions does not affect the results. The items were counterbalanced across lists so that each cue and target exemplar would be presented first once, e.g., each exemplar of “crown” appeared once as the first occurrence of the target concept “crown” across the eight experimental lists. Similarly, items were counterbalanced across lists so that each target appeared first in each similarity condition (phonological, semantic, unrelated, or identical) an equal number of times. For example, the first appearance of the target “crown” was with a phonologically related cue in lists 1 and 4, with a semantically related cue in lists 2 and 5, with an unrelated cue in lists 3 and 6, and with an identical cue in lists 4 and 8. See “counterbalancing_scheme.xlsx” in the OSF repository for more details. Moreover, the items were pseudorandomized using Mix (van Casteren & Davis, 2006) to obey the following constraints: a distance of at least 33 trials between the same concept; a maximum of four repetitions of the same response key or the same similarity condition; and a probability of occurrence of any pair of consecutive response keys that is no more likely than what would be expected by chance. Finally, to ensure that participants pressed the match and nonmatch keys equally often, 320 filler items were added, with 75% of them having identical cues and targets and 25% having unrelated cues and targets. The complete set of experimental lists can be found in the project’s OSF repository.
Measures
Semantic and phonological similarity ratings between cues and targets were obtained with a rating task, following Roebuck and Lupyan (2020). This was a simple online survey administered on the JATOS platform (Lange et al., 2015) based on a jsPsych (de Leeuw, 2015) experiment. In the task, 49 participants responded on a scale from 1 (completely different) to 7 (identical) how similar each word pair was in terms of “Bedeutung” [“meaning”] and “Klang” [“sound”]. The mean rating values related to the former were taken as the measure of semantic similarity in the analyses, while the mean rating values related to the latter were taken as the phonological similarity measure. Due to belatedly noticed cultural/linguistic issues, five items1 were replaced after the rating task had been completed. For these new items, a separate sample of 109 participants provided ratings in the same manner as described above.
The main measure of individual differences in inner speech propensity was the sum of the item ratings composing the Internal Verbalization scale of the IRQ-G divided by the number of items. The content of the scale can be seen in Supplementary Material S1. In addition, for exploratory analyses, measures of specific forms of inner speech were obtained by taking the mean score of each of the five subscales of the VISQ-R: Dialogic – how much one’s internal speech occurs in the form of a conversation; Condensed – how abbreviated one’s inner speech is (as opposed to occurring in complete grammatical sentences); Other People – how much one hears other people’s voices in their mind’s ear; Evaluative / Critical – how much a person evaluates their behavior through inner speech / how much they experience negative emotions because of their inner speech; and Positive / Regulatory – how much one’s inner speech engenders positive emotions / outcomes. The complete list of items in the German VISQ-R and the original VISQ-R can be found in Supplementary Material S2.
Procedure
The task was presented with PsychoPy 2022.2.5 and was performed individually on a desktop computer in the Babelfisch psycholinguistics laboratory of the University of Vienna. Electroencephalographic data were recorded while participants performed the task, but these data are not discussed herein. Participants were asked to indicate whether an image and a word matched as accurately and as quickly as possible using the keyboard (key “k” for match, key “f” for non-match). They read the task instructions and did four practice trials before moving on to the main part of the task. The trial structure (Figure 1) was as follows: first, a fixation cross appeared in the center of the screen for 500 milliseconds; then, a cue appeared in the center of the screen for another 500 milliseconds; finally, the target appeared in the center of the screen and remained visible until the participant responded. The task was divided into eight blocks of 80 trials each (640 in total, 320 experimental trials), with breaks in-between blocks that lasted as long as the participant wished, being typically around one to three minutes. During these breaks, the experimenter interacted with the participant to help them stay alert and engaged in the task. Participants were also offered snacks and coffee to further prevent fatigue, as the two behavioral tasks were performed consecutively by most participants (23 had to do them in two separate sessions due to technical issues). When the sessions involved both tasks, the complex categorization task was always performed first. The average duration of the word-picture verification task was around 50 minutes, while the total session duration varied between approximately 75 and 110 minutes (including EEG preparation, reading of the informed consent form, and execution of both behavioral tasks).
Design and Data Analysis
This experiment had a mixed design with between-participant (Internal Verbalization) and within-participant (phonological and semantic similarities, cue-target type) variables. There were three continuous independent variables: phonological similarity ratings, semantic similarity ratings, and Internal Verbalization; one dichotomous independent variable: cue-target type (word-picture – when the cue is a word and the target is a picture, and picture-word – when the cue is a picture and the target is a word); two continuous dependent variables: reaction times when cues and targets corresponded to the same concept (matchRT), and reaction times when cues and targets corresponded to different concepts (nonmatchRT); and one dichotomous dependent variable, namely accuracy when cues and targets did not match.
The exploratory analyses included as independent variables the similarity ratings plus the five scales of the VISQ-R, all of which were continuous. Only nonmatchRT and accuracy were included as dependent variables in the exploratory analyses.
We performed all modelling in R (Version 4.3.1; R Core Team, 2021) using mixed-effects models with varying intercepts for participants and targets as well as random slopes for the effects of interest using lmerTest (Kuznetsova et al., 2017) and lme4 (Bates et al., 2015). Parsimonious mixed models were attempted by comparing models with a more complex random-effect structure to simpler models based on the Akaike information criterion (AIC) in order to maximize power while maintaining acceptable Type I error rates (see Matuschek et al., 2017 for a similar approach). Following Roebuck and Lupyan (2020), we excluded 1.6% of the trials for having reaction times that were either too long (> 1500 ms) or too short (< 150 ms). Only correct trials were included in the analysis of reaction times. We standardized the continuous predictors, sum-coded the categorical predictor (cue-target type = -0.5 for picture, +0.5 for word) and log-transformed the reaction times.
Complex Categorization Task
Materials and Measures
The materials for this task were kindly provided by Peter Langland-Hassan and consist of 74 sets2 of five full-color high-resolution images obtained through internet searches. The items were normed by Langland-Hassan and colleagues (2021) 3, resulting in a Trial Concreteness score for each of them based on the sum of two types of human ratings: 1) visual similarity (“VIS” ratings), referring to how visually similar the target and the matching image are, relative to the target and the other choice images; and 2) common setting (“COM” ratings), referring to how frequently the target image and the matching image occur together in a common setting, relative to the other choices.
Executive Function Tasks
To tease apart effects of language from those of executive functions on the complex categorization task, three executive function tasks were administered, namely an operation span task (tapping working memory), a Simon task (inhibitory control), and a Berg card sorting task (cognitive flexibility).
The operation span task worked as follows: participants were shown a simple mathematical equation like 4+3=7 and were asked to indicate whether it was true or false; then, they were presented with a letter for recall. This was repeated a minimum of three times and a maximum of seven times during the task. In turn, each set size was repeated three times for a total of 75 operation-letter pairs overall. The measure from this task that was included in the analyses was the mean edit-distance score, which refers to the number of target letters participants had to recall minus the number of changes in letter position and content that were needed to transform what they recalled into the target sequences (Gonthier, 2023).
In the Simon task, participants were asked to respond to the color of a circle by pressing either “r” on the left of the keyboard if the circle was yellow or “p” on the right of the keyboard if the circle was blue. The location of the circle changed from the left side of the screen to the right side with equal probability for all position-color combinations. When the position of the circle matched the position of the correct response key, e.g., left and “r”, the trial was congruent. When the positions did not match, e.g., left and “p”, the trial was incongruent. The trial structure was as follows: first a fixation cross appeared in the center of the screen for 500 ms, then the colored circle appeared on either the left or the right of the screen and remained there until participants responded. After a practice block of 16 trials, participants completed a total of four blocks of 96 experimental trials each. The key measure extracted from this task was the Simon effect, which is the subtraction of congruent mean reaction times for each participant from their incongruent mean reaction times (Welch & Seitz, 2013).
In the Berg sorting task, participants saw four cards differing in quantity, color, and shape at the top of the screen and a new card below them, which they were tasked with matching to one of the four cards. The correct answer was based on a classification rule that was undisclosed to the participants and that changed after 10 cards had been correctly and consecutively sorted. Participants were supposed to discover each rule through a process of trial-and-error based on the feedback they received regarding the accuracy of their answer. This feedback was provided in written form (“Korrekt” [“correct”] or “Falsch” [“false”]) for 500 ms after each attempt. In total, participants completed 64 trials. The key measure from this task was the number of perseverative errors each participant made, that is, the number of times participants continued to sort cards according to a rule after it was no longer valid (Miles et al., 2021).
All three executive function tasks were completed online on participants’ computers and were implemented on the JATOS platform (Lange et al., 2015). We used translated versions of tasks created with jsPsych (de Leeuw, 2015) by Luthra and Todd (2019), Eben and colleagues (2020), and Vékony (2022).
Procedure
Participants performed the task individually on a desktop computer in the Babelfisch psycholinguistics laboratory of the University of Vienna. They were asked to select an image from a four-image bottom array that went best with a target image at the top and were also asked to rate on a scale from 1 (“I am guessing”) to 4 (“I am very confident”) how confident they were that they had the correct answer. Response was given via mouse click. Participants completed four practice trials before moving on to the main part of the task. A trial started with the display of the five images. After an image was clicked, the rating scale appeared (see Figure 2). Once participants had made their choice, they were unable to select another image, but they could change their response confidence rating. The task was presented on the JATOS (Lange et al., 2015) platform based on a revised version of the jsPsych (de Leeuw, 2015) experiment used in Langland-Hassan and colleagues (2021), which was kindly shared by Peter Langland-Hassan. In total, there were 70 experimental trials presented within a single block, which took participants on average 12 minutes to complete.
Design and Data Analysis
This task had a mixed design with between-participant (Internal Verbalization) and within-participant (Trial Concreteness) variables. There were two continuous independent variables: Internal Verbalization and Trial Concreteness; two continuous dependent variables: response times to select an image (image RT) and response times to indicate response confidence (confidenceRT); one dichotomous dependent variable: accuracy in selecting the correct image; and one ordinal dependent variable: level of response confidence. In addition, three between-participant continuous predictors relating to executive functions were added as control variables, namely the Simon effect, the mean edit-distance score in the operation span task, and the number of perseverative errors in the Berg sorting task.
For exploratory analyses, the component parts of Trial Concreteness, VIS and COM ratings (both continuous), were taken as independent variables in addition to Internal Verbalization. In other models, the VISQ-R scales were included as independent variables in combination with Trial Concreteness. The dependent variables were the same as those described above.
Like the analyses for the word-picture verification task, all modeling was performed in R (Version 4.3.1; R Core Team, 2021) using mixed-effects models with varying intercepts for participants and items. By-participant random slopes for the effect of Trial Concreteness were kept if the AIC of a model including them was lower than the AIC of a model without random slopes to fit the most parsimonious model in terms of power and Type I error balance. The lmerTest (Kuznetsova et al., 2017), lme4 (Bates et al., 2015) and ordinal (Christensen, 2022) packages were used to fit models predicting response times, accuracy, and confidence levels, respectively. Only trials with correct responses were included in the analyses of response times and confidence measures. Due to the nature of the task, no outliers were removed. All predictors were standardized, and response times were log-transformed.
Results
Study 1.1 – Validation of the IRQ-G
Translation
The first translations of the IRQ into German by the two independent translators (AT and JW) had some minor disparities, which were jointly discussed and resolved. In turn, the back-translation of this German version into English by a professional translator included some additions and minor changes in meaning compared to the original items (see Supplementary Material S4). For example, the German translation of the original item “I hear a running summary of everything I am doing in my head” was “Ich höre in meinem Kopf eine laufende Zusammenfassung von allem, was ich tue” [literally “I hear in my head a running summary of everything I am doing”], which was back-translated as: “I constantly hear in my mind an ongoing synopsis of all my activities”. These discrepancies were discussed by AT and JW together with CA, a German-speaking linguist, and were deemed to reflect the translator’s creativity rather than flaws in the German translation. In the next phase of the translation, two native German speakers familiar with the topic critically reviewed the items and suggested a few modifications to make them sound more natural. These were discussed with the translators until a final version was agreed upon by everyone on the team. This version was then presented to a small group of German-speaking pilot participants (both Austrians and Germans), who indicated that the questionnaire was generally easy to understand. Supplementary Material S1 shows the original IRQ and the final German translation.
Factorial Validity
Goodness of fit indices showed that the four-factor model based on Roebuck and Lupyan (2020) did not fit the data well: χ2 (588) = 1279.89, p < 0.0001, robust CFI = 0.768, robust RMSEA = 0.059 (90% CI = 0.054–0.063), SRMR = 0.084. Additionally, large standardized residuals and modification indices were observed, suggesting localized areas of inadequate fit in the model solution. Moreover, some of the indicator variables were poorly explained by their purported latent factor, as demonstrated by low squared standardized factor loadings (R2). Finally, some items were found not to be sufficiently related to their putative latent variable, as demonstrated by nonsignificant factor loadings. Parameter estimates for this solution can be seen in Supplementary Material S3.
Given these results, we proceeded to iteratively respecify the model by taking into account modification indices, expected parameter change values, standardized residuals, proportion of explained variance, and factor loadings in a theoretically sound manner. Figure 3 shows a path diagram with the summary of the last solution achieved by this procedure. As illustrated in the figure, the model includes only three factors: the Orthographic Imagery variable did not contain enough high-quality indicators to be retained. In addition, five of the eight original items in the Representational Manipulation scale, two of the 12 original items in the Internal Verbalization scale, and two of the 10 items from the Visual Imagery scale were dropped for not being sufficiently explained by their respective latent variables or, in one case, for cross-loading issues. The direction and magnitude of the factor loadings for the remaining items indicate that they are all significantly related to their respective latent variables in the expected manner. Goodness of fit indices suggest that the new three-factor model fits the data well χ2 (183) = 284.99, p < 0.0001, robust CFI = 0.947, robust RMSEA = 0.041 (90% CI = 0.032–0.050), SRMR = 0.059.
Convergent Validity
The correlations reported in this section are based on the final German version of the Internal Verbalization scale after the CFA, which excludes two of the items present in the original IRQ (see Supplementary Material S1). Conversely, the items composing the five German VISQ-R scales are the same as those in the original VISQ-R.
Similarly to the results in Roebuck and Lupyan (2020), Internal Verbalization showed the highest correlation with the Dialogic scale (ρ = 0.46, p < 0.0001), suggesting that the former indexes to a certain extent how often people have silent conversations with themselves (as opposed to inner monologues). This is to be expected given that some items from the Internal Verbalization factor refer to “a conversation”, e.g., “I think about problems in my mind in the form of a conversation with myself”, and the Dialogic factor contains statements like “When I am talking to myself about things in my mind, it is like I am having a conversation with myself”. The second highest correlation was with the Evaluative / Critical scale (ρ = 0.39, p < 0.0001), which includes items such as “I think in inner speech about what I have done, and whether it was right or not”. This might be due to the presence of items in the Internal Verbalization scale relating to social situations, such as “When thinking about a social problem, I often talk it through in my head”. The third highest correlation was with the Positive/Regulatory inner speech scale (ρ = 0.28, p < 0.0001), which includes statements such as “When I think to myself in words about upsetting things, I can easily change topics in my mind and talk to myself about other things”. This weak correlation might be due to the presence of the item “My inner speech helps my imagination” in the Internal Verbalization scale, among other reasons, as it too refers to a positive effect of inner speech on one’s mental state. Another small positive correlation was found with the Other People scale (ρ = 0.24, p < 0.0001), which features items like “I hear other people’s actual voices in my head, saying things that they actually once said to me”. This relationship might stem from items such as “When thinking about a social problem, I often talk it through in my head” from the Internal Verbalization scale, as hearing other people’s voices is more likely when the content of inner speech is an interpersonal situation. Finally, a small negative correlation was observed with the Condensed factor of the VISQ-R (ρ = -0.21, p < 0.0001), which contains statements like “I think to myself in words using brief phrases and single words rather than full sentences”, suggesting that, to some limited extent, the Internal Verbalization scale indexes more expanded forms of inner speech.
Reliability
Internal Consistency
Based on both Cronbach’s α and McDonald’s ω, all scales that remained after the CFA show good internal consistency, with the lowest values being found for the Representational Manipulation scale (α = 0.78, CI = 0.74–0.82; ω = 0.80, CI = 0.77–0.84), and the highest values being found for the Visual Imagery scale (α = 0.83, CI = 0.80–0.85; ω = 0.83, CI = 0.80–0.86). Internal Verbalization had values in-between: α = 0.80 (CI = 0.77–0.83) and ω = 0.81 (CI = 0.77–0.83).
Test-Retest
The intraclass correlation coefficient (ICC) associated with each subscale of the IRQ-G indicates that participants’ scores were moderately stable over time, with more consistent scores being linked to Visual Imagery (ICC = 0.74, CI = 0.68–0.80) and Internal Verbalization (ICC = 0.73, CI = 0.67–0.79), and less stable scores being linked to Representational Manipulation (ICC = 0.61, CI = 0.53–0.68).
Study 1.2 – Behavioral Tasks
The models presented below contain the random-effect structure associated with the lowest AIC value. Some estimate values and their associated standard error, t- and p-values were not explicitly stated below but can be found in the project’s OSF repository. We present estimates on the log scale for models fitted on log-transformed reaction time data.
Word-picture Verification Task
Participants’ mean Internal Verbalization score was 3.87 (SD = 0.8, range = 1.5 – 4.9). Excluding outliers and collapsing across conditions, participants took on average 540 ms (SD = 180) to indicate whether the targets matched the cues. They gave correct responses 98% (SD = 13) of the time on average.
Based on visual inspection and correlation analyses of both match and non-match trials, there was no evidence of a speed-accuracy trade-off in this task, so here we focus on reaction times. Accuracy-related results can be found in the OSF repository. The models presented below echo the analytical decisions made by Roebuck and Lupyan (2020).
Effects of Cue-Target Type and IRQ Factors on MatchRT
To assess the effects of cue-target type and the IRQ factors on matchRT, the following model was fitted:
log(matchRT) ~
Internal_Verbalization*Cue_target_type +
Representational_Manipulation*Cue_target_type +
Visual_Imagery*Cue_target_type +
(Cue_target_type|participant) + (1|target)
Participants with higher Internal Verbalization scores responded faster than those with lower scores when the items matched (b = -0.04, SE = 0.021, t = -2.04, p = 0.046). No main effects were observed for cue-target type or the other IRQ factors, and there were no significant interactions (all p > 0.4).
Effects of Cue-Target Type and Similarity Ratings on NonmatchRT
To assess the effects of cue-target type and phonological and semantic similarity ratings on nonmatchRT, we fitted the following model:
log(nonmatchRT) ~ Cue_target_type*Semantic_similarity + Cue_target_type*Phonological_similarity +
(Cue_target_type + Semantic_similarity|participant) +
(1|target)
Participants were faster in the picture-word condition (b = 0.03, SE = 0.01, t = 2.49, p = 0.02), relative to the average reaction time across all cue-target types. In addition, participants were slower both when the items were more phonologically related (b = 0.02, SE = 0.00, t = 9.87, p < 0.0001) and when they were more semantically related (b = 0.03, SE = 0.00, t = 10.17, p < 0.0001). Moreover, participants were particularly slowed by phonological similarity in the picture-word condition, as indicated by an interaction between phonological similarity and cue-target type (b = -0.01, SE = 0.00, t = -2.09, p = 0.04). Conversely, participants were especially slowed by semantic similarity in the word-picture condition, as shown by an interaction between semantic similarity and cue-target type (b = 0.03, SE = 0.00, t = 7.40, p < 0.0001). Figure 4 illustrates these results.
Effects of Internal Verbalization and Similarity Ratings on NonmatchRT
To evaluate whether and how Internal Verbalization, phonological similarity, and semantic similarity affected nonmatchRT, we fitted separate models for each cue-target type, i.e., one for the picture-word condition and another for the word-picture condition. The final syntax for the latter was:
log(nonmatchRT) ~
Semantic_similarity*Internal_Verbalization +
Phonological_similarity*Internal_Verbalization +
(Phonological_similarity + Semantic_similarity||participant) + (1|target)
For the picture-word condition, a similar model was fitted, but the by-participant random slopes for the effect of phonological similarity were removed because their variances were estimated to be very close to zero, impeding model convergence.
Participants with higher Internal Verbalization scores were generally faster than those with lower scores both in the word-picture condition (b = -0.05, SE = 0.02, t = -2.64, p = 0.01) and in the picture-word condition (b = -0.05, SE = 0.02, t = -2.21, p = 0.03). Furthermore, regardless of cue-target type, participants were slower when phonological similarity increased (word-picture condition: b = 0.02, SE = 0.00, t = 4.96, p < 0.0001; picture-word condition: b = 0.02, SE = 0.00, t = 8.29, p < 0.0001) as well as when semantic similarity increased (word-picture condition: b = 0.05, SE = 0.00, t = 12.53, p < 0.0001; picture-word condition: b = 0.01, SE = 0.00, t = 4.15, p < 0.0001). Finally, in the word-picture condition, higher semantic similarity slowed the response of individuals with stronger inner speech propensity to a larger degree than it did those with weaker verbalization tendencies, yielding an interaction between Internal Verbalization and semantic similarity ratings (b = 0.01, SE = 0.00, t = 2.32, p = 0.02). Conversely, there was no interaction between Internal Verbalization and phonological similarity in the word-picture condition (b = -0.003, SE = 0.003, t = -1.04, p = 0.30). No significant interactions were observed in the picture-word condition (all p > 0.1). See Figure 5 for a graphical representation of these results.
Effects of Other IRQ Factors and Similarity Ratings on NonmatchRT
To assess possible effects of Visual Imagery and Representational Manipulation on nonmatchRT, we fitted separate models for each cue-target type and IRQ factor, as follows:
log(nonmatchRT) ~ Semantic_similarity*IRQ_factor +
Phonological_similarity*IRQ_factor + (?|participant) +
(1|target)
The “?” indicates different random slopes for the different models: for the word-picture condition, both semantic similarity and phonological similarity were included, while only semantic similarity was kept in models focusing on the picture-word condition.
Higher similarity ratings were associated with slower reaction times regardless of cue-target type, but there were no main effects of Visual Imagery (picture-word condition: b = -0.01, SE = 0.02, t = -0.42, p = 0.68; word-picture condition: b = 0.01, SE = 0.02, t = 0.66, p = 0.51) or Representational Manipulation (picture-word condition: b = -0.00002, SE = 0.02, t = -0.001, p = 0.99; word-picture condition: b = 0.004, SE = 0.02, t = 0.16, p = 0.87). Furthermore, no significant interactions were observed between these IRQ factors and either semantic or phonological similarity, regardless of cue-target type.
Effect of Specific Inner Speech Forms on NonmatchRT
To explore the effects of specific forms of inner speech, as indexed by the VISQ-R, the following model was fitted:
log(nonmatchRT) ~ Semantic_similarity*Evaluative +
Phonological_similarity*Evaluative +
Semantic_similarity*Dialogic +
Phonological_similarity*Dialogic +
Semantic_similarity*Positive +
Phonological_similarity*Positive +
Semantic_similarity*Other_People +
Phonological_similarity*Other_People +
Semantic_similarity*Condensed +
Phonological_similarity*Condensed +
(Semantic_similarity + Phonological_similarity||participant) +
(1|target)
Once again, higher phonological and semantic similarity ratings were significantly associated with slower reaction times, regardless of cue-target type. In addition, in the word-picture condition, two effects were observed involving the Condensed subscale: on the one hand, participants reporting more abbreviated or abstract inner speech took longer to respond overall (b = 0.06, SE = 0.02, t = 2.69, p = 0.01); on the other hand, the same individuals were less slowed by increases in semantic similarity, as reflected by an interaction between Condensed scores and semantic similarity ratings (b = -0.01, SE = 0.004, t = -2.23, p = 0.03). No other significant effects were found (all p > 0.4).
Complex Categorization Task
Across conditions, participants took on average 6.13 seconds (SD = 5.47) to select an image, picking the correct one 86% (SD = 34) of the time on average. To indicate response confidence, participants took 2.36 seconds (SD = 1.63) on average, choosing “I am very confident” 59% of the time, “I am quite confident” around 25% of the time, “I am not confident” around 12% of the time, and “I am guessing” about 4% of the time.
Trial Concreteness, Internal Verbalization and Executive Function Effects on Response Times and Accuracy
To see whether Internal Verbalization facilitates categorization in more abstract contexts regardless of executive function capacity, the following model was fitted on response times to choose an image (imageRT) and accuracy:
Response_variable ~
Trial_Concreteness*Simon_Effect +
Trial_Concreteness*Number_Perseverative_Errors_Card_Sorting_Task +
Trial_Concreteness *Operation_Span_Edit_Score +
Trial_Concreteness *Internal_Verbalization + (? |participant) + (1|target)
The “?” represents the fact that the final model for accuracy did not include random slopes, while the best model for imageRT included random slopes for the effect of Trial Concreteness.
As illustrated in Figure 6 panels a) and b), when the context was more concrete, participants were both faster to choose an image (b = -0.29, SE = 0.04, t = -6.95, p < 0.0001) and more accurate (b = 1.32, SE = 0.16, z = 8.16, p < 0.0001). In addition, participants were more accurate when their inhibitory control was better, as indexed by a lower Simon effect (b = -0.20, SE = 0.09, z = -2.27, p = 0.02). Other executive function measures and Internal Verbalization, however, did not predict imageRT or accuracy, regardless of the level of abstractness.
Trial Concreteness, Internal Verbalization, and Executive Function Effects on Time to Indicate Response Confidence and Confidence Level.
To estimate the effects of Internal Verbalization and Trial Concreteness on metacognitive judgments of confidence, models were fitted on both confidenceRT and confidence levels, as follows:
Confidence_variable ~
Trial_Concreteness*Simon_Effect +
Trial_Concreteness*Number_Perseverative_Errors_Card_Sorting_Task +
Trial_Concreteness *Operation_Span_Edit_Score +
Trial_Concreteness*Internal_Verbalization + (1|participant) + (1|target)
Trial Concreteness was a significant predictor of confidenceRT (b = -0.06, SE = 0.01, t = -4.62, p < 0.0001; see Figure 6) as well as confidence level (b = 1.32, SE = 0.16, z = 8.06, p < 0.0001), with more concrete trials being linked to faster indications of confidence and higher confidence levels. No significant effects were found involving Internal Verbalization or any of the executive function tasks.
Non-preregistered Exploratory Analyses – Bayes Factors
We conducted a Bayes Factor analysis to estimate the relative evidence for models fitted on imageRT with an interaction between Internal Verbalization and Trial Concreteness over a null model without this interaction, as this was the effect most important to our hypotheses. To explore the impact of the choice of priors on the Bayes Factors, we considered alternative models with mean zero and different prior standard deviations for the interaction effect. The results show anecdotal evidence for the null model when the prior standard deviation of the interaction term was between 0.0001 and 0.01 (BF10 = 0.99 for SD = 0.0001; BF10 = 0.96 for SD = 0.0005; BF10 = 0.97 for SD = 0.001; BF10 = 0.83 for SD = 0.01), moderate evidence for the null model when the prior standard deviation was 0.05 (BF10 = 0.27) or 0.1 (BF10 = 0.14), very strong evidence for the null model when the standard deviation was 0.5 (BF10 = 0.03) or 1 (BF10 = 0.01), and extreme evidence for the null model when the standard deviation was 2 or higher (BF10 = 0.007 for SD = 2; BF10 = 0.001 for SD = 10; BF10 = 0.0003 for SD = 50). Thus, once larger effect sizes were made more probable by increasing the standard deviation of the prior, more and more evidence was obtained in favor of a model without the interaction between Internal Verbalization and Trial Concreteness, supporting the absence of this effect when more theoretically interesting effect sizes are assumed.
Effects of Specific Inner Speech Forms
We explored how different forms of inner speech, as indexed by the VISQ-R, impact complex categorization, using the following model:
Dependent_variable ~
Trial_Concreteness*Simon_Effect +
Trial_Concreteness*Number_Perseverative_Errors_Card_Sorting_Task +
Trial_Concreteness*Operation_Span_Edit_Score +
Trial_Concreteness*Evaluative +
Trial_Concreteness*Condensed +
Trial_Concreteness*Dialogical +
Trial_Concreteness*Positive +
Trial_Concreteness*Other_People + (?|participant) + (1|target)
The “?” indicates that the by-participant random slopes for the effect of Trial Concreteness were sometimes absent (in models predicting accuracy, confidence level, and confidenceRT) and sometimes present (in the model predicting imageRT), depending on what the most parsimonious solution was.
The results show once again that participants took less time to choose an image when Trial Concreteness was higher (b = -0.28, SE = 0.04, t = -6.89, p < 0.0001). In addition, participants reporting more evaluative / critical inner speech were found to be faster overall (b = -0.08, SE = 0.04, t = -2.34, p = 0.02). No other predictor was significantly linked to imageRT.
When it comes to accuracy, there was a beneficial effect of Trial Concreteness (b = 1.35, SE = 0.16, z = 8.18, p < 0.0001) and inhibitory control (b = - 0.23, SE = 0.09, z = -2.52, p < 0.01), with participants being more accurate when their inhibition was better (lower Simon effect) and when the trial was more concrete. Furthermore, there was a significant interaction between Condensed and Trial Concreteness, indicating that people who reported more abbreviated inner speech were more accurate than those reporting more expanded inner speech in particular when the trial was more abstract (b = -0.17, SE = 0.07, z = -2.48, p = 0.01; see Figure 7).
In turn, confidenceRT and confidence levels were both shown to be affected by Trial Concreteness (b = -0.06, SE = 0.01, t = -4.64, p < 0.0001; b = 1.33, SE = 0.16, z = 8.04, p < 0.0001, respectively), and confidenceRT was additionally affected by cognitive flexibility, with shorter reaction times being linked to more perseverative errors (less flexibility) (b = -0.05, SE =0.02, t = -2.04, p = 0.0457). However, no significant effects were observed involving any of the VISQ-R measures.
Non-preregistered Exploratory Analyses – Visual Similarity vs. Common Setting Ratings
To explore how Internal Verbalization interacted with the different component parts of Trial Concreteness, namely visual similarity (VIS) and common setting association (COM) ratings, the following model was fitted:
Dependent_variable ~
VIS*Internal_Verbalization + COM*Internal_Verbalization + Operation_Span_Edit_Score*VIS +
Operation_Span_Edit_Score*COM +
Simon_Effect*VIS + Simon_Effect*COM +
Number_Perseverative_Errors_Card_Sorting_Task*COM +
Number_Perseverative_Errors_Card_Sorting_Task*VIS +
(?|participant) + (1|target)
The “?” represents the fact that by-participant random slopes for the effect of Trial Concreteness were included in the most parsimonious model predicting imageRT but were absent in the other models.
Both VIS and COM ratings were linked to shorter imageRT (COM: b = -0.24, SE = 0.04, t = -5.52, p < 0.0001; VIS: b = -0.11, SE = 0.04, t = -2.47, p = 0.02) and higher accuracy (COM: b = 1.09, SE = 0.17, z = 6.34, p < 0.0001; VIS: b = 0.53, SE = 0.14, z = 3.65, p < 0.001). In addition, a larger Simon effect, that is, worse inhibitory control, was linked to lower accuracy overall (b = -0.21, SE = 0.09, z = -2.40, p = 0.007) as well as to a higher impact of common setting associations, yielding a significant interaction between Simon effect and COM (b = -0.19, SE = 0.08, z = -2.35, p = 0.02). In other words, varying how frequently the items co-occurred made a bigger difference to individuals with better inhibition. Similarly, there was an interaction between VIS and the number of perseverative errors in the card sorting task, indicating that more visual similarity was more beneficial (and less visual similarity was more detrimental) to the performance of participants with better cognitive flexibility (b = -0.12, SE = 0.06, z = -2.21, p = 0.03).
When it comes to metacognitive judgments of confidence, more abstract contexts, as measured by COM ratings, were linked to longer times to indicate confidence overall (b = -0.06, SE = 0.01, t = -4.45, p < 0.0001). Moreover, a modulatory effect of Internal Verbalization was found, with COM-indexed abstractness increasing confidenceRT to a higher degree for participants with stronger verbalization tendencies, relative to those with weaker tendencies (b = -0.01, SE = 0.01, t = -1.98, p = 0.047; see Figure 8). In other words, individuals with more inner speech took longer to indicate confidence, in particular when the target and the matching image occurred together less frequently compared to the target and the alternatives. When looking at confidence levels, more abstract contexts were associated with less confidence, regardless of how abstractness was measured, but participants with fewer perseverative errors in the card sorting task, that is, those with higher cognitive flexibility, were more confident than those with more perseverative errors when the trial was more abstract according to COM ratings (b = 0.11, SE = 0.05, z = 2.20, p = 0.03). In addition, there was a trend for participants with higher Internal Verbalization scores to be more confident than those with lower scores when the trial was more abstract according to COM ratings (b = -0.09, SE = 0.05, z = -1.96, p = 0.050).
Discussion
This study sought to probe the effects of individual differences in inner speech propensity on object recognition in the context of matching words and pictures, and abstract thought and metacognition in the context of a complex categorization task, further aiming to validate a German version of the IRQ (IRQ-G) to be able to measure this interindividual variation in German-speaking samples. To the best of our knowledge, only two previous studies had investigated how inner speech affects object recognition from an individual-differences perspective (Nedergaard & Lupyan, 2024; Roebuck & Lupyan, 2020) and no study had used this approach to study how inner speech affects abstract categorization and metacognitive judgements of response confidence.
IRQ-G Questionnaire Validation
Our expectations concerning the psychometric properties of the IRQ-G were mostly fulfilled. Specifically, our analyses of the IRQ-G: a) reproduced three of the four factors reported in the English version, namely Internal Verbalization, Visual Imagery, and Representational Manipulation, failing to reproduce the Orthographic Imagery scale; b) showed that the Internal Verbalization scale significantly correlates with VISQ-R scales indexing different forms of inner speech, most strongly (and positively) with the Dialogic and Evaluative scales, and most weakly (and negatively) with the Condensed scale, broadly in line with the pattern found for the original IRQ; c) showed that all three scales of the IRQ-G are moderately stable over time, with the most reliable being Visual Imagery and Internal Verbalization, echoing the findings based on the original IRQ; and d) indicated that all subscales have high internal consistency, i.e., contain items that deliver homogeneous scores, also consistent with the original questionnaire.
Thus, the IRQ-G offers valid and reliable measures of the tendencies to experience thoughts in the form of verbal and visual information as well as the capacity to manipulate visual representations. There are, however, a couple of differences compared to the original IRQ that are worth discussing in more detail. First, the fact that we could not obtain a factor linked to orthographic imagery. We speculate that the failure to reproduce this dimension might be related to differences between the English and the German orthographic systems: compared to the transparency of the latter, the opaqueness of English orthography might increase the adaptiveness of thinking in the form of written language as a means to separate phonological and orthographic representations of words and thus ensure correct spelling. Nonetheless, it is relevant to note that the theoretical significance of an orthography-related factor was dubious from the beginning, as it emerged unexpectedly and correlated with quite dissimilar constructs such as how sensitive a person is to the perception of others (Roebuck & Lupyan, 2020). Hence, speculations concerning this factor should be considered cautiously. Second, the observation that the correlations between the German Internal Verbalization scale and the German VISQ-R scales were less strong than the correlations involving the original Internal Verbalization scale. We attribute this to the fact that the English original was compared to an older version of the VISQ (McCarthy-Jones & Fernyhough, 2011), whereas we used a newer, revised version of the instrument (Alderson-Day et al., 2018). With different items, different correlation strengths are likely to be observed. Lastly, we found weaker correlations between the IRQ-G scales than those found between the English scales, e.g., instead of a .47, correlation between Internal Verbalization and Visual Imagery, the German scales had a correlation of .11. We believe that the different compositions of the scales can once again explain at least some of the disparity. For example, the items that were not kept in the German version of the Internal Verbalization scale both referred to memories, either explicitly or implicitly, i.e., “My memories often involve conversations I’ve had” and “If I am walking somewhere by myself, I frequently think of conversations that I’ve recently had”. Because memories often include visual imagery (e.g., Greenberg & Knowlton, 2014), these items possibly drove some of the correlation found between the English scales, which was lost in the German version.
Inner Speech and Object Recognition
Next, we tested whether and how inner speech tendencies affected performance in a word-picture verification task measuring object recognition. Our investigation showed that participants with higher Internal Verbalization scores were faster than those with lower scores across all analyzed conditions, i.e., when the items matched, when they did not match, when the cue was a word and the target was a picture, and when the cue was a picture and the target, a word. Moreover, contrary to our expectations, individuals across all levels of Internal Verbalization were similarly affected by phonological relatedness. Following the predictions in Roebuck and Lupyan (2020), we might interpret the overall faster performance by individuals reporting stronger verbal thinking propensities as evidence that they activate phonological representations from stimuli in a more robust or automatic manner, which would lead them to arrive faster at a decision based on phonological information. However, the fact that phonological interference was not significantly worse for participants with higher Internal Verbalization scores casts some doubt on that interpretation. Another explanation, which is linked to the label-feedback hypothesis, might be that people with more language-reliant thinking are more effective at processing the stimuli because they selectively activate the most category-defining perceptual and semantic features and abstract over irrelevant information (Lupyan, 2012b). This language-induced warping of the conceptual space would mean that representations of objects sharing the same category are brought closer together and those of objects from different categories are pulled further apart (Goldstone et al., 2001; Lupyan, 2012a), thus making it easier to respond “match” in matching trials and “nonmatch” in nonmatching trials.
Furthermore, in contrast to Roebuck and Lupyan (2020)’s finding of decreased semantic interference for individuals with higher Internal Verbalization scores when they matched words to images, participants in our study who had higher Internal Verbalization scores suffered more interference, i.e., they were more slowed by semantic relatedness, relative to those with lower scores, when matching words to images. Expanding on the argument presented above, we believe this discrepancy might be explained by the fact that the semantically related items in our study were more strongly and taxonomically related than the items in Roebuck and Lupyan (2020). To understand why this could be the case, one needs to start by considering that when concepts are taxonomically related, e.g., PIG and GOAT, they share many semantic features (e.g., four legs and fur) as well as category nodes (e.g., mammal and animal), while the opposite is true of thematically related concepts like PIG and FARM (Abdel Rahman & Melinger, 2007; Estes et al., 2011; Rose & Abdel Rahman, 2016). As a result, the activation of taxonomically related concepts converges on competitors that share semantic attributes (e.g., HORSE, COW); conversely, the activation of thematic associates does not converge on related concepts because of the scarcity of their shared attributes (Abdel Rahman & Melinger, 2009, 2019). In turn, this difference in representational overlap has been argued to underlie the different directions of semantic similarity effects involving thematic vs. taxonomic relations, namely facilitation (e.g., Alario et al., 2000; Costa et al., 2005; De Zubicaray et al., 2013) and interference (e.g., Piai et al., 2012; Schriefers et al., 1990; Shao et al., 2013), respectively (for a complete account of the mechanisms behind these effects, see Abdel Rahman & Melinger, 2019). In a similar vein, closely related concepts such as CAT and DOG have various and strongly related features in common, whereas more loosely related concepts like CAT and SPIDER share fewer and broader features, leading the former to induce more competition and therefore more semantic interference than the latter (Rose et al., 2019).
Crucially, we believe that producing words in inner speech might lead to increased representational overlaps between the items via facilitated access to their categorical nodes, e.g., APPLE and GRAPE would be approximated by “fruit”, and that this would differentially affect the processing of taxonomic vs. thematic and closely vs. loosely related items due to the structural differences outlined above. Specifically, when items are thematically or loosely related, the inner-speech-facilitated access to the categorical nodes of the objects would make it easier to distinguish them because the nodes would be more non-overlapping than overlapping. For example, the items from Roebuck and Lupyan’s study SHOE - TOE and SNAIL - WHALE, which received relatively high semantic similarity scores, would be linked to more distinct superordinate categories like “footwear” and “body part” in the former case and “mollusk” and “mammal” in the latter case, compared to shared categories like “animal” for SNAIL - WHALE, making the dissimilarity between the concepts more evident. In contrast, when processing concepts like TRAM - BUS or JUICE - WINE, which were two items with high semantic similarity scores in our stimuli, shared superordinate categories like “public transport” and “vehicle” for TRAM - BUS, and “beverage” and “fruit-derived product” for JUICE - WINE would be more prevalent, making it relatively more difficult to distinguish the objects. This proposal aligns well with literature suggestions that the representations of two things that have a common label become more similar and that labels highlight taxonomic relations between objects (Davis & Yee, 2019; LaTourrette & Waxman, 2020; Lupyan, 2012b; Markman & Hutchinson, 1984; Plunkett et al., 2008; Waxman & Booth, 2003; Waxman & Braun, 2005; Waxman & Gelman, 1986; Waxman & Hall, 1993; Waxman & Kosowski, 1990; Welder & Graham, 2001). Moreover, our explanation is consistent with theories that allow contextual and internal factors to dynamically modulate conceptual and lexical activation patterns during language production (Abdel Rahman & Melinger, 2019). Thus, we contend that the different direction of the Internal Verbalization effect in our study likely stems from differences in the type of semantic relationship in the stimulus materials and does not necessarily pose challenges to the validity of the label-feedback hypothesis, which we believe can also account for our findings.
Compared to the Internal Verbalization scale, our exploratory analyses revealed that the VISQ-R Condensed scale had the opposite effect on performance. Specifically, people who reported experiencing inner speech in a more abbreviated form took longer to respond and were less slowed by semantic similarity when matching word cues to picture targets. This pattern is in line with the negative correlation found between the Condensed and the Internal Verbalization factors and suggests that the assumed increase in semantic interference from larger feature overlap depends on the presence of more expanded forms of inner speech rather than on telegraphic or abbreviated forms. A possible explanation for the beneficial effect of condensed inner speech on semantic interference might relate to a potentially larger engagement of cognitive control mechanisms when fewer inner speech episodes are experienced, an idea consistent with the observation by Grandchamp and colleagues (2019) of increased activation in an area associated with cognitive control (the dorsomedial prefrontal cortex) for individuals who reported fewer verbal episodes during a mind wandering task.
Notably, the interaction between semantic similarity and both inner speech scales occurred exclusively when the cue was a word and the target was a picture. To understand this finding, two factors should be considered. First, because of the long interstimulus interval (around 1 second), performance may have been largely determined by target-related processes (Stadthagen-Gonzalez et al., 2009). Second, differences in inner speech tendencies are likely more apparent when pictures are being processed rather than words, as the latter are already a verbal stimulus whereas the former can be processed in various ways (Stadthagen-Gonzalez et al., 2009). Corroborating this suggestion, Roebuck and Lupyan (2020) also report interactions between Internal Verbalization and similarity measures exclusively in the word-picture condition.
Relatedly, while semantic interference was stronger in the word-picture condition, phonological interference was stronger in the picture-word condition. We reason that the level of information that is accessed first when processing the target might be what determines in which condition the similarity effects will be strongest. When the target is a picture, semantic information is available first, so response is more impacted by semantic factors; when the target is a word, phonological information is available first, so phonological factors may impact response more (Grainger & Holcomb, 2009; Indefrey, 2011). This account is partly in line with Roebuck and Lupyan (2020), who also found stronger effects of semantic similarity in the word-picture condition, although they found similar effects of phonological similarity for both cue-target types.
In turn, unlike in Roebuck and Lupyan (2020), self-reported propensity to experience inner speech did not significantly influence how participants in our sample processed phonologically related items. Concerning this discrepancy, the most relevant aspect to consider might be the fact that orthography and phonology often did not match in the rhyming English items (e.g., ‘plane’ and ‘train’), whereas they almost always matched in the German items (e.g., ‘Schuh’ and ‘Kuh’). When the phonological form does not follow straightforwardly from the written word, individual differences in activating phonological representations could play a bigger role, allowing for interaction effects between Internal Verbalization and phonological similarity to emerge.
Finally, Visual Imagery and Representational Manipulation did not significantly affect object recognition performance. This finding is in line with Roebuck and Lupyan (2020) based on the analyses of nonmatching trials but diverges from them when it comes to matching trials. In the latter case, Roebuck and Lupyan found that participants with higher Visual Imagery scores were slower to respond when the cue was a word and the target was a picture, relative to those with lower scores. They also found overall faster response times for participants with higher Representational Manipulation scores, regardless of cue-target type. We think this partial discrepancy might be due to the exclusion following confirmatory factor analyses of some of the original IRQ items from the German version of the scales; for example, the items referring to auditory representations were all missing from the Representational Manipulation factor of the IRQ-G. This suggests that the previously observed effect of this dimension might have been driven by the capacity to manipulate auditory rather than visual information, as the items related to the manipulation of visual representations were retained in the IRQ-G. In any case, further research is necessary to elucidate the roles played by these modes of thinking in object recognition among English and German speakers.
As pointed out by a reviewer, one interesting question is whether our results would align more closely with those reported by Roebuck and Lupyan (2020) if participants’ scores had been calculated using all items of the IRQ, rather than just those that survived the confirmatory factor analysis. To address this question, we redid the analyses to include all items from each of the three selected scales (12 for Internal Verbalization, 10 for Visual Imagery, and 8 for Representational Manipulation). The results were the same: Internal Verbalization was linked to faster reaction times in both matching and nonmatching trials across cue-target types; higher Internal Verbalization scores were linked to increased semantic interference in the word-picture condition; and the other IRQ factors showed no significant main effects or interactions. The effects of semantic and phonological similarity were also the same. For details, see “extra_analysis_IV_all_items.R” in the project’s OSF repository. These consistent results suggest our analyses are robust to some degree of variability in scale composition, suggesting that the dropped items exert similar effects as those kept after the CFA in this particular task, and lending further weight to the interpretations we have offered so far.
Overall, these results highlight the importance of taking into account differences in languages, types of semantic relationships, and forms of inner speech when studying the effects of verbal thinking on cued object recognition. Accordingly, our findings offer a valuable complement to the picture presented by Roebuck and Lupyan (2020) and help to consolidate underexplored pathways for future research.
Inner Speech and Abstract Thought
In addition to the object recognition task, we tested the effects of individual differences in inner speech propensity on a complex categorization task manipulating abstract thinking demands. Contrary to our expectations, Internal Verbalization did not significantly affect participants’ performance. This was confirmed by a Bayes factor analysis, which indicated that, for a wide range of theoretically interesting effect sizes, there was strong evidence in favor of a null model predicting reaction times without an interaction between Trial Concreteness and Internal Verbalization, relative to an alternative model including this interaction. This is in contrast to findings showing that limited access to linguistic representations, whether due to brain injury, articulatory suppression, or brain stimulation, relates to deficits in categorization performance in more abstract contexts (Langland-Hassan et al., 2021; Lupyan, 2009; Lupyan et al., 2012; Lupyan & Mirman, 2013). Similarly, our results differ from those of Fini and colleagues (2022), who present evidence that inner speech suppression via syllable repetition disproportionately slows the processing of abstract concepts during the classification of visual word stimuli as abstract or concrete.
On the other hand, our findings are somewhat aligned with those of Nedergaard, Borghi and Kapielska (2023). In their study, participants were asked to select the odd-one-out from a three-stimuli array, with the target concept varying in its degree of abstractness. To check for hypothesized effects of inner speech, participants engaged in interleaved interference tasks in which they solved 1-back matching problems involving either auditory or visual stimuli in addition to a no-interference baseline condition. Notably, despite not finding interactions with the interference tasks, when the authors restricted their analyses to concepts in the higher end of the abstractness continuum, they found that participants responded similarly across all trials when the stimuli were images but not when the stimuli were words. In the latter case, participants were slower and less accurate when responding to the most abstract concepts (those classified as philosophical and spiritual, e.g. BELIEF) compared to the least abstract ones (those classified as physical-spatiotemporal or quantities, e.g. ACCELERATION), reflecting the concreteness effect the authors expected. These results, along with those of Fini and colleagues (2022), suggest that the availability of inner speech might support abstract thinking differently depending on whether words or images are being processed, with inner speech possibly being more beneficial for word-based tasks. This could be due to the potentially more complex and varied ways in which pictures map onto abstract concepts, compared to how abstract ideas relate to words. That said, these two studies differ from ours in the way inner speech was operationalized, as well as in the tasks and stimuli used. Thus, future research is needed to clarify the role of stimulus modality in complex categorization tasks involving abstract thought.
In turn, our pre-registered exploratory analyses based on the VISQ-R revealed that scores in the Condensed scale were related to behavior, with participants being more accurate in more abstract trials when they reported experiencing more abbreviated inner speech. A tentative explanation for this finding is that a more condensed representation of a concept would already suffice to provide the links between the items (as participants did not need to rehearse information or make grammar- or form-based judgments) and that individuals who stay at this more abstract level when processing the stimuli are more efficient at arriving at a solution compared to those who hear more complete sentences. This idea is in line with Kompa and Mueller’s (2020, 2022) suggestion that engaging in “sparser” forms of inner speech might be computationally cheaper and therefore lead to reduced cognitive load when more detailed representations (whether syntactic, phonological, or articulatory) are not necessary to complete a task. Our account is also broadly consistent with Guy Dove’s proposal that abstract thought relies more on abstract linguistic codes (e.g., Dove, 2009) and with research assigning a prominent role for language in higher-order thinking more generally (e.g., Baldo et al., 2005; Gentner & Asmuth, 2019; Inagaki & Hatano, 2003; Kaltefleiter et al., 2021; Newton & de Villiers, 2007; Villiers, 2007; Wallace et al., 2017). What our findings make clear, in any case, is that more insight may be gained about the benefits of inner speech to abstract thinking if we consider the individual contributions of specific forms of inner speech as opposed to more general inner speech propensity. This conclusion is closely aligned with Borghi and Fernyhough (2023)’s recommendation to do more research into how different forms of inner speech intersect with abstract thought, a hitherto underexplored topic.
Furthermore, our analysis of the VISQ-R showed that participants with more evaluative and critical inner speech took less time to choose an image regardless of how abstract the trial was. In line with Gade and Paelecke (2019)’s findings of increased conflict resolution capacity for individuals with more evaluative and motivational inner speech, we tentatively interpret this result as a reflection of an improved ability to focus on relevant aspects that link the images together (and to ignore irrelevant ones) by individuals who can avail themselves of performance-related commentaries (e.g., “that’s right”, “that’s not right”).
Lastly, control analyses of the effects of executive functions on this task revealed that better inhibition capacity, as indexed by a lower Simon effect, was associated with higher chances of selecting the correct image in each trial, regardless of its degree of abstractness. This effect is perhaps unsurprising given that the task requires participants to ignore features that are not shared by the items to focus on things that do connect them, which resembles what participants need to do in the Simon task, namely ignore irrelevant location information to focus on the color of the stimuli.
Inner Speech and Metacognition
Moreover, we found that individual differences in inner speech tendencies did not significantly influence metacognitive measures of response confidence when Trial Concreteness was taken as a whole. However, when we analyzed abstractness by breaking down Trial Concreteness into its component ratings, significant effects of inner speech emerged. Specifically, when common setting ratings were lower, meaning the target and its matching image appeared together less frequently compared to the target and the other choices in the trial, individuals experiencing more inner speech were a) slower to indicate how confident they were that they had the correct answer and b) more likely to indicate high confidence in their decision, relative to those with less inner speech, although the latter effect was only marginally significant.
These results are broadly in line with Langland-Hassan and colleagues (2017), who found that participants with aphasia had specific deficits in determining their response confidence when categorizing images lacking thematic associations or perceptual similarities. Thus, one might offer a similar account to that in Langland-Hassan and collaborators for our results, namely that one gets an additional cue to response success by hearing a verbal label in inner speech that indicates how the items connect, which is particularly useful when the correct answer cannot be based on contextual information or visual imagery. If this were the case, however, one would also expect Internal Verbalization to be linked to faster or more accurate selection of the matching image in more abstract trials, but no such effects were observed.
A post-hoc account of these results is that people with higher Internal Verbalization scores engage in more explicit and deliberate decision-making processes when judging their mental states, which leads them to have more confidence in their response. In turn, this effect would be more prominent in more abstract trials because of the higher degree of uncertainty they involve, as evidenced by, for example, the overall slower response times and lower confidence levels. This suggestion is based on Kompa and Mueller’s (2022) proposal that engaging in language production-comprehension loops in inner speech enhances deliberation by allowing repeated explicit assessment and reformulation of the steps involved in decision processes until solutions become sufficiently informative, relevant, or clear. The fact that the Dialogic scale had the highest correlation with Internal Verbalization provides further indirect support for this idea, as dialogic forms of inner speech would be particularly suitable for engaging in the inner speech loops described by Kompa and Mueller.
Finally, concerning the control measures of executive function, higher cognitive flexibility, as indexed by fewer perseverative errors in a card-sorting task, was related to faster indication of response confidence, regardless of the degree of context abstractness. Seeing that cognitive flexibility refers to, among other things, the capacity to consider multiple possibilities or angles when solving a task (Miles et al., 2021), it might be reasonable to suppose that the effect results from people being faster at reaching a metacognitive decision when they take into account fewer possibilities or alternative pathways to their response.
Limitations and Future Directions
There are several limitations to the present study. First, the participants in our samples were all young adults and were mostly female, which means they were not very representative of the general population. Thus, the extension of our findings to more diverse populations would be desirable (Ramsey & Hewitt, 2005). Second, we made exclusive use of self-report questionnaires to index inner speech experiences. Although there are many advantages in using such tools compared to more resource-intensive methods, they also present unique challenges, including being bound by limitations in people’s memory, introspection, and metacognition capabilities, being vulnerable to desirability biases, and having questions amenable to multiple interpretations (McCarthy-Jones & Fernyhough, 2011; Schwarz, 1999). Future research could additionally employ an in-the-moment method of assessing inner speech and compare the results with those produced by the IRQ-G to better understand what is being measured and how different operationalizations of the construct affect behavior. Third, participants completed the questionnaires and the executive function tasks online, meaning that experimental control over the conditions in which these activities were performed could not be guaranteed. We attempted to minimize this problem by excluding participants with abnormal response patterns and by explicitly reminding participants before the start of each task that they should be in a distraction-free, quiet environment, using a laptop or a desktop computer instead of a cell phone, but it is possible that some participants ignored these instructions. Hence, it might be beneficial to replicate our findings using measures obtained in more controlled settings. Fourth, we have only superficially addressed a possible role that cultural differences between the American and the Austrian contexts might have played in producing some of the surprising results we observed. It is conceivable, for example, that societal discourse around inner speech differs between the two groups, thus potentially affecting the connotations of the questionnaire items. Furthermore, the two groups most likely have unidentical levels of familiarity or associations with some of the objects presented in the tasks (for example, BEER, PLANE, and DEER, which are concepts featured in both ours and Roebuck and Lupyan’s stimuli). Nevertheless, the large individual variability that can be expected to exist within each group likely overshadows any between-group differences in these variables, particularly when considering that both countries fall under the WEIRD (Western, educated, industrialized, rich, democratic) category, making cultural differences less relevant than they might have otherwise been. Finally, the evidence provided herein is correlational in nature, which means we cannot determine the causal direction of the relationships we observed between inner speech and the measures of object recognition and complex categorization. Research coupling questionnaire evidence with neuromodulation techniques, for example, would allow this shortcoming to be addressed.
Despite these limitations, this study makes important contributions. First, it offers a new tool for researchers in the German-speaking world to investigate individual differences in modes of thinking. Second, it corroborates a preliminary report of modulatory effects of individual differences in inner speech on the processing of words and pictures, highlighting the potential importance of this interindividual variation when accounting for performance in language comprehension or production tasks and beyond. Third, it presents novel evidence that experiencing more condensed versions of inner speech might change how people recognize objects and how they categorize objects in abstract contexts. Lastly, it provides first indications that individual differences in inner speech might affect metacognitive processes involved in judging response confidence. Future research can build on these contributions by investigating the brain mechanisms underpinning the effects of inner speech and by confirming the effects related to condensed inner speech and metacognition with a more targeted approach.
Conclusion
We have translated and validated a new tool, the IRQ-G, for researchers to further explore the effects of interindividual variation in different modes of thinking among German speakers. Using this instrument and another questionnaire measuring specific forms of inner speech we have shown that individual differences in inner speech tendencies differentially impact lower-level object recognition and higher-level abstract categorization. While general inner speech propensity benefited recognition performance, condensed forms of inner speech hindered object recognition. On the other hand, condensed inner speech facilitated abstract thinking while general inner speech propensity was unrelated to performance. Finally, preliminary evidence was obtained that general inner speech propensity might increase metacognitive deliberation when judging response confidence in more abstract contexts.
Together, our results highlight the nuanced and varied ways inner speech influences noncommunicative functions, arguing against monolithic views of both inner speech and the effects of language on cognition. Instead, our findings illustrate the value of shifting the conversation from whether or not (inner) language affects extralinguistic processes to the narrower question of which tasks are affected by it and which variety of inner language is at play. Furthermore, the observed behavioral outcomes linked to survey-based measures of individual variability in inner speech underscore that these measures may be a useful complement or alternative to more costly, experimental methods of inner speech manipulation, such as articulatory suppression. Finally, while we have assessed inner speech propensity as a trait, our results may provide grounds for applied interventions aimed at altering inner speech characteristics to enhance task performance, insofar as the findings lend support to the idea that what a person says or hears in their mind influences how they process information. We strongly encourage replications of our findings to deepen our understanding of when and how individual differences in inner speech shape cognition.
Author Contributions
Conception and design: PB, JM. Data acquisition: PB. Data analysis: PB. Original draft: PB. Editing and revision: PB, JM. Supervision and resources: JM. All authors approved the final version of the manuscript for submission.
Acknowledgements
We want to thank Aisha Futura Tüchler and Jakob Weickmann for their diligent translation work. We also wish to thank Sandra Regen, Carina Auzinger, and Peter Traunmüller for their help with data collection, language-related questions, and experimental set-up. Open access funding provided by the University of Vienna.
Funding Information
We do not have any funding sources to disclose.
Competing Interests
We have no known conflict of interest to disclose.
Data Accessibility Statement
The analysis scripts, anonymized datasets, sharable experiment files and stimuli, complementary results and other supplementary material can be found on the project’s OSF page at https://osf.io/u4zck/.
Footnotes
First, the target “Kartoffel” [“potato”] was replaced by “Krone” [“crown”] because in Austrian German “potato” is commonly referred to as “Erdapfel”. Consequently, instead of “Kartoffel” – “Pantoffel” [“slippers”], we had “Krone” – “Zitrone” [“lemon”]; instead of “Kartoffel” – “Pelikan”, we had “Krone” – “Fenster” [“window”]; and instead of “Kartoffel” – “Schnitzel” [“Schnitzel”, a type of fried breaded cutlet], we had “Krone” – “Ring” [“ring”]. Secondly, the cue “Kohl” [“cabbage”], which was paired with the target “Lauch” [“leek”] was replaced with “Brokkoli” [“broccoli”], as cabbage is often referred to as “Kraut” in Austria. Finally, the cue “Schornstein” [“chimney”], which was paired with the target “Schwein” [“pig”], was replaced with “Stein” [“rock”] because chimney is often referred to as “Rauchfang” in Austria.
Due to cultural and linguistic differences between the American / English-speaking original context and the Austrian / German-speaking target context, we excluded 10 items from the original 84 items in Langland-Hassan and collaborators (2021), namely those corresponding to the linking words “memory”, “identification”, “right”, “net”, “wax”, “rare”, “save”, “kick”, “turn” and “slow”.
Details of the norming procedure can be found in Langland-Hassan and colleagues (2021). Succinctly, hundreds of participants selected a matching image for each target and provided a linking word that best described how the two were related. Then, two other large groups of participants provided the visual similarity and common setting ratings.