A long-standing debate in social psychology is whether the cognitions reflected on implicit measures are unconscious. Research by Hahn et al. (2014) has documented that people are able to predict the patterns of their results on Implicit Association Tests (IATs) towards five pairs of social groups prospectively. The present article presents a meta-analysis of 17 published and unpublished exact replication studies conducted by or in close supervision of the original author. Replicating Hahn et al., participants in all 17 studies were able to accurately predict the patterns of their IAT results (meta-analytical within-subject effect size: b = .44; corrected average within-subjects correlation r = .56). This prediction accuracy effect was smaller for online (b = .27; corrected r = .37) than lab (b = .47; corrected r = .61) studies, as well as for general-public (b = .27; corrected r = .36) as opposed to student samples (b = .47; corrected r = .60). Moreover, predictions fully explained implicit-explicit relations, and they seemed to reflect unique insights into participants’ own cognitions beyond mere knowledge about normatively expected patterns of implicit responses. This pattern of results remained the same across samples, settings, countries (Canada, US, and Germany), and languages (English vs. German). Further analyses suggested that lower prediction accuracy in online samples seems to partly reflect a suppression effect from higher consistency between traditional explicit evaluations and predictions. Controlling for explicit evaluations (which exerted a negative unique effect on IAT scores beyond IAT score predictions) reduced the difference between online and lab studies substantially. Together, the results strengthen the hypothesis that the cognitions reflected on implicit evaluations are largely accessible to conscious awareness.
In 2014, Hahn et al. published a paper that showed that participants were able to predict the patterns of their results on five IATs toward different social groups. These findings challenged more traditional views conceptualizing the cognitions reflected on implicit1 evaluations as revealing unconscious attitudes to which people have no introspective access (Greenwald & Banaji, 1995; Lai et al., 2013; McConnell et al., 2011; Nosek et al., 2002). Instead, they favored interpretations by other dual-process models that assume that the cognitions reflected on implicit evaluations are in principle introspectively accessible (Fazio, 2007; Gawronski et al., 2006; Gawronski & Bodenhausen, 2006, 2011). According to these models, dissociations between implicit and explicit evaluations can be explained such that people tend to consider different information for their answers to explicit questions than the spontaneous cognitions that show on implicit measures, which are often rejected (Fazio, 2007; Gawronski et al., 2006; Gawronski & Bodenhausen, 2006, 2011). Since the publication of these studies, there have been some successful conceptual replications showing the generalizability of Hahn et al.’s (2014) findings to other attitudinal domains (Goedderz & Hahn, 2024; Morris & Kurdi, 2023; Rahmani Azad et al., 2023). At the same time, our lab has conducted several direct replications of the original paradigm in the domain of social groups of which some are published, and others remain unpublished. Some of these are pilot studies that tested whether the effects would hold in different settings (e.g., online), different languages (i.e., German vs. English), and with culturally different samples (e.g., in the US vs. Canada vs. Germany), which may not ever get published individually. Such unpublished studies, however, may lead to biased estimations of effect sizes and in the worst case perpetuate false positive findings (Murayama et al., 2014; Nuijten et al., 2015).
In the present research, we address this issue with a preregistered meta-analysis of all published and unpublished studies that directly replicated the prediction paradigm introduced by Hahn et al. (2014) in the domain of social groups. In doing so, we pursue three main goals. First, by including all published and unpublished studies, we want to provide a less biased estimate of an average effect size of the prediction accuracy effect in Hahn et al.’s (2014) paradigm. Second, we test the generalizability of the original findings to different samples and settings by examining differences in effect sizes between different study characteristics. Third, and last, we replicate and meta-analyze additional analyses proposed by Hahn et al. (2014) that are aimed at answering theoretically relevant questions for research in awareness and implicit evaluations. Specifically, we examine the extent to which predictions explain unique variance in IAT score patterns over and above traditional explicit measures (i.e., thermometer ratings). Further, we investigate whether participants have unique insights into their own patterns of IAT results or whether predictions are culturally normative and interchangeable across participants from the same sample. Finally, we analyze whether participants are better at predicting their own relative evaluations of different social groups than communicating the relative strengths of their evaluations of one social group in comparison to other participants in the sample, a process we call “social calibration” (Goedderz & Hahn, 2024; Hahn & Goedderz, 2020). We will explain each of these points and the initial results found by Hahn et al. (2014) in more detail next, after a quick summary of the different theoretical models that they address.
Awareness and Implicit Attitudes – Theoretical Framework and Specific Questions
Different classical theoretical models on the implicit-explicit distinction make different predictions upon whether the cognitions reflected on implicit measures are consciously accessible or not. On the one hand, there are models based on Greenwald and Banaji’s (1995) conceptualization of implicit social cognition as “introspectively unidentified traces of experience” (Greenwald & Banaji, 1995, p. 8; Lai et al., 2013; McConnell et al., 2011; Nosek et al., 2002). Specifically, when indirect attitude measures were first developed, some researchers argued that explicit measures tap into consciously accessible attitudes while implicit measures capture unconscious attitudes. This idea was met with enthusiasm by other researchers and the public media who started to use the terms “implicit attitudes”, “unconscious biases” or even “unconscious racism” interchangeably (Basu, 2018; BBC News, 2017; Haider et al., 2011, 2014; Quillian, 2008; The Guardian, 2018). From this perspective, the fact that implicit and explicit measures are often only weakly correlated (r = .24 in a meta-analysis by Hofmann et al., 2005) is interpreted as evidence that people are unable to report on the cognitions reflected on implicit measures (Nosek, 2005; Nosek et al., 2002).
On the other hand, several dual-process models propose that the cognitions reflected in implicit measures differ from explicit reports because people consider other information when they have time and resources to think about a deliberate answer. For instance, the MODE model (Fazio, 1990; Fazio & Olson, 2003) proposes that if people are motivated and have the opportunity, they will report different evaluations on explicit measures than they will show on implicit measures. The Associative-Propositional Evaluations Model (APE; Gawronski & Bodenhausen, 2006, 2011) hypothesizes that implicit and explicit measures will diverge when people do not consider their spontaneous reactions to be valid bases for their deliberate evaluations. For instance, a person may feel that they have a more negative spontaneous feeling toward a Black person than toward a White person. However, when asked directly and with time to think about an honest answer that person may think of different information. For example, they may think about Black friends they have, that they genuinely believe that all men are created equal, and that they have egalitarian worldviews. In this scenario, the person would probably show biases against Black people on an implicit measure but would not report such biases on an explicit measure. Importantly, however, the reason for this discrepancy would not be rooted in a lack of awareness of the spontaneous reactions. Instead, it would reflect the fact that they do not consider their spontaneous reactions to be the only valid bases for their general evaluations of Black people.
Based on these opposing theoretical considerations, Hahn et al. (2014) empirically tested whether people are generally aware of their (biased) evaluations of social groups or not. In these studies, participants were asked to first explicitly rate how they felt toward different social groups on “feeling thermometer” scales. They then went on to predict how they would score on five IATs measuring their reactions toward the social categories Black, Latino, Asian, Child, and Celebrity in contrast to Regular (non-Celebrity) White Adults. Afterwards, they completed all five respective IATs. This study design enabled the researchers to investigate several questions regarding participants’ awareness of the cognitions captured on implicit evaluations. We discuss these questions and the evidence from the original studies next.
Are People Able to Predict Their IAT Scores?
The main focus of Hahn et al.’s (2014) studies was to investigate whether people were generally aware of the cognitions reflected on their implicit evaluations. To test this, they used a within-subject design, letting participants predict and complete five IATs, and examined whether they were able to predict the patterns of their IAT results prospectively. Their reasoning for this particular design was as follows: To investigate whether people know about their own evaluative reactions toward different targets, participants would have to predict how their reactions toward one attitude object differs from their reaction toward another attitude object. This can only be analyzed using within-subject correlations between predictions and implicit evaluation scores for several attitude objects per participant (see also Hahn & Goedderz, 2020). Results showed that participants predicted the patterns of their IAT scores with significant accuracy, with an average within-subject correlation of r = .54 (corrected r = .66)2 across four studies. Results further showed that this prediction accuracy was independent of (1) whether implicit evaluations were described as true attitudes or culturally learned associations (Studies 1 and 2), (2) whether participants were told to specifically predict their behavior on an IAT (e.g., “which block would be easier for you?”, Study 1) or their “implicit attitudes” (Studies 2-4), or (3) how much explanations they received about the IAT or how much experience they had with the task (Study 4). Overall, these findings are first evidence that people are able to report on the cognitions reflected in their implicit evaluations, suggesting that these cognitions are generally consciously accessible.
Do People Consider Other Information for Traditional Explicit Reports Than What is Reflected on Their Implicit Measures?
Another goal of Hahn et al.’s (2014) studies was to empirically investigate the theoretical considerations of models such as the APE model (Fazio, 2007; Gawronski & Bodenhausen, 2006, 2011), which postulate that different information factors into explicit and implicit measures. These models hypothesize that the degree to which implicit and explicit measures are correlated depends on how much people rely on their spontaneous gut reactions for their explicit reports. The studies supported this idea. First, correlations between IAT scores and explicit thermometer ratings tended to be low, while correlations between participants’ predictions and their IAT score patterns were always considerably higher. This supports the notion that participants can generally have insight into the cognitions reflected on their IAT scores but nonetheless often report other information on traditional explicit measures. Second, the relationship between explicit thermometer ratings and IAT scores was entirely explained by participants’ predictions in all studies. In line with the APE model (Gawronski & Bodenhausen, 2006, 2011), this shows that beyond a first spontaneous reaction, people consider additional information for explicit reports that are not captured on implicit measures. Together, these findings constitute supporting evidence for the hypothesis that the reason why implicit-explicit correlations often tend to be low is not that people are unaware of the cognitions that are reflected on their implicit evaluations. Instead, the data are more compatible with the notion that people do have access to these cognitions, but that they rely on additional information for their explicit reports.
Do People Predict Their Own Evaluations or a Descriptively Normative Pattern?
One explanation for the fact that participants in Hahn et al.’s (2014) studies accurately predicted their IAT score patterns is that they had unique insight into the cognitions reflected on their implicit evaluations. Another possibility is that the IAT score patterns toward the social groups followed a descriptively normative pattern and participants were accurate because they predicted what they assumed made most sense in their cultural context, e.g., because they guessed these patterns from previous observations or societal debates. For example, an American citizen may assume that the average other American citizen will have negative reactions toward Black people and Latinos, neutral to somewhat negative reactions toward Asians, somewhat positive reactions toward Celebrities, and very positive reactions towards Children. If participants’ own patterns of IAT results are in line with these assumptions, the participants in Hahn et al.’s (2014) studies would not have predicted what they believed to be their own evaluative patterns toward the social groups but rather what they believe to be the culturally shared, descriptively normative evaluation of the social groups in their context.
To investigate this idea, Hahn et al. (2014) used two approaches. First, they reexamined data of their studies by pairing a random other participant’s predictions of the same sample with participants’ own IAT responses and vice versa. They argued that if all participants only predicted a descriptively normative pattern, then any other participant’s predictions should be as good as a predictor for their IAT results as their own predictions. In contrast, if participants predicted their own patterns of evaluations, their own predictions should predict unique variance of their IAT results over and above the randomly paired other participants’ predictions. Results supported the latter explanation: The random other participants’ predictions showed lower correlations with participants’ own patterns of IAT results than their own predictions. Additionally, participants’ own predictions predicted unique variance in their own patterns of IAT results over and above the random other participants’ predictions.
Second, they tackled the question experimentally. In one study, they additionally asked participants to predict how the average student at their university would score on the respective IATs. Their reasoning was that if participants had unique insight into their own cognitions reflected on implicit evaluations, their own predictions should explain the patterns of their own IAT results over and above what they believed the IAT score results for the average student at their institution would be. In line with this reasoning, results indeed showed that participants’ own predictions explained variance in the pattern of their own IAT results over and above their assumed pattern of results for the average student.
These results suggest that participants in Hahn et al.’s (2014) studies reported their own evaluations rather than what they believed to be cultural consensus, at least to some degree. Nonetheless, both approaches also showed that a significant proportion of every participant’s unique IAT score pattern was also predicted by random others and by their idea of what an average participant would show.
Taken together, the studies by Hahn et al. (2014) suggest that accurate predictions may be a combination of unique insight and cultural knowledge of descriptively normative responses, with the former playing a somewhat larger role.
Do People Know Where Their IAT Scores Rank in Comparison to Other People?
Thus far, the main analyses in Hahn et al.’s (2014) studies focused on within-subject correlations. Using within-subject analyses in a multilevel design allowed the researchers to estimate whether participants are able to accurately say how their own reactions on the IATs would differ for, e.g., a Black/White IAT compared to a Latino/White IAT and a Child/Adult IAT. However, Hahn et al. (2014) pointed out that most previous research investigating whether people know the cognitions reflected on their implicit evaluations looked at between-subject analyses. Those between-subject implicit-explicit correlations tend to be low (Hofmann et al., 2005), which could be interpreted such that the participants do not seem to know their implicit evaluations. Opposing this interpretation, Hahn et al. (2014), argued that this level of analysis answers a different question: Namely, whether participants know where their results on the IAT rank in comparison to other participants in the same sample. As such, a low correlation in a between-subject analysis could show that participants do not know whether they have more or less biases than other people in the sample, or that all participants use the prediction scale labels differently. Hahn and Goedderz (2020, see also Goedderz & Hahn, 2024) summarized these two perspectives of awareness that are connected to the two ways of analyses as “introspective awareness” (within-subject analyses) vs. “social calibration” (between-subject analyses). They argue that both types of analyses and thinking about people’s knowledge about their implicit evaluations can tell us different things about what kind of awareness people have of the cognitions reflected on their implicit evaluations.
Following this reasoning, as an additional analysis, Hahn et al. (2014) looked at the between-subject correlations between predictions and IATs computed per target-pair IAT and averaged across the five IATs per study. These averaged between-subject correlations for all 4 studies were still significantly different from zero. However, they were lower than the within-subject correlations. That is, participants seemed to be more accurate in predicting their own pattern of IAT results than in raking whether their biases were milder or stronger in comparison to other participants in the sample – they were aware of their biases, but somewhat miscalibrated with one another concerning estimation of their strengths (Goedderz & Hahn, 2024; Hahn & Goedderz, 2020). At the same time, correlations between explicit thermometer ratings and the IATs did not seem to differ in size for the within-subject or between-subject analyses.
Summary of the Original Findings
The studies by Hahn et al. (2014) provided first evidence that people may be aware of the cognitions reflected on their implicit evaluations. Participants in these studies were able to accurately predict the patterns of their IAT scores even beyond normatively expected evaluative patterns and even though they reported other evaluations on traditional explicit scales. These findings speak against older conceptualizations of implicit evaluations capturing unconscious mental contents. Instead, they favor theoretical models that assume that people are generally aware of the cognitions reflected on their implicit evaluations, but that people consider other information when they have time to think about a deliberate answer. Lastly, at the same time as participants were able to accurately sense their own biases toward different social groups, they seemed to be less accurate in sensing where their biases ranked in the sample distribution. In the current articles we aim to shed further light on whether these findings are reliable and robust.
The Need for Replications
Recent developments in scientific rigorousness highlight the importance of replications for scientific progress (Nosek et al., 2022). First, replications ensure that the original study is not based on a random false positive by showing that the result is reproducible when directly following the original design using a similar sample and setting (Murayama et al., 2014; Simmons et al., 2011). Secondly, a direct replication using a different sample (e.g., users of a survey platform vs. university students, participants in different countries) or a different setting (online vs. laboratory) can speak to the generalizability of the finding beyond the original studies (Henrich et al., 2010). The original paper by Hahn et al. (2014) already contained four studies, thus the authors already replicated their initial finding three times while at the same time showing that slight changes in the design did not significantly change the prediction effect. However, these studies were all conducted with undergraduate students at the same US university in the same laboratory. This poses the question whether the findings replicate in other samples and settings. That is, there could be something specific about the students at the specific US university at a specific time that made them more aware of their implicit biases. For instance, it could be that the topic of implicit biases is very present in the United States such that Americans already pay more attention to their biased reactions, or undergraduate students may be a very specific population that is highly sensitive to the topic of implicit biases. Additionally, a laboratory setting may enhance pressure on participants to “admit” to biases in the specific set-up of the study and thus exacerbate how much awareness people would be willing to report. Opposingly, the laboratory setting could also underestimate awareness when the presence of an experimenter may make study participants unwilling to admit to biases of which they are aware.
To ensure that the effects reported by Hahn et al. (2014) are not a random false positive or a specific effect of the investigated group of undergraduates at a US university in a laboratory setting, replications with different samples and in different settings are needed.
The Current Meta-Analysis
The current meta-analysis reviews published and unpublished replications in different samples and settings all conducted by, or in close supervision of, the original author of the Hahn et al. (2014) studies.3 As such, the present meta-analysis has three main goals. First, by including published and unpublished studies with varying effect sizes, we aim to inform future research that wishes to replicate Hahn et al.’s (2014) paradigm with a more accurate effect size estimation of the original prediction accuracy. Second, we aim to systematically investigate whether the original findings replicate in different samples and settings by running subgroup analyses for different study characteristics to investigate the generalizability of the previous findings. Third, and finally, we systematically investigate whether the different results and theoretical considerations suggested by Hahn et al. (2014) hold across all studies. Specifically, beyond the meta-analytical effect of the prediction accuracies across studies, we also examine (1) whether predictions explain variance in IAT score patterns beyond explicit thermometer ratings, (2) whether participants have unique insight into the cognitions reflected on their implicit evaluations beyond normative patterns, and (3) whether participants are better at predicting the patterns of their IAT scores than at placing their evaluations accurately in the sample distribution.
Method
Transparency and Openness
All materials, analysis codes, supplemental materials, and the preregistration are openly accessible at the OSF repository at https://osf.io/mejzp/.
Data Inclusion
All published and unpublished studies that used the prediction procedure put forward in Hahn et al.’s (2014) studies were considered for the present meta-analysis. The considered studies were all conducted or supervised by the original first author of the Hahn et al. (2014) article. We preregistered a list of criteria for the inclusion or exclusion of studies for the present meta-analysis (see “registrations” tab at https://osf.io/mejzp/). These criteria aimed to ensure that the examined studies were as comparable as possible in their procedural aspects while allowing other characteristics to vary between studies (e.g., setting and sample characteristics). In total, we collected data from 26 studies4 with an overall sample size of 5180 participants. We retained only studies that used the original five social group-pairs used in Hahn et al. (2014) that were Black/White, Asian/White, Latino/White, Child/White Adult, Celebrity/White Regular Adult. As such, we excluded five studies that used other target pairs – for instance baked-goods or occupational groups (e.g., Goedderz & Hahn, 2024, see discussion section). We further excluded three studies because participants did not see any or the same pictures as used in the upcoming IATs on their prediction slides and the predictions and IATs in these studies contained only five (instead of ten) pictures and words per category. Another two studies were dropped because the procedure of the implemented IATs differed slightly due to programming errors. We thus kept 17 studies with a total sample size of 3201 (a table including a list of all considered studies and the respective exclusion criterion for the present meta-analysis can be found in the supplemental materials). Nine of these studies contained one or more experimental conditions that altered the prediction procedure or procedural aspects of the studies diverged from our preregistered inclusion criteria. As preregistered, we retained these studies but included only conditions for analyses that followed our preregistered inclusion criteria and excluded conditions that differed from these criteria. We further excluded participants with missing data on any of the central variables for our main analyses (predictions, explicit thermometer ratings, IAT scores), participants who did not finish the study, or who failed attention checks or seriousness checks where applicable (e.g., in online studies). Following recommendations by Greenwald et al. (2003) we deleted trials >= 10,000 ms before calculating IAT scores, and excluded participants that responded <= 300ms in over 10% of the trials in any of the five IATs. In line with the original publication by Hahn et al. (2014) and as a final step, we dropped participants who had participated in a study with the prediction paradigm before, but kept the data from their first participation. The final sample size thus consisted of 17 studies with a total of 1734 participants. An overview of the final set of studies, their initial sample size and the retained sample size after the data cleaning process, as well as the central demographic characteristics can be examined in Table 1.
StudyID . | Study Code . | Year . | Total N . | Final N . | Status . | Setting . | Sample Group . | Language . | Mean Age . | Age SD . | % Female . | % White . | dominant citizenship . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | UOBSGQG2020 | 2020 | 65 | 65 | Unpublished | Online | Students | German | 30.02 | 11.39 | 61.5 | 83.1 | Germany (94%) |
2 | UOBSGQU2020 | 2020 | 79 | 57 | Unpublished | Online | General population | English | 34.18 | 8.10 | 45.6 | 68.4 | USA (100%) |
3 | UOBSGQO2020 | 2020 | 61 | 59 | Unpublished | Online | General population | English | 32.88 | 13.24 | 64.4 | 64.4 | UK (71 %) |
4 | ULBSGDG2019 | 2019 | 81 | 72 | Unpublished | Lab | Students | German | 24.40 | 7.05 | 69.4 | 76.4 | Germany (82%) |
5 | ULESGDG2019 | 2019 | 290 | 66 | Unpublished | Lab | Students | German | 22.18 | 3.90 | 72.7 | 78.8 | Germany (88 %) |
6 | UOBSGIU2019 | 2019 | 126 | 94 | Unpublished | Online | General population | English | 38.26 | 11.68 | 50.0 | 77.7 | USA (93%) |
7 | ULESGIG2019 | 2019 | 220 | 71 | Unpublished | Lab | Students | German | 23.79 | 6.58 | 84.5 | 83.1 | Germany (94 %) |
8 | ULESGIG2018_a | 2018 | 248 | 118 | Unpublished | Lab | Students | German | 22.53 | 3.88 | 78.8 | 84.7 | Germany (92%) |
9 | ULESGIG2018_b | 2018 | 318 | 95 | Unpublished | Lab | Students | German | 22.85 | 3.34 | 80.0 | 87.4 | Germany (96%) |
10 | ULESGDG2018 | 2018 | 256 | 74 | Unpublished | Lab | Students | German | 23.23 | 4.64 | 79.7 | 85.1 | Germany (96%) |
11 | PLESGIG2016 | 2016 | 243 | 125 | Published | Lab | Students | German | 23.50 | 6.02 | 78.4 | 84.0 | Germany (94 %) |
12 | ULBSGDG2015 | 2015 | 65 | 65 | Unpublished | Lab | Students | German | 25.14 | 7.93 | 84.6 | N/A | Germany (95%) |
13 | PLESGDG2015 | 2015 | 205 | 95 | Published | Lab | Students | German | 23.26 | 4.00 | 77.9 | 89.5 | Germany (96%) |
14 | ULESGDC2014 | 2014 | 253 | 65 | Unpublished | Lab | Students | English | 18.48 | 1.25 | 63.1 | 61.5 | Canada (72 %) |
15 | PLESGDC2013 | 2013 | 150 | 75 | Published | Lab | Students | English | 22.40 | 5.21 | 65.3 | 40.0 | Canada (49%) |
16 | ULBSGPU2012 | 2012 | 111 | 110 | Unpublished | Lab | Students | English | 19.25 | 1.58 | 50.0 | 78.2 | USA (88 %) |
17 | PLESGPU2011 | 2009-2012 | 430 | 428 | Published | Lab | Students | English | 19.16 | 1.61 | 60.5 | 79.9 | USA (N/A*) |
StudyID . | Study Code . | Year . | Total N . | Final N . | Status . | Setting . | Sample Group . | Language . | Mean Age . | Age SD . | % Female . | % White . | dominant citizenship . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | UOBSGQG2020 | 2020 | 65 | 65 | Unpublished | Online | Students | German | 30.02 | 11.39 | 61.5 | 83.1 | Germany (94%) |
2 | UOBSGQU2020 | 2020 | 79 | 57 | Unpublished | Online | General population | English | 34.18 | 8.10 | 45.6 | 68.4 | USA (100%) |
3 | UOBSGQO2020 | 2020 | 61 | 59 | Unpublished | Online | General population | English | 32.88 | 13.24 | 64.4 | 64.4 | UK (71 %) |
4 | ULBSGDG2019 | 2019 | 81 | 72 | Unpublished | Lab | Students | German | 24.40 | 7.05 | 69.4 | 76.4 | Germany (82%) |
5 | ULESGDG2019 | 2019 | 290 | 66 | Unpublished | Lab | Students | German | 22.18 | 3.90 | 72.7 | 78.8 | Germany (88 %) |
6 | UOBSGIU2019 | 2019 | 126 | 94 | Unpublished | Online | General population | English | 38.26 | 11.68 | 50.0 | 77.7 | USA (93%) |
7 | ULESGIG2019 | 2019 | 220 | 71 | Unpublished | Lab | Students | German | 23.79 | 6.58 | 84.5 | 83.1 | Germany (94 %) |
8 | ULESGIG2018_a | 2018 | 248 | 118 | Unpublished | Lab | Students | German | 22.53 | 3.88 | 78.8 | 84.7 | Germany (92%) |
9 | ULESGIG2018_b | 2018 | 318 | 95 | Unpublished | Lab | Students | German | 22.85 | 3.34 | 80.0 | 87.4 | Germany (96%) |
10 | ULESGDG2018 | 2018 | 256 | 74 | Unpublished | Lab | Students | German | 23.23 | 4.64 | 79.7 | 85.1 | Germany (96%) |
11 | PLESGIG2016 | 2016 | 243 | 125 | Published | Lab | Students | German | 23.50 | 6.02 | 78.4 | 84.0 | Germany (94 %) |
12 | ULBSGDG2015 | 2015 | 65 | 65 | Unpublished | Lab | Students | German | 25.14 | 7.93 | 84.6 | N/A | Germany (95%) |
13 | PLESGDG2015 | 2015 | 205 | 95 | Published | Lab | Students | German | 23.26 | 4.00 | 77.9 | 89.5 | Germany (96%) |
14 | ULESGDC2014 | 2014 | 253 | 65 | Unpublished | Lab | Students | English | 18.48 | 1.25 | 63.1 | 61.5 | Canada (72 %) |
15 | PLESGDC2013 | 2013 | 150 | 75 | Published | Lab | Students | English | 22.40 | 5.21 | 65.3 | 40.0 | Canada (49%) |
16 | ULBSGPU2012 | 2012 | 111 | 110 | Unpublished | Lab | Students | English | 19.25 | 1.58 | 50.0 | 78.2 | USA (88 %) |
17 | PLESGPU2011 | 2009-2012 | 430 | 428 | Published | Lab | Students | English | 19.16 | 1.61 | 60.5 | 79.9 | USA (N/A*) |
Note. Study Codes were created to capture important information about the studies. The abbreviations are as follows: U/P = Unpublished/Published, L/O = Laboratory/Online, B/E = Basic Paradigm/Experimental manipulations in some conditions, SG/OG = Targets are Social Groups/Other Groups (in the present meta-analysis only studies with social groups were included), D/I/Q/P = Study was programmed in DirectRT/Inquisit/Qualtrics/Python, C/G/U/O = Data was collected in Canada/Germany/USA/Other Country, all Study Codes end with the year of data collection, if all else criteria resulted in the same Study Code “_a” or “_b” was added to distinguish the studies. N/A indicates that data on this topic was not collected and was hence not available. *Study 17 was run in the United States at the University of Colorado. Citizenship was not specifically assessed such that precise percentages are missing.
Materials
All materials are almost identical to the materials used in Hahn et al. (2014) and were created by the first author of Hahn et al. (2014) or the first author of the current paper. In all studies included in the present meta-analysis participants first completed explicit thermometer ratings, second predictions, and third IATs. Variations within each of those three measures included (1) which words were used in the IATs (with four English and four German-language sets that often varied only by individual words), (2) which pictures were used in the predictions and IATs, (3) whether participants completed trial predictions before completing the actual predictions, (4) whether predictions were completed on 7-point scales (in 13 studies), 11-point scales (2 studies), or 7-point sliders (in three studies, two of which in Hahn et al., 2014), as well as (5) the language and (6) the precise wording of the instructions in the prediction task. A table specifying these variations and all instructions and stimuli that were used can be viewed in the materials file at https://osf.io/mejzp/.
Explicit evaluations. To assess explicit evaluations toward the five social group pairs, participants rated their feelings toward each group on standard thermometer scales. The scales ranged from 0 (very cold feelings) to 100 (very warm feelings) and were combined with a depiction of a thermometer colored in green or blue on one end (cold feelings) and red on the other end (warm feelings). Participants were shown each social group separately and asked to indicate how warmly or coolly they feel toward this social group. The social groups were “Black people”, “Latinos/Latinas”, “Asian people”, “White people”, “Celebrities”, “Regular people (non-celebrities)”, “Children”, and “Adults”. For better comparison to the predictions and IAT scores, the final explicit thermometer ratings were computed subtracting participants’ ratings for the target groups from their ratings for the respective comparison group. Positive scores thus indicate more positive explicit evaluations toward the social categories White, Regular, and Adult than toward Black, Latino, Asian, Celebrity, or Child 5.
Predictions. Participants were asked to predict how they would score on the five upcoming IATs. Before doing so, they read an introduction briefly explaining the concept of “implicit attitudes” and introducing the IAT as a method that was developed to measure them.6 These explanations described “implicit attitudes” as a construct psychologists were interested in that may be different from the attitudes participants had just indicated (in the thermometer ratings). It further encouraged participants to think of them as a person’s first spontaneous reactions towards different attitude targets which may be different from what a person would think or say when they had time to reflect on their attitudes towards the same targets. Across all studies, there were several versions of these instructions in English and German with minor variations. All instructions can be found in the materials file.
After this, participants completed – depending on the study – one or two trial predictions toward dogs vs. cats and/or insects vs. flowers to get familiar with the prediction procedure. They went on to complete the critical predictions toward the five upcoming IAT social group pairs (Black/White, Latino/White, Asian/White, Celebrity/Regular, Child/Adult). The prediction slides were structured as follows: In the top part, all pictures that were used in the upcoming IATs were depicted. They were sorted such that all pictures of the target groups were depicted on one side, and all pictures of the comparison group on the other. There was a prompt in the center asking participants to indicate what they think their “implicit attitude” toward these social categories is. The bottom part showed the prediction scale (depending on the study, this was a 7-point-scale, an 11-point scale, or a slider) ranging from “a lot more positive toward [TARGET GROUP]” to “a lot more positive toward [COMPARISON GROUP]”. The direction of the scale was in line with the depiction of the pictures above, such that if the target groups were presented on the left-hand side positive reactions toward the target group were also indicated on the left end of the scale and vice versa. The exact wording used in all studies can be found in the materials file.
Implicit evaluations. Implicit evaluations were assessed using evaluative Implicit Association Tests (IATs; Greenwald et al., 1998) following the procedure used in the studies by Hahn et al. (2014). The studies used different software for implementing the IATs. They were either programmed using Python, Inquisit, DirectRT or an adapted version of a JavaScript based program in Qualtrics developed by Carpenter et al. (2019) (see Study Code in Table 1 to see which program was used for each study). In every study, participants completed five evaluative IATs in individually randomized orders toward five different social groups with the same comparison group (non-celebrity White Adults). The IATs used the labels “Black vs. White”, “Latino vs. White”, “Asian vs. White”, “Celebrity vs. Regular”, and “Child vs. Adult”. Participants were instructed to position their fingers on a left and a right key on their keyboard (depending on the study, the left key was “A” or “E”, and the right key was “5” (on the number pad) or “I”) and to sort pictures and words according to the assignments on the top of the screen. The categories “Bad” and “Good” were represented by 10 positive and negative words each. The specific words differed slightly between studies and can be found on OSF. The social categories were represented by 10 pictures (five male, five female) per target category (Black, Latino, Asian, Celebrity, Child). The comparison category (White, Regular, Adult) was represented by 10 pictures (five male, five female) per IAT (50 different pictures in total). The pictures differed slightly between studies and can be clustered in three sets of pictures which can be found in the materials section on OSF. Which set of pictures was used per study can be found in Table S1 in the materials file.
Participants were instructed to respond as fast as possible while making as few mistakes as possible. If participants pressed the wrong key, they saw a red “X” and could only proceed by pressing the correct key. The latency for wrong trials was taken from trial onset until the correct response key was pressed. All IATs used a 250ms interstimulus interval. Before completing the five target IATs, every participant completed an initial 20-trial word-sorting block in which they sorted positive and negative words to the left and right side. After that, every IAT consisted of 4 blocks. Block 1 consisted of 20 trials in which participants sorted pictures of the target and the comparison group. In Block 2, participants had to sort both pictures and words for the duration of 40 trials. Block 3 consisted of 40 trials in which pictures had to be sorted to reversed sides. In Block 4, participants were again asked to sort both pictures and words while the pictures had changed sides, such that pictures that had to be sorted to one side with e.g., negative words in Block 2 now had to be sorted with e.g., positive words on that side and vice versa.
Following recommendations by Greenwald et al. (2003), we calculated an IAT D score for each IAT by subtracting the mean reaction time for the compatible block (in which positive words are paired with the comparison groups) from the mean reaction time for the incompatible block (in which positive words are paired with the target groups) divided by their inclusive standard deviation. We proceeded like this for the first half and the second half of the compatible and incompatible blocks (Blocks 2 and 4) such that we derived two D scores which we averaged to obtain a final D score. Higher D scores indicate faster reactions when positive words were paired with the comparison groups (White, Regular, or Adult) and negative words were paired with the respective target group (Black, Asian, Latino, Celebrity, or Child).
Procedure
All included studies followed the basic prediction paradigm introduced by Hahn et al. (2014). Participants first completed the explicit thermometer ratings in block-randomized orders: (1) ethnic groups in randomized order, (2) celebrities followed by regular people, and (3) children followed by adults. Next, participants completed the prediction procedure with the prediction slides presented in random order for each participant. Lastly, participants completed the five IATs toward the social group pairs in randomized order. Participants resumed by answering demographic questions.
Analyses
Prediction Accuracy. To calculate how accurately participants predicted the patterns of their IAT results in each study, we ran a multi-level model predicting participants’ person-standardized IAT scores from their person-standardized predictions on level 1. The random slopes in this analysis are equivalent to a correlation coefficient per participant between their IAT scores and predictions. To examine the mean prediction accuracy in each sample, we examined the slope on level-2 (fixed effect) which equals the average size of the random slopes. Because the distributions of the within-subject correlations per person were left-skewed and significantly deviated from a normal distribution in all 17 studies (see supplement for details), we also calculated a corrected average (for a comparable procedure see Hahn & Goedderz, 2020; Rahmani Azad et al., 2023). The corrected average was calculated by Fisher z-transforming the distribution of correlations, calculating the mean of this distribution, and backtransforming this mean into a correlation. We call this corrected average “backtransformed” or “corrected r” throughout this paper. This second analytic approach was not preregistered, but we considered it important to account for the skewness in the data in estimating effect sizes.
To estimate the average effect size of the prediction accuracy across studies, we ran two random-effects models. In the first model, as preregistered, we weighted the estimates of the fixed effects from the multi-level analysis with the inverse-variance method. In a second model, we weighted the z-transformed average correlations with the inverse-variance method.
Implicit-Explicit Relationship. To analyze how strongly participants’ explicit evaluations were related to their implicit evaluations, we ran a multi-level model per study regressing person-standardized IAT scores onto person-standardized explicit thermometer difference scores on level-1 and examined the level-2 fixed effect. The distribution of the within-subject correlations was again skewed – though less skewed than for the prediction accuracy (see supplement for details) – and deviated significantly from a normal distribution in all 17 studies. Thus, we also calculated a corrected average for the implicit-explicit relationship as described above by Fisher z-transforming the distribution of correlations, calculation the mean of the distribution and backtransforming this mean into a correlation.
To further investigate whether the relation between explicit and implicit evaluations could be explained by participants’ predictions, in a third analysis, we regressed participants person-standardized IAT scores simultaneously onto participants’ person-standardized predictions and explicit thermometer difference scores on level-1 and examined the level-2 fixed effects.
We meta-analyzed the obtained fixed effect estimates from both multi-level models and the z-transformed correlation in three random-effects models using inverse-variance weighting to investigate their average sizes across studies.
Predicting the Predictions: Simultaneous Regression of Predictions Onto IAT Scores and Explicit Thermometer Ratings. We preregistered an additional analysis that is not explained above that Hahn et al. (2014) only conducted in their Study 4. This analysis looks at the unique within-subjects relationships between participants’ predictions and their explicit and implicit evaluations, controlling for the respective other type of evaluation. The purpose was to see to what degree participants’ predictions were based on the same information that went into their explicit evaluations beyond the spontaneous reactions reflected on implicit evaluations; and vice versa, to what degree their predictions uniquely incorporated the spontaneous reactions reflected on implicit measures in ways that is not reflected on traditional explicit measures. To this end, we regressed participants’ person-standardized predictions simultaneously onto their person-standardized IAT scores and explicit thermometer difference scores on level-1 and examined the fixed effects on level-2. We meta-analyzed the two fixed effects from this analysis (fixed effect for IAT scores and explicit thermometer ratings) with two random-effects models using inverse-variance weighting to investigate their average effect sizes across studies.
Prediction Accuracy Beyond Normative Prediction Patterns That are Shared with Other Participants. To investigate whether participants’ predictions are related to their patterns of IAT results beyond a descriptively normative tendency, we adopted the analytic approach described earlier by Hahn et al. (2014) in which they predicted participants’ IAT scores from another persons’ predictions in the same sample. In their study, they paired one person with one other person from the same sample. To ensure that the obtained results are not bound to the specific random pairing for this one model, we iterated this procedure 1000 times (see also Rahmani Azad et al., 2023 for similar analyses). Specifically, we ran a multi-level model in which every participant’s person-standardized IAT scores were predicted by another participant’s person-standardized predictions on level 1 and examined the level-2 fixed effect indicating how accurately on average another person in the sample predicted the participants’ pattern of IAT scores. We repeated this analysis process 1000 times such that every participant’s IAT scores were predicted 1000 times by another person’s predictions and averaged the 1000 fixed effects. In a second step, we took the same 1000 random pairings and entered them into a multi-level model in which participants’ IAT scores were predicted simultaneously by both their own predictions and the random other persons’ predictions. We again averaged the fixed effects on level-2 to estimate whether overall participants’ own predictions explained variance in their patterns of IAT scores over and above the other persons’ predictions.
To estimate the meta-analytical effect of the described analyses across studies, we ran three random effects models on the obtained averaged fixed effects from both analyses weighting them using the inverse-variance method.7
Between-Subjects Analysis. To examine the degree to which participants knew how much bias they would show in comparison to others, we also examined between-subjects correlations per social category. To this end, we standardized predictions and IAT scores by social category and ran a multilevel model predicting IAT scores from predictions by IAT type. We examine the fixed effect which is equivalent to the average between-subjects correlation between predictions and IAT scores across IAT types. To estimate the meta-analytical effect of this analysis we ran a random-effects model weighting the fixed effects derived from this analysis using the inverse-variance method.
Results
Prediction Accuracy
Multi-Level Analysis. Examining estimates from multi-level models, in all 17 studies participants were able to accurately predict the patterns of their IAT results with effects ranging from a minimum of b = 0.25, 95% CI [0.14, 0.36], t(92) = 4.52, p < .001 up to a maximum of b = 0.63, 95% CI [0.54, 0.71], t(324) = 14.41, p < .001. The meta-analytical effect in a random-effects model was b = 0.44, 95% CI [0.39, 0.49] and was significantly different from zero, Z = 16.97, p < .001 (see Figure 1). Effect sizes seemed to vary systematically between studies as indicated by the significant Cochran’s Q statistic for heterogeneity, Q(16) = 77.35, p < .001 with I² = 79%, 95% CI [68%, 87%].
Note. Estimates in each study are calculated on standardized scores within-subjects, once per participant, aggregated across participants in a multi-level analysis regressing IAT Scores on IAT Score predictions (Panel A), and IAT scores on explicit thermometer ratings (Panel B). The meta-analytical effect weighs the estimates of the fixed effects with the inverse-variance method. Note that the confidence intervals in these figures may differ slightly from those reported in the multi-level analyses because in the meta-analysis confidence intervals are calculated using degrees of freedom based on the sample size while in the multi-level model confidence intervals were based on the Satterthwaite approximation of degrees of freedom.
Note. Estimates in each study are calculated on standardized scores within-subjects, once per participant, aggregated across participants in a multi-level analysis regressing IAT Scores on IAT Score predictions (Panel A), and IAT scores on explicit thermometer ratings (Panel B). The meta-analytical effect weighs the estimates of the fixed effects with the inverse-variance method. Note that the confidence intervals in these figures may differ slightly from those reported in the multi-level analyses because in the meta-analysis confidence intervals are calculated using degrees of freedom based on the sample size while in the multi-level model confidence intervals were based on the Satterthwaite approximation of degrees of freedom.
To examine this heterogeneity in effect sizes, we ran subgroup analyses for the study setting (Online vs. Lab), the sample (General Public vs. Students), the study language (English vs. German), and the status of publication (Published vs. Unpublished). The meta-analytical effects for each subgroup can be found in Figure 2.
Note. Backtransformed correlations are calculated by regressing within-subjects standardized IAT Scores on IAT Score predictions (Panel A), and similarly standardized IAT scores on explicit thermometer ratings (Panel B), Fisher z-transforming these within-subject correlations, calculating the means for each study, and backtransforming it into a correlation. The meta-analytical effect weighs the backtransformed correlation with the inverse-variance method.
Note. Backtransformed correlations are calculated by regressing within-subjects standardized IAT Scores on IAT Score predictions (Panel A), and similarly standardized IAT scores on explicit thermometer ratings (Panel B), Fisher z-transforming these within-subject correlations, calculating the means for each study, and backtransforming it into a correlation. The meta-analytical effect weighs the backtransformed correlation with the inverse-variance method.
Results showed that the prediction accuracy effect was significantly higher in studies that were conducted in the laboratory (b = 0.48, 95% CI [0.44, 0.52]) than in studies that were conducted online (b = 0.28, 95% CI [0.22, 0.34]), Q(1) = 29.45, p < .001. Further, studies with student samples showed higher prediction accuracy effects (b = 0.47, 95% CI [0.43, 0.51]) than studies conducted on the general public (b = 0.27, 95% CI [0.20, 0.34]), Q(1) = 23.46, p < .001. Though differences were less pronounced for the publication status of the studies, prediction accuracy effects were larger for published studies (b = 0.51, 95% CI [0.47, 0.55]) than for unpublished studies (b = 0.41, 95% CI [0.35, 0.48]), Q(1) = 6.57, p = .010. Effects did not significantly differ for studies that were conducted in English (b = 0.41, 95% CI [0.32, 0.51]) as opposed to studies that were conducted in German (b = 0.45, 95% CI [0.39, 0.51]), Q(1) = 0.42, p = .515.
Backtransformed Correlations. Because the distribution of the within-subject correlations was left-skewed, with many high correlations and few low or negative correlations, we decided to additionally examine backtransformed (corrected) correlations which account for the skewness in the data (for details of the analysis see Analyses section). Examining backtransformed correlations corroborated the findings by the multi-level model approach. Again, in all 17 studies participants were able to accurately predict the patterns of their IAT results with a larger overall meta-analytical effect of r = 0.56, 95% CI [0.51, 0.61, Z = 16.28, p < .001 (see Figure 2). Effect sizes again varied systematically between studies as indicated by the significant Cochran’s Q statistic for heterogeneity, Q(16) = 69.10, p < .001 with I² = 77%, 95% CI [63%, 85%]. The meta-analytical effects for the different subgroups can be found in Figure 3.
Note. The effects in Panel A are based on two separate multi-level models per study predicting IAT scores from IAT score predictions or explicit thermometer ratings. All scores are standardized within-subjects per participant and aggregated across participants in the multi-level analysis. The resulting fixed effect was imputed in a meta-analysis using the inverse-variance method for weigthing. The effects in Panel B are based on the distribution of within-subject correlations per person calculated between IAT scores and IAT score predictions or between IAT scores and explicit thermometer ratings. The correlations were Fisher-z-transformed, and the mean of the resulting distribution was backtransformed into a correlation. This corrected average correlation was meta-analyzed using the inverse-variance method of weighting.
Note. The effects in Panel A are based on two separate multi-level models per study predicting IAT scores from IAT score predictions or explicit thermometer ratings. All scores are standardized within-subjects per participant and aggregated across participants in the multi-level analysis. The resulting fixed effect was imputed in a meta-analysis using the inverse-variance method for weigthing. The effects in Panel B are based on the distribution of within-subject correlations per person calculated between IAT scores and IAT score predictions or between IAT scores and explicit thermometer ratings. The correlations were Fisher-z-transformed, and the mean of the resulting distribution was backtransformed into a correlation. This corrected average correlation was meta-analyzed using the inverse-variance method of weighting.
Results were again identical to the analysis with multi-level estimates. The prediction accuracy effect was significantly higher in laboratory settings (r = 0.61, 95% CI [0.58, 0.64]) than in online settings (r = 0.37, 95% CI [0.28, 0.44]), Q(1) = 35.58, p < .001, and higher in student samples (r = 0.60 95% CI [0.56, 0.64]) than the general public (r = 0.36, 95% CI [0.27, 0.45]), Q(1) = 24.61, p < .001. Again, differences were less pronounced for the publication status of the studies, with larger prediction accuracy effects for published studies (r = 0.63, 95% CI [0.59, 0.67]) than for unpublished studies (r = 0.56, 95% CI [0.46, 0.60]), Q(1) = 5.71, p = .017. Effects did again not significantly differ for English studies (r = 0.54, 95% CI [0.43, 0.63]) or German studies (r = 0.58, 95% CI [0.52, 0.63]), Q(1) = 0.49, p = .483.
Implicit-Explicit Relationship
Multi-Level Analysis. Explicit thermometer ratings were inconsistently related to the pattern of IAT results with effect sizes ranging from b = -0.03, 95% CI [-0.17, 0.12], t(58) = -0.39, p = .697 to b = 0.42, 95% CI [0.31, 0.53], t(64) = 7.73, p < .001. The meta-analytical effect in a random-effects model was significantly different from zero, b = 0.24, 95% CI [0.17, 0.30], Z = 7.28, p < .001 and substantially lower than the meta-analytical effect for the prediction accuracy (see Figure 1). Effect sizes varied systematically between studies as indicated by the significant Cochran’s Q statistic for heterogeneity, Q(16) = 92.95, p < .001with I² = 83%, 95% CI [74%, 89%].
Subgroup analyses showed that explicit thermometer ratings were more strongly related to IAT patterns in laboratory settings (b = 0.29, 95% CI [0.25, 0.24]) than in online settings (b = 0.02, 95% CI [-0.05, 0.08]), Q(1) = 47.85, p < .001. Effects were also stronger for student samples (b = 0.28, 95% CI [0.24, 0.33]) than for the general public (b = 0.00, 95% CI [-0.08, 0.07]), Q(1) = 41.73, p < .001. Although the difference was smaller between studies conducted in German (b = 0.30, 95% CI [0.25, 0.35]) and studies conducted in English (b = 0.15, 95% CI [0.03, 0.27]), it was significant, Q(1) = 5.07 , p = .024, mainly due to the large variation in estimates in the English-language studies, which included both lab and online studies. There were no differences between published (b = 0.27, 95% CI [0.17, 0.36]) and unpublished studies (b = 0.23, 95% CI [0.15, 0.31]), Q(1) = 0.44, p = .505.
Backtransformed Correlations. Because the distributions of the within-subject correlations between explicit thermometer ratings and IAT scores were again skewed and deviated from normality in all 17 studies, we again examined backtransformed correlations to account for the skewness in the data. Results replicated the findings from the multi-level-model analyses. The meta-analytical effect was slightly larger than the meta-analytical effect based on the multi-level estimates, b = 0.32, 95% CI [0.24, 0.39], Z = 7.29, p < .001, but still substantially smaller than the prediction accuracy effect based on backtransformed correlations (see Figure 2). Effect sizes varied systematically between studies as indicated by the significant Cochran’s Q statistic for heterogeneity, Q(16) = 81.28, p < .001 with I² = 80%, 95% CI [69%, 87%].
Analogous to the multi-level analysis, subgroup analyses showed that explicit thermometer ratings were more strongly related to IAT patterns in laboratory settings (b = 0.38, 95% CI [0.33, 0.44]) than in online settings (b = 0.03, 95% CI [-0.07, 0.12]), Q(1) = 40.32, p < .001, and stronger for student samples (b = 0.37, 95% CI [0.31, 0.43]) than for the general public (b = 0.02, 95% CI [-0.09, 0.12]), Q(1) = 32.62, p < .001. Effects did again also differ to a smaller degree between studies conducted in German (b = 0.39, 95% CI [0.33, 0.45]) and studies conducted in English (b = 0.22, 95% CI [0.06, 0.36]), Q(1) = 4.40, p = .036. Effects did not differ between published (b = 0.35, 95% CI [0.23, 0.46]) and unpublished studies (b = 0.31, 95% CI [0.20, 0.40]), Q(1) = 0.28, p = .598.
Predictions vs. Explicit Thermometer Ratings
In all 17 studies, predictions were more strongly related to participants’ patterns of IAT results than explicit thermometer ratings. A pattern that was even more strongly pronounced in the simultaneous model predicting IAT score patterns from predictions and explicit thermometer ratings. While in the simultaneous model predictions remained a significant predictor in all 17 studies (Effects ranged from b = 0.27, 95% CI [0.14, 0.41], t(86) = 4.06, p < .001 to b = 0.58, 95% CI [0.48, 0.68], t(318) = 11.33, p < .001), explicit thermometer ratings only remained significantly (positively) related to IAT score patterns in two studies (Effects ranged from b = -0.19, 95% CI [-0.35, -0.03], t(62) = -2.39, p < .020 to b = 0.16, 95% CI [0.02, 0.29], t(115) = 2.33, p < .022; for an overview of all effects see Table 2). The meta-analyses on both effects in the simultaneous models corroborated these results and showed that the meta-analytical effect of the prediction accuracy remained significant, b = 0.45, 95% CI [0.41, 0.49], Z = 21.08, p < .001, while the meta-analytical effect of the explicit thermometer ratings did not significantly differ from zero, b = 0.00, 95% CI [-0.05, 0.05], Z = -0.04, p = .966. Both effects showed significant Cochran’s Q statistics for heterogeneity, Qpredictions(16) = 48.71, p < .001, I²Predictions = 67%, 95% CI [45%, 80%], QThermometer(16) = 53.53, p < .001, I²Thermometer = 70%, 95% CI [51%, 82%].
Study ID . | Study . | Predictions . | Explicit Thermometer Ratings . | ||
---|---|---|---|---|---|
Simple model . | Controlling for Explicit Thermometer Ratings in Simultaneous Regression . | Simple Model . | Controlling for Predictions in Simultaneous Regression . | ||
1 | UOBSGQG2020 | .32*** | .41*** | .09 | -.18* |
2 | UOBSGQU2020 | .26*** | .35*** | .03 | -.14 |
3 | UOBSGQO2020 | .29*** | .41*** | -.03 | -.19* |
4 | ULBSGDG2019 | .32*** | .27*** | .25*** | .11 |
5 | ULESGDG2019 | .40*** | .32*** | .34*** | .13 |
6 | UOBSGIU2019 | .25*** | .38*** | -.01 | -.21** |
7 | ULESGIG2019 | .37*** | .33*** | .29*** | .08 |
8 | ULESGIG2018_a | .47*** | .51*** | .24*** | -.05 |
9 | ULESGIG2018_b | .43*** | .46*** | .25*** | -.04 |
10 | ULESGDG2018 | .51*** | .49*** | .36*** | .07* |
11 | PLESGIG2016 | .50*** | .45*** | .36*** | .10* |
12 | ULBSGDG2015 | .63*** | .58*** | .40*** | .10 |
13 | PLESGDG2015 | .49*** | .45*** | .34** | .06 |
14 | ULESGDC2014 | .50*** | .41*** | .42*** | .16* |
15 | PLESGDC2013 | .47*** | .51*** | .16*** | -.07* |
16 | ULBSGPU2012 | .52*** | .52*** | .25*** | .00 |
17 | PLESGPU2011 | .54*** | .56*** | .21*** | -.02 |
Study ID . | Study . | Predictions . | Explicit Thermometer Ratings . | ||
---|---|---|---|---|---|
Simple model . | Controlling for Explicit Thermometer Ratings in Simultaneous Regression . | Simple Model . | Controlling for Predictions in Simultaneous Regression . | ||
1 | UOBSGQG2020 | .32*** | .41*** | .09 | -.18* |
2 | UOBSGQU2020 | .26*** | .35*** | .03 | -.14 |
3 | UOBSGQO2020 | .29*** | .41*** | -.03 | -.19* |
4 | ULBSGDG2019 | .32*** | .27*** | .25*** | .11 |
5 | ULESGDG2019 | .40*** | .32*** | .34*** | .13 |
6 | UOBSGIU2019 | .25*** | .38*** | -.01 | -.21** |
7 | ULESGIG2019 | .37*** | .33*** | .29*** | .08 |
8 | ULESGIG2018_a | .47*** | .51*** | .24*** | -.05 |
9 | ULESGIG2018_b | .43*** | .46*** | .25*** | -.04 |
10 | ULESGDG2018 | .51*** | .49*** | .36*** | .07* |
11 | PLESGIG2016 | .50*** | .45*** | .36*** | .10* |
12 | ULBSGDG2015 | .63*** | .58*** | .40*** | .10 |
13 | PLESGDG2015 | .49*** | .45*** | .34** | .06 |
14 | ULESGDC2014 | .50*** | .41*** | .42*** | .16* |
15 | PLESGDC2013 | .47*** | .51*** | .16*** | -.07* |
16 | ULBSGPU2012 | .52*** | .52*** | .25*** | .00 |
17 | PLESGPU2011 | .54*** | .56*** | .21*** | -.02 |
Note. Relationships are calculated on standardized scores within-subjects, once per participant, and then aggregated across participants in a multi-level analysis. Estimates from simple models were calculated in two separate models (1) predicting IAT scores from predictions and (2) predicting IAT scores from explicit thermometer ratings. Estimates from the simultaneous regression were calculated by simultaneously predicting IAT scores from predictions and explicit thermometer ratings.
* indicates significance at the p < .05 level, ** at the p < .01 level, and *** at the p < .001 level.
Interestingly, no subgroup analyses for the prediction accuracy-beyond-explicit-ratings effect in the simultaneous model revealed significant differences between groups (Setting: Q(1) = 3.18, p = .074; Sample: Q(1) = 3.16, p = .076; Language: Q(1) = 0.39, p = .533; Publication status: Q(1) = 3.51, p = .061). In contrast, the meta-analytical effect for explicit thermometer ratings beyond predictions differed in the subgroup analyses depending on the study setting (Q(1) = 28.21, p < .001), and sample (Q(1) = 21.24, p < .001). Explicit thermometer ratings were negative unique predictors of IAT score patterns when studies were conducted online (b = -0.18, 95% CI [-0.25, 0.11]), or on the general public (b = -0.18, 95% CI [-0.26, 0.10]). In contrast, there was simply no (negative or positive) effect of explicit thermometer ratings beyond predictions in the lab (b = 0.04, 95% CI [-0.00, 0.08]) and student (b = 0.03, 95% CI [-0.01, 0,07]) samples. That is, while raw prediction accuracy was lower in online samples than in lab samples (and general-population as opposed to student samples), this difference disappeared when controlling for explicit thermometer ratings. In other words, explicit thermometer ratings showed a suppression effect on prediction accuracy in the online and general-public samples, but not the lab and student samples. Once controlling for this suppression effect, this difference between the lab and online settings dropped below the significance threshold (p = .076). The explicit thermometer effects did also, to a smaller degree, differ significantly for the subgroups on language (Q(1) = 3.94, p = .047), again mainly due to the large variation in estimates in the English-language studies, which included both lab and online studies. Effects did not differ by publication status (Q(1) = 0.25, p = .616). An overview of all meta-analytical effects of the prediction accuracy and explicit thermometer ratings for all subgroups can be found in Figure 4.
Note. The effects are based on a multi-level analysis per study in which IAT scores were simultaneously predicted from IAT score predictions and thermometer ratings. All scores were standardized within-subjects per participant and aggregated across participants in the multi-level analysis. The resulting fixed effects were imputed in a meta-analysis using the inverse-variance method for weighting.
Note. The effects are based on a multi-level analysis per study in which IAT scores were simultaneously predicted from IAT score predictions and thermometer ratings. All scores were standardized within-subjects per participant and aggregated across participants in the multi-level analysis. The resulting fixed effects were imputed in a meta-analysis using the inverse-variance method for weighting.
Predicting the Predictions: Simultaneous Regression of Predictions onto IAT Scores and Explicit Thermometer Ratings
In all 17 studies, predictions were significantly predicted by both participants’ IAT score patterns and patterns of thermometer ratings in a simultaneous model. The effects for IAT scores ranged from b = 0.21, 95% CI [0.10, 0.32], t(71) = 3.82, p < .001 to b = 0.49, 95% CI [0.40, 0.58], t(72) = 11.24, p < .001, with a meta-analytical effect of b = 0.34, 95% CI [0.30, 0.38], Z = 15.98, p < .001. The effects for thermometer ratings ranged from b = 0.16, 95% CI [0.02, 0.29], t(115) = 2.33, p = .022 to b = 0.57, 95% CI [0.45, 0.68], t(60) = 9.61, p < .001, with a meta-analytical effect of b = 0.44, 95% CI [0.39, 0.49], Z = 17.66, p < .001. Both effects showed significant Cochran’s Q statistics for heterogeneity, QIAT(16) = 81.15, p < .001, I²IAT = 80%, 95% CI [69%, 87%], QThermometer(16) = 76.40, p < .001, I²Thermometer = 79%, 95% CI [67%, 87%]. An overview of the fixed-effect estimates across studies and the meta-analytical effects can be found in Figure 5.
Note. Estimates in each study are calculated on standardized scores within-subjects, once per participant, aggregated across participants in a multi-level analysis regressing IAT Scores on IAT Score predictions (Panel A), and IAT scores on thermometer ratings (Panel B). The meta-analytical effect weighs the estimates of the fixed effects with the inverse-variance method. Note that the confidence intervals in these figures may differ slightly from those reported in the multi-level analyses because in the meta-analysis confidence intervals are calculated using degrees of freedom based on the sample size while in the multi-level model confidence intervals were based on the Satterthwaite approximation of degrees of freedom.
Note. Estimates in each study are calculated on standardized scores within-subjects, once per participant, aggregated across participants in a multi-level analysis regressing IAT Scores on IAT Score predictions (Panel A), and IAT scores on thermometer ratings (Panel B). The meta-analytical effect weighs the estimates of the fixed effects with the inverse-variance method. Note that the confidence intervals in these figures may differ slightly from those reported in the multi-level analyses because in the meta-analysis confidence intervals are calculated using degrees of freedom based on the sample size while in the multi-level model confidence intervals were based on the Satterthwaite approximation of degrees of freedom.
Subgroup analyses showed that prediction patterns were less strongly related to participants’ IAT score patterns when studies were conducted online (b = 0.28, 95% CI [0.23, 0.33]), than when they were conducted in the laboratory (b = 0.35, 95% CI [0.30, 0.40]), but this difference fell just short of being significant, Q(1) = 3.73, p = .053. All other subgroup analyses did not show significant differences between groups (Sample: Q(1) = 2.20, p = .138; Language: Q(1) = 1.97, p = .160; Publication status: Q(1) = 2.13, p = .144). As a mirror image, thermometer ratings were less strongly related to participants’ prediction patterns in studies conducted in the laboratory (b = 0.42, 95% CI [0.37, 0.48]) than in studies conducted online (b = 0.51, 95% CI [0.45, 0.57]), but here too, the effect was just short of significant, Q(1)= 3.70, p = .055. These results echo the suppression effect above in that it suggests there may be more consistency between IAT score predictions and thermometer ratings in online as opposed to lab settings. Future research is needed to determine whether participants generally tend to make predictions that are more consistent with their explicit evaluations in online studies than in the lab and whether this explains why IAT score predictions rise when controlling for thermometer ratings in the analyses. Subgroup analyses did also again show significant differences of effects for language, Q(1) = 5.28, p = .022. Effects did not differ for the type of sample (Q(1) = 1.22, p = .269), nor the publication status (Q(1) = 0.34, p = .559). All meta-analytical effects with IAT score predictions as dependent variable can be found in Figure 6.
Note. The effects are based on a multi-level analysis per study in which IAT score predictions were simultaneously predicted from IAT scores and thermometer ratings. All scores were standardized within-subjects per participant and aggregated across participants in the multi-level analysis. The resulting fixed effects were imputed in a meta-analysis using the inverse-variance method for weighting.
Note. The effects are based on a multi-level analysis per study in which IAT score predictions were simultaneously predicted from IAT scores and thermometer ratings. All scores were standardized within-subjects per participant and aggregated across participants in the multi-level analysis. The resulting fixed effects were imputed in a meta-analysis using the inverse-variance method for weighting.
Prediction Accuracy Beyond Normative Patterns Based on Other Participants’ Predictions
The average prediction accuracy of the randomly paired other participants ranged from b = 0.13, 95% CI [0.02, 0.24] to b = 0.44, 95% CI [0.38, 0.50] indicating that in all 17 studies the randomly paired other participants’ prediction patterns were related to participants’ own pattern of IAT results above zero on average. In 16 out of these 17 studies, participants’ own predictions descriptively showed higher accuracies than the random other participants’ predictions. This difference was significant in 12 out of the 17 studies as indicated by the 95% CI of the random other prediction accuracies that did not include the average participants’ own prediction accuracy in these studies (see Figure 7). The meta-analytical effect corroborated this finding and showed that across all studies, participants’ own prediction patterns were significantly more related to their pattern of IAT scores (b = 0.44, 95% CI [0.39, 0.49], Z = 17.38, p < .001) than the average other participants’ prediction patterns (b = 0.33, 95% CI [0.28, 0.38], Z = 13.51, p < .001).
Note. The effects of participants’ own predictions are based on a multi-level analysis per study in which participants’ IAT scores were predicted from their own IAT score predictions and are the same as the effects reported in the prediction accuracy subsection. The effects for random other participants’ predictions are the average of 1000 iterations of a multi-level analysis predicting participants’ IAT scores from a randomly paired other participants’ predictions. All scores were standardized within-subjects per participant and aggregated across participants in the multi-level analysis. The resulting fixed effects were imputed in a meta-analysis using the inverse-variance method for weighting.
Note. The effects of participants’ own predictions are based on a multi-level analysis per study in which participants’ IAT scores were predicted from their own IAT score predictions and are the same as the effects reported in the prediction accuracy subsection. The effects for random other participants’ predictions are the average of 1000 iterations of a multi-level analysis predicting participants’ IAT scores from a randomly paired other participants’ predictions. All scores were standardized within-subjects per participant and aggregated across participants in the multi-level analysis. The resulting fixed effects were imputed in a meta-analysis using the inverse-variance method for weighting.
These effects replicated when running the simultaneous model in which we predicted participants’ pattern of IAT scores simultaneously from their own prediction patterns and the randomly paired others’ prediction patterns. In 16 out of the 17 studies the randomly paired other participants’ prediction patterns explained variance in participants own IAT score patterns above participants’ own prediction patterns (effects ranged from b = 0.10, 95% CI [-0.01, 0.21] to b = 0.31, 95% CI [0.23, 0.38]). However, again, in 16 out of the 17 studies participants’ own prediction accuracies descriptively outperformed the random other participants’ prediction accuracies (effects ranged from b = 0.24, 95% CI [0.20, 0.29] to b = 0.54, 95% CI [0.50, 0.58]). In line with this, the meta-analytical effects again supported this pattern of findings, showing that across all studies, participants’ own prediction patterns significantly explained variance in their own pattern of IAT results (b = 0.36, 95% CI [0.32, 0.40], Z = 16.85, p < .001) over and above the randomly paired others’ prediction patterns (b = 0.21, 95% CI [0.18, 0.24], Z = 15.95, p < .001).
Subgroup analyses showed that the average random other participants’ prediction accuracies in the simple model were higher when the studies were conducted in the laboratory (b = 0.38, 95% CI [0.36, 0.41]) than when the studies were conducted online (b = 0.16, 95% CI [0.11, 0.20]), Q(1) = 70.32, p < .001, and higher for student samples (b = 0.37, 95% CI [0.34, 0.40]) than for the general public (b = 0.15, 95% CI [0.09, 0.20]), Q(1) = 49.99, p < .001. The effect was just significant for the differences in effects by publication status (Q(1) = 3.92, p = .048), with higher accuracies in published studies (b = 0.39, 95% CI [0.35, 0.43]) than in unpublished studies (b = 0.32, 95% CI [0.25, 0.43]). There were no significant differences for the studies’ language (Q(1) = 2.36, p = .125). This pattern of results replicated in the simultaneous model. Effects were larger for studies conducted in the laboratory (b = 0.23, 95% CI [0.20, 0.25]) than when the studies were conducted online (b = 0.13, 95% CI [0.08, 0.18]), Q(1) = 13.12, p < .001, and higher for student samples (b = 0.22, 95% CI [0.20, 0.25]) than for the general public (b = 0.12, 95% CI [0.07, 0.18]), Q(1) = 10.76, p < .001. Effects did not differ significantly depending on the studies’ language (Q(1) = 2.51, p = .113) or the studies’ status (Q(1) = 0.00, p = .946). The effect of participants’ own prediction patterns on their IAT score pattern in the simultaneous model was also larger in laboratory studies (b = 0.39, 95% CI [0.35, 0.43]) than in online studies (b = 0.27, 95% CI [0.24, 0,29]), Q(1) = 23.34, p < .001, and larger for student samples (b = 0.38, 95% CI [0.34, 0.43]) than for the general public (b = 0.25, 95% CI [0.23, 0.27]), Q(1) = 29.02, p < .001. This effect was also larger for published studies (b = 0.42, 95% CI [0.37, 0.47]) than for unpublished studies (b = 0.34, 95% CI [0.29, 0.39]), Q(1) = 4.80, p = .028, but did not differ significantly depending on the studies’ language, Q(1) = 0.00, p = .949. Overall, across all subgroups both participants’ own predictions and the random other participants’ predictions predicted participants’ own IAT score patterns but participants’ own predictions were consistently more strongly related to their own patterns of IAT results than a randomly-paired other participants’ predictions (see Figure 8).
Note. The simple model effects are based on a multi-level analysis per study in which participants’ IAT scores were separately predicted from their own IAT score predictions or 1000 iterations of randomly paired other participants’ predictions. In the simultaneous model effects represent the average of 1000 iterations of a multi-level analysis simultaneously predicting participants’ IAT scores from their own predictions and a randomly paired other participants’ predictions. All scores were standardized within-subjects per participant and aggregated across participants in the multi-level analyses. The resulting average effects were imputed in a meta-analysis using the inverse-variance method for weighting.
Note. The simple model effects are based on a multi-level analysis per study in which participants’ IAT scores were separately predicted from their own IAT score predictions or 1000 iterations of randomly paired other participants’ predictions. In the simultaneous model effects represent the average of 1000 iterations of a multi-level analysis simultaneously predicting participants’ IAT scores from their own predictions and a randomly paired other participants’ predictions. All scores were standardized within-subjects per participant and aggregated across participants in the multi-level analyses. The resulting average effects were imputed in a meta-analysis using the inverse-variance method for weighting.
Between-Subjects Analysis
The average between-subject correlations ranged from b = 0.08, 95% CI [-0.03, 0.25], t(4) = 1.39, p = .238 to b = 0.34, 95% CI [0.14, 0.54], t(4) = 4.65, p = .010, with a meta-analytical effect of b = 0.22, 95% CI [0.19, 0.26], Z = 12.06, p < .001. An overview of the between-subject effects across studies and the meta-analytical between-subject effect can be found in Figure 9.
Note. The within-subject effects are based on multi-level analysis predicting IAT scores from IAT score predictions. In this analysis all scores were standardized within-subjects per participant and aggregated across participants in the multi-level analysis. The between-subject effects are based on a multi-level analysis predicting IAT scores from IAT score predictions with scores standardized between-subjects per target group aggregated across target groups in the multi-level analysis. The resulting fixed effects were imputed in a meta-analysis using the inverse-variance method for weighting.
Note. The within-subject effects are based on multi-level analysis predicting IAT scores from IAT score predictions. In this analysis all scores were standardized within-subjects per participant and aggregated across participants in the multi-level analysis. The between-subject effects are based on a multi-level analysis predicting IAT scores from IAT score predictions with scores standardized between-subjects per target group aggregated across target groups in the multi-level analysis. The resulting fixed effects were imputed in a meta-analysis using the inverse-variance method for weighting.
Effects varied substantially between studies as indicated by a significant Cochran’s Q statistics for heterogeneity, QIAT(16) = 38.34, p = .001, I²IAT = 58%, 95% CI [29%, 76%]. Despite this heterogeneity in effect sizes, none of the subgroup analyses showed systematic differences between the defined groups (Setting: Q(1) = 0.00, p = .999; Sample: Q(1) = 0.27, p = .604; Language: Q(1) = 0.39, p = .533; Publication status: Q(1) = 0.00, p = .978). The meta-analytical between-subject effects for each subgroup can be found in Figure 10.
Note. The within-subject effects are based on multi-level analysis predicting IAT scores from IAT score predictions. In this analysis all scores were standardized within-subjects per participant and aggregated across participants in the multi-level analysis. The between-subject effects are based on a multi-level analysis predicting IAT scores from IAT score predictions with scores standardized between-subjects per target group aggregated across target groups in the multi-level analysis. The resulting fixed effects were imputed in a meta-analysis using the inverse-variance method for weighting.
Note. The within-subject effects are based on multi-level analysis predicting IAT scores from IAT score predictions. In this analysis all scores were standardized within-subjects per participant and aggregated across participants in the multi-level analysis. The between-subject effects are based on a multi-level analysis predicting IAT scores from IAT score predictions with scores standardized between-subjects per target group aggregated across target groups in the multi-level analysis. The resulting fixed effects were imputed in a meta-analysis using the inverse-variance method for weighting.
Discussion
The main goal of the present meta-analysis was to examine whether the findings by Hahn et al. (2014) that people are able to accurately predict the patterns of their IAT results would replicate across different samples and settings and to provide a more accurate estimate of the average effect size of the prediction accuracy. To this end, we reanalyzed 17 published and unpublished studies that followed the original prediction paradigm closely. Results replicated Hahn et al.’s (2014) findings and showed that in all 17 studies, participants were able to accurately predict the patterns of their IAT results with an average within-subject relationship across studies of b = .44 (corrected average correlation: r = .56). This finding further strengthens Hahn et al.’s (2014) claim that the cognitions reflected on implicit evaluations are consciously accessible and reportable. Further, in all studies predictions were more strongly related to IAT score patterns than explicit thermometer ratings with an average within-subject relationship between thermometer ratings and IAT scores of b = 0.24 (corrected average correlation: r = .32). This highlights that people are willing and able to report on the cognitions reflected on their implicit evaluations even though they may report different evaluations on traditional explicit measures. Predictions further remained a significant predictor explaining variance in IAT score patterns over and above thermometer ratings in a simultaneous model with an average meta-analytical effect size of b = 0.45. At the same time, thermometer ratings only remained a significant (positive) predictor of IAT score patterns in 1 out of the 17 studies and the meta-analytical effect of thermometer ratings in this model dropped to b = 0.00. This finding is in line with theorizing by dual-process models such as the APE (Gawronski & Bodenhausen, 2006) or the MODE model (Fazio, 1990, 2007), which propose that people may be well aware of their automatic cognitions and these may partially inform their explicit evaluations, but additional (propositional) information is considered for their final deliberate answer.
An analysis where IAT score predictions were predicted simultaneously by IAT scores and traditional explicit measures further showed that both contributed in all studies, with the contribution of explicit measures being somewhat larger. We believe this result reflects the bigger methodological overlap between IAT score predictions and traditional explicit measures, as well as the fact that a lot of the other propositional information that goes into traditional explicit measures (e.g., specific experiences, beliefs, values) is to some degree still reflected in IAT score predictions. That is, the same participants essentially indicated their feelings towards different groups twice, and the two different types of evaluations hence showed substantial overlap. Importantly, however, the parts of participants’ IAT score prediction patterns that were inconsistent with their explicit thermometer ratings could be predicted by their IAT scores in all 17 studies. In sum, the meta-analysis confirmed that participants report different evaluations when they are asked to predict IAT scores than when they report on their “feelings” towards different groups, and these IAT score predictions reflected their actual IAT scores with high accuracy.
An often-voiced concern with the original findings is whether participants are indeed introspectively aware of their biases or whether they merely infer their own pattern of biases from expectations about descriptive norms (Morris & Kurdi, 2023). In line with the idea that IAT score patterns are partially shared among people, in all 17 studies randomly paired other participants’ prediction patterns were significantly related to participants’ own IAT score patterns. Importantly, however, in 16 out of the 17 studies participants’ own prediction patterns were more strongly related to their own IAT scores, and participants’ own prediction patterns explained variance in their own IAT score patterns over and above the randomly paired other participants’ prediction patterns in 16 out of the 17 studies. The meta-analytical effects supported these findings and showed that across all studies, even though the randomly paired other participants’ prediction patterns partially explained variance in participants’ own IAT score patterns with an effect-size of b = 0.33 (b = 0.21 in the simultaneous model), participants’ own prediction patterns were more strongly related to their own IAT score patterns, b = 0.44 (b = 0.36 in the simultaneous model). These results suggest that participants’ predictions may be a combination of introspective insight and cultural knowledge, with unique insight playing a slightly larger role.
As pointed out earlier, these findings all rely on within-subject correlations between participants’ person-standardized predictions and their person-standardized IAT results and thus indicate the degree to which participants are aware of their own automatic reactions toward different target groups. As such, the findings indicate how much people know their own reactions if comparison and labeling standards are not taken into consideration. Hahn et al. (2014) showed that between-subject correlations between predictions and IAT scores standardized per target group were considerably smaller than those within-subject correlations. We replicated these findings in all 17 studies and found that descriptively the average between-subject correlation was always smaller than the average within-subject correlation with a meta-analytical between-subject effect of b = .22 compared to a meta-analytical within-subject effect of b = .44 (corrected average r = .56) (see Figure 9). In line with theorizing by Hahn and Goedderz (2020), this further highlights the importance of distinguishing between the concept of introspective awareness and social calibration. Hahn and Goedderz (2020) have proposed to use the term introspective awareness to refer to a person’s ability to sense and report on their own cognitions toward different targets, while the term social calibration may be used to describe a person’s ability and willingness to apply labels to their own cognitions in accordance with culturally shared conventions. The present study is not in the position to make further claims on different processes involved in introspective awareness and social calibration, but it suggests that it is worthwhile to distinguish between the two concepts. If researchers continue to primarily inspect between-subject correlations to assess awareness, they may assume that their participants lack awareness when really they are just poorly calibrated (Goedderz & Hahn, 2024).
Subgroup Analyses
The overall pattern of results as they pertain to the different theoretical considerations replicated in all examined subgroups. Nonetheless, it is noteworthy that effect sizes differed substantially between some of them. Participants were overall less accurate in online studies than in the laboratory (average within-subjects relationships b = .28 vs. b = .48, corrected average correlations: r = .37 vs. r = .61). This may have several reasons. First, studies conducted online could potentially show lower effect sizes than studies conducted in the laboratory due to less motivated and/or focused participants and an overall less-controlled environment. Second, in the specific case of predicting reactions on socially sensitive topics, a laboratory environment may lead participants to feel more encouraged to report on their implicit biases because they more strongly believe those will be found out either way by the researcher. Another, compatible explanation is proposed by the present results. In addition to lower prediction accuracies, explicit thermometer ratings were also substantially less related to IAT scores in online settings, and when controlling for predictions they even showed negative relationships with IAT scores. In contrast, once thermometer ratings were controlled for, the online-lab difference in the relationship between predictions and IAT scores decreased substantially (to b = .39 vs. b =.46). This suggests that higher consistency between explicit thermometer ratings and predictions in online samples might exert a suppression effect on prediction accuracy. In other words, online participants’ explicit thermometer ratings diverged more strongly from their IAT scores compared with lab participants. Because participants’ explicit thermometer ratings seem to partially influence participants’ predictions, online predictions are consequently also less related to their IAT score patterns. Once we controlled for these explicit evaluations, predictions explain more variance in their IAT score patterns in online samples, closer to the prediction accuracy found in the other samples.
We found the same pattern of results for student samples as opposed to the general public. It is important to note that most online studies were also conducted on the general public such that there is only one study in the present meta-analysis that was conducted online on a student sample. Results for this study (Study 1) are descriptively more in line with the other online study results than with the other student sample results. Note, for instance, that thermometer ratings are only unrelated to IAT scores in the online studies, and the relationship between explicit thermometer ratings and the IATs turns negative in the simultaneous model only in the online studies (see Table 2), and the online student sample shows these patterns as well. These results make us believe that the setting may be more important than the sample. More studies in different settings using more diverse samples are needed to support this speculation.
Beyond this notable difference between online and lab samples, no other meaningful difference emerged between subsamples. Perhaps most strikingly, German participants did not differ from Canadian and US-American participants on any measure with the exception of higher correspondence between implicit and traditional explicit evaluations. In contrast to the English-speaking world, where “implicit bias” has been a matter of continuous public debate for at least two decades, this construct had not reached public discourse outside the academy in Germany to a similar degree when these studies were run (between 2015 and 2020), although general discussions about diversity beyond “implicit bias” have been equally prevalent for several decades. Additionally, the biggest immigrant groups in German society come from Central-Eastern and Eastern Europe, as well as the middle-East and Turkey; while the proportion of the population that identifies as Black, East-Asian, and Latino/-a has historically been much lower than in the US, Canada, and the UK.8 This lower experience with the groups in question for this paradigm may partly explain higher implicit-explicit correlations. German participants seem to have more readily based their explicit evaluations on spontaneously activated knowledge, while considering fewer pieces of additional propositional information, perhaps because less other information (from experience or public discourse) was available to them. Importantly, however, the lack of exposure to the construct of “implicit bias” or the groups does not seem to have made it harder for them to accurately predict the patterns of their IAT scores. This further contradicts the notion that accurate prediction of IAT scores merely reflects cultural knowledge in American participants. If this were true, then participants with less exposure to cultural information about “implicit bias” and the specific groups in question (i.e., German participants) should be worse at predicting their IAT scores. These interpretations remain exploratory and speculative and need to be corroborated by more targeted research and analyses.
Lastly, unpublished studies showed somewhat lower effect sizes than published ones. A notable proportion of these unpublished studies were run online as pilot studies, to see whether the paradigm could be moved online to save resources, trying different recruitment platforms, programming languages, countries, and samples. Hence, the lower effect sizes in unpublished studies can in large part be explained through the lower effects in online samples discussed above. While we have so-far concluded that the present paradigm cannot be run online without some notable sacrifices in data precision, we hope that our decision to publish these hitherto unpublished samples in the current format will help future researchers gauge what effect sizes to expect if they try to replicate or extend the present findings in different modalities.
In sum, the present subgroup analyses suggest that Hahn et al.’s (2014) patterns of results replicate across modalities, two languages and three countries, as well as general-public and student populations; with some notable differences in patterns when the paradigm is administered online. Future research with even more languages and cultures, conducted by independent researchers, is needed to corroborate these effects.
Theoretical Implications – Do Implicit Measures Reflect “Implicit Bias”?
The present meta-analysis shows that people are able to accurately predict the patterns of their IAT scores, and we have interpreted this as indication that people are generally capable of gaining awareness of the cognitions reflected on implicit evaluations. One central debate in this light is whether these cognitions are conscious at all times or remain unconscious until we ask people to predict their IAT scores in a psychological study, as well as what these results mean in light of newer calls to differentiate between “implicit bias” and “bias on implicit measures” (Gawronski et al., 2022), or calls to abolish the “implicit” terminology altogether (Corneille & Hütter, 2020).
One central argument we have tried to make in discussions around implicit measures and awareness is that there appears to be a misconception of the term „consciousness” as a trait of a cognition (Hahn & Goedderz, 2020). In this trait conception, a cognition is either categorically inaccessible, or categorically always conscious, such that it would have to be consciously rejected to not enter judgment. In contrast to such ideas, research indicates that conscious awareness depends on not just accessibility but also on attention, such that even accessible cognitions will only enter awareness if attention is paid to them (Dehaene & Naccache, 2001; Hahn & Goedderz, 2020; Hofmann & Wilson, 2010). Under this conception, awareness is better seen as a state of a cognition, such that cognitions can enter awareness when attention is paid to them rather than residing there at all times (Hahn & Goedderz, 2020).
Applying this insight to the studies at hand, one can observe that on the most minimal level, the IAT captures first reactions towards group-categorized stimuli, in this case pictures of social groups. Whether or not a person is aware of their biased reaction will hence fundamentally depend on whether they have paid attention to this or a similar reaction in the past. Confirming these ideas, we found that people both were less surprised at IAT feedback (Goedderz & Hahn, 2022) and acknowledged to harboring more racial bias (Hahn & Gawronski, 2019) when they were asked to pay attention to their reactions to a set of pictures of people with different racial backgrounds than when they were not.
These studies question the idea that people are chronically aware of their biases but consciously reject them. If this were the case, then IAT score feedback should not ever be surprising, and an attention manipulation should not lead to a sudden rise in acknowledgement of bias. Instead, they suggest that many participants in the studies reported here may have gained awareness through the prediction task (Goedderz & Hahn, 2022; Hahn & Gawronski, 2019; Hahn & Goedderz, 2020). In other words, many people may in fact be unaware of the biases that are found on implicit measures before they are asked to pay attention to their reactions to the stimuli, such that the IAT may in fact pick up “unconscious” reactions for some, but not all people (Hahn & Goedderz, 2020). These insights lead to many new questions regarding why so many people who live in a diverse world never notice their own biased reactions before being asked to reflect on them in a psychological study. One possibility is that many people may be unwilling to pay attention to their biased reactions. Another is that the specific biases captured by the IAT in its contrastive form do not play out in the real world in the same way. Future research is needed to investigate these questions.
This also has implications for Gawronski et al.’s (2022) call to distinguish between “implicit bias” and “bias on implicit measures”. The authors call upon researchers to use the term “implicit bias” only to describe truly unconscious biases, and since implicit measures have been shown to be predictable, they seem to be capturing a different construct. It is true that our studies have shown repeatedly that implicit measures do not assess categorically unknowable cognitions. As discussed above, however, they neither indicate that all people at all times know what an IAT towards specific pictures would show. In other words, once one re-conceptualizes awareness as a state, then many people might show “implicit bias” (i.e., temporarily unconscious influences of race) on the IAT while others might not. The former group would consist of people who, for some reason or another, have not paid much attention to their biased reactions before or while completing an IAT, while the latter would consist of people who have. That is, our studies do not per se indicate that the IAT does not pick up unconscious reactions for no person at no time. They simply show that it is possible to observe the reactions and become aware of them. In light of these arguments, we also find it somewhat unlikely that future research will develop instruments that capture categorically inaccessible biases: Our studies and theorizing indicate that un-noticed biases in behavior (“unconscious” at a certain point in time) can probably enter awareness with the right attention manipulations at other times. Future research will have to tell whether distinguishing between “implicit bias” (as categorically unconscious bias) and “bias on implicit measures” will prove helpful in distinguishing between public debates around unnoticed race categorization effects in behavior and social-cognitive research.
Lastly, all of these developments have led some authors to conclude that one should abolish the “implicit” terminology (Corneille & Hütter, 2020). We agree with many of the criticisms concerning the inconsistency with which the terms are used, safe the conclusion that one should abolish the terminology altogether. There is still little consensus in terms of what it is implicit measures capture, and hence we have chosen to stick with using “implicit” to denote measurement outcomes (except in the title to reference the original article). We hope that our research may contribute to resolving debates around implicit measures and implicit bias.
Limitations
There are several limitations to the present meta-analysis. First, this meta-analysis only includes studies conducted by or in close supervision of the original first author. While this ensured that the procedures were maximally comparable across studies, internal meta-analysis may provide a biased estimation of effect sizes (Vosgerau et al., 2019). To minimize this risk, we preregistered all criteria on inclusion and exclusion of studies and participants within studies, and preregistered our analytical strategies (see https://osf.io/mejzp/). Though this may reduce the risk of biased estimates due to selective reporting, specifics about the procedure may also impact the size of the effect (Simons, 2014). In order to enable other researchers to closely replicate effects reported in this meta-analysis, all materials are openly accessible on OSF.
Evidence by Morris and Kurdi (2023) show that the overall prediction effect replicates – with effect sizes comparable to our online effects – when other researchers follow the prediction paradigm even with deviations from the exact procedures used in this paper. We are hence optimistic that our overall conclusions will hold when studies are conducted by other researchers. Future studies are necessary to expand the conclusions of this research to other targets and instruments.
Second, we deliberately focused this meta-analysis on studies that precisely followed the specific prediction paradigm by Hahn et al. (2014) with exactly the same targets. This is because we have found that effect sizes do in fact change in theoretically meaningful ways when it is expanded to different measures and targets (e.g., Goedderz & Hahn, 2024; Morris & Kurdi, 2023; Rahmani Azad et al., 2023) or when aspects of the procedure are changed. For instance, Goedderz and Hahn (2024) found that participants showed similar awareness of their reactions towards baked goods, but were much more calibrated (they more-accurately predicted how much more or less they liked certain baked goods than other people in the sample). Morris and Kurdi (2023) extended the paradigm with minor variations in procedure and analysis to 50 different targets and found generally similar effect sizes but also some variation as our online samples. They found larger effect sizes when using the Affect Misattribution Procedure (AMP, Payne et al., 2005) in a second study, and traditional explicit measures continued to predict IAT scores even after controlling for implicit measures (but positively, not negatively as in our online samples). Procedural changes the authors made include using IATs that use both pictorial or non-pictorial stimuli, but work in our lab suggests that whether or not IATs are predicted with or without pictures makes a difference (data under review, mentioned in Goedderz & Hahn, 2022; Hahn & Goedderz, 2020). Most of the patterns in our results in this paper, however, replicated across numerous studies. Hence, to be able to discuss the meaning behind variations in results found across different measurement instruments, targets, and procedures, we thought it important to publish authoritative effect sizes on the original paradigm in both online and lab settings without these additional variations. All this additional evidence and the only minor variations in results make us optimistic that the theoretical considerations of this paper will generalize to other attitudinal domains and other implicit measures, but more research is needed to understand the differences as well.
Conclusion
A long-standing debate in research on implicit evaluations is whether they capture unconscious mental content (Greenwald & Banaji, 1995). In contrast to this conceptualization, Hahn et al. (2014) found that participants were able to accurately predict the patterns of their IAT results suggesting that participants were aware of the cognitions reflected on their IAT scores. The present meta-analysis reanalyzed 17 published and unpublished studies replicating Hahn et al.’s (2014) prediction paradigm. All patterns of results replicated across the examined studies and showed that participants were able to accurately predict the patterns of their IAT results even though they often report different evaluations on traditional explicit measures. Results further indicated that participants had unique insight into their own automatic cognitions beyond knowledge about normatively shared patterns. While participants were quite accurate in reporting their patterns of IAT results toward different target-groups, they were less accurate in labeling their cognitions in accordance with conventions in the sample. While effect sizes were smaller for online studies conducted on the general public than for lab studies run on university students, the overall pattern of results remained unchanged throughout the examined subgroups. Together, these findings provide further evidence that the cognitions reflected in implicit evaluations are consciously accessible. We hope this meta-analysis can guide researchers willing to study awareness of implicit evaluations in two important ways: First by providing meta-analytical effect size estimations, and second, by contributing to a better theoretical understanding of studying awareness in research on implicit evaluations.
Disclosure Statement
The research presented in this manuscript is the author’s own and in no connection to the German Institute for Development Evaluation
Author Contributions
AG conceptualized and designed this meta-analysis in close collaboration with AH, prepared and documented all data sets, conducted all analyses, interpreted the results, created the osf repository, and prepared all versions of this article as lead author.
ZRA contributed to data analysis, preparation of data and figures, and interpretation
AH contributed to the conceptualization and design of this meta-analysis, interpretation of the results, and all drafts and revisions of this manuscript as senior author. AH is the corresponding author.
Funding
The research reported in this paper was supported by a grant from the German Research Society (Deutsche Forschungsgesellschaft, DFG) awarded to Author 3 (HA 8167/2-1 “Self-Insight into Attitudes: Distinguishing Introspective from Social Self-Awareness in Research on Implicit Evaluations”).
Competing Interests
All authors declare no conflicts of interest.
Supplemental Materials
Supplemental Materials are available on the OSF repository at https://osf.io/mejzp/
Data Accessibility Statement
Open Data Repository with all data, scripts, preregistrations and materials: https://osf.io/mejzp/
A preprint of this article has been published at https://doi.org/10.31234/osf.io/frwcy
Footnotes
With the exception of the title, we use the terms implicit and explicit to refer to measurement outcomes, called “measures” henceforth. Accordingly we use the term “implicit evaluation” when we refer to an evaluation that is inferred from an indirect computerized measurement instrument, and the term “explicit evaluation” when we refer to an evaluation that is stated on a direct self-report measure (Hahn & Gawronski, 2018; Houwer et al., 2009). This terminology differs from Hahn et al. (2014), who used the term “implicit” to describe the underlying attitude instead of the measurement outcome. Hence, we make an exception to our measurement-focused usage of the terms in the title because we wanted to reference the original article. The instructions to participants developed by Hahn also talk about “implicit attitudes” as constructs.
The distribution of the correlations per persons were left-skewed with many high correlations but few low or negative correlations. To account for the skew in the distribution a corrected average was computed by Fisher z-transforming all correlations, computing the average of this distribution, and back-transforming this average into a correlation (see also Hahn & Goedderz, 2020). The backtransformed correlation reported here was calculated for the present article and is not part of the original paper by Hahn and colleagues (2014), who reported a similarly-sized Median correlation of r =.65.
There have been replications and extensions of Hahn et al.’s (2014) paradigm (e.g., Goedderz & Hahn, 2024; Morris & Kurdi, 2023; Rahmani Azad et al., 2023). Because these extensions often find meaningful or potentially meaningful variations in their effect sizes, we decided to focus the current meta-analysis on exact replications only, so that these variations may be meaningfully quantified in the future. We will return to this choice and its reasons in the discussion section.
We included the four original studies reported in the Hahn et al. (2014) in this meta-analysis. Because effects did not differ significantly between different manipulations, we collapsed the data across the four studies and treated them as one study (Study 17). All meta-analytical effects hold when the original studies are not included in the meta-analyses (see supplemental materials).
Study 1 in Hahn et al. (2014) didn’t ask participants about “regular people” and “adults” separately, such that “White people” were always used as the comparison group. This imprecision was fixed starting with Study 2.
The instructions hence always described “implicit attitudes” as a construct despite our critical view on this idea, to simplify communication and to stick to the materials developed by Hahn et al. (2014).
To our knowledge, there are no conventions on how to run a meta-analysis on effects derived from bootstrapping analyses. We thus decided to apply the same method to the average effects of the bootstrapping analyses as we used for the effects from the standard multi-level analyses.
E.g., freely available data for 2021 from the German Federal Statistical Office (statistisches Bundesamt) (2022, 2023, p. 133) suggest that far less than 1% of the German population each have a sub-Saharan African, South-American, and East-Asian background. The Federal Statistical Office of Germany does not collect data on racial identification.