Cognitive Control and the Implicit Association Test: A Replication of Siegel, Dougherty, and Huber (2012)

The implicit association test (IAT) is widely used to measure evaluative associations towards groups or the self but is influenced by other traits. Siegel, Dougherty, and Huber (2012, Journal of Experimental Social Psychology ) found that manipulating cognitive control via false feedback (Study 3) changed the degree to which the IAT was related to cognitive control versus evaluative associations. We conducted two replications of this study and a mini meta-analysis. Null-hypothesis tests, meta-analysis, and a small telescope approach demonstrated weak to no support for the original hypotheses. We conclude that the original findings are unreliable and that both the original study and our replications do not provide evidence that manipulating cognitive control affects IAT scores. Association Test

The implicit association test (IAT) is widely used to measure evaluative associations towards groups or the self but is influenced by other traits. Siegel, Dougherty, and Huber (2012, Journal of Experimental Social Psychology) found that manipulating cognitive control via false feedback (Study 3) changed the degree to which the IAT was related to cognitive control versus evaluative associations. We conducted two replications of this study and a mini meta-analysis. Null-hypothesis tests, meta-analysis, and a small telescope approach demonstrated weak to no support for the original hypotheses. We conclude that the original findings are unreliable and that both the original study and our replications do not provide evidence that manipulating cognitive control affects IAT scores.
The Implicit Association Test (IAT; Greenwald et al., 1998) is one of the most widely cited indirect measures in psychology (Azar, 2008). It compares the strength of associations between concepts and attributes and is often used to measure sensitive topics, such as evaluative associations about racial groups (e.g., the association between Black Americans and negativity or positivity).
Despite widespread use, researchers have questioned the validity of the IAT as a measure of personally held associations (Arkes & Tetlock, 2004;Oswald et al., 2013). Salience asymmetries among stimuli (Rothermund & Wentura, 2004), environmental associations (Karpinski & Hilton, 2001), base rates (Johnson & Chopik, 2019), task switching skill (Mierke & Klauer, 2003), faking responses (Röhner & Ewers, 2016), and cognitive control (Siegel et al., 2012) may all impact IAT scores, independent of personally held associations. If these results are robust, they raise questions about interpreting the meaning of IAT scores. The goal of this study was to replicate research on whether the last of these factors-cognitive control-impacts IAT scores about racial associations. Specifically, the studies reported here attempt to replicate one specific finding within this literature: The finding that the role of cognitive control in the IAT can be manipulated.
The IAT was designed to indirectly measure associations by bypassing potential biases in self-report methods (Greenwald et al., 1998). This is accomplished by instructing participants to respond quickly, reducing their ability to control their responses. In the evaluative race IAT, for example, participants see faces of Black and White people and evaluative words (e.g., war, love). Faces and words are presented sequentially and participants must categorize them by pressing one of two keys. The task is difficult because different types of stimuli are categorized with the same key, causing response interference when categories are paired with concepts incompatible with a participant's associations (e.g., Black faces with positive words). Stronger prepotent responses lead to more interference, longer response times, and a larger IAT effect.
Cognitive control is typically thought to encompass a person's ability to override prepotent responses and switch between tasks (Botvinick et al., 2001;Hammond & Summers, 1972). Although the structure of the IAT encourages fast responses, it does not fully prevent individuals from exerting cognitive control over their behavior and thus is still susceptible to controlled influences (for a review, see Blair, 2002). For example, individuals with high levels of cognitive control can mask stereotypic racial associations in other tasks. Payne (2005) used a weapon identification task and found participants were faster to identify guns when primed with faces of Black versus White men. This relationship was attenuated for participants with high levels of cognitive control.
These priming data are consistent with studies showing correlations across unrelated IATs (McFarland & Crouch, 2002), indicating that a common source of variance impacts IAT scores. For example, responses are faster in the IAT when preceded by a trial of the same judgment (e.g., two evaluative words in succession), suggesting task-switching may account for shared method variance across tasks mdougher@umd.edu a Johnson, D. J., Ampofo, D., Erbas, S. A., Robey, A., Calvert, H., Garriques, V., Hatch, J., Gulbransen, L., Iqbal, R., Lewis, M., Stern, E., & Dougherty, M. (2021). Cognitive Control and the Implicit Association Test: A Replication of Siegel, Dougherty, and Huber (2012). Collabra: Psychology,7(1). https://doi.org/10.1525/collabra.27356 (Mierke & Klauer, 2003). Because task-switching is an important component of cognitive control (Altmann & Gray, 2008;Meiran, 2000), individuals with higher levels of cognitive control may be able to respond to a race IAT in ways that avoid revealing their evaluative associations. Siegel et al. (2012) argued explicitly that the relative contribution of association strength and cognitive control to IAT scores can be changed with task instructions. For example, students were told the race IAT measures racial attitudes versus categorization ability (Study 2) or they had high versus low levels of prejudice (Study 3). The authors posited when race was salient concerns about appearing biased should increase the degree to which IAT performance is driven by cognitive control. Specifically, they predicted that making race more salient should 1) increase the extremity of IAT scores, 2) increase the relationship between cognitive control and IAT scores, and 3) decrease the relationship between evaluative racial associations and IAT.

Manipulating Cognitive Control
Study 3 provided the strongest evidence for these hypotheses. Students' (N = 98) cognitive control was measured using the Stroop task and their racial associations with the affect misattribution procedure (AMP; Payne et al., 2005). They were then randomly assigned to receive false feedback about their prejudice before completing the race IAT, with the goal of increasing the degree to which cognitive control impacts IAT scores. As predicted, students told they had high levels of prejudice showed more extreme IAT scores, stronger correlations between Stroop and IAT scores, and weaker correlations between AMP and IAT scores.
Below, we report the results of two replications of Siegel et al. (2012) Study 3. These replications were motivated by a re-analysis of the original study data using Bayesian methods, which can be found in the online preregistration. In addition, we also report a meta-analysis and small telescope analysis of the original and replication results after our confirmatory tests.

Current Studies
We conducted two direct replications of Study 3 (Siegel et al., 2012) with the same instructions, stimuli, analyses, and student population. While the original report conceptually replicated the effect across studies, inconsistencies in methods make them studies difficult to compare (Pashler & Harris, 2012). In contrast, direct replications provide an opportunity to understand changes in effect sizes which provide information about the reliability of an effect (Simons, 2014).
We collected larger sample sizes (N = 148, Study 1a; N = 218, Study 1b; versus N = 98 in the original report) to increase statistical power. We used data from Study 3 of Siegel et al. (2012) to conduct a sensitivity power analysis (Green & MacLeod, 2016) of our key hypotheses (Stroop/AMP interactions with condition). While the original study had .90 power to detect an effect size of = .60 with , our analyses have .90 power to detect effects of size = 0.50 (Study 1a) and = 0.40 (Study 1b). Note that the effect size Study 1b is able to detect is a conservative estimate, given the increased reliability of the measures.

Method
All measures, manipulations, and exclusions are disclosed. Materials, data, and analyses can be found at https://osf.io/973ez/. The study 1b preregistration can be found at https://osf.io/j9v7e.

Participants
Participants were undergraduates from the University of Maryland who received course credit for their participation. In Study 1a and 1b we sought to collect at least 50% more data than the original report (N = 98). As noted in our preregistration of Study 1b, we continued data collection until near the end of the semester to maximize statistical power. We did not examine the data until after data collection was complete. This resulted in a sample of 148 for Study 1a and 250 for Study 1b.
Following our pre-registration and consistent with the original report (Siegel et al., 2012), we excluded color-blind and Black participants. Demographic information was not recorded for Study 1a. Participants in Study 1b were 66% women. Most (61%) were White, 25% Asian, 8% Hispanic, and 6% from some other group. These exclusions removed 42 Black participants from Study 1b. However, all participants' data are analyzed in the Exploratory Analyses section.

Procedure
We replicated the procedures from Study 3 of Siegel et al. (2012) as closely as possible, with the exception that we increased the number of participants to increase statistical power (Study 1a and 1b) and the number of trials on all tasks (Study 1b) to improve measurement reliability. All tasks were administered on computers in the lab. Participants completed a measure of cognitive control (the Stroop task) followed by a measure of racial associations (the race AMP; Payne et al., 2005). They were then given false feedback about their AMP performance. In the high-prejudice condition participants ( = 72, = 105) were told their results "indicate that you have high levels of racial prejudice." In the low-prejudice condition participants ( = 76, = 113) were told their results "indicate that you have low levels of racial prejudice." High-prejudice feedback was intended to make concerns about appearing biased salient and increase the degree to which IAT performance was driven by cognitive control.
Following the feedback, participants were told they would complete a second measure of racial bias to validate their results. They then completed the race IAT. Participants in Study 1b also completed a measure of motivation to avoid prejudice (Plant & Devine, 1998). As per our preregistration, we did not make explicit hypotheses about this measure. As such, we present analyses of this measure in the Exploratory Analyses section.

Stroop task
Participants indicate the font color of a color word (red, Cognitive Control and the Implicit Association Test: A Replication of Siegel, Dougherty, and Huber (2012) Collabra: Psychology blue, green) displayed in a congruent or incongruent colored font by button press (Stroop, 1935). Responding correctly in incongruent trials requires overcoming response interference with cognitive control (MacLeod, 1991). Participants completed 100 trials (Study 1a) or 200 trials (Study 1b) with 80% of trials congruent. Split-half reliabilities for all measures were obtained via bootstrapping with 10,000 random splits (Parsons et al., 2019). We report the bootstrapped estimate and 95% CI in brackets. The reliability of the Stroop task was .69 [.61,.76] in Study 1a and .76 [.72, .81] in Study 1b. As per our pre-registration and the original report, we addressed skew in response times by taking the inverse response time for each trial. Stroop scores were calculated by subtracting the mean inverse response time of the incongruent condition from the mean inverse response time from the congruent condition. Higher scores in this inverse metric represent more cognitive control.

Race AMP
Participants indicate whether a Chinese ideograph is pleasant or unpleasant by button press. Before each trial, participants are either primed with Black faces, White faces, or a gray box (control) for 200ms. Although they are told to ignore these faces when judging the ideographs, racial associations affect these judgments, providing an indirect measure of associations (Payne, 2005). In addition, because the AMP does not involve task-switching or response competition, it is arguably less likely to be affected by cognitive control (Siegel et al., 2012

Race IAT
Participants categorize pictures of Black and White faces as well as positive and negative words by pressing one of two keys (Greenwald et al., 1998). Pictures and words are mapped onto the same key, creating stereotype-compatible trial blocks (e.g., Black/Bad) and stereotype-incompatible blocks (e.g., Black/Good). Across the typical seven blocks of the IAT, participants completed 120 trials (Study 1a) or 240 trials (Study 1b) in the compatible and incompatible blocks. The reliability of the IAT was .78 [.72, .84] in Study 1a and .87 [.82, .91] in Study 1b. IAT scores were computed using the algorithm recommended by Greenwald et al. (2009). Higher scores indicate stronger associations between positive words and White faces than Black faces.

Directionality of Measures
Positive scores on the AMP and IAT indicate stronger associations between White versus Black faces and positivity, whereas negative scores indicate the opposite. However, the predictions in the original report (Siegel et al., 2012) are about how cognitive control relates to reducing interference that arises from racial associations, regardless of which group is preferred. These predictions are about the extremity of associations, rather than the direction of those associations. Following our pre-registration, we took the absolute value of AMP and IAT scores such that a score of zero indicates equally positive associations between Black and White faces. Positive scores indicate stronger preferences for one group over the other, regardless of which group is preferred. However, we also conducted an analysis examining the raw IAT and AMP scores in the the Exploratory Analyses section.

Confirmatory Analyses
Descriptive statistics for all variables are provided in Table 1. Correlations for all variables are provided in Table  2.

Study 1a
We used multiple regression in both studies to test whether making race salient increases the extremity of IAT scores and changes the strength of the relationship between the Stroop or AMP scores and IAT scores. Following Siegel et al. (2012) we specified a model regressing IAT scores on feedback, Stroop scores, AMP scores, and the interaction between feedback and the latter two predictors. Continuous variables were grand mean centered and standardized such that a score of 1.0 indicated a value one standard deviation above the mean. Feedback was effect coded with -.5 indicating low-prejudice and .5 indicating high-prejudice.
In contrast with the original report, the high-prejudice feedback manipulation did not result in more extreme IAT scores, = -0.06, t(137) = -0.33, p = 0.741. The relationship between Stroop and IAT scores also did not increase in the high-prejudice condition, = 0.10, t(137) = 0.61, p = 0.542. We did find some support for the hypothesis that the relationship between AMP and IAT scores would decrease in the high-prejudice condition, = -0.43, t(137) = -2.38, p = 0.019. However, simple slope analyses demonstrated the relationship between the IAT and the AMP was not significant in the low-prejudice feedback group, = 0.20, t(137) = 1.92, p = 0.057, nor the high-prejudice feedback group, = -0.23, t(137) = -1.56, p = 0.120. Figure 1 (top row) shows the Stroop-IAT and AMP-IAT correlations.

Exploratory Analyses
Our confirmatory analyses followed the procedures used in Siegel et al. (2012) to take the absolute value of the AMP and IAT scores. This was done to estimate a measure of the extremity of evaluative racial associations, rather than the Cognitive Control and the Implicit Association Test: A Replication of Siegel, Dougherty, and Huber (2012) Collabra: Psychology  direction of those associations. However, this method has some limitations. For example, participants may be more motivated to control pro-White associations than pro-Black associations, which would suggest that changes in IAT score would only occur for certain participants. In addition, this procedure artificially reduces the variability in these measures, which will impact statistical inferences. Another limitation with the analyses in Siegel et al. (2012) is the exclusion of Black participants from analyses. This procedure is common in stereotyping and prejudice research, under the assumption that Black participants' responses are qualitatively different than White participants' responses. However, evidence for this is not very strong. For example, Black and White participants do not significantly vary in racial bias within a simulated shooting task (Correll et al., 2002). In addition, this practice reduces generalizability, providing inferences that can only be safely made to WEIRD populations (Henrich et al., 2010).
Cognitive Control and the Implicit Association Test: A Replication of Siegel, Dougherty, and Huber (2012) Collabra: Psychology To avoid these issues, we reran the analyses in Study 1a and 1b on the untransformed variables. In Study 1b where Black participants were removed (rather than excluded, Study 1a), we include their data. Finally, we also conducted exploratory analyses on the relationship between motivation to control prejudice, the false feedback manipulation, and IAT scores.

Analyses on Untransformed Variables
We first reran the main analyses in Study 1a without transforming variables or removing Black participants. This analysis did not show that high-prejudice feedback manipulation increased anti-Black associations, = 0.26, t(137) = 1.53, p = 0.127. The relationship between Stroop and IAT scores also did not increase in the high-prejudice condition, = -0.15, t(137) = -0.88, p = 0.383, nor did the relationship between AMP and IAT scores decrease, = -0.12, t(137) = -0.72, p = 0.475.
Participants also completed measures of internal motivation ( = .81 [.77, .84]) and external motivation ( = .83 [.80, .86]) to avoid prejudice (Plant & Devine, 1998). Past work has typically shown that people who are internally motivated show weaker anti-Black associations on the IAT, whereas those who were more externally motivated displayed stronger anti-Black associations (Hausmann & Ryan, 2004;Ito et al., 2015). We ran a regression model to test if we could replicate this finding using motivation scores, false feedback, and their interactions as predictors.

Evaluation of Replication Results
While our confirmatory analyses used null-hypothesis significance testing to evaluate our replication attempts, there are multiple ways to determine how the results of replication studies should be compared to original studies (Asendorpf et al., 2013;Collaboration, 2012). We provide two additional ways to examine our results: mini-meta

Figure 1. Scatter plot of the correlations between the race IAT and the race AMP or Stroop task
The top row shows data from Study 1a. The bottom row shows data from Study 1b. Separate correlations were calculated for the high (blue) and low (red) prejudice feedback conditions. Stroop effect was multiplied by 1000 to make the axis easier to read.
analysis (Goh et al., 2016) and Simonsohn's (2015) small telescope approach. The latter is a framework for interpreting replication studies based on effect size estimates and sample sizes from the original study. Replications that obtain effect sizes significantly smaller than 33% of the original effect are classified as "informative failures" and indicate the effect size was too small for the original study to have reliably detected.
We chose to focus on the two key interactions from Study 3 of Siegel et al. (2012), the interaction between feedback condition and Stroop or AMP. Figure 2 presents forest plots of fixed-and random-effect meta-analyses of the condition by Stroop interaction. Figure 3 presents the same analyses for the condition by AMP interaction. Both plots show the for the effect as reported in Siegel et al. (2012) Study 3. Focusing on the condition by Stroop interaction (i.e., that the relationship between Stroop and IAT scores would increase when given false feedback about prejudice), both the fixed-effect ( = 0.15, z = 1.58, p = .113) and randomeffect meta-analyses ( = 0.16, z = 1.28, p = .201) do not show significant support for this hypothesis. However, neither replication study is an informative failure. The effect size for Study 1a did not significantly vary from the for the original study ( = 0.28), t(137) = -1.03, p = .307, nor did the effect size from Study 1b, t(137) = -1.89, p = .060.
In terms of the condition by AMP interaction (i.e., that the relationship between AMP and IAT scores would decrease when given false feedback about prejudice), the Additional analyses (available online) demonstrated that stroop task score did not predict IAT score, nor did it interact with the motivation variables. 1 Cognitive Control and the Implicit Association Test: A Replication of Siegel, Dougherty, and Huber (2012) Collabra: Psychology fixed-effect analysis provided significant support ( = -0.21, z = -2.35, p = .018), although the random-effect analysis did not ( = -0.26, z = -1.64, p = .101). Using the small telescope approach, Study 1a would be seen as a significant replication and is consistent with the null-hypothesis significance testing result. However, Study 1b is an informative failure; the effect size for Study 1b was significantly closer to zero than the for the original study ( = -0.28), t(211) = 2.22, p = .027.

Discussion
We tested three hypotheses from Siegel et al. (2012) relating to the impact of cognitive control on IAT scores. With the exception of some support in Study 1a for the hypothesis that the relationship between AMP and IAT should increase when race is made salient, we found little support for the original findings despite our close replication design, larger sample sizes, and increased reliability. These results decrease our confidence in our original results that cognitive control impacts IAT performance or that its influence on the IAT can be manipulated by false feedback.

Facets of Cognitive Control
The Stroop task is thought to measure response inhibition, one of three facets of cognitive control: response inhibition, task switching, and working memory updating (Ito et al., 2015). It is possible that only certain aspects of cognitive control relate to IAT performance. For example, in an experiment testing the three types of cognitive control, task-switching was most clearly related to the size of IAT scores (Klauer et al., 2010). Similarly, Ito et al. (2015) used a latent variable approach to test whether general, working memory updating, or task switching control predicted IAT scores. They found those with higher levels of task switching control displayed weaker anti-black associations. General cognitive control, which the Stroop task mapped on to, was not related to IAT scores.
What do these findings mean for the validity of the IAT as a measure of personally held associations? Our work suggests that manipulations cognitive control-when operationalized similarly to Siegel et al. (2012)-may not reliably impact IAT scores. However, when considering other research demonstrating sensitivity of IAT scores to different external factors (e.g., Johnson & Chopik, 2019;Karpinski & Hilton, 2001;Rothermund & Wentura, 2004), the conclusion that these results show that the IAT mostly reflects one's personal evaluative associations regarding race is also unwarranted.

Limitations
One limitation of the current study design concerns how measures were coded in this study. The IAT and AMP are relative measures, where positive scores indicate stronger associations between White faces and positivity. In contrast, negative scores indicate stronger associations between Black faces and positivity. Measuring extremity by taking the absolute value of these scores can mask trends in the data. For example, a participant that shows a small positive bias towards Whites on the AMP and a small negative  bias towards Whites on the AMP would appear identical in the current analyses. This may explain why the positive correlation between the IAT and AMP disappears when using the transformed measures, as shown in the correlation table and in the Exploratory Analyses section.
Another limitation is that making racial bias salient may be insufficient to manipulate cognitive control in the IAT. One way to address this issue would be to manipulate cognitive control more directly and verify its effectiveness through a manipulation check. For example, an experimenter might instruct participants to focus on inhibiting their default responses, and test the effectiveness of this manipulation by comparing performance on a response inhibition or task-switching measure.

Conclusion
We were unable to replicate results demonstrating that the effect of cognitive control on IAT scores can be manipulated (Siegel et al., 2012). Reanalyses of the original and replication data using meta-analysis and a small telescope approach demonstrated-at best-weak support for Cognitive Control and the Implicit Association Test: A Replication of Siegel, Dougherty, and Huber (2012) Collabra: Psychology the hypotheses laid out in the original report. However, we note this conclusion is specific to how cognitive control was measured (Stroop task) and manipulated (false feedback). More broadly, this study demonstrates the importance of direct replication for establishing the robustness of results. In order for psychology to become more self-correcting (Jussim et al., 2016), we encourage researchers to test and publish replications of their past work.