Processing the Word Red and Intellectual Performance: Four Replication Attempts

Colors convey meaning and can impair intellectual performance in achievement situations. Even the processing of color words can exert similar detrimental effects. In four experiments, we tried to replicate previous findings regarding the processing of the word “red” (as compared to a control color) on cognitive test scores. Experiments 1 and 2 (Ns = 69 and 104) are direct replications of Lichtenfeld, Maier, Elliot, and Pekrun (2009). Both experiments failed to uncover a red color effect on verbal reasoning scores among high school students and undergraduates (Cohen’s d = 0.04 and –0.23). Experiments 3 and 4 (N = 103 and 1,149) failed to identify an effect of processing red on general knowledge test scores (Cohen’s d = 0.19) and 0.01) among undergraduates and adults. Together, these results do not corroborate the assumption that processing the word red impairs intellectual performance.

Biologically inherited and socially learned color associations can affect psychological functioning and observed behaviors (cf. Elliot & Maier, 2014). According to color-in-context theory (Elliot & Maier, 2012), colors may have different, potentially even opposite effects, depending on the prevalent situational conditions. For example, in an achievement context, red color tends to convey a negative meaning due to its implicit association with failure and danger and, thus, induces avoidance motivation. In contrast, in mating contexts, the same color has a positive meaning, evoking approach motivation because red is typically associated with romance and sexual desire. Several empirical findings have supported key propositions in this respect: viewing red impaired cognitive performance on standardized achievement tests (Elliot, Maier, Moller, Friedman, & Meinhardt, 2007) and reduced risky choices to avoid financial losses (Gnambs, Appel, & Oeberst, 2015), whereas it increased interpersonal attraction (Lehmann, Elliot, & Calin-Jageman, 2018) and signaled social status (Wu, Lu, Dijk, Li, & Schnall, 2018). After several studies demonstrated the effect of viewing the color red in achievement contexts (for reviews see Elliot, 2015, andMaier, 2014), it has been suggested that even simply processing the word red is sufficient to yield effects that are comparable to actually seeing a red stimulus (Lichtenfeld, Maier, Elliot, & Pekrun, 2009). Referring to the well-known Stroop effect indicating a close connection between color words and actual color representations (e.g., DeHouwer, 2003;Richter & Zwaan, 2009) and to supportive neuropsychological evidence (e.g., Teichmann, Grootswagers, Carlson, & Rich, 2019), Lichtenfeld and colleagues (2009) demonstrated in four experiments that presenting the word red before a reasoning test resulted in significantly lower test scores as compared to reading the word gray or green. The observed effects of Cohen's d = 0.57, 0.73, 0.64, and 0.99 suggested a substantial impact of processing color words, despite the subtle color manipulations. For example, in two experiments the authors manipulated a small copyright notice including seven words (10 point font size) at the bottom of the cover pages of the test booklet. In another experiment, an example item containing the word red or gray was placed before a reasoning test. In all experiments, reading the word red consistently led to poorer test performance as compared to reading another color word. These findings could have important implications for psychological and educational assessments. If reading the word red in an exam item influences subsequent test performance, this might bias estimates of students' proficiency and even threaten test fairness if students are differentially affected by color cues. Thus, it is important to scrutinize whether these effects can be robustly substantiated.
Despite the substantial effects previously triggered by the word red, the effect sizes reported in Lichtenfeld et al. (2009) were highly uncertain. The respective confidence intervals included large effects up to Cohen's d = 1.60 as well as vanishingly small effects close to zero (see Table 1). Thus, the available findings encompass effects of clear practical importance as well as effects that are unlikely to impact applied assessments. Similar recent replications of viewing the color red have failed to find consistent effects on, for example, perceived attractiveness (Lehmann & Calin-Jageman, 2017;Peperkoorn, Roberts, & Pollet, 2016) or cognitive test performance (Gnambs, 2019). These results suggest that red color effects might be overestimated in published studies and that actual effects are more modest. Therefore, the present study sought to replicate and extend the study by Lichtenfeld and colleagues (2009). Importantly, we sought to narrow the range of compatible effect sizes to derive a more precise estimate of red color effects. Thus, we present two direct replications and two conceptual replications (cf. Schmidt, 2009) of Experiment 2 in Lichtenfeld et al. (2009) testing the hypothesis that respondents reading the word red before working on an intelligence test would solve fewer items correctly as compared to respondents reading a control color word. The data, including the statistical syntax to reproduce our findings, can be found at https://osf.io/5ckvd; methodological details on all four experiments are available in the electronic supplemental material.

Experiment 1 Power Analysis
The sample size was determined based upon a priori power analyses to identify an effect of Cohen's d = 0.70 (as in Experiment 2 in Lichtenfeld et al., 2009) for a onetailed t-test at a significance level of 5% and a power of 80%. This resulted in a minimum sample size of 52.

Method
Sixty-nine students (41 female and 28 male) from two upper secondary schools ("Gymnasium") in Austria with a median age of 17 years (Min = 16, Max = 19) were randomly assigned to a red color (n = 30) and a gray color (n = 39) condition. The participants were informed that they were about to work on a short intelligence test. Then the verbal analogy subtest of the Intelligence Structure Test 2000 R (Liepmann, Beauducel, Brocke, & Amthauer, 2007) also used by Lichtenfeld and colleagues (2009) was administered in both groups. The test included 20 multiple-choice items showing a word pair and the first word of a second pair (e.g., "fast : slow = young : ?"). For each item, five response options were presented, one of which correctly completed the analogy (e.g., "quick, long, tall, tardy, old"). The number of correct answers was the dependent variable. Missing responses were scored as incorrect. Before the actual test, two example items were presented to explain the logic of the test. Following Experiment 2 in Lichtenfeld and colleagues (2009), the second example item included the experimental manipulation: "animal: hound = plant: ?" with five response options "branch, red/gray-alder, root, tree, organism". The manipulation was instigated by the correct solution being presented either as "red-alder" (red color condition) or "gray-alder" (gray color condition; both these trees are quite common in Austria). Moreover, a description below the item (one sentence) explained that "red/gray-alder" was the correct solution. The study was not preregistered.

Statistical Analyses
We expected that reading the word red in the example item presented before the analogy test would result in lower test scores as compared to reading the word gray. This hypothesis was tested with a one-sided t-test for independent groups using an alpha level of 5%. The color effect was quantified using Cohen's d coded in such a way that positive effects fell in line with our hypothesis and indicated lower scores in the red condition. Because the original study considered the respondents' sex as a potential moderator, we replicated these analyses and Note: The effect sizes were coded in such a way that positive values indicate lower scores in the red color condition as compared to the control condition. a The effect was reported as "F(1, 47) = 3.91, p ≤ .05" in Lichtenfeld et al. (2009Lichtenfeld et al. ( , p. 1274, which is not strictly significant (p = .054) at the conventional alpha level of 5%. also report the results of a 2 (color: red versus gray) × 2 (sex: girls versus boys) analysis of variance (ANOVA). To test the directed hypothesis for the main effect of color, we made use of the equivalence between the F distribution with one degree of freedom and the t 2 distribution to derive the one-sided p-value. The effect size for these analyses was partial eta squared 2 p h . Finally, to determine the replication success we conducted an equivalence test (Lakens, 2017) and examined whether the observed effect could be distinguished from the effect of d = 0.73 reported in Lichtenfeld et al. (2009). The smallest effect size of interest for this analysis was set to the effect size that the original study had a power of 33% to detect (cf. Simonsohn, 2015), that is, at a d of 0.47. A significant test result would indicate an observed effect that was equivalent to the original effect.

Results
Participants in the red color condition solved fewer analogy items correctly (M = 11.20, SD = 2.68) as compared to participants in the gray color condition (M = 11.30, SD = 2.47). However, an independent samples t-test (one-tailed) showed no significant (p < .05) difference between the two color conditions, t(67) = 0.14, p = .443, and an effect size close to zero, d = 0.04, 95% CI [-0.44, 0.51]. Moreover, an equivalence test, t(67) = -1.90, p one-tailed = .969, indicated that the observed effect was not equivalent to the original effect and, thus, showed no replication success. Following Lichtenfeld and colleagues (2009), the analyses were repeated considering the respondents' sex as a potential moderator. The ANOVA showed no significant main effects for the color condition, F(1, 65) = 0.21, p one-tailed = .326, significant. In summary, these analyses failed to support the hypothesis that reading the word red would impair analogy test performance.

Power Analysis
Following the social science replication project (Camerer et al., 2018), we aimed at identifying a smaller effect that was only 75% of the original effects size. Thus, the sample size was determined based upon a priori power analyses to identify an effect of Cohen's d = 0.50 for a one-tailed t-test at a significance level of 5%, and a power of 80%. This resulted in a minimum sample size of 102.

Method
One hundred and four Austrian university students (35 female, 68 male, 1 without information on sex) with a median age of 22 years (Min = 18, Max = 57) were randomly assigned to a red color (n = 53) and a gray color (n = 51) condition. The procedure was identical to the first experiment, with participants completing the questionnaire and intelligence test voluntarily during the first 15 minutes of a class period. The study was not preregistered.

Results
In contrast to our hypothesis, participants in the red color condition solved more analogy items correctly (M = 12.10, SD = 2.82) as compared to participants in the gray color condition (M = 11.50, SD = 3.09). An independent samples t-test (one-tailed) showed no significant (p < .05) difference between the two color conditions, t(102) = -1.14, p = .872, and an effect size in the wrong direction, d = -0.23, 95% CI [-0.61, 0.16]. Again, an equivalence test, t(102) = -2.68, p one-tailed = .996, showed no replication success. Follow-up analyses for moderating effects of sex revealed no significant main effects for the color condition,

Power Analysis
As in Experiment 2, the sample size was determined based upon a priori power analyses to identify an effect of Cohen's d = 0.50 for a one-tailed t-test at a significance level of 5% and a power of 80%. This resulted in a minimum sample size of 102.

Method
One hundred and seven students (81 female and 26 male) from a German university with a median age of 21 years (Min = 19, Max = 44) were randomly assigned to a red color (n = 56) and a green color (n = 51) condition. In line with previous research (Gnambs, Appel, & Batinic, 2010;Gnambs, Appel, & Kaspar, 2015), it was expected that reading the word red would impair performance on an indicator of crystallized intelligence. Therefore, participants were administered a short version of the General Knowledge Test -German (GKT-D; Lynn, Wilberg, & Margraf-Stiksrud, 2004). The test included 37 items from different domains that were answered in open response fields. The experimental manipulation required respondents to write down the word red (or a control color word) to allow for a deeper processing of the color word. Thus, the wording of item 19 of the GKT-D was changed to ask either about the color of a ripe tomato (correct answer: red) or about the color of a ripe cucumber (correct answer: green). The number of correct responses after the experimental manipulation was the dependent variable. Missing responses were scored as incorrect. The study was not preregistered.

Power Analysis
Although Lichtenfeld and colleagues (2009) reported effect sizes between Cohen's d = 0.57 and 0.99 for their color manipulations, other research on behavioral priming has typically identified substantially smaller effects. For example, meta-analytic estimates for action and goal priming using incidentally presented words have been about d = 0.35 (Weingarten et al., 2016). In order to increase statistical power to detect even such a small effect, the present study used a more conservative effect size estimate of d = 0.30 (i.e., less than half the effect reported in Lichtenfeld et al., 2009). Moreover, to guard against type II error, the power was set to 95%. An a priori power analysis estimated a required sample size of N = 1,180 (for details see the supplement material).

Method
The study was conducted as an unproctored, web-based test. A sample of N = 1,149 participants from a German online access panel (596 female, 552 male, and 1 without information on sex) with a median age of 38 years (Min = 16, Max = 85) were randomly assigned to a red color (n = 563) or a gray color (n = 586) condition. The respondents were administered a short knowledge test measuring crystallized intelligence from the Berlin Test of Fluid and Crystallized Intelligence -Short Scale (Schipolowski, Wilhelm, & Schroeders, 2013). The test included 12 multiple-choice items with four response options each (one of which was correct). The number of correct answers was the dependent variable. Missing responses were scored as incorrect. The experimental manipulation was implemented in a similar way as in Experiment 2 of Lichtenfeld et al. (2009). Before the knowledge test, the following example item explaining the logic of the test was presented: "Which of these trees is a leaf tree?" with four response options "Nordmann-fir, red/gray-alder, Sargent-spruce, mountainpine". The experimental manipulation again consisted of the correct solution shown either as "red-alder" (red color condition) or "gray-alder" (gray color condition). In addition, a description below the item (one sentence) explained that "red/gray-alder" was the correct solution. To enforce processing of the color word, respondents had to give the correct response to the manipulated example item before being able to proceed to the knowledge test. Because Lichtenfeld and colleagues (2009) assumed that worries about test performance would mediate the color effect on that performance, three worry items (e.g., "I am not satisfied about my performance in the test.") based on Morris, Davis and Hutchings (1981) were presented after the knowledge test with seven-point response scales from 1 (does not apply at all) to 7 (strongly applies). The study was preregistered at https://doi.org/10.23668/ psycharchives.2102.

Results
Participants in the red color condition answered fewer knowledge items correctly (M = 8.71, SD = 2.22) as compared to participants in the gray color condition (M = 8.76, SD = 2.22). However, an independent sample t-test (one-tailed) showed no significant (p < .05) difference between the two color conditions, t(1147) = 0.35, p = .365, and an effect size close to zero, d = 0.02, 95% CI [-0.10, 0.14]. Furthermore, an equivalence test, t(1147) = -7.88, p one-tailed > .999, indicated that the observed effect was not statistically equivalent to the original effect. Examining respondents' sex revealed a significant sex difference, F(1, 1144) = 13.64, p two-tailed < .001, CI [0.00, 0.01], was significant. In conclusion, despite the high power of the study, we found no support for an effect of reading the word red on knowledge test performance or self-reported worries.

Discussion
Colors are assumed to convey meaning that can influence cognitive functioning in achievement contexts (cf. Elliot, 2015;Elliot & Maier, 2014). Even the mere processing of color words without actually seeing any color stimuli has been reported to exert such effects (Lichtenfeld et al., 2009). Thus, reading the word red supposedly impairs performance on verbal and numeric intelligence tests. The present study tested this assumption by trying to replicate Experiment 2 in Lichtenfeld et al. (2009). In two direct replications and two conceptual replications, red color effects were examined for different outcomes (i.e., verbal analogy and general knowledge tests) covering different age groups (from high school students to adults). However, across the four experiments no effect of processing red was observed (see Table 1). Notably, the largest effect size of Cohen's d = -0.23 fell in the wrong direction. Despite the large effects identified in the original study and the substantially larger sample sizes in our replication attempts, we were unable to corroborate that simply processing the word red impairs intellectual performance.
These replication failures do not necessarily invalidate the basic premise of red color effects in achievement situations (Elliot et al., 2007) or color-in-context theory (Elliot & Maier, 2012). Rather, they raise doubts regarding the robustness and generalizability of previously reported results. For example, it could be the case that red color impairs intellectual performance, but these effects are so small (and practically negligible) that they require huge sample sizes to be reliably identified. After all, the original studies reported rather imprecise effect estimates (see Table 1) ranging from substantial red color effects (exceeding d = 1.00) to negligible effects close to zero. Our results suggest that the latter seems more likely. On the other hand, Elliot (2015Elliot ( , 2019 emphasized that only a precise combination of luminance, chroma, and hue is expected to produce intellectual impairment. According to this line of reasoning, simply reading the word red should not elicit any cognitive effects. Moreover, subtle differences in the experimental procedure by Lichtenfeld et al. (2009) and our replications could have introduced unknown confounds (e.g., regarding the precise instructions or respondent incentives; see supplementary materials for details) that led to different results. Therefore, we concur with Elliot (2019) that further high-quality studies are needed "to serve as a cornerstone on which a solid empirical foundation can be built" (p. 16). A large-scale collaborative research project including different research teams (cf. Moshontz et al., 2018) might help devise a study to uncover robust color effects with sufficient power. On a positive note, the present results suggest that color effects are unlikely to exert systematic biases in applied settings. If red color effects require highly standardized settings to be observable, a robust impact on, for example, school exams or cognitive testing in personnel selection seems improbable -at least in the case of purely text-based color effects. Thus, it would seem premature to recommend considering such effects in standard psychological and educational assessment.

Data Accessibility Statement
The data for all experiments, including the statistical syntax to reproduce our findings, can be found at https://osf.io/5ckvd; methodological details on all four experiments are available in the electronic supplemental material.