Interpreting a failure to replicate is complicated by the fact that the failure could be due to the original finding being a false positive, unrecognized moderating influences between the original and replication procedures, or faulty implementation of the procedures in the replication. One strategy to maximize replication quality is involving the original authors in study design. We (N = 17 Labs and N = 1,550 participants, after exclusions) experimentally tested whether original author involvement improved replicability of a classic finding from Terror Management Theory (Greenberg et al., 1994). Our results were non-diagnostic of whether original author involvement improves replicability because we were unable to replicate the finding under any conditions. This suggests that the original finding was either a false positive or the conditions necessary to obtain it are not fully understood or no longer exist. Data, materials, analysis code, preregistration, and supplementary documents can be found on the OSF page: https://osf.io/8ccnw/
A substantial proportion of replications in recent large-scale efforts have failed to support the original finding (Nosek et al., 2022). Less than 40% of 100 replications from three top psychology journals were considered successful across a variety of criteria (Open Science Collaboration, 2015). Likewise, 13 of 21 social science experiments published in the journals Nature or Science were interpreted as replicating successfully based on observing statistical evidence (p < .05) in the same direction as the original study (Camerer et al., 2018). Using similar criteria, the “Many Labs” projects conducted high-powered replications of non-randomly sampled studies, yielding variable rates of success: Many Labs 1 (Klein et al., 2014): 10/13, Many Labs 2 (Klein et al., 2018): 14/28, Many Labs 3 (Ebersole et al., 2016): 3/10, Many Labs 5 (Ebersole et al., 2020): 2/10. There are many possible contributors to failures to replicate, including publication bias or p-hacking in original research, unidentified moderators that differ between the original and replication, and failures to effectively implement the replication studies. Shortcomings in implementation could occur by failing to transfer knowledge of key features of the study methods to replication teams.
We sought to evaluate whether involving original authors in selection and design could improve replication success. To do so, we selected an important theory that is also considered sensitive to the procedures necessary to elicit its effects: Terror Management Theory (TMT; Greenberg et al., 1986, 1994). TMT considers the psychological consequences of being reminded of one’s impending death. This topic has spawned hundreds of publications, some with more than 1,000 citations. TMT states that as humans evolved self-awareness, they also came to know that their death is inevitable. To avoid preoccupation with thoughts of death or feelings of meaninglessness, one must manage the potential terror caused by this knowledge (Becker, 1973, 1975). Greenberg, Pyszczynski, and Solomon (1986) proposed that self-esteem is the buffer to these intrusive thoughts, and that the purpose of self-esteem is to manage terror related to mortality. TMT has been applied to understand human activities such as religion (e.g., belief in an afterlife grants literal immortality, alleviating mortality terror; Jonas & Fischer, 2006), cultural identity (e.g., belief in being part of a greater good that will persist after death; Greenberg et al., 1990), and inter-group conflict (Burke et al., 2010). TMT experts also indicated that there was substantial nuance required in implementing a successful TMT study, and at least one doubted it could be captured in a “Many Labs” style project at all. These nuances include how the experimenter delivers the experimental script (tone, manner), precise design of materials, accounting for contextual details, and other aspects which were refined over years. However, there has never been a systematic investigation of the replicability of TMT findings.
To investigate whether author involvement could improve replicability, we engaged experts in three phases of research design. First, we conducted a community-based search for an important area of research that requires expertise for effective implementation and had experts willing to contribute to replication. TMT met these criteria.
Second, we consulted with TMT experts – Tom Pyszczynski and Sheldon Solomon – to identify a seminal study appropriate for replication (Jeff Greenberg later contributed to the design of the study materials). Candidate studies had to take 30 minutes or less, be administered on pencil-and-paper or computerized, and preferably have two or fewer between-subjects conditions to maximize statistical power. We sought studies that had room for researcher flexibility in design, were theoretically central to TMT, and had reasonably high expectations of replicability. In consultation with experts, we selected Study 1 from Greenberg, Pyszczynski, Solomon, Simon, and Breus (1994). This paper is highly influential (1252 citations on Google Scholar as of January 14, 2022), and Study 1 used a prototypical mortality salience manipulation: writing about one’s own death, which 79.8% of studies in the field of TMT have used (Burke et al., 2010).
Third, original authors were instrumental for designing the “Author Advised” version of the replication protocol. The goal was to leverage all possible expertise to design a study with the greatest chance of replicating the original effect. All parts of the description, implementation, and analysis plan for this Author Advised protocol were reviewed by at least one original author before data collection began.
To investigate whether author involvement increases replicability, we recruited 21 teams and randomly assigned them to either administer the Author Advised protocol or to develop and administer their own “In House” protocol to samples that they recruited. In House protocols were designed without contact with the original authors, other teams participating in the project, or other domain experts. Each lab was responsible for collecting data from at least 80 participants to ensure a high degree of statistical power when aggregating across sites.1 At some labs, when separate principal investigators were identified to administer independent replications, both In House and Author Advised protocols were conducted.
With this design, we examined the following questions:
Can we successfully replicate a central finding supporting TMT? We replicated Study 1 of Greenberg et al. (1994) in 17 labs with 1,578 participants (after exclusions). At minimum, this provides a high-powered test, for all but very small effect sizes, of this important finding.
Does original author involvement improve replicability? The selected study (Greenberg et al., 1994, Study 1) left ample room for non-experts to miss potentially critical aspects of implementation. For example, the original paper deviated somewhat in materials from the eventual Author Advised replication implementation and contained little of the advice regarding lab context and setup. We tested whether the Author Advised protocol was more effective at eliciting statistically significant and larger effect magnitudes than In House protocols.
Does a standardized, author-reviewed protocol produce less variability across data collection sites? We hypothesized that a standardized implementation of the Author Advised protocol across sites would result in less variation in results compared to the In House protocols because the In House protocols would likely have substantial variability in procedures.
Method
Study 1 of Greenberg et al. (1994) provided evidence that reminders of death induce worldview defense, and that this effect was stronger when the reminders of death were comparatively subtle (i.e., thinking about one’s own death) than when reminders of death were more emotionally salient (i.e., thinking about one’s own death and reflecting on their deepest feelings about dying). In the original study, a total of 59 introductory psychology students in the United States were randomly assigned between five conditions. Our replication focused on the two conditions that were most effective in the original paper: the condition labeled “subtle own death salient” and the “TV salient” (control) condition. The outcome of interest was evaluations of pro- vs. anti-American essay authors. Specifically, participants in the death salient condition reported a greater preference for the pro-American essay author over the anti-American essay author, compared to participants in the control condition.
Participants in both conditions were told that the study session was composed of two separate studies. In the “first study,” participants completed two filler measures and the mortality salience or control manipulation. Participants in the condition described as “subtle own death salient” wrote about the emotions they experienced when thinking about their own death, and about what would happen to their physical body as they were dying and once they were dead. Participants in the “TV salient” condition received similar writing prompts, but instead described the emotions they would experience while watching television, and what they thought happened to their physical body as they watched television. Participants then completed the Positive and Negative Affect Schedule (PANAS; Watson et al., 1988).
In a so-called “unrelated second study” during the same session, participants read two brief essays ostensibly written by international students. One essay was pro-American and the other was anti-American. For the dependent variables, participants answered three questions evaluating each essay’s author, and two questions evaluating the essay itself. The present replication focused on evaluations of the author because that showed the strongest effect in the original study. A composite rating of the author was created by subtracting the mean of the three anti-American items from the mean of the three pro-American items. A pairwise comparison revealed that participants in the “subtle own death salient” condition showed a greater preference for the pro-American author than for the anti-American author (Mean Diff = 12.25). This was compared to participants in the “TV salient” (control) condition (Mean Diff = 1.64), t(53) = 4.87, p < .001, Cohen’s d = 1.34 (Greenberg et al., 1994).
Notable Differences from Original. The replication included just two of the five conditions from the original (“subtle own death salient” and “TV salient”) and focused on just the evaluations of the essay authors (disregarding any evaluations of the essays themselves). Because this mortality salience condition may not be considered particularly subtle, and is described in the original paper as their usual mortality salience induction, we will refer to this condition simply as the mortality salience condition from here on. The replication studies were administered in the US Fall of 2016 and Spring of 2017.
Procedure for Author Advised Protocol. All materials and instructions used in the study are available on the OSF page (https://osf.io/bq4n4/), as well as instructions and detailed procedures and advice provided to Author Advised sites. For the replication, we made several changes to the original protocol as requested by the original authors:
The Anti-American essay was adjusted to be more forceful and extreme, to ensure it conflicted with the average participant’s worldview.
The filler-task portion of the study was extended to include additional surveys and increase the delay between the mortality salience induction and essay evaluation.
The procedure tried to put participants into a relaxed, experiential mood. Precise language was included in the expert instructions packets and included recommendations such as selectively choosing relaxed research assistants to administer the study, using a covered box for handing in packets to ensure a feeling of confidentiality, separating participants into different cubicles or rooms, and having experimenters act and dress casually.
Additional demographic items were added to the end of the survey to apply exclusion criteria.
The Author Advised protocol was always administered in person and in individual cubicles or small groups. When participants first entered the room, they were read a prepared script that included a cover story that the study consisted of two parts, and statements assuring participants of the anonymity of their responses to reduce demand characteristics. Participants were instructed to complete the first half of the survey, and then return the completed form to a covered box before being given the second survey. Then, participants were seated and given the second packet of materials.
The first materials packet started by reassuring participants that their responses were confidential, and that they should respond naturally: “On the following pages you will find a series of personality, attitude and judgment questionnaires. There are no right or wrong, or good or bad answers; rather different responses reflect different personalities, attitudes and judgment styles. Please respond honestly and naturally to each question and complete the questionnaires in the order that they appear in the packet. Your responses to these questions are completely anonymous and will be used for research purposes only.” Then, participants completed a 23-item “personality inventory” included as a distractor, in which participants answered “Yes” or “No” to items such as, “Does your mood often go up or down?” They also completed a 12-item measure modeled after the Personal Need for Structure Scale (Thompson et al., 2001), which was also included primarily as a distractor. Participants responded on a 6-point scale (1 – strongly disagree to 6 – strongly agree) to items like “It upsets me to go into a situation without knowing what I can expect from it.” After this, participants completed the mortality salience or TV (control) induction.
Prior to the induction text, participants were again reminded to respond naturally: “On the following page there are a couple of open-ended questions. Please respond to them with your first, natural response. We are just looking for people’s gut-level reactions to these questions.” Participants in the mortality salience condition then responded to two open-ended items disguised as “The Projective Life Attitudes Assessment.” The first item asked participants to “Please briefly describe the emotions that the thought of your own death arouses in you,” while the second item asked participants to “Jot down, as specifically as you can, what you think will happen to you physically as you die and once you are physically dead.” Participants in the TV (control) condition responded to nearly identical items, which instead asked about their emotions experienced while watching television and what happens to them as they watch television. Participants then completed the PANAS-X (Watson & Clark, 1994), a 60-item measure of current emotional state in which they indicated the degree to which they currently felt each of 60 emotions on a 5-point scale (1 – very slightly or not at all to 5 – extremely), and the Morningness-Eveningness Questionnaire (Horne & Ostberg, 1975) which is a 19-item scale assessing the degree to which participants performed better in the morning or evening. Example items included: “What time would you get up if you were entirely free to plan your day?” and, “What time would you go to bed if you were entirely free to plan your evening?” The purpose of these two scales was to provide “filler” time between the induction and the essay evaluations. This concluded the “first half” of the study and the first materials packets.
After the participant handed in the first packet, they were provided the second packet of materials, matched by a participant number. Participants read a cover story that indicated the university had collected writing samples of impressions of America from international students attending the university. Participants then read two essays, in counterbalanced order across participants. One essay was relatively pro-U.S. and the other was relatively critical of the U.S. Participants rated each essay and its author on a five-item, 9-point scale (1 = “not at all” to 9 = “extremely). Example items include”To what extent do you think the essay makes valid points?” and “How intelligent do you think the person who wrote this essay is?”.
Procedures for In House Protocols. The In House protocols were created independently by each participating lab. Instructions provided to these labs are available on the OSF page (https://osf.io/drfg2/). To the best of their ability, and using only the original paper, relevant literature, and any other publicly available resources, each lab designed their own replication protocol of the assigned study. The labs were prohibited from contacting the original authors, other experts, and the other participating labs. The only outside review was by the project leaders to confirm that the correct study was being replicated and that the resulting data collection would yield all variables necessary for the basic analysis plan.
The In House protocols differed substantially, both from one another and as compared to the standardized Author Advised protocol. For example, 7 In House labs used the PANAS as a filler task between the manipulation and the dependent variables; the other 5 labs included no filler task at all. None of the In House labs used the same set of filler tasks as the Author Advised version, which included the PANAS and second unrelated questionnaire. All 12 In House labs collected data using a computer instead of pencil-and-paper, a design explicitly discouraged by the original authors for the Author Advised version. We summarize major attributes of the designs of all In House sites in Table 2. All materials used at In House sites and videos documenting their implementation are available at http://osf.io/8ccnw/ to facilitate review of the varying procedural implementations.
Study Administration. Within each lab, participants were randomly assigned to either the mortality salience or control condition of the given protocol. At three universities, both protocols were administered by independent groups of researchers (to maintain blinding to the Author Advised materials). At those sites, the groups recruited participants separately, but randomized participants between conditions in their given protocol. Each collaborating lab was responsible for recruiting at least 80 participants for their assigned protocol. Participants were prohibited from participating in the study more than once.
Provisions for Quality Control
As in previous iterations of the “Many Labs” projects (e.g., Ebersole et al., 2016, 2020; Klein et al., 2014, 2018), we employed quality control and accountability standards for each lab’s data collection. Each site was required to film a mock session of the data collection in which the researcher filmed a mock participant walking through the steps as if they were actually taking the study. These videos allowed a more thorough documentation of the lab spaces and any idiosyncratic differences in protocol. In addition, sites were responsible for ensuring all experimenters in direct contact with participants were blind to the participant’s condition assignment. Collaborators also completed an experimenter survey assessing their expertise, expectations, and motivations for joining the collaboration. The survey and videos are available at https://osf.io/8ccnw/.
At the end of the Author Advised protocol, participants filled out demographic information. Three of these items were included at the request of original authors: importance of American identity, country of birth, and race. In House protocols were asked to collect similar demographic information (see https://osf.io/drfg2/), but these were described minimally to avoid influencing design decisions. The American identity item was omitted entirely. As a result, demographic data from In House sites were not entirely comparable to demographic data from Author Advised sites.
Participants
Twenty-one labs from 18 universities participated in the project and collected data from a total of 2,281 participants. Several overall exclusions were made, each as specified by our preregistration: https://osf.io/4xx6w. First, data from four labs were excluded for collecting fewer than 60 participants.2 This excluded 158 participants. Next, some In House sites began data collection before the analysis plan was pre-registered due to their own deadlines. Data collected before the pre-registration were excluded from the confirmatory tests reported in this manuscript. This excluded a further 545 participants.3
From the remaining 17 labs and 1,578 participants, in adherence to our pre-registration, we excluded from all analyses participants who either failed to complete all 6 ratings of the essay authors, or who failed to complete both writing prompts within the mortality salience or control conditions (e.g., the between-subjects manipulation). The latter exclusion criteria applied only to participants from Author Advised sites, because the necessary data was not always available for In House sites. Thus, the usable sample included 1,550 participants (see Table 1 for a breakdown by lab).
841 participants (54.26%) were women and 442 participants (28.52%) were men; the remaining participants did not respond to the item, were asked about gender in a non-standard way, or chose a different response. The mean age was 19.91 years (SD = 2.49). Participant reported race was 946 (61.03%) White, 252 (16.26%) Asian, 112 (7.23%) Black or African American, 4 (0.26%) American Indian or Alaska Native, 7 (0.45%) Native Hawaiian or Pacific Islander, and 124 (8%) another category. The remaining participants did not report their race or their responses could not be recoded to match these categories.
Location | Site Identifier | Author Advised (AA) or In House (IH) | N |
1. Brigham Young University – Idaho, ID | Byui | AA | 81 |
2. Ithaca College, NY | Ithaca | IH | 81 |
3. Occidental College, CA | Occid | AA | 88 |
4. Pace University, NY | Pace_expert | AA | 106 |
5. Pace University, NY | Pace_inhouse | IH | 58 |
6. Pacific Lutheran University, WA | Plu | IH | 60 |
7. The College of New Jersey, NJ | Cnj | AA | 135 |
8. University of California, Riverside, CA | Riverside | AA | 107 |
9. University of Florida, FL | Ufl | IH | 98 |
10. University of Illinois at Urbana- Champaign, IL | Illinois | IH | 89 |
11. University of Kansas, KS | Kansas_inhouse | IH | 75 |
12. University of Pennsylvania, PA | Upenn | IH | 86 |
13. University of Wisconsin, Madison, WI | Uwmadison_expert | AA | 79 |
14. University of Wisconsin, Madison, WI | Uwmadison_inhouse | IH | 68 |
15. Virginia Commonwealth University, VA | Vcu | AA | 103 |
16. Wesleyan University, CT | Wesleyan_inhouse | IH | 95 |
17. Worcester Polytechnic Institute, MA | Wpi | IH | 141 |
Location | Site Identifier | Author Advised (AA) or In House (IH) | N |
1. Brigham Young University – Idaho, ID | Byui | AA | 81 |
2. Ithaca College, NY | Ithaca | IH | 81 |
3. Occidental College, CA | Occid | AA | 88 |
4. Pace University, NY | Pace_expert | AA | 106 |
5. Pace University, NY | Pace_inhouse | IH | 58 |
6. Pacific Lutheran University, WA | Plu | IH | 60 |
7. The College of New Jersey, NJ | Cnj | AA | 135 |
8. University of California, Riverside, CA | Riverside | AA | 107 |
9. University of Florida, FL | Ufl | IH | 98 |
10. University of Illinois at Urbana- Champaign, IL | Illinois | IH | 89 |
11. University of Kansas, KS | Kansas_inhouse | IH | 75 |
12. University of Pennsylvania, PA | Upenn | IH | 86 |
13. University of Wisconsin, Madison, WI | Uwmadison_expert | AA | 79 |
14. University of Wisconsin, Madison, WI | Uwmadison_inhouse | IH | 68 |
15. Virginia Commonwealth University, VA | Vcu | AA | 103 |
16. Wesleyan University, CT | Wesleyan_inhouse | IH | 95 |
17. Worcester Polytechnic Institute, MA | Wpi | IH | 141 |
Note: Location numbers rather than site names will be used on subsequent tables. N reported here is after Exclusion Set 1 is implemented (see analysis plan). The In House site at Pace University is retained because they provided more than 60 participants in total.
Protocol | 2 | 5 | 6 | 9 | 10 | 11 | 12 | 14 | 16 | 17 | AA |
Filler task(s) before mortality salience: | X | X | X | X | X | X | X | X | |||
Rosenburg Self Esteem scale | X | X | X | X | X | X | |||||
Neuroticism scalea | X | X | X | X | |||||||
TIPIb | X | X | |||||||||
BFI or shortened BFIc | X | X | |||||||||
Dot taskd | X | ||||||||||
Filler task after mortality salience: | X | X | X | X | X | X | X | ||||
Mood scale | X | ||||||||||
PANAS | X | X | X | X | X | X | X | X | |||
MEQe | X | ||||||||||
Picture presented after mortality salience | X | X | |||||||||
Essays attributed to student author | X | X | X | X | X | X | X | X | X | ||
Computer data collection | X | X | X | X | X | X | X | X | X | X | |
Pencil/paper data collection | X | ||||||||||
Completed in the lab | X | X | X | X | X | X | X | ||||
Completed online | X | X | X | X | |||||||
College student sample | X | X | X | X | X | X | X | X | X | X | X |
Used unchanged essays from Greenberg et al. (1994) | X | X | X | X | X | X |
Protocol | 2 | 5 | 6 | 9 | 10 | 11 | 12 | 14 | 16 | 17 | AA |
Filler task(s) before mortality salience: | X | X | X | X | X | X | X | X | |||
Rosenburg Self Esteem scale | X | X | X | X | X | X | |||||
Neuroticism scalea | X | X | X | X | |||||||
TIPIb | X | X | |||||||||
BFI or shortened BFIc | X | X | |||||||||
Dot taskd | X | ||||||||||
Filler task after mortality salience: | X | X | X | X | X | X | X | ||||
Mood scale | X | ||||||||||
PANAS | X | X | X | X | X | X | X | X | |||
MEQe | X | ||||||||||
Picture presented after mortality salience | X | X | |||||||||
Essays attributed to student author | X | X | X | X | X | X | X | X | X | ||
Computer data collection | X | X | X | X | X | X | X | X | X | X | |
Pencil/paper data collection | X | ||||||||||
Completed in the lab | X | X | X | X | X | X | X | ||||
Completed online | X | X | X | X | |||||||
College student sample | X | X | X | X | X | X | X | X | X | X | X |
Used unchanged essays from Greenberg et al. (1994) | X | X | X | X | X | X |
aFrom the Eysenck Personality Inventory bTen Item Personality Inventory cBig Five Inventory dParticipants instructed to place 10 dots among shapes eMorning-Eveningness Questionnaire Note: This table summarizes the major differences between the protocols developed by each In House site (identified by numbers at the top corresponding to Table 1) and compares them to the version administered identically across all Author Advised sites (“AA” column).
Analysis Plan
The primary finding of interest from Greenberg et al. (1994) was that participants who underwent the mortality salience treatment showed greater preference for the pro-US essay author over the anti-US essay author as compared to the control condition. To assess whether the replication results support the original, within each lab we followed a similar analysis plan as in the original article, as specified by our pre-registration (https://osf.io/4xx6w). Scores from the three items evaluating the authors of the anti-American essays were averaged = 0.89) and then subtracted from the average of the three items evaluating authors of the pro-American essays = 0.89). An independent-samples t-test was then conducted comparing those in the mortality salient (MS) condition with scores from the “TV salient” (control) condition. We then analyzed these individual results meta-analytically to get an aggregate effect size across all labs. Supplemental exploratory (non-preregistered) analyses treating these as two separate dependent variables are available in the online supplement (https://osf.io/xtg4u/), and those outcomes do not qualify the conclusions offered here.
In addition to the overall exclusions detailed previously in the “Participants” section, original authors requested some additional exclusions based on participant responses. Original authors were not entirely in agreement about what exclusions should be implemented. Therefore, as indicated in the pre-registration, we repeated our analyses under three different exclusion criteria: A minimal set of exclusions (Exclusion Set 1), and two sets of more strict exclusions recommended by original authors (Exclusion Sets 2 and 3).
Because Exclusion Sets 2 and 3 were specifically recommended by original authors and could be considered part of their expertise, we did not inform the In House labs about these exclusions prior to data collection. This masking was to avoid influencing their decisions about data collection procedures, which had to be made independently of any outside help. However, as a result, none of the In House labs collected the data required to make these more strict exclusions because the items required were not present in the original target article.
Therefore, analyses were repeated for each of the three exclusion sets for Author Advised participants. However, Exclusion Set 1 was used consistently across all analyses for In House participants. The exclusion sets consisted of:
Exclusion Set 1: Exclude participants who did not complete both writing prompts and all six items evaluating the essay authors. This yields 1,550 participants (n = 699 Author Advised, n = 851 In House). This aggregate sample size gives us 95% power to detect a mortality salience condition effect of d = 0.18 in an independent samples t-test.4
Location | Author Advised (AA) or In House (IH) | N (TV) | N (MS) | Mean (TV) | Mean (MS) | SD (TV) | SD (MS) | Hedges' g | p |
1 | AA | 41 | 40 | 1.77 | 2.65 | 2.06 | 2.06 | 0.42 | .06 |
2 | IH | 39 | 42 | 0.12 | 0.06 | 1.86 | 2.67 | -0.03 | .9 |
3 | AA | 42 | 46 | 0.33 | 0.52 | 1.66 | 2.26 | 0.09 | .66 |
4 | AA | 53 | 53 | 1.33 | 1.18 | 1.89 | 2.05 | -0.08 | .68 |
5 | IH | 34 | 24 | 2.2 | 0.11 | 5.93 | 5.33 | -0.36 | .17 |
6 | IH | 30 | 30 | 0.32 | -0.18 | 1.79 | 2.14 | -0.25 | .33 |
7 | AA | 59 | 76 | 1.18 | 1.57 | 1.97 | 1.69 | 0.22 | .22 |
8 | AA | 52 | 55 | 0.79 | 0.73 | 1.68 | 1.72 | -0.04 | .84 |
9 | IH | 38 | 60 | 0.87 | 0.92 | 1.24 | 1.18 | 0.04 | .83 |
10 | IH | 45 | 44 | -0.79 | 0.61 | 1.78 | 2 | 0.73 | <.001 |
11 | IH | 40 | 35 | 1.32 | 1.21 | 3.03 | 3.04 | -0.03 | .88 |
12 | IH | 45 | 41 | -0.16 | 0.2 | 1.75 | 2.23 | 0.17 | .42 |
13 | AA | 39 | 40 | 0.94 | 0.91 | 1.72 | 2.16 | -0.02 | .94 |
14 | IH | 36 | 32 | -0.74 | -0.93 | 2.05 | 1.77 | -0.1 | .69 |
15 | AA | 42 | 61 | 1.33 | 1.3 | 1.15 | 1.99 | -0.01 | .94 |
16 | IH | 53 | 42 | -0.56 | -0.22 | 1.82 | 1.46 | 0.2 | .32 |
17 | IH | 70 | 71 | 0.54 | 0.51 | 1.95 | 1.61 | -0.02 | .92 |
Location | Author Advised (AA) or In House (IH) | N (TV) | N (MS) | Mean (TV) | Mean (MS) | SD (TV) | SD (MS) | Hedges' g | p |
1 | AA | 41 | 40 | 1.77 | 2.65 | 2.06 | 2.06 | 0.42 | .06 |
2 | IH | 39 | 42 | 0.12 | 0.06 | 1.86 | 2.67 | -0.03 | .9 |
3 | AA | 42 | 46 | 0.33 | 0.52 | 1.66 | 2.26 | 0.09 | .66 |
4 | AA | 53 | 53 | 1.33 | 1.18 | 1.89 | 2.05 | -0.08 | .68 |
5 | IH | 34 | 24 | 2.2 | 0.11 | 5.93 | 5.33 | -0.36 | .17 |
6 | IH | 30 | 30 | 0.32 | -0.18 | 1.79 | 2.14 | -0.25 | .33 |
7 | AA | 59 | 76 | 1.18 | 1.57 | 1.97 | 1.69 | 0.22 | .22 |
8 | AA | 52 | 55 | 0.79 | 0.73 | 1.68 | 1.72 | -0.04 | .84 |
9 | IH | 38 | 60 | 0.87 | 0.92 | 1.24 | 1.18 | 0.04 | .83 |
10 | IH | 45 | 44 | -0.79 | 0.61 | 1.78 | 2 | 0.73 | <.001 |
11 | IH | 40 | 35 | 1.32 | 1.21 | 3.03 | 3.04 | -0.03 | .88 |
12 | IH | 45 | 41 | -0.16 | 0.2 | 1.75 | 2.23 | 0.17 | .42 |
13 | AA | 39 | 40 | 0.94 | 0.91 | 1.72 | 2.16 | -0.02 | .94 |
14 | IH | 36 | 32 | -0.74 | -0.93 | 2.05 | 1.77 | -0.1 | .69 |
15 | AA | 42 | 61 | 1.33 | 1.3 | 1.15 | 1.99 | -0.01 | .94 |
16 | IH | 53 | 42 | -0.56 | -0.22 | 1.82 | 1.46 | 0.2 | .32 |
17 | IH | 70 | 71 | 0.54 | 0.51 | 1.95 | 1.61 | -0.02 | .92 |
Exclusion Set 2: All prior exclusions, and further exclude participants who did not identify as White or who indicated they were born outside the United States. This reduces the N to 1,229 participants (n = 378 Author Advised, n = 851 In House). This aggregate sample size gives us 95% power to detect a mortality salience condition effect of d = 0.21.
Exclusion Set 3: All prior exclusions, and further exclude participants who responded lower than 7 on the American Identity item (“How important to you is your identity as an American?” 1 - not at all important; 9 - extremely important). This further reduces the usable N to 1,076 participants (n = 225 Author Advised, n = 851 In House). This aggregate sample size gives us 95% power to detect a mortality salience condition effect of d = 0.22.
All data handling, exclusions, and computation of results within sites followed our pre-registered (prior to data collection) analysis plan on the OSF (https://osf.io/4xx6w). All results are reported with two-tailed tests of significance.5
Results
Deviations from Pre-registered Analytic Plan
Our pre-registered analytic plan specifies the use of a three-level meta-analysis, conducted in the MetaSEM R package (Cheung, 2014), to control for the clustering of effect sizes when independent teams ran both In House and Author Advised versions of the protocol at the same university. However, during data analysis we discovered that these models failed to optimize to a solution (OpenMX status1 = 5) which means that their output cannot be interpreted (see https://openmx.ssri.psu.edu/wiki/errors). This is likely because we did not have enough data within each cluster, as almost all sites ended up conducting only one study (e.g., administering only the Author Advised or In House protocol, not both). As such, we had to drop the clustering variable. The results reported below are thus a more common univariate meta-analysis, which is the model that most closely mirrors our originally planned analysis.
Research Question 1: Meta-analytic Results across All Labs (Random Effects Meta-analysis).
The most basic question is whether we observed the predicted effect of mortality salience on preference for pro- vs anti- American essay authors. To assess this, we conducted a random-effects meta-analysis. This analysis produces the grand mean effect size across all sites and versions. Regardless of which exclusion criteria were used, we did not observe the predicted effect, and the confidence interval was quite narrow: Exclusion Set 1: Hedges’ g = 0.07, 95% CI = [-0.03, 0.17], SE = 0.05, Z = 1.32, p = .187. Exclusion Set 2: Hedges’ g = 0.09, 95% CI = [-0.03, 0.21], SE = 0.06, Z = 1.49, p = .135. Exclusion Set 3: Hedges’ g = 0.09, 95% CI = [-0.04, 0.22], SE = 0.07, Z = 1.30, p = .193. Forest plots showing the effects for individual sites and the aggregate are available in Figures 1, 2, and 3 for Exclusion Sets 1, 2, and 3, respectively. These results indicate that, in the aggregate, we failed to replicate the predicted mortality salience effect.
There may have been a mortality salience effect at some sites and not others, so we next examined how much variation was observed among effect sizes (e.g., heterogeneity). This sort of variation did not exceed variation expected by chance (e.g., sampling variance) regardless of exclusion rule: Exclusion Set 1: Q(16) = 19.35, p = .251; Exclusion Set 2: Q(16) = 21.72, p = .152; Exclusion Set 3: Q(16) = 18.57, p = .292. This result suggests that any observed differences in effect size between sites are likely due to random noise, as opposed to true differences in effect size.
In sum, we observed little evidence for an overall effect of mortality salience in these replications. Additionally, overall results suggest that there was little or no heterogeneity in effect sizes across sites. This lack of variation suggests that it is unlikely we will observe an effect of Author Advised versus In House protocols or other moderators such as differences in samples or TMT knowledge. Even so, the plausible moderation by Author Advised/In House protocol is examined in the following section.
Research Question 2: Moderation by Author Advised/In House protocol
A covariate of protocol type (In House vs Author Advised) was added to the random effects model to create a mixed-effects meta-analysis. This is our primary model of interest, and the model most similar to the three-level mixed-effects meta-analysis that we pre-registered as our primary outcome.
This analysis again produces an overall grand mean effect size, and those were again near zero and relatively precisely estimated across all three Exclusion Sets: Exclusion Set 1: Hedges’ g = 0.07, 95% CI = [-0.03, 0.17], SE = 0.05, Z = 1.33, p = .182. Exclusion Set 2: Hedges’ g = 0.11, 95% CI = [-0.02, 0.24], SE = 0.07, Z = 1.71, p = .087. Exclusion Set 3: Hedges’ g = 0.13, 95% CI = [-0.03, 0.29], SE = 0.08, Z = 1.57, p = .117.
Again, significant heterogeneity was not observed: Exclusion Set 1, Q(16) = 19.35, p = .251; Exclusion Set 2, Q(16) = 21.72, p = .152; Exclusion Set 3, Q(16) = 18.57, p = .292. This again suggests that any observed differences in effect size between sites are likely due to random noise, as opposed to true differences in effect size.
Critically, protocol version did not significantly predict replication effect size regardless of which exclusion criteria were used. Exclusion Set 1: b = 0.01, 95% CI = [-0.09, 0.11], SE = 0.05, Z = 0.20, p = .842; Exclusion Set 2: b = 0.05, 95% CI = [-0.08, 0.18], SE = 0.07, Z = 0.82, p = .410; Exclusion Set 3: b = 0.07, 95% CI = [-0.09, 0.23], SE = 0.08, Z = 0.86, p = .391. The Author Advised version did not produce significantly larger effect sizes when compared with the In House versions.
Research Question 3: Effect of Standardization
Finally, we examined whether In House protocols displayed greater variability in effect size than Author Advised protocols. We outlined this hypothesis in our pre-registration, but the methods for testing it are exploratory.
As an initial test, we conducted separate meta-analyses for the In House and Author Advised labs. For each, we conducted both a fixed-effects (with variance between labs constrained to be equal to zero) and random-effects meta-analysis, and then compared the two models with a chi-squared differences test to assess whether the fit significantly changed. If the random-effects model fit significantly better than the fixed-effects model, this would indicate that allowing for variability in effect sizes between sites improved the model.
In this case, neither In House nor Author Advised labs showed a significant benefit of the random effects model over the fixed effects model across any of the Exclusion Sets: In House labs: Exclusion Set 1: ² (1) = 0.67, p = 0.41; Author Advised labs: Exclusion Set 1: ² (1) = 0.00, p = 1; Exclusion Set 2: ² (1) = 0.00, p = 1; Exclusion Set 3: ² (1) = 0.00, p = 1. Overall, this evidence indicates that neither In House nor Author Advised labs showed significant variability in effect size across sites, despite the fact that In House labs were unambiguously more variable in their procedural implementation. This does not mean the variances were equal, but based on the present evidence we cannot conclude that they were different.
Follow-Up Exploratory (Non-preregistered) Analyses
Results reported in this section were not pre-registered and should be considered exploratory.
Researcher expectations and characteristics. A total of 23 researchers from 17 participating sites completed an experimenter survey about their motivations and expertise. This survey was administered during data collection, and although no researcher had access to overall project-wide results, about one third of the researchers reported looking at or analyzing their own site’s data prior to completing the survey. Psychology research experience ranged from 0 to 28 years (M = 9.78, SD = 9.50). Five (22%) indicated they had “a lot” of TMT knowledge, eight (35%) indicated “some” knowledge, four (17%) indicated little knowledge, five (22%) indicated zero knowledge, and one (4%) did not respond to the question. One researcher indicated that they were an expert in TMT, but their site did not reach the minimum sample size specified by the preregistration.
When asked what outcome they wanted to happen, ten (43%) indicated that they hoped for the project to successfully replicate the TMT effect, nine (39%) indicated no preference, and two (9%) hoped the project would result in a failure to replicate, with two (9%) researchers leaving the question blank. On average, the teams estimated a 56% chance of successful replication with a wide range of estimates from 20% to 95% (SD = 23.15).6
Results for expert-only protocols. To provide a test of the replicability and average effect size of TMT under ideal circumstances, one could focus only on the effect size within author-advised protocols. Effect sizes are descriptively larger among these sites but still not statistically significant. Exclusion rule 1: Hedges’ g = 0.08, 95% CI = [-0.07, 0.23], SE = 0.08, Z = 1.05, p = .292; Exclusion rule 2: Hedges’ g = 0.17, 95% CI = [-0.04, 0.37], SE = 0.10, Z = 1.59, p = .111; Exclusion rule 3: Hedges’ g = 0.19, 95% CI = [-0.10, 0.49], SE = 0.15, Z = 1.28, p = .200.7
Results for TMT-knowledgeable sites. Five principal investigators indicated having “a lot” of knowledge about TMT. One might expect that these locations would have greater success at replicating the mortality salience effect. However, in all exclusion sets, when restricting analyses to these five sites, results were not statistically significant: Exclusion rule 1: Hedges’ g = 0.02, 95% CI = [-0.18, 0.23], SE = 0.10, Z = 0.20, p = .843; Exclusion rule 2: Hedges’ g = 0.03, 95% CI = [-0.18, 0.25], SE = 0.11, Z = 0.30, p = .766; Exclusion rule 3: Hedges’ g = -0.03, 95% CI = [-0.27, 0.22], SE = 0.13, Z = -0.20, p = .842.
Evaluations of pro- and anti-US authors. Overall, participants preferred the pro-US author (M = 5.68, SD = 1.83) to the anti-US author (M = 4.98, SD = 4.98). This difference was statistically significant, t(1, 549) = 11.80, p < .001.
Results for participants who preferred the pro-US author. The present hypothesis that mortality salience would cause a participant to become more favorable to the pro-US author as compared to the anti-US author relies on the participant perceiving the pro-US stance as more similar to their own worldview (and/or the anti-US stance as threatening to their worldview). Original authors anticipated that the essays from the original study may not serve this function in the replication, run in 2016. There was a particular concern that in the months leading up to and following the 2016 US Presidential Election of Donald Trump, the generally more liberal-leaning student bodies on college campuses may feel less patriotic and not identify with the pro-US worldview. For this reason, the anti-US essay from the original study was made more extreme in the Author Advised version of the replication.
This did successfully increase participant preferences for the pro-US author over the anti-US author among Author Advised replications as compared to In House replications, (2) = 76.82, < .001. Among In House replications, 48% of participants preferred the pro-US essay author, 42% preferred the anti-US essay author, and 10% had no preference. Among Author Advised replications, 68% of participants preferred the pro-US essay author, 22% preferred the anti-US essay author, and 10% had no preference.
Is the anticipated effect stronger among participants who preferred the pro-US author? We restricted the analysis to only participants at Author Advised sites who preferred the pro-US author. Under this restriction, there was still no significant difference between the mortality salience and control groups in their preference for the pro-US author over the anti-US author. Exclusion Set 1: Hedges’ g = 0.15, 95% CI = [-0.03, 0.33], SE = 0.09, Z = 1.60, p = .110; Exclusion Set 2: Hedges’ g = 0.13, 95% CI = [-0.12, 0.37], SE = 0.12, Z = 1.02, p = .306; Exclusion Set 3: Hedges’ g = 0.07, 95% CI = [-0.23, 0.37], SE = 0.15, Z = 0.46, p = .649.8
Discussion
We conducted a high-powered investigation of the replicability of a classic finding supporting Terror Management Theory (Greenberg et al., 1994; Study 1). With 17 labs contributing usable data from 1,550 participants, we observed little evidence that priming mortality salience increased worldview defense compared to a control condition. We intended to use this paradigm to evaluate whether experts advising on the research protocol would increase effect sizes and overall replication success compared to independent In House replication attempts. However, neither the Author Advised nor In House protocols successfully replicated the original finding. Moreover, we did not observe greater variability in effect sizes across sites using In House protocols despite their procedural variability compared to the standardized Author Advised protocol. With these protocols, in the context of these labs and time in history, we find little support for this key finding of TMT.
This failure to replicate is quite precisely estimated, but it does not mean that the original effect was necessarily a false positive. It could be that changes in history have substantially altered the observability of the effect (original d = 1.34; replication d ranged from 0.07 to 0.13 in the confirmatory tests). Or, that the effect size is much smaller than prior studies have indicated, in which case future studies would need much larger sample sizes than the norm in order to appropriately study it. This null effect also does not necessarily mean that this method of invoking mortality salience is ineffective. It could be effective on other outcomes or in other circumstances. It could even be that the expert-advised modifications to the direction replication, such as making the anti-US essay more extreme, resulted in a smaller effect size. If so, we would need evidence for why it did not occur under these circumstances, and updating of the theory for this boundary condition. Finally, the literature on TMT consists of a network of evidence using a variety of procedures and testing a variety of claims. A single failure to replicate, no matter how precisely estimated, does not overturn all of that prior work. The present evidence does, however, provide an important challenge for TMT to address. The study developed for this report was designed with feedback from experts who suggested the original study to base it on because of its centrality to TMT (Nosek & Errington, 2020b). Moreover, the study was highly powered and used a preregistered analysis plan to maximize diagnosticity of the statistical inferences (Nosek et al., 2018). In addition, we are aware of two other large-scale, pre-registered replications of TMT studies. Both of these efforts failed to find support for the effect being replicated (Sætrevik & Sjåstad, 2019; Schindler et al., 2021). Effective counterevidence to this challenge to the reliability of TMT would be new evidence to show that the finding can be reliably replicated under other conditions, and direct evidence of a boundary condition that the present study inadvertently identified (Nosek & Errington, 2020a).
For example, at least one original author proposed prior to the study that the timing of the replication—September 2016 to May 2017, the period leading up to and following the election of Donald Trump as President of the United States—may result in a failure to observe the mortality salience effect on worldview defense. In essence, students at U.S. universities, which tend to be more liberal, may have been experiencing perpetually threatened worldviews which would decrease the likelihood of demonstrating the mortality salience effect. While it remains possible that idiosyncrasies of the time period may have decreased the effect, we sought to address this concern in three ways: (1) the anti-US essay was made more extreme in its criticism, making it less likely to appeal to most Americans, (2) the strictest exclusion criteria selected for only highly patriotic Americans, and (3) we analyzed and reported exploratory results including only participants who indicated a preference for the pro-US author. Nevertheless, it would be possible to collect new evidence under similar circumstances to test this speculation directly.
It is also possible that this effect requires more expertise than the present design could deliver—for instance, one original author indicated that he was not comfortable endorsing any replication attempt that he did not directly supervise. Advice provided to Author Advised labs was sometimes subtle, making it difficult to ensure correct implementation. For example, labs were told to induce a casual, relaxed mindset in participants by, for example, selecting laid back research assistants, dressing casually, and using less formal lab space. Each site recorded a video of data collection to try to document these subtleties, but it is difficult to assess how successfully this context was induced. However, in exploratory analysis, even highly TMT-knowledgeable sites failed to replicate the mortality salience effect in the present project.
The failure to replicate undermined our goal to test for an expertise effect on replicability. We selected this study because of the belief that the finding was likely replicable, but subtle, so we could study the extent to which expertise matters. Without any effect, we cannot evaluate whether expertise matters. We did observe procedural differences between the In House and Author Advised protocols, including features that were deemed critical by the original authors. For example, original authors agreed that in-lab demonstrations would produce larger effects, but some In House protocols were web-based. Because these differences did not matter in this case we cannot conclude much about the role of expertise for replicable phenomena. Future studies examining this role of expertise may benefit from using multiple original effects crossed between labs, although this may prove an intensive undertaking.
Perhaps most importantly, the present project underscores the need for replication even in areas of research with large bodies of published evidence. If expert advice on central findings of large literatures is insufficient to ensure replicability, then there is still a good deal of work to do to establish replicability of at least some apparently robust findings. Improvements to descriptions and sharing of methods, power of research designs, preregistration of studies and analysis plans, and peer review by experts, may all contribute to improving the robustness and reliability of research findings and their theoretical explanation.
Author Contributions
R. A. Klein, C. L. Cook, C. R. Ebersole, B. A. Nosek, and K. A. Ratliff conceived and designed the study idea. R. A. Klein, C. L. Cook, C. R. Ebersole, B. A. Nosek, and K. A. Ratliff developed study materials. P. Ahn, C. R. Chartier, C. D. Christopherson, S. Clay, B. Collisson, J. T. Crawford, R. Cromar, D. Dudley, G. Gardiner, C. L. Gosnell, J. Grahe, C. Hall, I. Howard, J. A. Joy-Gaba, M. Kolb, A. M. Legg, C. A. Levitan, A. D. Mancini, D. Manfredi, J. Miller, G. Nave, L. Redford, I. Schlitz, K. Schmidt, J. L. M. Skorinko, D. Storage, T. Swanson L. M. Van Swol, L. A. Vaughn, B. Wiggins, and A. J. Brady adapted materials for their individual site and collected data. R. A. Klein and J. Hilgard analyzed the data. R. A. Klein, C. L. Cook, C. R. Ebersole, C. Vitiello, B. A. Nosek, and K. A. Ratliff drafted the report. All authors reviewed, edited, and approved the manuscript for submission.
Acknowledgements
We thank Jeff Greenberg, Tom Pyszczynski, Sheldon Solomon, and Armand Chatard for helping develop and review materials.
Funding
This project was supported in part by a Replication Studies grant (no. 401.18.053) from the Netherlands Organization for Scientific Research (NWO; www.nwo.nl), a Consolidator Grant (IMPROVE) from the European Research Council (ERC; grant no. 726361; https://erc.europa.eu/), a French National Research Agency “Investissements d’avenir” grant (ANR-15-IDEX-02), the John Templeton Foundation, Templeton World Charity Foundation, Templeton Religion Trust, and Arnold Ventures. Those funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests
B. A. Nosek is Executive Director of the Center for Open Science, a non-profit organization with a mission to increase openness, integrity, and reproducibility of research.
Data Accessibility Statement
All the stimuli, presentation materials, de-identified participant data, and analysis scripts can be found on the project page: https://osf.io/8ccnw/
Footnotes
Data collection was cancelled at one site (N = 19 participants) when it became clear that they would fall far short of the data collection target. Data collected at that site were dropped from the project entirely. Four additional sites completed data collection, but fell short of the required N according to our pre-registered exclusion criteria, as detailed in the analysis section. Data from these sites are available on the OSF project page, but are excluded from the analyses reported in this manuscript.
We originally posted a pre-print accidentally omitting this exclusion: https://psyarxiv.com/vef2c/. In a critique, Chatard, Hirschberger, and Pyszczynski (2020) noted this as a deviation from the pre-registration. We revised the present manuscript with these comments in mind and ensured strict adherence to the pre-registration.
Whether these participants are included or excluded has little impact on the results. When these participants are included, some confidence intervals are somewhat narrower, and some results show effect sizes that are slightly smaller. The results presented without these exclusions may be viewed on the OSF page, and all data are available for reanalysis: https://osf.io/8ccnw/
Power calculated with the ‘pwr’ R package (Champely, 2020) assuming equal distribution of participants between mortality salience and control conditions, and two-tailed test with alpha of .05.
We did not specifically pre-register whether tests would be one- or two-tailed, and some have argued that one-tailed tests would be more appropriate (e.g., Chatard et al., 2020). However, we had always intended to use two-tailed tests and this is consistent with all prior Many Labs projects (Ebersole et al., 2016, 2020; Klein et al., 2014, 2018) and the pre-registered sample analysis code (https://osf.io/4xx6w).
Including only sites that had not looked at any data, researchers still estimated a 56% chance of successful replication.
For Exclusion rules 1 and 3 we report the random effects model. We report the fixed effects model for Exclusion rule 2 because the optimizer could not reach a solution for the random effects model (OpenMX error code = 5, indicating that the result should not be interpreted).
The random-effects meta-analysis failed to converge because the heterogeneity parameter, $*$\tau,$*$ was estimated as very, very small. To get the model to converge, we restricted $*$\tau$*$ to zero, creating a fixed-effects meta-analysis.