A majority of statistically educated scientists draw incorrect conclusions based on the most commonly used statistical technique: null hypothesis significance testing (NHST). Frequentist techniques are often claimed to be incorrectly interpreted as Bayesian outcomes, which suggests that a Bayesian framework may fit better to inferences researchers frequently want to make (Briggs, 2012). The current study set out to test this proposition. Firstly, we investigated whether there is a discrepancy between what researchers think they can conclude and what they want to be able to conclude from NHST. Secondly, we investigated to what extent researchers want to incorporate prior study results and their personal beliefs in their statistical inference. Results show the expected discrepancy between what researchers think they can conclude from NHST and what they want to be able to conclude. Furthermore, researchers were interested in incorporating prior study results, but not their personal beliefs, into their statistical inference.
1. Introduction
Null hypothesis significance testing (NHST) is used in most scientific disciplines, including Psychology (Rucci & Tweney, 1980), Economics (McCloskey & Ziliak, 1996) and Medical Sciences (Chavalarias et al., 2016; Goodman, 1999). In NHST, an alternative hypothesis (for example, there is a mean difference between a treatment group and a control group) is tested against a null hypothesis (for example, there is no mean difference between a treatment group and a control group). The measured test statistics (e.g., t-statistics, F-statistics) indicate the difference between one’s data and the null model prediction. The philosophical underpinning of NHST is called frequentism, and allows researchers to draw conclusions that are based on the average performance of these test statistics for a hypothetical infinite repetition of experiments. Thus, a p-value is the probability of obtaining the observed test statistics or more extreme ones, assuming the model assumptions (e.g., linearity, independence) are met and the null hypothesis is true (Greenland et al., 2016).
Despite the central role of NHST in the scientific process, the results from these techniques are misinterpreted by a majority of statistically educated scientists (Falk & Greenbaum, 1995; Haller & Krauss, 2002; Hoekstra et al., 2014; Lyu et al., 2020; Oaks, 1986). For instance, Oaks (1986) presented a scenario to Psychology researchers and students and asked them about their endorsement of six false statements regarding a significant p-value (see Table 1). These statements were: (1) You have absolutely disproved the null hypothesis; (2) You have found the probability of the null hypothesis being true; (3) You have absolutely proved your experimental hypothesis; (4) You can deduce the probability of the experimental hypothesis being true; (5) You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision; and (6) You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.
Number | Statement |
1 | You have absolutely disproved the null hypothesis (that is, there is no difference between the population means). |
2 | You have found the probability of the null hypothesis being true. |
3 | You have absolutely proved your experimental hypothesis (that there is a difference between the population means). |
4 | You can deduce the probability of the experimental hypothesis being true. |
5 | You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision. |
6 | You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions. |
Number | Statement |
1 | You have absolutely disproved the null hypothesis (that is, there is no difference between the population means). |
2 | You have found the probability of the null hypothesis being true. |
3 | You have absolutely proved your experimental hypothesis (that there is a difference between the population means). |
4 | You can deduce the probability of the experimental hypothesis being true. |
5 | You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision. |
6 | You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions. |
None of the six presented statements are valid interpretations of a significant p-value.1 On average, 2.5 incorrect statements were endorsed independent of the participants’ statistical background. In a subsequent study, these findings were replicated when participants were given a prior explanation of the correct interpretation of NHST (Falk & Greenbaum, 1995). Misinterpretations of NHST were even found among statisticians and researchers who are teaching statistics (Haller & Krauss, 2002; Lecoutre et al., 2003). In addition, these misconceptions of NHST have been found in numerous statistical textbooks (Gigerenzer, 2004), such as Introduction of Statistics for Psychology and Education:
[a significant test result] .. is the probability that an observed difference is real….if the probability is low, the null hypothesis is improbable.. (Nunnally et al., 1975, p. 194)
The most common misconception, which was endorsed by 73% of methodology instructors and 68% of psychology students, was “You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision” (Haller & Krauss, 2002). Thus, a majority of statistically educated researchers misinterpret frequentist statistics as the probability of one’s hypothesis being true. That is, researchers misinterpret p-values as the probability of some hypothesis being true given the data (P(H0|D)), when in fact p-values are the probability of the observed data, or data more extreme, given the hypothesis (e.g., a right tailed test P(≥D|H0)). This is a crucial difference, as a given set of data might lead to the rejection of the null hypothesis but could also lead to the rejection of some alternative hypothesis (Wagenmakers et al., 2011). The probability of the alternative hypothesis cannot explicitly be taken into account in a frequentist framework, but it can be in a Bayesian framework (Dienes, 2011; Etz & Vandekerckhove, 2018; Falk & Greenbaum, 1995).
The relevance of directly testing the alternative hypothesis can be illustrated by the court case of Sally Clark (see e.g., Rouder et al., 2016; Wagenmakers et al., 2018). Sally Clark was accused of murdering her two children. The judge and jury had to weigh two competing explanations for the death of the two infants: either Sally Clark murdered her own children, or both children died from sudden infant death syndrome (SIDS). Based on the estimate that the probability that SIDS would happen twice in a row was is roughly 1 in 73 million, Sally Clark was convicted for double murder. This exemplifies NHST reasoning, in which only a single hypothesis taken into account (H0: the children died because of SIDS), and an alternative is accepted by default if this single hypothesis is found to be unlikely. As H0 was very improbable, the jury did not accept the explanation of SIDS occurring twice. However, a crucial comparison was overlooked: the chances of multiple SIDS against the chances of multiple homicide. Only looking at the probability of two SIDS in a row is not informative, as the two deaths being the result of a mother murdering her two infant children, is not taken into account (and, as it happens, this explanation is even less likely). From a Bayesian perspective, what matters is the relative likelihood of both hypotheses. According to this reasoning, the relative plausibility of SIDS versus murder given the two infant children dying actually favors SIDS as an explanation by a factor 9 (see Hill, 2005, for details and assumptions for this calculation).
Despite considerable statistical training, many scientists revert to the habit of misinterpreting the conditional probabilities of the data given a certain hypothesis as the probability that the hypothesis is true. In order to be able to make such claims, statistical inference following a Bayesian framework is necessary (Gigerenzer, 2004). For an additional list of p-value misinterpretations, we refer the interested reader to Badenes-Ribera and colleagues (2015). The goal of the present study is to delve into the reasons why researchers incorrectly endorse (some of) these statements. Could it be that researchers want to be able to make these statements once they obtain a positive result? Are researchers perhaps Bayesians at heart?
In Bayesian statistics, the prior p(H) is combined with the likelihood p(D|H) to arrive at a posterior belief p(H|D). By dividing likelihoods of two rival hypotheses, it is possible to calculate the relative probability of the data under each hypothesis. The ratio of the probability of the data given the alternative hypothesis and the probability of the data given the null hypothesis is called a Bayes factor. The Bayes factor is the Bayesian way of quantifying statistical evidence and it is quite different from p-values. A Bayes factor can quantify how much our observed data shifts the balance of evidence from one hypothesis (e.g., the null hypothesis H0) to another (e.g., the alternative hypothesis H1; for more details see Dienes, 2011):
where the quantity on the left is called the posterior odds, the quantity on the right is called the prior odds, and the quantity in the middle is the Bayes factor (BF10). The subscript indicates that the Bayes factor quantifies the relative evidence provided by the data for the alternative hypothesis (H1) and the null hypothesis (H0). The Bayes factor can be thought of as an updating factor, it reflects the change in belief about the relative likelihood of two hypotheses after the data has been observed (see e.g., van Ravenzwaaij & Etz, 2020).
The BF is directly interpretable in relation to one’s hypothesis. For example, a BF10 = 19 means that the alternative hypothesis is 19 times more probable than the null hypothesis given the data (Rouder et al., 2009). In addition to being able to directly test one’s hypotheses against each other, there are other advantages of using Bayes factors, such as being able to quantify evidence in favor of the null hypothesis, being able to employ sequential testing without the need to correct for multiple testing, and being able to compare strength of evidence across different studies (van Ravenzwaaij & Wagenmakers, 2019; Wagenmakers et al., 2018).
There are two methods to define one’s prior probability: a subjective and an objective approach (Wagenmakers, 2007). In a subjective Bayesian approach, one incorporates their own beliefs about a possible parameter or hypothesis into the prior probability distribution, possibly informed by prior study results (for more details see Kruschke, 2014). In an objective Bayesian approach, one uses a predefined prior instead. One possibility is to use a default prior distribution that is comparatively uninformative in the sense that it allocates probability density to a wide range of possible parameter values (Rouder et al., 2009).
It has been proposed that current frequentist statistical practice is beset by a difference between what it can provide and what researchers desire from them (Morey et al., 2016, but see Lakens, 2019). Combined with the abundance of misinterpretations of frequentist statistics, perhaps a Bayesian framework might provide researchers with a more appropriate tool for conducting statistical inference (Gigerenzer, 2004, 2018). In this study, we attempt to find empirical evidence for these claims.
In a first task, we examine whether there is a difference between what researchers think they can conclude from statistical results and what they want to conclude from them. First, we tested whether researchers endorse false statements regarding NHST. Second, we investigated how much researchers would like to be able to make such statements after conducting statistical inference. Demonstrating a discrepancy between what traditional tests do and what researchers want them to do shows that the standard tools for statistical inference do not (completely) match the researchers’ needs. Our study goes one step further and attempts to demonstrate not only the mismatch between what traditional tests do and what researchers want them to do, but also researchers’ awareness of this mismatch. We expect that researchers will score higher on the items indicating they want to be able to draw the conclusions in the six statements than on the items indicating they can draw the conclusions in the six statements.
In a second task, we examined to what extent researchers think they typically incorporate two types of subjective priors, their own beliefs and prior study results, into their statistical analyses, and to what extent they want to incorporate these two types of subjective priors into their statistical analyses. We hypothesize that researchers feel uncomfortable with using their own beliefs because it may lead to different results depending on the person, thus losing the appearance of objectivity.
2. Methods
Participants
The study was emailed to the corresponding authors of all articles published in 2015, 2016, and 2017 in the following journals:
Journal of Experimental Psychology: General
Psychological Science
Journal of Abnormal Psychology
Journal of Consulting and Clinical Psychology
Journal of Experimental Social Psychology
Journal of Personality and Social Psychology
These journals were chosen to represent a sample of researchers in diverse fields of psychology (experimental, social, neuro-, and clinical), a sampling strategy previously used by Cramer et al. (2016). Participants who did not respond after two weeks received a reminder. After checking for duplicates and invalid email addresses, 1282 unique addresses were left. In total we obtained 117 participants for a response rate of 9%. Based on our preregistered power calculations, this is well enough to reliably obtain a Bayes factor higher than 10 for an underlying effect size of 0.5, but is a bit short for reliably obtaining a Bayes factor higher than 10 for an underlying effect of 0.2 (see https://osf.io/r75qd/). Respondents indicated to be faculty member (55%), graduate student (18%), post-doctorate (12%), external researcher (12%), and other (3%).
Materials and Procedure
This study made use of a within-subjects design, consisting of two tasks that required answering a multiple-choice questionnaire. After indicating their academic position, participants proceeded with the tasks. The verbatim email text and the questionnaire can be found at https://osf.io/r75qd/.
In the first task participants were asked to read a hypothetical research scenario and statistical results (“Please carefully read the text. Afterwards, indicate for each of the statements your confidence that it is true or false.”False" means that the statement does not follow logically from the information above. Also note that several or none of the statements may be correct.
Suppose you have a treatment that you suspect may reduce symptoms of migraine. You compare the means of your control and experimental groups (say 20 subjects in each sample). Further, suppose you use a simple independent means t-test and your result is (t(18) = 2.7, p = 0.01).").2
Afterwards, participants were asked to indicate their confidence in the truthfulness of six statements, taken from Haller & Krauss (2002; see Table 1), about the observed statistical results from a scale from 1 (confident it is false) to 9 (confident it is true). Moreover, participants were asked to indicate how much they would like to make each statement, on a scale from 1 (not at all interested) to 9 (very interested) (“For all following six statements shown, we would like to ask you to indicate to what extent you would want to be able to draw this kind of conclusion, provided the statistical techniques you use would be suitable to draw this kind of inference. In other words: is this a useful kind of statement for a researcher to make, provided they are able to? If this statement is true or false, does not matter here. Keep in mind that there are no right or wrong answers.”).
In a second task, we asked researchers how they would act in a hypothetical scenario as a proxy for what researchers typically do. Participants were presented with the following text: “Suppose you are about to study the effect of a new drug against depression. A former study with a drug similarly synthesized showed moderate to strong effects in the treatment of depression. However, due to your knowledge about the development of the drug you are suspicious and have strong beliefs against the drug's efficiency.” Afterwards participants were asked to indicate to what extent they typically take into account (1) results from a previous study (“To what extent do you typically take into account the previous study in your statistical analysis?”) and (2) their own beliefs in a statistical analysis (“To what extent do you typically take into account your belief in a statistical analysis?”), on a scale from 1 (never) to 9 (always). Finally, participants were asked to indicate to what extent they would like to take into account (1) results from previous studies and (2) their own beliefs in a statistical analysis, on a scale from 1 (not all at interested) to 9 (very interested) ("These questions are related to the scenario before. Suppose you do know a way how to incorporate existing information in your analysis. Please indicate below to whether you want to take into account results from a previous study and your own beliefs into any statistical analysis. Note that there are no right or wrong answers"). Prior to data collection, we preregistered this study, the preregistration document may be found at https://osf.io/r75qd/.
3. Results
There was no missing data and we did not exclude any participant. A visual inspection of the frequency distribution of the collected scores for Task 1 indicated that the scores did not seem to follow a normal distribution: Scores corresponding to confidence ratings about the truthfulness of the statement were heavily skewed to the right for each of the six statements (see Figure 1). This suggests that many researchers were confident that they cannot draw the proposed conclusion. In contrast, scores corresponding to participants’ desire to be able to make each of the six statements were mostly skewed to the left (see Figure 2). This indicates that many researchers did wish to be able to draw these conclusions.
The histograms for Task 2 show that data for this task is less strongly skewed (see Figure 3).
Because of the violation of the normality assumption, we deviated from the preregistration document in which we specified we would conduct a Bayesian t-test. Instead, we conducted a non-parametric alternative to the Bayesian t-test: the Bayesian Signed Rank Sum test (van Doorn et al., 2017). Other than this deviation, the analyses were conducted as planned.
To quantify the statistical evidence, we computed Bayes factors. For the first and second task, we conducted a set of two-sided Bayesian Signed Rank Sum tests, each with a default folded Cauchy effect size prior width of r = √2/2 (i.e., 0.707 or “medium”; for details see Rouder et al., 2009). Gibbs sampling (Geman & Geman, 1984; van Ravenzwaaij et al., 2018) was used to sample from the posterior distribution 100,000 times. According to the classification scheme suggested by Jeffreys (1998), we considered a Bayes factor of 10 (in favor or against the alternative hypothesis) as strong evidence and 1/3 ≤ BF ≤ 3 as inconclusive evidence. In the present context, Bayes factors quantify the relative likelihood of the data under the two-sided alternative versus the likelihood of the data under the null hypothesis. Our analysis for Task 1 resulted in six Bayes factors, each of which quantifies whether there is a discrepancy between what respondents think can be concluded and what they would like to be able to conclude.
For the second task, the data is a range of scores from 1 to 9 on two statements which quantify whether respondents typically take into account their own prior beliefs and/or prior study results and whether they would like to take into account their own prior beliefs and/or prior study results. The two resulting Bayes factors quantify evidence for the discrepancy between what respondents typically take into account and what they would like to take into account. Bayes factors pointing towards the null hypothesis are indicative of no discrepancy, whereas Bayes factors pointing towards the alternative hypothesis are indicative of a discrepancy. All planned analyses and the associated R-code can be found in our preregistration document at https://osf.io/r75qd/.
In Task 1, most participants were confident to varying degrees that the statements were false, as indicated by the distributions of scores and the means that were well below 5 (see Figure 1). In comparison to the results of Haller & Krauss, the performance on interpreting p-values (Statement 5: You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision) appears to have increased. Similar to Haller & Krauss Statement 4 (with mean 3.17), 5 (mean = 3.81) and 6 (mean = 3.57) were more often rated as correct than Statement 1 (mean = 2.84), 2 (mean = 2.43), or 3 (mean = 2.79). This is in line with Haller & Krauss, who found Statements 1-3 to be answered mostly correct (10-34% incorrect among Scientific Psychologists and Psychology students in comparison to Statement 4 (33%-59% incorrect) and Statement 5 (67-78% incorrect). In our data we see this trend as well, which suggests that Statements 1-3 might be inherently easier to flag as incorrect compared to the later statements about the probability of H1 (Statement 4) or probability of Type I errors (Statement 5). In line with our hypothesis that there is a difference between what researchers think they can conclude from statistical results and what they want to conclude from them, it can be seen that the means for Wish statements are generally higher than for Can statements. Note that the comparison between our study and Haller & Krauss’s study can only be made indirectly: Whereas they looked at proportions of incorrect answers, we looked at the averages of the proportion of the entire scale.
Bayes factors for the first task can be found in Table 2. All six Bayes factors indicate overwhelming evidence in favor of there being a difference between what researchers think they can conclude and what researchers want to be able to conclude. Inspection of the means indicates that scores on what researchers want to be able to conclude are higher than what they think they can conclude for all six statements. In addition to the individual BFs for both tasks, we also calculate a meta-analytic Bayes Factor (Rouder & Morey, 2011), resulting in a high BF10 = 3.84e+89.
Test(X vs. Y) | BF10 |
W1vsC1 | 2.84*106 |
W2vsC2 | 3.28*106 |
W3vsC3 | 2.37*106 |
W4vsC4 | 1.20*105 |
W5vsC5 | 6.00*105 |
W6vsC6 | 1.18*107 |
Test(X vs. Y) | BF10 |
W1vsC1 | 2.84*106 |
W2vsC2 | 3.28*106 |
W3vsC3 | 2.37*106 |
W4vsC4 | 1.20*105 |
W5vsC5 | 6.00*105 |
W6vsC6 | 1.18*107 |
Bayes factors for the second task can be found in Table 3. The first Bayes factor indicates overwhelming evidence in favor of there being a difference in the extent to which people take into account prior study results and the extent they wish to do so. The second Bayes factor indicates almost no evidence for a difference in the extent to which people take into account a-priori beliefs and the extent to which they want to be able to. Inspection of the means indicate that scores on what researchers want to be able to do are higher than what they typically do in both cases.
4. Discussion
NHST is without a doubt the most common variant of hypothesis testing used in a majority of quantitative scientific disciplines (Goodman, 1999; Hoekstra et al., 2006; McCloskey & Ziliak, 1996; Rucci & Tweney, 1980). Despite considerable statistical training, many scientists misinterpret the outcomes of NHST (Falk & Greenbaum, 1995; Haller & Krauss, 2002; Lyu et al., 2020; Oaks, 1986). It has been proposed that these common misinterpretations might emerge because there is a discrepancy between what researchers can conclude and what they wish to conclude from their statistical analyses (Gigerenzer, 2004; Morey et al., 2016). The results of the current study suggest that this proposed discrepancy exists. However, the results show contradicting evidence for the notion that a Bayesian framework better describes how researchers want to use inferential statistics.
In Task 1, participants were presented with a research scenario investigating a treatment that may reduce symptoms of migraine including statistically significant results from a simple independent means t-test. On average, participants indicated that they wish to make each of these six statements, even though they realize they are not able to under NHST. Specifically, participants wish to “absolutely disprove the null hypothesis” (Statements 1) and wish to find the probability of the null hypothesis being true (Statement 2); they wish to absolutely prove the experimental hypothesis (Statement 3) and find the probability of the experimental hypothesis being true (Statement 4). Finally, participants wish to know, if they decide to reject the null hypothesis, the probability of making a wrong decision (Statement 5) and they wish to have a reliable experimental finding in the sense that if the experiment were repeated a great number of times, the obtained statistical quantity would inform the number of times a significant result would be obtained (Statement 6).
So would Bayesian statistics offer what researchers seem to want, based on the results of our study? Bayes factors quantify the relative likelihood of the data under one hypothesis (e.g. the null hypothesis) to another (e.g. the alternative hypothesis). As such, one can quantify evidence in favor of the alternative or in favor of the null hypothesis. This cannot be done within the frequentist NHST as only the null hypothesis is explicitly taken into consideration. Therefore, it is impossible to assign a probability to any hypothesis. Bayesian inference allows for some, but not all, of the previously presented conclusions to be drawn, given a prior belief. Specifically, Bayes factors enable drawing the conclusions from Statement 2 (We can find the probability of the null hypothesis being true, given a prior belief), Statement 4 (We can find the probability of the alternative hypothesis, given a prior belief). When used for making decisions (see e.g., Aczel et al., 2020), Bayes factors can be used to make Statement 5 (We can find the probability of making a wrong decision, when rejecting the null hypothesis, given a prior belief).
Although the conclusions from Statement 1 and Statement 3 cannot be drawn under either a frequentist or a Bayesian framework (we can never disprove the null or alternative hypothesis absolutely), we are able to assign a concrete number to the relative evidence in favor of one hypothesis over another provided by the data. As a result, a researcher is at liberty to decide that a relative likelihood of 1000 to 1 constitutes compelling evidence for one hypothesis over another (possibly influenced by their prior belief in the plausibility of both hypotheses) and collect data until the Bayes factor is either 1000 or 1/1000. Thus, contrary to NHST, Bayesian inference does allow one to draw conclusions that approximate those drawn in Statements 1 and 3.
Statement 6 is not applicable to a Bayesian framework, as significant results pertain to NHST specifically. However, if a researcher were to conduct a great number of experiments, and in each of these continuously samples evidence until a BF10 of 10 or 1/10 was reached, then they would hit BF10 =10 ten times as often as BF10 =1/10 if the alternative hypothesis were true and they would hit BF10 =1/10 ten times as often as BF10 =10 if the null hypothesis were true (Rouder, 2014; Tendeiro et al., 2019). And if they would do so until hitting a BF10 of 100 or 1/100 instead, they would hit BF10 =100 hundred times as often as BF10 =1/100 if the alternative hypothesis were true and they would hit BF10 =1/100 hundred times as often as BF10 =100 if the null hypothesis were true. As such, Bayes factors allow one to draw conclusions about the probability of drawing the wrong conclusion in the long run based on the strength of evidence. Put simply – given a great number of experiments – there is a link between the size of the BF and the expected proportion of BFs that provide evidence in favor of the correct model. However, a single Bayes factor does not allow for predicting, say, the size of the Bayes factor in a replication attempt.
The misfit between what researchers can conclude versus what they want to conclude from a frequentist test might explain the common misinterpretation of frequentist NHST in line with a more intuitive Bayesian interpretation (Gigerenzer, 2004; Haller & Krauss, 2002; Lyu et al., 2020; Oaks, 1986). The results of Task 1 indicate that, indeed, researchers wish to make inferences in line with a Bayesian framework, such as finding the probability of the null and alternative hypothesis given the data and prior belief.
In a second task, participants were presented with a hypothetical scenario about the effectiveness of a new drug against depression. We asked participants whether they would like to be able to incorporate previous findings, and their own personal beliefs into their statistical analyses. Our results indicate that many researchers feel partial to being able to specify their own prior distribution. On the one hand, participants seem to want the ability to incorporate prior study results into their analysis more than they are presently able to. On the other hand, our results are ambiguous with respect to a mismatch between researchers wanting to incorporate their own personal beliefs into their statistical inference versus their ability to do so, which suggests that they might be more comfortable with a subjective prior based on previous research or an objective prior.
Whether this is reflective of an actual preference, or because this is closer related to the techniques they are currently using is an open empirical question. It might well be that incorporating one’s own beliefs into study outcomes might be seen as problematic given that objectivity is often propagated as a scientific virtue. We would like to point out though that subjective beliefs may well be informed by knowledge about previous study results, so the distinction between the two may not be as clear-cut.
Researchers’ potential hesitance to take their subjective beliefs into account in the analysis might be one of the reasons why Bayes is not used instead of NHST as often as one might expect given the amount of attention for such techniques. The most common Bayesian way of thinking about probability is to define the probability of an event as the degree of belief that we assign to the truth of an event. Thus, probabilities do not exist in the world, but rather in thought and assumptions of the researcher (Navarro, 2015), and are hence necessarily subjective. Our study indicates that this Bayesian definition of probability might be less appealing than the frequentist definition of probability, which in the context of test statistics is the average performance for an infinite repetition of hypothetical experiments. However, we should be cautious about concluding too much from these findings alone. Importantly, this result suggests that the exact way we arrive at a prior distribution is a crucial factor on whether researchers accept the Bayesian notion of subjectivity (i.e., if the prior distribution is well founded by previous studies it is accepted).
In our study, we could not replicate the finding that a majority of psychological researchers endorses false statements about p-values (Falk & Greenbaum, 1995; Haller & Krauss, 2002; Hoekstra et al., 2014; Oaks, 1986). One reason for this might be an increased awareness about the false use of statistics since the term p-hacking was coined (Simonsohn et al., 2014). As we obtained a low response rate (9%), we should be careful with the conclusions we can draw based on our study. Possibly, those who did participate were on average more knowledgeable in statistical inference than those who did not. If true, our participants might be more likely to know about the limitations of NHST and might feel more comfortable with Bayesian statistics, although nothing in the invitation email nor in the survey mentioned Bayesian statistics.
We asked participants about their willingness to include prior beliefs or studies in the context of a study analyzing the effect of new drugs against depression. This result might be limited to a clinical context, in which participants might be more careful in including their prior beliefs (as people’s well-being is at stake), therefore we must be careful to generalize the outcome to other fields. Our recruited participants had a diverse background (experimental, social, neuro and clinical psychologists), thus non-clinical researchers might be hesitant to include prior beliefs in a clinical research context, as they simply did not feel qualified to do so. In our study, we did not ask participants to indicate their field of expertise, thus future studies would need to study whether being comfortable with including prior beliefs changes with the expertise a researcher has in the context of the research question. We also did not ask participants to report what kind of statistical inference they typically employ in their own work. As such, the results of the present study do not allow linking participants’ responses to their own practices, but we believe this to be a fruitful avenue to explore in future studies.
Finally, future studies should include options for open answer responses. It might be interesting to learn from researchers under which circumstances they would like to incorporate a prior and what information they would like to base such a prior on.
General conclusion
Our study suggests that there is a gap between what researchers would like to conclude and what they can conclude from their statistical analyses. Researchers seem to be interested in making inferences in line with a Bayesian framework, such as finding the probability of the null and alternative hypothesis given the data and prior. The difference between what researchers can and want to conclude might be one of the explanations for the previously found misinterpretations of statistical results (Haller & Krauss, 2002; Lyu et al., 2020). However, researchers seem to be ambiguous about what information to include in their prior distribution. On the one hand, we did not find compelling evidence that researchers wish to take their own prior beliefs into account when analyzing their data. On the other hand, we did find overwhelming evidence to suggest that researchers are interested in including information from previous studies into their analyses. This suggests that the source from which the prior distribution is derived (i.e., outcomes of previous studies) is a crucial factor on whether researchers accept the Bayesian notion of subjective probability.
Our results should be interpreted with a great deal of caution. First of all, the response rate was low (9%), and there is no way to tell whether the remaining sample is still representative of researchers. The question of what researchers want to conclude from their statistical analysis is quite complex to answer, as it depends amongst other things on the field of study, their statistical education and on the specific research question. The used questionnaires can only give some indications about what people really want, so more detailed questionnaires or qualitative studies are needed.
Allowing for these caveats, our study allows for two conclusions: (1) there was a clear difference between what researchers think can be concluded versus what they would like to be able to conclude from the fictional results paragraph presented in our experiment, and (2) researchers seems to like being able to incorporate prior results but not personal beliefs into statistical inference. Thus, our study provides modest support for the notion that researchers sympathize with some elements native to the philosophy of objective Bayesianism. Moreover, a subjective Bayesian approach is deemed more acceptable if a prior distribution is built upon previous research.
Contributions
MH drafted the manuscript, prepared the material, designed the study and analyzed the data; JM prepared the material, designed the study, gathered the data and analyzed the data; RH drafted the manuscript and designed the study; DR drafted the manuscript and designed the study.
Competing Interests
The authors declare no competing interests. Don van Ravenzwaaij is a Section Editor at Collabra: Psychology. He was not involved in the peer review of this article.
Data Accessibility Statement
All the stimuli, participant data, preregistration and analysis script can be found on the paper's project page on the open science framework: https://osf.io/r75qd/.
Footnotes
A p-value quantifies the probability of obtaining the data at hand or more extreme data given that the null hypothesis is true. Statements 1 and 3 are false, because nothing can be proved absolutely. Statements 2 and 4 are false, because p-values do not speak to the probability of either the null or the alternative hypothesis being true. Statement 5 is false, because the probability of making a wrong decision would be the probability of the null hypothesis being true (Statement 2). Statement 6 is false, because reliably obtaining qualitatively similar results would imply the p-value relates to the probability of the alternative being true, which is not the case.
Note that due to experimenter error, the df reads 18 instead of 38. The discrepancy was commented on by a single participant. As both values for the df lead to a p-value of 0.01, the critical quantity for purposes of this study, we do not think this error has affected the study outcomes. Interestingly this error was also made in the original study by Haller & Krauss (2002) and has since been repeated by several authors (van Ravenzwaaij et al., 2019).