If cognitive biases are universal and consistent across domains, they should be considered when planning for real-life decision-making. Yet results are mixed when exploring biases applied to medical settings. Further, most studies have focused on the effect of a single bias on a decision choice, rather than on how various biases may interact in a complex decision-making process. We performed three preregistered experiments where trained medical students (∑N = 224) read hypothetical mental health patient descriptions. Participants made initial diagnoses, explored follow-up information and could adjust their diagnoses. We tested whether there was an anchoring bias of the first-presented symptoms having a larger impact on the decision, whether there was a confirmation bias in the selection of follow-up requests, and whether confidence increased during the decision process. We found that confidence increased for participants that did not change their decision or seek disconfirming information. Two of the experiments indicated a confirmation bias in the selection of follow-up requests. There was no indication that confirmation bias increased when confidence was high, and no support for the order of symptom presentation leading to an anchoring bias. We conclude that the biases may be difficult to demonstrate for complex decision-making processes in applied medical settings.
1.1. Bias in decision-making
Decision-making is a complex mental process which is vulnerable to errors. Errors may have grave consequences for diagnostic decisions, as error in treatment directly impacts the health and well-being of patients, and indirectly impacts the efficiency of the health-care systems. Studies have indicated (Berner & Graber, 2008; Graber, 2013) that medical professionals make the wrong diagnosis in about 10-15% of the cases. It may be difficult to establish what constitutes a diagnostic error (Zwaan & Singh, 2015). Mental health diagnoses may in particular be “wicked problems” (Hannigan & Coffey, 2011; Hogarth et al., 2015), as the problem space is poorly defined, decision feedback provides poor grounds for learning as treatments give ambivalent feedback, and ground truth may be difficult to establish (Hogarth et al., 2015).
Cognitive psychology may be relevant for understanding and preventing errors caused by the clinician’s thinking process (Blumenthal-Barby & Krieger, 2015; Featherston et al., 2020; Saposnik et al., 2016; Whelehan et al., 2020). A taxonomy of medical errors has been developed that distinguishes between no-fault errors, systematic errors and cognitive errors (Graber et al., 2002). The latter category has been found to be most frequent (Croskerry, 2009; Graber et al., 2005; Graber & Carlson, 2011). This covers faults in data gathering, knowledge, reasoning, and verification. Errors in reasoning due to cognitive biases are an important subcategory.
Heuristics are mental shortcuts that allow clinicians to make quick decisions. These tend to be sufficiently accurate when thorough analyses of all possibilities and probabilities are inexpedient or simply impossible (Gigerenzer, 2008; Graber et al., 2002; Ludolph & Schulz, 2018; Todd & Gigerenzer, 2001). Such shortcuts can be necessary and useful in dynamic medical settings. However, heuristics entail systematic biases, that may lead to misinterpretation and oversimplification of information and may limit the search for and evaluation of information (Crowley et al., 2013; Tversky & Kahneman, 1974).
1.2. Bias in medical diagnosis
Heuristics in diagnostic reasoning may be understood within dual process theories (Morewedge & Kahneman, 2010; Pelaccia et al., 2011). This implies that a fast and automatic “system 1” processing of diagnostic information is subject to a number of biases, while a more effortful, analytic “system 2” reasoning is needed to avoid them (see Croskerry, 2009; Payne, 2011; van den Berge & Mamede, 2013). Some have argued that system 2 reasoning also can contribute to errors (Norman et al., 2017). Overlearning, dysrationalia override, fatigue, distractions and time constraints may lead to “system 1” being dominant for clinicians (see also Croskerry, 2009; Norman, 2009; Norman & Eva, 2010).
While the degree of uncertainty and errors in diagnostic decisions may vary across medical fields, errors may occur within any specialty (Croskerry, 2003). Decades of research have contributed to understanding how heuristics can lead to errors, and may provide information on how the errors may be reduced. There has been attempts at “debiasing” medical decision-making (for reviews, see Ludolph & Schulz, 2018; Thompson et al., 2023). There is sparse research on the use of heuristics in mental health diagnosis, but some studies have indicated some of the same phenomena as in general decision-making. Among the biases that have been found to influence decision-making in medical research are anchoring, confirmation bias and excessive confidence.
1.2.1. Anchoring bias
Research on decision making (Curley et al., 1988; Lund, 1925) has shown that the information presented first tend to have a larger influence than information presented later. Tversky and Kahneman (1974) identified anchoring bias as a mechanism where a belief is established based on initial information, and that belief is not adequately adjusted by subsequent information. Anchoring may be caused by serial-position effects (Ebbinghaus, 1913), where the primacy of the information presented first gives it additional emphasis. The belief-updating model (Hogarth & Einhorn, 1992) has been suggested to account for various order-effects.
Crowley and colleagues (2013) argued that anchoring could influence medical diagnoses if the clinician “locks onto” salient symptoms presented early in the diagnostic process, leading to an initial diagnosis that influences the final diagnosis. Studies have shown “first-impressions” to have a large influence final diagnostic decision (Bösner et al., 2014; see e.g., Kostopoulou et al., 2017). Similar studies of anchoring in mental health diagnosis have shown inconsistent findings (Ellis et al., 1990; Friedlander & Stockman, 1983; Richards & Wierzbicki, 1990). Nevertheless, several studies where participants assess written descriptions of mental health cases have shown that diagnoses tend to be compatible with the first-presented symptoms (Cunnington et al., 1997; Richards & Wierzbicki, 1990). For instance, Parmley (2006) manipulated the order of symptom presentation in case descriptions, and showed that a third of clinicians failed to alter their initial diagnosis when presented with disconfirming evidence. Parmley predicted (2006, p. 84) that participants would “fail to alter their initial diagnosis even when new information presented at time two has disconfirming evidence”.
1.2.2. Confirmation bias
We tend to seek or interpret information in ways that can corroborate or support our current beliefs, expectations or assumptions. Information that is inconsistent with our beliefs may be ignored or toned down. The more general mechanism has been called a congruence bias (Baron et al., 1988), while the more specific term confirmation bias (Nickerson, 1998) is often used to describe situations where new information is interpreted as supporting one’s current beliefs. The two terms have been used interchangeably in much of the empirical literature (Beattie & Baron, 1988; Xiao et al., 2021). In a diagnostic setting, such a mechanism could lead to closing the exploratory phase prematurely; accepting a diagnosis before it has been fully verified, or neglecting plausible alternatives (Croskerry, 2002, 2009; Eva, 2001; Parmley, 2006). Studies of mental health diagnoses have shown confirmation bias in selecting which information to gather (Martin, 2000; Mendel et al., 2011) and in interpreting of information (Croskerry, 2002; Eva, 2001; Oskamp, 1965; Parmley, 2006).
Most studies of confirmation bias in diagnostic decision-making start out by indicating an incorrect diagnosis and examines whether participants are able to switch to the correct diagnosis when provided with additional information (Mendel et al., 2011). While this approach may make it easier to establish a confirmation bias, an approach where participants are allowed to arrive at an initial diagnosis based on ambiguous information may have higher ecological validity. Such an approach can also be used to examine how congruence influences the ongoing decision process.
1.2.3. Overconfidence
In addition to evaluating the evidence, clinicians also need meta-cognition about how certain they are about their evaluation (Moore & Healy, 2008). Confidence in a decision may be affected by the perceived qualitative and quantitative aspects of available information, and on the existence of plausible alternatives (see Eva, 2001; Martin, 2000; Oskamp, 1965). Croskerry (2002) defined overconfidence bias as a universal tendency to believe that we know more than we do, or to place too much faith in opinions rather than evidence. Overconfidence may lead to diagnostic errors, for instance by leading to applying unsuitable heuristics (Berner & Graber, 2008). The phenomenon may vary across cultures (Yates, 2010). It may be difficult to say when the confidence in a decision is “excessive” or “unfounded”. In the current approach we operationalize “overconfidence” as an increase in confidence when no further clarifying information is provided.
1.2.4. Interaction of biases in applied decision-making
Confirmation bias may be closely associated with the anchoring effect, and these two mechanisms may interact to compound errors (Croskerry, 2002, 2003). First, clinicians may “lock onto” salient symptoms early in the diagnostic process, which leads them towards a preliminary diagnosis (anchoring). Subsequent processes of seeking and interpreting additional information may be biased towards this initial hypothesis, while alternative plausible explanations may be ignored (confirmation).
Some studies (Martin, 2000; Oskamp, 1965) have indicated that confirmation bias may interact with the degree of confidence to influence how inconclusive symptom information is interpreted in diagnostic decisions.
Croskerry (2002) specified that overconfidence may be augmented by anchoring. “Locking onto” salient information early in an encounter with a patient may make the clinician confident that this information is particularly important. In turn this may affect the formation and rigidness of a preliminary diagnosis. Clinicians who feel confident about their diagnostic decision may be more biased in their search and interpretation of additional information (see Martin, 2000; Oskamp, 1965). Confidence may thus influence a clinician’s decision process, and may act as both an effect and a cause for cognitive biases.
1.3. Research gap
There is substantial empirical basis for cognitive biases like anchoring, confirmation and overconfidence. Given that they have been argued to be relevant for diagnostic decisions (Graber et al., 2002), it is surprising that results have been inconsistent when testing how biases influence mental health diagnostics (Ellis et al., 1990; Friedlander & Stockman, 1983; Oskamp, 1965; Parmley, 2006; Richards & Wierzbicki, 1990). It be unclear how biases contribute to diagnostic errors, and there are few empirical studies addressing this (Norman et al., 2017).
It has been argued (Saposnik et al., 2016) that most studies on the topic are of low quality. There is a need for experiments that control for some of the from previous studies, like the balance between the severity of the cases and the amount of information provided simultaneously (Richards & Wierzbicki, 1990). Further, few studies have examined how the biases relate to both seeking and interpreting information the affects diagnostic decision processes (with some exceptions, see Martin, 2000; Mendel et al., 2011).
There appears to be no controlled investigation of how a diagnostic decision process is affected by the co-occurrence of cognitive biases. It would be of value to combine the testing of both anchoring and confirmation bias in one experimental design, and to model decision-making as a sequential process where information is gathered and confidence may vary over time.
Finally, as only a minority of studies use medically trained personnel (Blumenthal-Barby & Krieger, 2015), it would be of value to establish similar effects in a reasonably realistic problem field in which the participants had relevant training. Relatedly, it has been identified as a gap in the literature that few studies use clinical guidelines to determine diagnostic or treatment errors (Saposnik et al., 2016).
1.4. Current study
1.4.1. Aims for the current study
The current study seeks to develop an experimental paradigm for testing the interaction of anchoring, confirmation and confidence on seeking information, evaluating information, and making diagnostic decisions. To achieve this, we designed a basic experimental procedure that measures information gathering, choice of and confidence in decisions. The experiment can measure (1) whether participants prefer diagnoses that match the first-presented symptoms, (2) whether participants seek information to corroborate the assumption they currently hold, (3) whether levels of confidence predict corroborating information seeking, and (4) whether diagnosis and confidence changes across the diagnostic process.
1.4.2. Hypotheses
Based on previous research on anchoring, we expected that the order in which symptoms are presented will affect the choice of a preliminary diagnosis. Our first hypothesis was thus that (H1) participants will be more likely to select the preliminary diagnosis congruent with the first-presented symptoms in a case vignette, rather than selecting the diagnosis congruent with the later-presented symptoms.
Based on previous research on congruence, we expected that participants would primarily seek out information that appeared to confirm their existing diagnostic beliefs. Our second hypothesis was thus that (H2) participants would request more information related to the diagnosis they already preferred, rather than seeking out information that may support an alternate diagnosis.
Furthermore, we expected such confirmatory styles of information gathering to correspond to higher levels of confidence in one’s existing diagnostic beliefs. The third hypothesis was thus that (H3) more confirmatory information would be requested when confidence is higher.
We expected that those who do not change their mind or explore other alternatives feel more certain about their diagnostic decision. The final hypothesis was thus that (H4) participants who prefer the same diagnosis throughout the case exploration and only request confirming information should increase in confidence between their first and final diagnostic decision.1
1.4.3. Procedure and preregistration.
All four hypotheses were explored in three experiments. Hypotheses remained unchanged, and were preregistered ahead of each experiment’s data collection. The experiment procedure was tested in classroom experiments in order to efficiently collect data from participants with medical training. Each experiment was done consecutively, building on the analysis of the preceding experiments to allow for iterative improvements in experiment design. This allowed better control for competing hypotheses, and to further investigate earlier findings. This led to removing some details in case descriptions and making a few changes in the materials for Experiment 2 and 3 to make the manipulation more effective (see details below). See online materials for preregistrations and experiment details (https://osf.io/dn4rv/).
2. Methods
2.1. Participants
Participants were advanced medical students with extensive education in somatic and mental health issues, and were drafted from a university hospital in Norway. The three experiments recruited participants from three lectures. Most of the students that were present in class participated, constituting 71, 56 and 91 students respectively. In accordance with the preregistration, Experiment 3 was completed in two separate data collection sessions, as the initial session provided a lower turnout than required.
Demographic variables were not collected in any of the experiments to preserve a sense of anonymity. However, we judged the student population from which the sample was drawn to be predominantly female (about 75%), and in their mid-twenties. A lottery was conducted immediately after each data collection, in which about 5% of the participants won a gift card for a lunch meal at a campus cafe.
2.2. Experiment overview
Each experiment was conducted in an auditorium during a break in the lecture that the participants attended. The lecturer introduced the experimenters to the classes, and described the project as an investigation of decision-making under uncertainty. The experimenters informed the students that participation was voluntary and anonymous, and that they could withdraw from the study at any time without consequences. As the experiment came in the form of an online survey, participation was possible through laptop computers, tablets and smartphones. No personal information such as names or email addresses was collected. The online questionnaires (in Google Forms) were accessed through the university internet connection, causing the IP address to be the same for all participants.
All three experiments had the same overall structure, where two fictitious patient cases were evaluated (see Table 1 for an overview of the steps in responding to each of the two cases in each experiment). Completion of the experiment took about 10 minutes.
Step | Experiment procedure step |
1. | Introduction and description of the experiment procedure |
2. | Introduction of the two diagnostic options (A and B), and presentation of the ICD-10 diagnosis criteria for both options |
3. | First half of initial case description, randomized to present symptoms either in support of A or B |
4. | T1a choice (Experiment 3 only): State confidence in one of the two diagnoses: A or B |
5. | Second half of initial case description, presenting the symptom information absent from step 3 (supporting either B or A) |
6. | T1b choice: State confidence in one of the two diagnoses: A or B |
7. | T1 request: Select one of four options to further explore one of the diagnoses: A1, A2, B1, or B2 |
8. | T2 choice: State confidence in one of the two diagnoses: A or B |
9. | T2 request: Select one of four options to further explore one of the diagnoses: A1, A2, B1, or B2 |
10. | T3 choice: State final confidence in one of the two diagnoses: A or B |
11. | End of the first case, repeat step 2-10 for the second case where A and B are replaced with C and D |
12. | Four follow-up questions about what the participant assumed the research hypotheses to be (in Experiment 3 only) |
13. | Short oral debrief at the end of experiment |
Step | Experiment procedure step |
1. | Introduction and description of the experiment procedure |
2. | Introduction of the two diagnostic options (A and B), and presentation of the ICD-10 diagnosis criteria for both options |
3. | First half of initial case description, randomized to present symptoms either in support of A or B |
4. | T1a choice (Experiment 3 only): State confidence in one of the two diagnoses: A or B |
5. | Second half of initial case description, presenting the symptom information absent from step 3 (supporting either B or A) |
6. | T1b choice: State confidence in one of the two diagnoses: A or B |
7. | T1 request: Select one of four options to further explore one of the diagnoses: A1, A2, B1, or B2 |
8. | T2 choice: State confidence in one of the two diagnoses: A or B |
9. | T2 request: Select one of four options to further explore one of the diagnoses: A1, A2, B1, or B2 |
10. | T3 choice: State final confidence in one of the two diagnoses: A or B |
11. | End of the first case, repeat step 2-10 for the second case where A and B are replaced with C and D |
12. | Four follow-up questions about what the participant assumed the research hypotheses to be (in Experiment 3 only) |
13. | Short oral debrief at the end of experiment |
2.3. Materials and experiment procedure
We developed text descriptions of two hypothetical patient cases. Both featured symptom information that could be read to support one of two mental health diagnoses as listed in the ICD-10 diagnosis manual. The cases were developed similarly as those used by Parmley (2006), though considerably shorter. A clinical psychologist verified that the materials were clinically relevant for mental health diagnosis. A medical doctor in charge of the university’s medical training evaluated the experiment procedure as a relevant test of the diagnostic approach. The first case presented a choice between (A) dementia and (B) depressive episode. The second case presented a choice between (C) bipolar mood disorder and (D) borderline personality disorder. The participants were instructed to base their decisions on these criteria, rather than any prior knowledge they had about the diagnoses or assumptions about epidemiology.
Except for a few minor changes, the material remained unchanged for all three experiments, and all participants received the same cases in the same order (see Table 1 below). The two diagnostic options and the corresponding ICD criteria were presented before each of the case descriptions (step 2 in Table 1). Each case consisted of an initial vignette describing a hypothetical patient. This vignette first (step 3) presented all the symptoms that supported one of the diagnoses, and then (step 5) presented all the symptoms supporting the opposite diagnosis.2 In addition, “neutral” symptoms that are compatible with both diagnoses were included to avoid the contrast between the other pieces of information becoming too obvious. The full case descriptions are available as supplemental materials online (https://osf.io/dn4rv). The initial case descriptions were intended to present equally persuasive arguments for both of the diagnoses, without conclusively supporting either of them.3 4 Participants were subsequently (step 6) asked to decide on a tentative diagnosis (instructed to “select a diagnosis and indicate your confidence in it”). The response was made by clicking on a 10-point scale on which the extreme ends represented the highest degree of certainty for the case’s two diagnoses. In Experiment 3, participants also had to answer the same question when they were halfway through the initial case description (step 4), after only symptoms supporting one of the diagnoses had been presented. This was done as a manipulation check for whether the first half of the symptoms led to a decision compatible with the information presented so far, and to check whether being forced to make an early decision would enhance a confirmation bias.
After indicating their initial diagnosis (step 7), the participants were asked to select one of four options for getting more information about the symptoms (such as request A2 “Reduced language skills may indicate dementia. Ask the patient about her language use and verbal skills.”). Two of the options were explicitly marked as seeking more information about symptoms for one of the diagnoses, while the remaining two were marked as seeking information about the other diagnosis. After selecting an option (step 8), participants received additional information (between 33 and 80 words) relevant to the diagnosis they had selected. However, the additional information was worded in a way that did not conclusively point to either of the diagnoses (such as”[The patient] has thought of herself as polite and articulated but has recently been told by her family that she can be wicked, vulgar and condescending. [The patient] says that this only happens when she talks about topics that upset her.”). After receiving the follow-up information, the participants were again instructed to indicate their confidence in either diagnosis, by responding on the same 10-point scale as in step 6. This was followed (step 9) by a second opportunity to seek follow-up information, choosing among the same four options as before.5 After receiving the second follow-up information, participants were then (step 10) asked to set their final diagnosis in the same way as before. Step 2-10 were then repeated for the second case in the experiment.
For Experiment 3, four debrief questions were included at the end of the questionnaire (step 12). The first two questions explored the participants’ thoughts about the aim of our study, while the latter two asked about the strategy they had used in their decision-making. The aim of these questions was to check whether participants may have guessed the research hypotheses, and to test whether this had affected their responses. Additionally, we wanted to explore whether the participants were aware of their own decision-making strategy. Participant answers were scored separately by 3 coders and compared for inter-rater reliability. The raters initially scored 1.8% of the participants differently, which were resolved by discussion.
After completing the online experiment (step 13) the participants were quickly debriefed about the purpose of the experiment, any questions the participants had were answered in plenary, and the gift cards were distributed. Due to time constraints only a short debrief was given verbally in class, while a more thorough debrief and results summary was sent on email.
Data preparation was done in a Google Sheet synchronized to Open Science Framework to provide transparency and version history for all data transformations, and can show the calculation of the indices described below. Statistical analyses were performed in RStudio. Since the preregistrations had used past literature to make directed hypotheses, all the tests were one-tailed. Where averages indicated an inverse effect, two-tailed exploratory analyses were done. All hypotheses were tested with Wilcoxon signed rank test, with a standard alpha cut-off of p < .05.6 All data (https://osf.io/t3zh4/) and analysis files (https://osf.io/nye24/) are available online.
More detailed descriptions of the experiment procedures and text descriptions of the cases in the original Norwegian and translated to English are available in the preregistrations for each of the experiments (https://osf.io/dn4rv/ registrations).
3. Results
3.1. Tests of anchoring
Hypothesis H1 stated that despite reading all symptoms before making a diagnosis, anchoring would lead to the first-presented symptoms to have a larger effect on the decision than the last-presented symptoms. An index was created to reflect how often (for the two cases) the initial diagnosis (at T1b choice) matched the symptoms presented first in the descriptions (thus having a value of 0, 1 or 2). The null-hypothesis of no effect of symptom order would predict that by chance participants would select the diagnosis matching the first-presented symptom in one of the two cases. H1 was assessed with a one-sample one-tailed Wilcoxon signed-rank test for whether the T1b choice matched the first-presented symptoms for more than one case.
3.1.1. Test of H1 in Experiment 1
For Experiment 1 the number of initial diagnoses matching the first symptoms (M = 1, SD = .74) was identical to the reference constant of 1, thus failing to show a significant difference (W = 370.5, p = .503, rrb = 0).
3.1.2. Test of H1 in Experiment 2
Similarly for Experiment 2, the number of initial diagnoses matching the first symptoms (M = 1, SD = 0.71) was identical to the reference constant of 1, thus failing to show a significant difference (W = 203, p = .505, rrb = 0).
3.1.3. Test of H1 in Experiment 3
Experiment 3 included a manipulation check by asking participants to make an additional preliminary diagnosis after reading the first half of the symptoms in each case (T1a choice). This was done in order to test whether participants, when they had only read symptoms indicating one of the diagnoses in fact indicated that diagnosis. A further aim was to test whether forcing participants to make a decision at an early point could lead to anchoring or confirmation bias (on the T1b request). The number of diagnoses at T1a matching the first symptoms (M = 1.53, SD = 0.60) was higher than the reference constant of 1, indicating that the participants were consistent with the symptom information that had been presented so far.
As in Experiment 1 and 2, the H1 test of anchoring in Experiment 3 used responses from the diagnostic decision made after hearing the full initial case description (the T1b choice). The number of diagnoses matching the first symptoms (M = 0.63, SD = 0.69) was lower than the constant of 1, and thus not significant in one-tailed testing against a higher value than 1 (W = 308, p = 1, rrb = -0.6). Exploratory two-tailed testing indicated that the effect would have been significant (p < .001) for a non-directional H1 prediction.
3.1.4. Test of H1 pooled across experiments
In an exploratory follow-up analysis, data from all three experiments (N = 218) was collapsed to make a more robust assessment of H1. The one-tailed test remained non-significant despite higher power (W = 2684, p = 0.999, rrb = -0.27).7
3.2. Tests of confirmation bias
Hypothesis H2 stated that a confirmation bias would lead participants to seek out information that could support the diagnosis they already preferred. An index was created to reflect the number of times the participants requested information (on T2 request and T3 request) that could support the diagnosis they had stated to prefer (on T1b choice and T2 choice, respectively). The null-hypothesis of no confirmation bias predicted that participants would be equally likely to investigate the preferred diagnosis as the alternate diagnosis, and would thus seek confirmatory information at two of the four possible opportunities (across the two cases). H2 was thus assessed with a one-sample one-tailed Wilcoxon signed-rank test of whether participants made more than two confirmatory requests.
3.2.1. Test of H2 in Experiment 1
For Experiment 1, the average number of requests for confirming information (M = 1.93, SD = 0.76) was slightly lower than the reference constant of 2, and thus did not support the hypothesis (W = 183.5, p = .791, rrb = -0.16). Exploratory two-tailed testing indicated that the effect would not have been significant (p = .431) for a non-directional H1 prediction.
3.2.2. Test of H2 in Experiment 2
For Experiment 2, the average number of requests for confirming information (M = 2.23, SD = 1.04) was higher than the reference constant of 2, but failing to meet our significance criteria (W = 426, p = .062, rrb = 0.28).
3.2.3. Test of H2 in Experiment 3
For Experiment 3, the average number of requests for confirming information (M = 2.16, SD = 0.95) was somewhat higher than the reference constant of 2, and met our significance criteria (W = 956, p = .047, rrb = 0.24).
3.2.4. Test of H2 pooled across experiments
An exploratory test for H2 that collapsed participants across all three experiments showed a significant effect (W = 4221, p = .048, rrb = 0.16). It should be noted that the effect was small, reflecting that there were on average 2.11 (of 4 possible) requests for confirmatory information.
3.3. Tests of confidence leading to confirmation
Hypothesis H3 stated that higher confidence in the diagnostic choice should lead to more confirmatory information seeking. Both diagnostic choices for both cases were calculated to a confidence rating (between 1 and 5). H3 predicted than at both time points (T1b and T2 choice), confidence would be higher for the diagnostic questions that preceded request for confirmatory information than for the diagnostic decision that preceded requests for dissenting information (null-hypothesis predicted no difference). H3 was tested with one-tailed dependent samples Wilcoxon signed-rank tests for whether confidence was higher for confirmatory than for dissenting requests.
3.3.1. Test of H3 in Experiment 1
For Experiment 1, the confidence on diagnoses preceding confirmatory request (M = 2.43, SD = 0.85) was somewhat lower than the confidence on diagnoses preceding dissenting request (M = 2.64, SD = 0.88). The effect was thus in the opposite direction of the H3 prediction (W = 622, p = .966, rrb = -0.27). Exploratory two-tailed testing indicated that the effect would not have been significant (p = .068) for a non-directional H3 prediction.
3.3.2. Test of H3 in Experiment 2
For Experiment 2, the confidence on diagnoses preceding confirmatory request (M = 2.38, SD = 0.9) was identical to the confidence on diagnoses preceding dissenting request (M = 2.38, SD = 1.08), and thus not significant (W = 380.5, p = .445, rrb = 0.03).
3.3.3. Test of H3 in Experiment 3
For Experiment 3, the confidence on diagnoses preceding confirmatory request (M = 2.19, SD = 0.75) was higher than the confidence on diagnoses preceding dissenting request (M = 2.08, SD = 0.85), but not reaching significance (W = 1281.5, p = .13, rrb = 0.16).
3.3.4. Test of H3 across all three experiments
When collapsing participants across all three experiments, the exploratory test for H3 remained non-significant (W = 6423.5, p = .618, rrb = -0.03).
3.4. Test of decision process influencing confidence
Hypothesis H4 stated that having a consistent decision and only seeking confirming information should be associated with increased confidence. An index was calculated for the change in confidence from the first (T1b choice) to the last (T3 choice) decision for each case (the index could vary from -3 to +3). Participants who preferred the same diagnosis on all three diagnostic questions for a case (T1b, T2 and T3 choices), and requested congruent information at both opportunities for the case (T1b and T2 requests) were considered to have consistent choice preferences throughout the case evaluation. Since relatively few participants met our criteria for consistency (across experiments, 31%, n = 68), this was not tested separately for each experiment but was tested pooled across the three experiments. H4 was tested as a one-sample one-tailed Wilcoxon signed-rank test that confidence would increase more among the consistent than the inconsistent participants. This showed that the consistent participants increased their confidence (M = .65, SD = 1.4) compared to the sample average (M = .26, SD = 0.91), which was a significant change in the predicted direction (W = 1057, p < .001, rrb = 0.53).8
3.5. Follow-up analyses after removing non-naive participants
Experiment 3 included questions about what the participants thought the research hypotheses were. These were used in preregistered follow-up analyses that tested whether the hypotheses listed above would be supported when excluding participants that appeared to have correctly guessed the hypotheses. After reviewing the responses, we excluded 19 participants who appeared to have fully or partly guessed the central research questions or any of the hypotheses. This left 71 participants for follow-up analyses. The analyses of this sample still showed no significant effects for the three hypotheses (H1: W = 202.5, p = 1, rrb = -0.59, H2: W = 661, p = .135, rrb = 0.19, H3: W = 817, p = .185, rrb = 0.14). Testing H4 in Experiment 3 for the same sample remained significant (W = 287, p < .009, rrb = 0.52).
4. Discussion
4.1. Biases in diagnostic decision-making
4.1.1. Summary of results
The aim of the current study was to develop an experimental procedure for studying the interaction of anchoring, confirmation and confidence across the decision process. Further, we wanted to test the experiment in moderately sized samples of decision-makers with relevant expertise. To this end, we performed three classroom experiments where advanced medical students were asked to diagnose two hypothetical mental health cases. Across three experiments, we found (H1) no support of anchoring, (H2) marginal support for confirmation bias, (H3) no support for confidence increasing confirmatory information seeking, and (H4) support for confidence increasing when being consistent throughout the decision process. The results for all hypotheses are summarized in Figure 2, and each of them are discussed in more detail below.
4.1.2. No indication of anchoring
We expected that (H1) the first-presented symptoms would create an anchor for reading the rest of the case description. This would lead participants to favor the diagnosis matching the first-presented symptoms rather than the diagnosis matching the last-presented symptoms. However, this was not supported in any of the three experiments. Experiment 1 and 2 both showed that the initial diagnosis matched the first-presented symptoms equally often as it matched the last-presented symptoms. Experiment 3 showed a non-predicted effect of the initial diagnosis more often matching the last-presented symptoms. This is in contrast to previous studies that have found anchoring for medical decision-making in similar designs that manipulated the order of symptom presentation (see e.g., Cunnington et al., 1997; Richards & Wierzbicki, 1990).
The missing anchoring effect leads us to re-examine the stimulus materials. Richards and Wierzbicki (1990) argued that it can be challenging to create case materials that are sufficiently balanced so that the mere order of symptoms is sufficient to tip the scales in favor of a given diagnosis, while avoiding imbalances due to the differences in description length or severity of symptoms. When designing the current materials, we also attempted to strike a balance between describing symptoms that pointed towards a specific diagnosis, while being sufficiently ambiguous to not conclusively eliminate the competing diagnosis. If our case descriptions were not informative or balanced between the diagnoses, this may have prevented the experiments from producing an anchoring effect. We counterbalanced symptom order (i.e. for half the participants the first-presented symptoms indicated diagnosis A, while for the other half the first-presented symptoms indicated diagnosis B), which should minimize effects of materials being imbalanced. Further, examining the data shows that the initial decisions had a roughly normal distribution centered around having low confidence in one of the diagnoses, which indicates that the case descriptions are fairly well-balanced.
Instead of showing an anchor effect in preferring the diagnosis that matched the first-presented symptoms, Experiment 3 showed the opposite pattern, of preferring the diagnosis matching the last-presented symptoms. This could indicate a recency effect (Botto et al., 2014; Murdock, 1962; Tzeng, 1973). Such an effect could lead to last-presented symptoms being more available in working memory, and thus having a larger impact on the decision (see similar effects in e.g., Bergus et al., 1995; Tubbs et al., 1993; see also the “serial anchoring” effect discussed by Bahník et al., 2019). However, the impact of the last-presented symptoms in our Experiment 3 could also be due to a demand characteristics caused by a study design variation in this experiment (see further discussion below).
Previous attempts to conceptually replicate studies of bias in medical decision-making have had mixed results (Ellis et al., 1990; Friedlander & Stockman, 1983). It is possible that the inherent complexity makes it difficult to replicate anchor effects in real-life expert decision-making than for naive lab participants making abstract decisions. This could be due to the decision being affected by various prior assumptions, strategies and preferences (Hutton & Klein, 1999). Nevertheless, our manipulation check (the T1a choice) indicated that participants were sensitive to the presented information, as they favored the diagnosis consistent with the information that had been presented thus far.
4.1.3. Confirmatory information seeking
We operationalized confirmation bias as requesting follow-up information that could confirm the diagnosis preferred on the preceding decision (rather than information that could confirm the opposing diagnosis). We expected (H2) that there would be more requests for confirming, rather than dissenting information. This test for confirmation bias was significance in Experiments 2 and 3, and when collapsed across the three experiments.
The study as a whole thus indicated confirmation bias for diagnostic decision-making, in terms of clinicians seeking information that could support the diagnosis they already hold, rather than seeking information that could falsify their assumption. This finding is compatible with other studies (Martin, 2000; Mendel et al., 2011; Oskamp, 1965; Parmley, 2006). This indicates that the previously identified confirmation bias phenomenon (Jonas et al., 2001; Schulz-Hardt et al., 2000) extends to novel experimental settings. Further, this supports the claim that the confirmation bias could be relevant for the mental health domain. Clinicians should thus be aware that confirmation bias can lead to non-optimal information gathering and decision-making (Blumenthal-Barby & Krieger, 2015).
However, while the argument can be made that the current study as a whole found indications for a confirmation bias, it should be noted that the effect sizes were small, two experiments showed an effect close to the preregistered significance threshold, while one of the experiments did not show a significant effect. This is in contrast with previous research (Martin, 2000; Mendel et al., 2011; Oskamp, 1965; Parmley, 2006) that has indicated the confirmation bias to be a robust effect that should reliably reproduce. Nevertheless, most of the studies on biases in diagnostic decision-making also appear to have small effects, which resemble our pooled results.
A possible reason for the weak effect in our study could be that we by design used materials that gave ambiguous feedback on the follow-up requests. This may have confused or annoyed the participants, which could motivate them to use a more analytical mode of thinking (see Croskerry, 2009).
We further do not know the participants’ motivation for making their follow-up requests. We assumed that participants’ strategy was to seek support for their current assumption if their request led to showing symptoms consistent with their preferred diagnosis (i.e., modus ponens). However, some participants could have asked about a symptom associated with their preferred diagnosis with the intention to discount the assumption if the symptom was absent (i.e., modus tollens). Such confirmatory information-seeking strategy would not be registered as indicating confirmation bias in our operationalization. Similarly, participants could also ask about symptoms associated with the non-preferred diagnosis, in order to support their preferred diagnosis if the non-preferred symptoms were absent (thus a confirmatory strategy that would not be registered as confirmatory in our operationalization). We should note that similar mechanisms may be present in most studies of confirmation bias, and may be difficult to fully discount in purely behavioral designs.
4.1.4. Confidence did not lead to confirmatory information seeking
We expected (H3) that confidence would be higher for the diagnostic decision that was followed by requests for confirmatory information, than for decisions that were followed by requests for dissenting information. If so, this could indicate a mechanism for confirmation bias. However, although there was an overall tendency for seeking confirmatory information (see H2), there was no support in any of the experiments for increased confirmatory information seeking when participants were more certain of their decision. This null-finding thus fails to support a previous finding (Martin, 2000) of higher confidence leading to more confirmatory information seeking. The absence of such an effect may indicate that confirmatory information seeking is not motivated by the degree of confidence in the preferred decision.
It should be noted that we used a novel response mode in the current experiments, where participants indicated their graded confidence between either diagnosis on a 10-point scale. It is possible that participants’ use of the scale did not represent their actual degree of confidence. However, note that most participants indicated low confidence in their decisions (see Figure 1), which may be expected given the ambiguous case descriptions and follow-up information. This may somewhat support our response mode as a measure of confidence.
4.1.5. Increased confidence when not exploring alternatives
We expected (H4) that participants who favored a single diagnosis (at T1b, T2 and T3 choices) and only sought information consistent with that diagnosis (at T1b and T2 requests) would show an increase of confidence in their decision. This was supported (when pooled across experiments), in the sense that the consistent participants increased their confidence during the decision process. As the follow-up information was designed to be ambiguous and should not provide the participants with any additional conclusive information, the increase in confidence could be said to indicate overconfidence (Oskamp, 1965). Similar effects have been seen in a study that presented more conclusive information (Martin, 2000), showing that experts were less confident than novices. While subject to cultural variation (Yates, 2010), overconfidence has previously been shown in Norwegian settings (Bratvold et al., 2020; Jørgensen et al., 2004). We are not familiar with previous studies of overconfidence for medical decisions in Scandinavia.
4.2. Limitations and suggestions
4.2.1. Alternative experiment designs
The current study had a novel experiment design in order to test the combination of anchoring, confirmation and confidence on seeking and evaluating information throughout the decision process. To ensure sufficient samples of trained participants, the experiment was designed to be short so that it could be completed in a break between two lectures. Providing participants with only a limited amount of information about each case may have made it easier to deliberately process the information. This may have resulted in less heuristic processing than in other experiments with more comprehensive patient information, which may account for deviating results (see e.g., Mendel et al., 2011). One may argue that providing more information would be a more valid representation of real-life mental health decisions. It could also be that a longer experiment with more patient cases would be a more robust test for the biases.
The case descriptions were designed to be ambiguous and for follow-up requests to not provide any conclusive information. It could be that anchoring had emerged if the beginning of the patient case had given a clearer indication for a given diagnosis. However, it may have stretched the validity of the case to first clearly indicate one diagnosis for it later to be contradicted. Another approach could be that the follow-up requests led to feedback information that resolved the participants’ uncertainty. Participants may have used the requests more actively in such a situation, which may have shown a confirmation bias more clearly. However, this would not fit with the current aim of exploring a multi-stepped decision process.
A possible reason for the lack of an anchor effect in the current study may be that participants did not commit to a decision after reading the first-presented symptoms. To test for this, Experiment 3 asked participants to make a preliminary diagnosis (T1a choice) after hearing only the first-presented symptoms. However, this did not have the expected effect that the diagnosis after hearing the full vignette (T1b choice) to be more in line with the first-presented symptoms. On the contrary, while in Experiment 1 and 2 the decisions after hearing the full vignette (T1b choice) were evenly distributed between the two diagnoses, in Experiment 3 there was a strong preference for the diagnosis that matched the last-presented symptoms. This may be due to demand characteristic effect (Orne, 2006; Rosenthal, 1963; Strohmetz, 2008): When participants in Experiment 3 were asked to diagnose again after being given additional information, they may have believed that they were expected to change their diagnosis. Another approach to this issue could have been to present audio or video recording of patient cases, so that participants had to make evaluations “on the spot” without being able to re-read or reconsider previous information.
4.2.2. Possible participant bias
Participants’ knowledge or assumptions about the research hypotheses may change their responses towards or away from the hypothesized results. This was indicated in informal debrief conversations after Experiment 1 and 2. A debriefing survey at the end of Experiment 3 indicated that about a fifth of the participants correctly guessed one or more of the research hypotheses. Removing these participants did not change the significant results from Experiment 3. Nevertheless, such artefacts may have impacted the results in Experiment 1 and 2, as well as in other published literature where participant’s demand characteristics are not controlled for (Orne, 2006).
4.2.3. Statistical power
The sample size of our current three experiments were restricted by practical concerns (the number of medical students at our local university). The three studies separately (average n = 72) had sufficient power to detect effects of Cohen’s d = 0.36 or larger (given one-sided alpha of .05 and power of .8). Alternatively, pooling all participants (n = 224) gives sufficient power to detect effects of d = 0.21. The studies on decision making in mental health cited above often fail to provide sufficient information to calculate effect sizes for anchoring, confirmation bias and overconfidence, and typically had low power. Some of the studies (e.g., Richards & Wierzbicki, 1990) have described their effects to be between weak and moderate (thus corresponding to d between 0.2 and 0.5). If so, the current study is more or less sufficiently powered to detect effects of this magnitude, at least when pooled across experiments. It is possible that the examined biases have weaker effects than previously assumed when tested in somewhat realistic problem-fields in which the decision-maker has relevant expertise. If so, the current study may be underpowered, higher-powered studies combined with salient manipulations are necessary to demonstrate the biases in real-world settings.
4.3. Implications and future research
This is the first study investigating the interaction of anchoring, confirmation and overconfidence bias on mental health diagnostic processes for trained clinicians. The current experiment procedure and materials may inspire similar explorations of medical decision-making. A transparent research process with preregistration, open materials and open data may assist the planning of future studies. To improve ecological validity and measurement specificity, one may consider expanding the number of patient cases, symptom information or follow-up requests. Another approach could be to measure meta-knowledge and strategy during the decision process, similar to what we attempted at the end of Experiment 3. A relevant research program to explore whether the biases can still be shown in ecologically valid settings could be to first use more salient manipulations to demonstrate proof-of-concept for the biases, and to then gradually making the manipulation more subtle.
The current study failed to find anchoring and only weakly indicated confirmation bias. This may indicate that reproducing the biases for real-life decision-making may be more challenging than what has been indicated in past literature. While several studies have detected these biases in similar settings (Mendel et al., 2011; Parmley, 2006; Richards & Wierzbicki, 1990), there have also been studies failing to demonstrate them (Christensen et al., 1995; Ellis et al., 1990; Weber et al., 1993). Above we argue that the complexity of the medical diagnostic process may yield more opportunities for biases than in simpler one-off decisions (see similar arguments in e.g., Blumenthal-Barby & Krieger, 2015; Saposnik et al., 2016). However, such complexity also allows for different processes and interactions to be in effect. It could be that biases previously shown in simpler situations are only relevant for a subset of these processes, or that they may be countered by other processes. For comparison, cognitive biases such as “loss aversion” have been shown to be moderated by factors such as domain knowledge, experience and education (Mrkva et al., 2020).
The literature has shown that cognitive biases can be produced in certain settings when using specific manipulations. Norman and Eva (2010, p. p. 97) argued that some previous demonstrations of biases may “induce error for the sake of determining if the biases exist”. Alternative approaches such as the “naturalistic decision making” (Klein, 2015) and “ecological rationality” (Gigerenzer, 2008) have expressed doubts about the extent to which such biases will detriment experts’ real-life decision-making. Recent development in psychological science (Ioannidis et al., 2014) have emphasized how research practices such as analysis flexibility, selective publication and conceptual (rather than direct) replications may have generated and propagated false positive findings. This makes it difficult to know whether published effects generate to other settings. Such practices may have contributed to overestimating the reliability of anchoring and confirmation biases, and may make it difficult to identify the necessary conditions for producing the effects. This may partly explain why the current results deviate from the majority view in the literature.
Author Contributions
The theoretical framework and the experiment designs were developed by BS, TF and ØKB. TF and ØKB developed the case descriptions and did data collection for Experiments 1 and 2. VTS did data collection for Experiment 3. BS managed the project. All authors contributed to writing the first draft of the manuscript, while BS wrote and edited the current version of the manuscript. Results from Experiments 1 and 2 have previously been part of a master thesis by TF & ØKB, supervised by BS (http://hdl.handle.net/1956/17805).
Acknowledgements
Thanks to our collaborators at the University of Bergen Faculty of Medicine: Anders Bærheim, Øystein Hetlevik, and Thomas Mildestvedt. Thanks to the participants in the three experiments.
Conflicts of interest
All authors declare that they have no conflicts of interest in the publication of this research.
Data Accessibility Statement
All data (https://osf.io/t3zh4/) and analysis files (https://osf.io/nye24/) are available online.
More detailed descriptions of the experiment procedures and text descriptions of the cases in the original Norwegian and translated to English are available in the preregistrations for each of the experiments (https://osf.io/dn4rv/registrations).
Footnotes
This hypothesis can only be tested on participants that show this response pattern. As this applied to only 12, 20 and 36 participants in the three experiments, it is underpowered to be tested for each experiment, but was tested pooled across experiments.
Experiment 1 and 3 presented two symptoms for each of the diagnoses. In an attempt to make the manipulation stronger, Experiment 2 presented two symptoms for the first diagnosis and only one for the other diagnosis. Preceding experiments were also used to adjust the wording of the cases in order to make them be selected equally often.
Figure 1 illustrates the extent to which this was successful. When Experiment 1 indicated a preference for one diagnosis independent of symptom order, the case description for Experiment 2 and 3 were edited to adjust for this.
In order to counterbalance for effects of the order that the diagnoses were listed in (rather than the intended effect of the order of presenting the symptoms), two different versions of each experiment were made, with the two diagnoses for each case in the order ABCD and BADC. Approximately half of the class (assigned by seating) was asked to answer the ABCD form, while the other half was asked to answer the BADC form. This counterbalancing variation was collapsed in the analysis, as the operationalization of responses only counted whether the selected diagnosis matched the first-presented symptoms or second.
It was possible to request the same follow-up information twice, but this only happened for 1.82% of the cases across all participants. This indicates that the far majority of participants were invested in and attentive to the task.
Note that the preregistration specified that hypotheses would be tested with t-tests. Since the assumptions for Student’s t-test may not be fulfilled for the current study, we here instead present results for Wilcoxon signed-rank tests. This mostly does not affect the significance criteria for the tests, but for H2 the t-test was p = 0.51 in experiment 2 (compared to W p = .062), and p = .052 in experiment 3 (compared to W p = .047). Both statistics are included in the online results output files.
With two-tailed testing, this effect would have been significant (p = .003), due to Experiment 3 showing a preference for the last-presented symptoms on the T1b choice.
Note that this way of testing H4 deviates from the preregistration. The preregistration specified that H4 would be evaluated by testing whether the participants with consistent responses increased their confidence (as opposed to remaining stable or decreasing). However, such an approach would not be sensitive to whether the subset’s confidence increased more than the other participants. The preregistered analysis also found a significant result, although with a larger effect size (t(67) = 3.82, p < .001, d = 0.46).