The impact of the COVID-19 pandemic on psychological well-being led to a proliferation of online psychological interventions, along with the publications of studies assessing their efficacy. The aim of the present work was to assess the scientific quality of studies addressing online psychological interventions for common mental health problems, comparing studies published during the COVID-19 pandemic to equivalent control articles published four years before the pandemic. To this end, we developed and applied a quality checklist to both samples of articles (N=108). Overall, we found that the methodological quality of many studies on psychological interventions was poor both before and during the pandemic. For instance, 33% of the studies lacked a control group of any kind in both samples of articles, and less than 5% of studies used blinding of any sort. Within this context, we found that studies conducted during the pandemic were published faster, but showed a decrease in key indicators such as the randomized allocation of participants of the experimental groups, pre-registration or data sharing. We conclude that the low overall quality of the available research on online psychological interventions deserves further scrutiny and should be taken into consideration to make informed decisions on therapy choice, policy making, and public health –particularly in times of increased demand and public interest such as the COVID-19 pandemic.
The COVID-19 pandemic created an exceptional demand for answers from the scientific community to address a wide array of critical, time-sensitive challenges. One of these challenges has been the deterioration of mental health and well-being, and the need for treatment or prevention of these pervasive problems via online interventions. However, past experience suggests that expediting scientific production may be problematic (e.g., Jung et al., 2021; Kataoka et al., 2021; Keenan et al., 2021). Here, we address whether the general context of the COVID-19 pandemic may have had an impact on quality of research in the specific area of online interventions for common mental health problems such as depression, anxiety, or stress.
Concerns regarding the quality of research produced in response to the pandemic were first raised in the medical-clinical research field, where scores of methodological quality assessments were reported to decline (Candal-Pedreira et al., 2022; Jung et al., 2021; Quinn et al., 2021). More specifically, meta-scientific reports identified several issues, such as a decreased use of randomized clinical trials (RCT) in favour of observational designs (Joshy et al., 2022), smaller sample sizes (Candal-Pedreira et al., 2022), poorer adherence to publication standards (Quinn et al., 2021), a lower rate of pre-registered studies, a lack of transparency of research protocols and data analysis (Kapp et al., 2022), and a higher risk of publication and reporting bias (Candal-Pedreira et al., 2022). There is also evidence that these studies underwent faster peer reviews, which is not a direct indicator of quality but has been reported to be statistically associated with lower methodological quality (Allen, 2021; Horbach, 2021; Jung et al., 2021; Kapp et al., 2022). Based on these results, some authors have underscored that special care should be taken when interpreting the scientific outputs produced during the COVID-19 pandemic (Candal-Pedreira et al., 2022; Jung et al., 2021; Khatter et al., 2021).
The bulk of research regarding this meta-scientific question has been conducted in biomedical fields to monitor the quality of pharmacological interventions -and their scientific output- during the pandemic, but research in other areas with widespread consequences for public health has been almost entirely overlooked. Particularly, methodological assessments of the scientific production in the field of mental health are scarce, despite the well-documented negative psychological consequences of the COVID-19 pandemic (Gruber et al., 2021; Usher, Bhullar, et al., 2020; Usher, Durkin, et al., 2020) and the considerable increase in the number of therapeutic interventions that have been made available to alleviate such consequences. To our knowledge, only the study by Nieto, Navas and Vázquez (2020) conducted a systematic review of the evidence regarding the impact of the pandemic on the quality of research on mental health. Nieto et al. pointed out that the reviewed papers may not meet the standards on validity, generalizability, reproducibility, and replicability (see, for example, Eignor, 2013). They identified issues such as the use of convenience samples, the lack of a priori power analysis, or poor adherence to open science recommendations. Nevertheless, the research quality of the mental health interventions that proliferated during the pandemic – mostly online, due to the lockdown constraints – remains virtually unexplored.
The anxiety related to the health threat of the pandemic, compounded by social distancing requirements and mobility restrictions, fostered a rapid proliferation of online tools to meet the urge for psychological support. While online (i.e., “tele-health”) psychological interventions had emerged in recent years as alternatives to face-to-face interventions (Torre et al., 2018), it was during the pandemic that these approaches experienced a dramatic surge in popularity (Ho et al., 2021; Sammons et al., 2020), along with corresponding research studies aimed at assessing their effectiveness (e.g., Holmes et al., 2020; Palayew et al., 2020; Ruiz-Real et al., 2020). These interventions often involve the use of applications on electronic devices, SMS services associated with different institutions, or the implementation of classic face-to-face therapy through video-conference platforms. Gaining knowledge about the methodological quality of such research is critical to ensure that patients and users receive appropriate mental healthcare and support. Hence, the goal of the present study was to assess the quality of research on online interventions for mental health during the COVID-19 pandemic, compared to a control set of studies published before the pandemic. To this end, we developed a quality checklist designed to address key methodological features of a heterogeneous set of research designs before and after the COVID-19 outbreak, given the lack of adequate tools available for such purpose. We focused on methodological issues previously detected as problematic in this field (Cybulski et al., 2016; Fraley & Vazire, 2014; Spring, 2007; Tackett et al., 2017; Tackett & Miller, 2019).
Methods
Literature Search
We searched for primary research articles published in journals included in three scientific databases widely recognized throughout the community (Scopus, Web of Science, and PubMed). The general inclusion criteria for the selection of articles were: (1) peer-reviewed original research articles published in English in indexed journals, (2) studies that tested the effects of online interventions for common mental health problems and well-being, (3) studies that used quantitative measures for the target dependent variables, being them outcomes of mental health, and (4) studies using at least one within-group (before vs. after the intervention) or one between-group comparison (outcomes for experimental vs. control groups). We allowed ample heterogeneity in research designs to be able to collect sufficient samples (especially those after the outbreak, given the short time elapsed). Observational studies, research articles focused on qualitative analyses, case studies, published protocols, reviews, and meta-analyses were excluded from our search. We also excluded grey literature (i.e., evidence not published in indexed journals that could provide data with null or negative results that might not otherwise be disseminated; see, for example, Paez, 2018). This decision was made because we aimed to explore the quality of the articles that would effectively contribute knowledge usable in society (e.g., for the implementation of public policies on mental health, users’ decision-making, …).
For the COVID-19 sample, we searched for articles published during the pandemic. In addition to the general inclusion criteria listed above, articles were selected if they were published after 2018 and included in their title, abstract or keywords a reference to the COVID-19 (“covid-19”, “coronavirus”, “sars-cov-2”, or “pandemic”), to their interventional nature (“intervention”, “program”, “training”, “treatment”, or “therapy”), to the online format of the interventions (“online”, “web”, “app”, “eHealth”, “e-health”, “telehealth”, “tele-health”, “videoconference”, “videocall”, or “digital”), and to the expected psychological variables targeted by the interventions (“anxiety”, “depression”, “stress”, “distress”, “worry”, “mental health”, “mood disorder”, “coping”, “fear”, or “loneliness”). The number of records found for articles published during the pandemic was 2,278. After reading the title and abstract, 1,924 articles were excluded, and the remaining 354 articles were further evaluated for eligibility. We then filtered out studies that reported: interventions that started before the pandemic, case studies, studies that apply a decommissioning strategy, studies that did not report psychological outcomes, studies without interventions, studies that were not exclusively online (e.g., studies that evaluated the transition of a treatment from in-person to the online modality), qualitative studies and study protocols). Articles not explicitly mentioning COVID-19 were excluded, as these may mostly represent papers that were already in the pipeline or submitted for publication before the pandemic. After reading the title and abstract, there were 354 candidate studies left. Of these, 56 finally made up the sample of articles that were analyzed (see PRISMA flowchart below).
For the control sample, we searched for articles published before the pandemic outbreak (between 2016 and 2020), with otherwise identical keywords and criteria as the COVID-19 sample, except for omitting pandemic-related terms in the search. Given that this approach yielded too many items, we included an additional constraint so that we would narrow down our search results to papers addressing types of interventions already present in the COVID-19 sample. This was not only intended to facilitate the literature search, but also to improve the comparability between the two groups of articles. The added keywords represented the most common interventions found in the abstracts of the COVID-19 sample of articles. To this end, we restricted our search to articles including at least one of the following terms in their title, abstract or keywords: “mindfulness”, “relax”, “sms”, “CBT”, “iCBT”, “psychoeducation”, “support”, “meditation”, “coach”, “EMDR”, “acceptance and commitment”, “MBCT”, “MBI”, “MBSR”, or “heartfulness”. A total of 467 articles were found in the Scopus database. After reading the title and abstract, 260 articles were excluded. After applying the same filters as above, we retained 205 articles that were eligible for evaluation. Of these, 52 articles made up the final control sample (see PRISMA flowchart below).
Please note that three databases (Scopus, Web of Science and PubMed) were used to search for the COVID-19 articles, whilst only Scopus was used in the search for the sample of control articles. In the first case, we wanted to maximize the number of COVID-19 articles and, therefore, we expanded the search as much as possible. In the second case, the objective was to match the COVID-19 articles and a search in Scopus was enough for this purpose, since most journals are represented in all three databases.
We conducted a power analysis using G*Power version 3.1.9.7 (Faul et al., 2007) to determine the minimum effect size detectable with a sample of 108 articles (56 articles published during the pandemic, 52 articles published before the pandemic). For a two-tailed Wilcoxon-Mann-Whitney test, our sample size yields 90% power to detect an effect size of d = 0.645 and 80% power to detect an effect size of d = 0.557, assuming a Gaussian parent distribution. Therefore, this sample size allowed us to detect medium-to-large effects with reasonable power, which we considered adequate given our time and financial resources.
Development of a Checklist for Quality Assessment
The article search described above returned studies with a large variety of methodologies. Previous quality checklists focused almost exclusively on specific designs, such as randomized controlled trials, where methodological characteristics such as blinding and other uncommon design features in psychology have great weight, such as the RoB-2 (Sterne et al., 2019), which is specific for randomized trials; the ROBINS-I of Sterne et al. (2016) or the Newcastle-Ottawa scale for non-randomized studies (Wells et al., 2013). In the present context, we decided to develop a specific checklist tailored to assess the methodological quality of studies with wider methodological variability. Although the development of the checklist was not the primary goal of our study, we aimed to capture the heterogeneity in research designs present in the field of online psychological interventions. We took as a reference previous work on the assessment of research quality (Ferrero et al., 2021; Jung et al., 2021; Sterne et al., 2019). Items were organized in the following clusters: design features, blinding, statistical analysis, replicability, and reporting (Table 1). Using the definition proposed by Nosek & Errington (2020), by replicability we refer to the possibility of repeating the same study. In this sense, the checklist included three items/questions to check if an independent research team would have sufficient information to re-run the same study on an online intervention. We did not attempt to compute a single quality score for each study merging information from different items, as these composite scores have been criticized extensively in the literature due to their heterogeneous nature (Jüni et al., 1999; Mazziotta & Pareto, 2013; Peduzzi et al., 1993).
Domain of quality evaluation | Items |
Design features |
|
Blinding |
|
Statistical analysis |
|
Replicability |
|
Reporting & publishing |
|
Domain of quality evaluation | Items |
Design features |
|
Blinding |
|
Statistical analysis |
|
Replicability |
|
Reporting & publishing |
|
Coding Procedure
Two independent judges (authors C.R.P. and M.B.) assessed the checklist items for each article. The coding protocol was refined throughout 12 meetings (average duration 2:30h) over five months during which the two coders iterated between independent pilot coding and discussion to expand/expunge variables and to fine-tune their definition, following the recommendations by Wilson (2019). Most importantly, during these meetings the two raters never discussed the specific coding of any study in particular, but just general problems faced with the application of the checklist. This ensured that the assessment of each judge remained as independent as possible while allowing the progressive refinement of the checklist. Once the two judges had completed their independent assessments, they resolved disagreements through discussion until a unified agreement was reached on each case.
Of note, coders were not blind to the article category. While we acknowledge that coding would benefit from including blind coders, this would have added an extra layer of complexity to the quality assessment process (i.e., thoroughly pre-processing article characteristics by an independent researcher before coding) that was deemed unfeasible. The coding protocol can be found in Supplementary Material 1.
Data Analysis
The dataset containing the scoring of each article was created, stored, and manipulated using Microsoft Excel 16.0.15028.20160 and Google Spreadsheets. Statistical analyses were performed using Jamovi (The jamovi project, 2021) and RStudio (Posit team, 2023), with the packages tidyr (Wickham, Girlich, et al., 2022) and dplyr (Wickham, François, et al., 2022) for data wrangling. For descriptive and inferential statistical analysis, we used pastecs (Grosjean et al., 2018), stats (R Core Team, 2023), MVN (Korkmaz et al., 2021), asht (Fay, 2022), summarytools (Comtois, 2022), gmodels (Warnes et al., 2022), MASS (Ripley et al., 2022) and vcd (Meyer et al., 2023). The ggplot2 (Wickham, 2016) package was used for data visualization. Continuous variables were reported as mean and SD or median as appropriate, and categorical variables were reported as proportions (%). Continuous variables were compared using the Student t-test (or Mann-Whitney U-test when assumptions were not met) and categorical variables were compared using χ2, Fisher’s exact test, or Kruskal-Wallis test depending on compliance with statistical assumptions. All contrasts were two-tailed. Cohen’s d and odds ratios (OR) were presented as effect sizes for continuous and categorical variables, respectively. However, it is worth noting that as some of our continuous variables did not meet normality assumptions, Cohen’s d may misrepresent the difference between medians of both distributions.
Results
Out of the 108 studies included in the analyses, 41.7% of the articles were originally produced in research centers or universities in North America, 27.8% in Europe, 18.5% in Asia, 9.3% in Oceania, and 0.9% (one article) in South America. Most articles included samples of participants drawn from the general population in western countries. The most represented journals were Mindfulness (Springer, IF = 3.8; n = 8), JMIR Mental Health (JMIR Publications, IF = 6.33; n = 8), International Journal of Environmental Research (Springer, IF = 1.48; n = 4), Frontiers in Psychology (Frontiers Media, IF = 4.23; n = 4), and Internet Interventions (Elsevier, IF = 5.36; n = 3). Impact Factors refer to 2022. Results for each category of the quality assessment checklist are presented below separately for each group (published before the pandemic started and during COVID-19). Summary statistical results are shown in Table 3 for dichotomous variables.
Design Features
In terms of design features, statistically significant differences between COVID-19 and control articles were found in the use of pre-registration, randomization of participants, and the use of RCTs, with control articles showing higher scores than COVID-19 ones in all cases. No statistically significant differences were found in the use of equivalence experimental groups, control groups, the use of active control groups and the use of research guidelines (e.g., CONSORT; Schulz et al., 2010) (Figure 2, Table 3). It should be noted that only 71 of the 108 studies (66%) altogether used control groups at all. This proportion was numerically higher for articles published before the pandemic (73%) than during the pandemic (55%), albeit the difference did not reach statistical significance (Figure 2, Table 3).
No statistically significant differences were found in final sample size (MCOVID-19 = 267.38; MCONTROL = 153.85; SDCOVID-19 = 824.66; SDCONTROL = 204.46; MdnCOVID-19 = 75.50, MdnCONTROL = 62; Mann-Whitney U = 0.53, p = 0.597, 95% CI =[0.42; 0.63], Cohen’s d = 0.186, 95%CI = [-0.19; 0.56]) and sample size ratios between the experimental and control groups (MCOVID-19 = 1.06; MCONTROL = 1.02; SDCOVID-19 = 0.62; SDCONTROL = 0.316; MdnCOVID-19 = 1, MdnCONTROL = 0.945; Mann-Whitney U = 0.56, p = 0.336, 95% CI = [0.43; 0.69], d = 0.101, 95%IC = [-0.358, 0.558]). However, the attrition rates of both the experimental (MCOVID-19 = 14.90; MCONTROL = 28.85; SDCOVID-19 = 18.84; SDCONTROL = 22.84; MdnCOVID-19 = 7.62, MdnCONTROL = 25.69; Mann-Whitney U = 0.30, p < 0.001, 95% CI = [0.21; 0.40], d = 0.67, IC95% = [1.07, 0.26] ) and control groups (MCOVI-19 D = 11.43; MCONTROL = 21.18; SDCOVID-19 = 16.45; SDCONTROL = 19.76; MdnCOVID-19 = 1, MdnCONTROL = 16.13; Mann-Whitney U = 0.32, p < 0.001, 95% CI = [0.21; 0.46], d = 0.553, IC95% = [1.02, 0.03]) did differ, with articles published during COVID showing a significantly lower attrition rate (Figure 3).
Blinding
No statistically significant differences were found in reported blinding of participants, interveners, and data analysts. However, it is noteworthy that overall, the reported use of blinding of any sort (and by implication, of double-blind designs) was extremely rare in both samples of articles: during the pandemic, 5.66%, and their historical controls 0% (Figure 4). We assume that there is no tradition of flagging these characteristics due to the nature of the interventions and studies within Clinical Psychology.
Statistical Analysis
Articles published during the pandemic relied significantly more often on the analysis of complete cases, without considering the percentage of the sample that did not complete the intervention. For the remaining checklist items, no statistically significant differences were found, although it is noteworthy that it is the articles published during COVID-19 that show the highest proportion of “no information/not applicable” responses (Figure 5, Table 3).
Replicability
Articles published during the pandemic were assessed to be less replicable by other researchers. No statistically significant differences were found concerning the use of replicable and validated dependent variables (Figure 6, Table 3). The difference was therefore due to other aspects such as the specificity of the interventions’ descriptions. Specifically, the interventions’ descriptions did not provide sufficient detail to enable consistent application by other researchers or clinicians.
Reporting and Publishing
Articles published during COVID-19 were more likely to be published in Open Access than articles published before the pandemic, and they underwent a shorter acceptance time (Mann-Whitney U = 0.32, p < 0.01, 95% CI = [0.21; 0.44], d = 0.58, IC95% = [1.04, 0.12] ) (Figures 7 and 8). No statistically significant differences were found in terms of full reporting of key results, conflicts of interest, or the types of funding received (Figure 7).
The twenty-seven conflicts of interest explicitly reported in the articles were related to the authors’ relationships as developers of the online apps being assessed, receiving income after publishing books on psychopathology and interventions, and/or collaborating with private companies or receiving royalties from them.
Design features | |||||||
95% IC | |||||||
Variable | N | Statistic | Value | p | OR | Lower | Upper |
Pre-registered | 108 | χ2 | 6.40 | 0.011 | 0.34 | 0.15 | 0.80 |
χ2 cc | 5.40 | 0.020 | |||||
Randomization | 108 | χ2 | 14.26 | < .001 | 0.21 | 0.09 | 0.48 |
χ2 cc | 12.80 | < .001 | |||||
RCT | 108 | χ2 | 16.30 | < .001 | 0.19 | 0.09 | 0.44 |
χ2 cc | 14.80 | < .001 | |||||
Use of equivalence experimental groups | 108 | χ2 | 1.55 | 0.213 | 0.50 | 0.17 | 1.50 |
χ2 cc | 0.95 | 0.330 | |||||
Use of control groups | 108 | χ2 | 2.40 | 0.122 | 0.53 | 0.23 | 1.19 |
χ2 cc | 1.81 | 0.179 | |||||
Use of active control groups | 71 | χ2 | 1.16 | 0.280 | 0.60 | 0.23 | 1.53 |
χ2 cc | 0.71 | 0.400 | |||||
Use of guidelines | 108 | χ2 | 1.25 | 0.264 | 0.60 | 0.25 | 1.47 |
χ2 cc | 0.80 | 0.372 | |||||
Statistical analysis | |||||||
Power analysis | 108 | χ2 | 1.19 | 0.274 | 0.65 | 0.29 | 1.42 |
χ2 cc | 0.799 | 0.372 | |||||
Sociodemographic equivalence (statistical) | 74 | χ2 | 1.99 | 0.159 | 0.50 | 0.19 | 1.32 |
χ2 cc | 1.35 | 0.245 | |||||
Sociodemographic features accounted for | 71 | χ2 | 0.0482 | 0.826 | 0.90 | 0.33 | 2.40 |
χ2 cc | 0 | 1 | |||||
Baseline scores equivalence (statistical) | 65 | χ2 | 0.244 | 0.621 | 1.32 | 0.44 | 3.96 |
χ2 cc | 0.0471 | 0.828 | |||||
Baseline scores accounted for | 75 | χ2 | 2.49 | 0.287 | 2.40 | 0.55 | 10.4 |
χ2 cc | 2.49 | 0.287 | |||||
Adequate statistical analysis | 108 | χ2 | 1.62 | 0.202 | 2.87 | 0.53 | 15.5 |
χ2 cc | 0.78 | 0.377 | |||||
Exclusive complete-cases analysis | 101 | χ2 | 9.39 | 0.002 | 3.85 | 1.59 | 9.31 |
χ2 cc | 8.15 | 0.004 | |||||
Reported use of blinding | |||||||
Of participants | 108 | χ2 | 2.87 | 0.091 | 6.87ª | 0.35 | 136 |
χ2 cc | 1.22 | 0.268 | |||||
Of interveners | 108 | χ2 | 0.89 | 0.345 | 2.89 | 0.29 | 28.7 |
χ2 cc | 0.19 | 0.664 | |||||
Of data analysts | 108 | χ2 | 0.08 | 1 | 1.26 | 0.27 | 5.90 |
χ2 cc | 5.01e-31 | 0.772 | |||||
Replicability | |||||||
Replicable interventions | 108 | χ2 | 5.88 | 0.015 | 0.33 | 0.13 | 0.83 |
χ2 cc | 4.86 | 0.027 | |||||
Replicable DV | 108 | χ2 | 0.937 | 0.333 | 0.35ª | 0.014 | 8.84 |
χ2 cc | 1.11e-30 | 1 | |||||
Validated DV | 108 | χ2 | 0.392 | 0.531 | 0.62 | 0.14 | 2.75 |
χ2 cc | 0.0669 | 0.796 | |||||
Reporting & Publishing | |||||||
Fully reported | 108 | χ2 | 1.98 | 0.160 | 2.00 | 0.75 | 5.31 |
χ2 cc | 1.35 | 0.245 | |||||
Open Access | 108 | χ2 | 14.6 | < .001 | 6.91 | 2.36 | 20.2 |
χ2 cc | 12.9 | < .001 | |||||
Open Data | 108 | χ2 | 1.15 | 0.284 | 2.45 | 0.45 | 13.2 |
χ2 cc | 0.464 | 0.496 |
Design features | |||||||
95% IC | |||||||
Variable | N | Statistic | Value | p | OR | Lower | Upper |
Pre-registered | 108 | χ2 | 6.40 | 0.011 | 0.34 | 0.15 | 0.80 |
χ2 cc | 5.40 | 0.020 | |||||
Randomization | 108 | χ2 | 14.26 | < .001 | 0.21 | 0.09 | 0.48 |
χ2 cc | 12.80 | < .001 | |||||
RCT | 108 | χ2 | 16.30 | < .001 | 0.19 | 0.09 | 0.44 |
χ2 cc | 14.80 | < .001 | |||||
Use of equivalence experimental groups | 108 | χ2 | 1.55 | 0.213 | 0.50 | 0.17 | 1.50 |
χ2 cc | 0.95 | 0.330 | |||||
Use of control groups | 108 | χ2 | 2.40 | 0.122 | 0.53 | 0.23 | 1.19 |
χ2 cc | 1.81 | 0.179 | |||||
Use of active control groups | 71 | χ2 | 1.16 | 0.280 | 0.60 | 0.23 | 1.53 |
χ2 cc | 0.71 | 0.400 | |||||
Use of guidelines | 108 | χ2 | 1.25 | 0.264 | 0.60 | 0.25 | 1.47 |
χ2 cc | 0.80 | 0.372 | |||||
Statistical analysis | |||||||
Power analysis | 108 | χ2 | 1.19 | 0.274 | 0.65 | 0.29 | 1.42 |
χ2 cc | 0.799 | 0.372 | |||||
Sociodemographic equivalence (statistical) | 74 | χ2 | 1.99 | 0.159 | 0.50 | 0.19 | 1.32 |
χ2 cc | 1.35 | 0.245 | |||||
Sociodemographic features accounted for | 71 | χ2 | 0.0482 | 0.826 | 0.90 | 0.33 | 2.40 |
χ2 cc | 0 | 1 | |||||
Baseline scores equivalence (statistical) | 65 | χ2 | 0.244 | 0.621 | 1.32 | 0.44 | 3.96 |
χ2 cc | 0.0471 | 0.828 | |||||
Baseline scores accounted for | 75 | χ2 | 2.49 | 0.287 | 2.40 | 0.55 | 10.4 |
χ2 cc | 2.49 | 0.287 | |||||
Adequate statistical analysis | 108 | χ2 | 1.62 | 0.202 | 2.87 | 0.53 | 15.5 |
χ2 cc | 0.78 | 0.377 | |||||
Exclusive complete-cases analysis | 101 | χ2 | 9.39 | 0.002 | 3.85 | 1.59 | 9.31 |
χ2 cc | 8.15 | 0.004 | |||||
Reported use of blinding | |||||||
Of participants | 108 | χ2 | 2.87 | 0.091 | 6.87ª | 0.35 | 136 |
χ2 cc | 1.22 | 0.268 | |||||
Of interveners | 108 | χ2 | 0.89 | 0.345 | 2.89 | 0.29 | 28.7 |
χ2 cc | 0.19 | 0.664 | |||||
Of data analysts | 108 | χ2 | 0.08 | 1 | 1.26 | 0.27 | 5.90 |
χ2 cc | 5.01e-31 | 0.772 | |||||
Replicability | |||||||
Replicable interventions | 108 | χ2 | 5.88 | 0.015 | 0.33 | 0.13 | 0.83 |
χ2 cc | 4.86 | 0.027 | |||||
Replicable DV | 108 | χ2 | 0.937 | 0.333 | 0.35ª | 0.014 | 8.84 |
χ2 cc | 1.11e-30 | 1 | |||||
Validated DV | 108 | χ2 | 0.392 | 0.531 | 0.62 | 0.14 | 2.75 |
χ2 cc | 0.0669 | 0.796 | |||||
Reporting & Publishing | |||||||
Fully reported | 108 | χ2 | 1.98 | 0.160 | 2.00 | 0.75 | 5.31 |
χ2 cc | 1.35 | 0.245 | |||||
Open Access | 108 | χ2 | 14.6 | < .001 | 6.91 | 2.36 | 20.2 |
χ2 cc | 12.9 | < .001 | |||||
Open Data | 108 | χ2 | 1.15 | 0.284 | 2.45 | 0.45 | 13.2 |
χ2 cc | 0.464 | 0.496 |
Note. OR estimates were calculated by comparing rows (COVID/control). Variables for which a significant effect was observed are shown in bold. cc = continuity correction. ªHaldane-Ascombe correction applied.
Discussion
The COVID-19 pandemic that started in early 2020 has become a large-scale natural experiment that can reveal valuable information regarding how research works under unusually high pressure. It is well known that the generalized context of increased stress levels after the COVID-19 outbreak led to a higher prevalence of mental health issues in the population (see, for example, Seyed et al., 2021). This, combined with a surge of online communication due to mobility constraints, led to the proliferation of mental health online interventions in the market (see, for example, Sammons et al., 2020). Many of these interventions urgently sought empirical evidence supporting their efficacy, thereby creating a research context in which: (1) the need to obtain information immediately led to speeding up the research process, increasing the risk of mistakes; (2) a rapid peer review process may contribute to overlooking important details; (3) an already competitive research environment to achieve higher bibliometric indicators was fostered with the urge to publish while the topic remained timely; (4) effective and early solutions were urgently required in a moment of social emergency, and (5) contextual variables may have a negative effect on researchers’ performance (e.g., the stress of being in a highly demanding situation, or conciliation problems). All these factors have been associated with a negative impact on methodological research quality (Allen, 2021; Bommier et al., 2021; Kapp et al., 2022; Park et al., 2021). Consequently, we wondered if the methodological quality of articles addressing the efficacy of online mental health interventions may have decreased during the pandemic compared to before the pandemic. This question was additionally motivated by previous reports underlining a decrease in the methodological quality of other health-related research fields in the wake of the COVID-19 outbreak (see Jung et al., 2021). To address this issue, we developed a quality checklist and applied it to a selection of studies published during the COVID-19 pandemic and a set of comparable control studies. Overall, for many indicators, we found poor adherence to best methodological standards across the board, without significant differences between the two samples of articles. In addition, in some of the indicators we detected signs of lower methodological quality in the articles on online mental health interventions conducted after the COVID-19 outbreak. Specifically, there was a decrease in the use of randomized groups, RCT designs and pre-registration (which has an impact in reducing questionable practices such as HARKing; Kerr, 1998; Lakens, 2019). In addition, although not statistically significant, we detected a numerical decline in the use of active control groups in favour of waiting list control groups. Overall, 25 variables were measured, 17 were non-significant, and 4 indicated poorer research quality for research published during the pandemic (pre-registration, randomization, use of RCT designs, replicability). Additionally, there were reduced attrition rates in research published during the pandemic, and open-access publishing was more common in these studies.
The decline in the use of randomized groups could have compromised the quality of the evidence after the pandemic. The lack of randomization increases the chance that confounding baseline variables produce systematic differences between groups, in addition to, or instead of, the main treatment at test (Reeves, 2008). Although randomization is not the silver bullet for perfectly matched groups (Sella et al., 2021), its absence makes it more likely to incur into range restriction and to overestimate effects. Besides inflating the apparent efficacy of treatments, the overestimation of effects leads to an increase in false-positive rates, contributing to the replicability crisis (Fraley & Vazire, 2014; Spring, 2007; Tackett et al., 2017; Tackett & Miller, 2019). Furthermore, the lack of adequate control groups may make it impossible to draw valid conclusions from one study, since it is not possible to compare the effect of the application of an intervention with its absence. In other words, it poses problems in terms of internal validity (Joy et al., 2005). Beyond the comparison between COVID-19 and control articles, we would like to point out that the generalized lack of active control groups overall (only 35 in 108 studies included an active control group) is an alarming result in itself. What is more, about 37 in 108 studies did not include any kind of control group at all. In clinical research, the inclusion of an active control group, although not sufficient, is often necessary to tell apart the improvements due to the treatment from other variables such as the passage of time, or general amelioration based on the expectation of treatment (e.g., Boot et al., 2013).
Articles published during the pandemic also showed a preference for analysis of complete cases (as opposed to intent-to-treat analysis). The loss of participants mid-treatment breaks the principle of randomization and generates a final sample with fewer participants due to potential variables that have not been measured or observed but are related to the outcome. When the sample of participants is particularly motivated by the intervention –and losing participants during the intervention could generate this circumstance– there may be an overestimation of the effect sizes (Cuijpers et al., 2010; May et al., 1981; Peduzzi et al., 1993; Sackett & Gent, 1979). Thus, the conclusions that can be drawn about the efficacy of the interventions may be biased (Salim et al., 2008). Another distinguishing feature between the COVID-19 and control articles is the time of acceptance for publication in journals. Articles published during the pandemic had shorter acceptance times than articles published before the pandemic. This is consistent with converging evidence that claims the editorial and peer review processes were faster during the pandemic, which was associated with fewer editor’s recommendations for significant changes or additional experiments, and a more conciliatory and cooperative tone (Horbach, 2021; Kapp et al., 2022). Although this may not necessarily be an indication of lower quality, it may reflect more lenient standards in certain studies (see, for example, Horbach, 2021), which may ultimately lead to overlooking methodological problems and to a lack of transparency (Allen, 2021; Jung et al., 2021; Kapp et al., 2022). An alternative explanation for the shorter review time may be that the reviewers were more motivated given the situation and prioritized the review over other tasks, causing the review to be compressed in time even though the time spent on the review itself remained equivalent. With our present data, we cannot discard these alternative explanations.
Articles during COVID-19 tended to publish their full texts in open access to a greater extent, but this was not accompanied by an increase in making the datasets available to the scientific community. This limitation was shared with articles published before the pandemic: neither before nor after the pandemic there is a tradition of publishing the data in repositories (Vanpaemel et al., 2015). This has implications for the replicability debate. Sharing datasets is associated with the strength of the evidence obtained and the quality of statistical reporting of the results (Claesen et al., 2023; Wicherts et al., 2011). It also allows for preventing or solving statistical errors, generating new research questions, and establishing collaborations between researchers (Wicherts, 2016). We could say Open Science initiatives seem to pay more attention to sharing the article than to sharing the data, thus generating a ‘marketing-oriented open science’ phenomenon due to the several difficulties in terms of the response cost for the individual researchers (e.g., on preparing the research byproducts for them to be publicly available with no extra-resources from their institutions), as suggested by some studies (see Fernández Pinto, 2020; Scheliga & Friesike, 2014; Tenopir et al., 2020; Vines et al., 2014). Given the design of the present study, we cannot tell whether the greater proportion of open access modality was due to the contextual conditions caused by the pandemic or to the increasing trend to adopt this publishing modality with time. There was already an increasing tendency to publish in open access before the pandemic began, so the causal link is impossible to establish at present (Belli et al., 2020; Miguel et al., 2016; Tennant et al., 2016).
Beyond the potential before-after differences highlighted above, a secondary but important finding that emerges from the present study relates to the generalized problems in research on online interventions for mental health. Of particular relevance is the lack of control groups, which were lacking in about a third of all the studies, and the even rarer use of active control groups; the reduced use of randomization to form treatment groups; the lack of sample size calculation supported by an a priori power study; and the lack of pre-registration of the study design. We also found that the studies of our sample often did not use (or failed to report) blinding strategies, did not usually match groups in socio-demographic variables, did not inform about the specifics of the implemented interventions sufficiently to allow replication, and did not provide sufficient details about the data treatment and analysis performed, among other limitations that jeopardize the generalizability of their results.
It is also relevant to highlight several methodological limitations of the work presented in this article. Regarding the design of the checklist, it is possible that we failed to cover all the content necessary to fully assess the methodological quality in such a heterogeneous area of research. Furthermore, we encountered difficulties in the evaluation of some items, such as the adequacy of the data analysis; while there is agreement regarding the lack of adequacy of certain analyses, there is significantly less agreement concerning the correctness of others. In many cases, it was difficult to know whether the reported analysis was ideal for the question posed in the study merely because of insufficient information (e.g., lack of precision in justifying or describing the chosen technique). This could have potential implications, as Simmons et al. (2011) point out that the varying flexibility in reporting the selection of statistical analysis is related to a higher incidence of false positives. As Scheel (2022) indicates, “most psychological research findings are not even wrong” because of its critical underspecification. There are some decisions on analysis techniques that are incorrect (e.g., using separate t-tests for experimental and control groups) while others could be more open to discussion, which made it impossible to be categorical on this item. Another limitation of the checklist is that it is only sensitive to reported information. That is, there is a possibility that some of the studies included met the methodological standards assessed in our checklist, but simply failed to report them. For instance, blinding of participants may have been implemented but not mentioned in the article. Conversely, other information may have been concealed, such as the existence of undeclared conflicts of interest. Yet, failing to report what would seem important information about a research study would be, itself, an indicator of poor scholarship. Beyond financial conflicts of interest –which are the easiest to detect if reported, non-financial conflicts such as personal experience, academic competition and the theoretical approach of the researchers could also influence the research process. This issue is not assessed by our checklist because of the inherent difficulties in such a conceptualization. Nevertheless, even if not explicitly stated, it does not mean that these kinds of conflics are not ocurring. Rather, they are an important factor to consider (Bero, 2017).
It is also important to note that the present study cannot establish a causal link between changes in research quality and the pandemic. However, we can hypothesize that the difficulties in working conditions generated by the public health crisis probably played a role. There may have been difficulties in recruiting samples, with those available being more committed to the study, hence the reduction in attrition; a more agile and immediate research process may have been necessary, hence reduced RCT planning and pre-registration. These are all speculative, but plausible causal explanations of the phenomenon observed in this study. Sakamoto et al. (2022) found that 91.1% of a sample of researchers from different fields changed the way they managed their work routine because of the pandemic. Many of the problems mentioned above (pressure to increase the number of publications, problems for planning research, conflicting interpretations, and finding financial support) may have been exacerbated because of the pandemic (Fuentes, 2017; Mohan et al., 2020; Zarea Gavgani, 2020). However, there is no systematized literature on all these factors and their study needs to be expanded.
The present study has revealed an initial but admittedly incomplete picture regarding the research quality of the burgeoning field of online mental health interventions. Future research should for example deepen the development of more complete checklists and other evaluation tools that cover more faithfully the characteristics of research addressing online psychological treatments. These could be complemented by interviews and surveys of the authors of the same articles to understand in greater depth which steps have been taken. We believe that it is necessary to continue working on the evaluation of the methodological quality of online interventional studies, since they are becoming one of the main approaches used in psychology, especially for the treatment of the most frequent forms of mental issues. Moreover, it is worth exploring how other relevant and more sophisticated issues in research such as HARKing (re-hypothesizing once the data have been analyzed), and ‘spinning’ (when the narrative of an article does not correspond to the results obtained and seeks to emphasize the positive results and disguise the negative ones; see, for example, Chiu et al., 2017) could be included. It is also equally necessary to analyze the context in which the research is conducted, as well as to consider the rules of the game that govern it, many of which favour questionable research practices, to propose more appropriate practices (Bakker et al., 2012).
Conclusions
Articles published to address the efficacy of online interventions for common mental health problems in clinical psychology were assessed, in terms of their methodological quality, in two time windows right before and after the COVID-19 outbreak. We observed a decline affecting various aspects of the research process. Articles published during COVID-19 show less frequent use of pre-registration, less randomization of participant allocation, fewer RCTs, and more frequent use of complete case analyses, as well as a shorter acceptance time in journals. Most of the detected problems are not new. In fact, our results also revealed especially important generalized shortcomings in this area of clinical research regarding online psychological interventions, although some of them may have been exacerbated during the pandemic. Given the important consequences for wellness and mental health in large sections of the population, the present results should motivate a more careful consideration of methodological issues in the field of online psychological interventions in general, and during times of increased societal stress and urgency in particular.
Contributions
Contributed to conception and design: SSF, MAV, LM, MB
Contributed to acquisition of data: CRP, MB
Contributed to analysis and interpretation of data: CRP, MAV, LM, SSF, MB
Drafted and/or revised the article: CRP, SSF, MAV, LM, MB
Approved the submitted version for publication: CRP, SSF, MAV, LM, MB
Acknowledgements
Many thanks to Gonzalo García-Castro for his invaluable classes in the use of ggplot2, and for making the stays in Barcelona so pleasant. Academia is a kinder place because he is in it.
Funding Information
The project was funded by ‘Ayudas Fundación BBVA a Equipos de Investigación Científica SARS-CoV-2 y COVID-19’, grant nº #093_2020.
MAV is funded by AEI / UE (grant CNS2022-135346)
SSF is funded by the AGAUR Generalitat de Catalunya (2021 SGR 00911).
MB is funded by the Serra Hunter programme / Generalitat de Catalunya.
CRP is co-financed by the funds of the Recovery, Transformation, Resilience and Next Generation - EU plan of the European Union. Exp. INVESTIGO 2022-C23.I01.P03.S0020-0000031- Psychiatry Area.
Competing Interests
The authors declare there are no competing interests.
Data Accessibility Statement
The database used, as well as the scripts used, and the raw results can be found on the project page within the Open Science Framework: “COREVID: COronavirus Research EVIDence evaluation.” https://osf.io/t9gqj/. DOI: 10.17605/OSF.IO/T9GQJ