This study explores the clinical utility of the Clinical Outcomes in Routine Evaluation–Outcome Measure (CORE-OM), and the Clinical Outcomes in Routine Evaluation–10 (CORE-10). This is an exploratory, naturalistic longitudinal study conducted in two psychotherapy services that offer training for psychology students in Ecuador. These services provide affordable and free short-term psychotherapy based on a constructivist approach. In total, 259 adult clients presenting non-severe mental disorders and symptoms and/or relational problems were included, 147 were women (57%); ages ranged from 18 to 66 with mean 28.70 years. At first contact we collected socio-demographic data and responses to the Clinical Outcomes in Routine Evaluation- Outcome Measure, Outcome Questionnaire-45.2 and the Schwartz Outcome Scale-10-E. Data were analyzed with statistical estimation using bootstrap 95% confidence intervals to assess clinical utility and psychometric adequacy of the measures for similar services and clients. The measures showed good acceptability and adequate internal consistency, similar to findings from the United Kingdom and Spain. There were strong correlations between all scale scores of the CORE-OM except for risk. Adequate convergent validity was found for the CORE-OM with the other two measures. There were no significant gender, age, or education effects for initial scores. Comparing help-seeking and non-help-seeking populations, large effect sizes were found for the CORE-OM total, non-risk, and risk scores, and medium effect size for the CORE-10 scores. Cut-off scores for the CORE-OM and CORE-10 were 1.26 and 1.51, respectively. The measures showed sensitivity to change with large effect sizes for all scores, except for risk, which shows a medium effect size. The CORE-OM and CORE-10 are suitable for clinimetric use, the former for broad assessment of psychological state and the latter for tracking changes in psychotherapy. Recommendations are made for assessment structure designs and data interpretation for routine use particularly in Latin America.
Introduction
Psychological assessment instruments have historically been developed and analyzed from the standpoint of the psychometric theory (Fava & Belaise, 2005), however other frameworks related to the clinical utility of measures have emerged in recent decades (Fava et al., 2012; Feinstein, 1983). One of these frameworks is clinimetrics, which is an umbrella framework with conceptual, methodological and practical levels. A central feature of psychometrics is that it looks for unidimensional measurement across items, focusing on factor analysis of construct validity, which may fit poorly with the complexity of many clinical phenomena, particularly those of mental health. The clinimetric focus is on clinical utility, rather than on homogeneity or unidimensionality (Cosci, 2021).
In clinimetrics, assessment of validity can include clinical, construct, biological, predictive, incremental, and concurrent validity. Clinical validity is the extent to which instruments can usefully measure severity and reflect the impact of therapy, whether for planning, using aggregated data, or for tracking changes in individual clients and potentially improving interventions. Clinical utility as acceptability to clients is achieved by avoiding repetitive/synonymous items, keeping the number of items small and aiming to cover the main treatment challenges rather than focusing on one postulated latent construct of differences between clients. Such measures cannot cover the entirety of mental health issues that clients bring to therapies, however, their shortness make them appropriate for clients to accept given the frequencies of use across therapies (Applegate, 1987; Bech, 1984; Carrozzino et al., 2021; Foster & Cone, 1995). Clinimetric evaluation requires sensible attention to the specific aim and clinical context being studied, considering therapists’, clients’ and services’ characteristics; the format of any measure: length, wording, understandability, response format; transferability of scores across gender, age, socioeconomic status, among others, as well as its reliability when used routinely with the same group throughout a therapeutic process (Bech et al., 1978; Evans & Carlyle, 2021; Feinstein, 1986). In the psychotherapy field several measures have been created for this purpose. For example, the Clinical Outcomes in Routine Evaluation System (CORE; Evans et al., 2002) comprises several measures. All the CORE measures can be used without licence payments and downloaded from https://www.coresystemtrust.org.uk/, which enables their use in settings with scarce resources as some found in Latin America (LA). The first measure developed from this system was the CORE-OM, OM referring to Outcome Measure. The CORE-OM is comprised of 34 items that cover four domains (Well-being, Problems, Functioning and Risk) that do not fall out as neat factors, but provide a global score of psychological distress and a score of risk to oneself and others (Evans, 2012). Although the CORE-OM provides valuable information, it could be considered item-onerous, in other words, a measure that, because of the number of items it has, could feel burdensome or demanding of a lot of effort if completed on a sessional basis; in fact, it was created to be used before and after therapy and has typically been used in that way. Shorter, quick and easy-to-use variants, such as the CORE-10 (Barkham et al., 2013) complement the CORE-OM to be used on a session-by-session basis with repeated assessments for weekly/sessional change monitoring.
Session-by-session assessment of psychological therapies has expanded noticeably in the last two decades (e.g., Cooper et al., 2021; Duncan et al., 2020; Sim et al., 2021). Session-by-session change measurement can help identify patterns and predictors of change that may escape the conventional pretreatment-posttreatment measurement, as the latter does not usually assess changes between subsequent sessions (Paz et al., 2021). To assess sessional change using the CORE-10 scores, trajectory plots such as cat’s cradle plots (further described in the Methods section), multilevel modeling and other ways to analyze what becomes a dataset of time series sequences of varying lengths are particularly helpful. Both assessment approaches, i.e., pretreatment-posttreatment and session-by-session, are valuable and to some extent complementary to track psychotherapeutic changes and outcomes.
Tracking psychotherapeutic changes and outcomes is a relevant endeavor for the generation of Practice Based Evidence (PBE), i.e. the collection and analysis of evidence generated in clinical practice (Evans et al., 2003; Margison et al., 2000). From the standpoint of PBE, research and clinical practice are intertwined in a loop with research and practice continuously informing each other, where personalized feedback on the basis of the collected data can be a significant contribution for psychotherapy quality and outcomes enhancement (Miller et al., 2016). Data can be explored to assess outcomes within a service by comparing subgroups of clients (Evans et al., 2003) and individual change can also be tracked; moreover, results between services can be compared. The overall philosophy of PBE is expanded clinimetrically, aiming at practitioners, in Evans & Carlyle (2021).
One element of PBE explorations is Routine Outcome Monitoring (ROM). When conducting ROM, all clients of a service are invited to participate and to complete one or more outcome measures on a routine basis. To understand the data, both the psychometric properties of the measures and their clinimetric utility must be understood, but this is limited, particularly in LA, by a shortage of reports about psychotherapy outcome measures generally. Compounding the specific deficit in the region most psychometric publications from LA focus mainly on symptom specific measures rather than global measures (Paz et al., 2021). Exploration of the psychometric properties and clinical utility of global outcome measures is needed to identify valid systems to measure change and global treatment outcomes that can be widely used and distributed in the region.
The objective of this study is to investigate the clinical utility of CORE-OM and CORE-10. To achieve this, we examined their psychometric comparability (including acceptability and psychometric properties) and evaluated their suitability as clinimetric measures, with a focus on sensitivity to change, the ability to reflect the magnitude of change, and the identification of clinically significant differences. We believe this is the first paper to bring together the traditions of PBE and of clinimetrics in psychotherapy. This paper also (1) draws on data from LA where the scarcity in both public and private funds has limited the generation of evidence of psychological interventions (Paz & Evans, 2019), (2) highlights the practicalities of the clinimetric PBE approach and (3) suggests where parameters must draw on local evidence and how PBE designs can be adapted to local needs.
Method
Design
This is an exploratory naturalistic longitudinal study which presents the first exploration of clinical utility of two global measures of psychological distress (CORE-OM and the CORE-10 embedded in the CORE-OM) in a help-seeking population in Ecuador, using the scores both first-versus-last and session-by-session, to describe outcomes and trajectories of change within psychological interventions.
Participants
Participants were clients attending two clinics offering psychological services for non-severe mental disorders and symptoms (e.g., anxiety or depression) and relational problems. Participation was voluntary and offered by the therapist and informed consent was obtained from those willing to participate. Clients under 18 years are below the age for consent in Ecuadorian law so they were excluded from this study.
Ethics
This study was approved by the Ethics Committee of the Universidad San Francisco de Quito, Ecuador (ref.2017-113E).
Measures
The primary measure was the CORE-OM. Two other measures, the Outcome Questionnaire-45.2 (OQ-45.2) and Schwartz Outcome Scale-10 (SOS-10) were included at first contact only as a check on convergent validity.
CORE-OM
The CORE-OM (Evans et al., 2000, 2002) is a 34 item self-report questionnaire. Items are scored on a Likert scale from 0 (“never”) to 4 (“always or almost always”). Items cover four domains: subjective well-being, problems, functioning, and risk. Good internal reliability, sensitivity to change, test–retest stability, convergent validity in relation to other measures and discrimination between clinical and non-clinical populations have all been reported for the original version (Evans et al., 2002). The Spanish translation of the CORE-OM developed and used in Spain showed similar psychometric properties to the original version (Trujillo et al., 2016). In Ecuador, Paz et al. (2020) conducted the psychometric exploration for a non-help-seeking population (i.e., people not seeking psychotherapy or using psychotropic medication), showing similar results as the original and translated versions. In our data, total score reliability as Cronbach’s alpha was .94[.93, .95] (95% confidence intervals will be shown as [xx, yy] through the paper).
CORE-10
The CORE-10 (Barkham et al., 2013) comprises ten items from the CORE-OM. The authors developed this abbreviated measure with the goal of using it on a session-by-session basis to alleviate the time and burden associated with lengthier measure for filling out and scoring. The selection of the items aimed to enable measurement of depression while retaining broader coverage of psychological distress, six items were included from the problem domain, three from the functioning and one from the risk. None of the well-being domain were selected due to high correlation with the problems domain. This short version has shown good practicality and psychometric properties (Barkham et al., 2013). In our data, Cronbach’s alpha was .81[.78, .84]. While multiplication by ten is commonly suggested for better score comprehension (Barkham et al., 2013), we will present the mean score throughout the paper to facilitate comparisons with CORE-OM scores.
OQ-45.2
The OQ-45.2 (Lambert et al., 1996) is a 45-item self-report questionnaire constructed to assess treatment outcomes in mental health settings. Items are rated on a five-point Likert scored 0 to 4. Both the original version in English (Lambert et al., 2010) and the Spanish version tested for the Chilean population (von Bergen & De la Parra, 2002) have demonstrated acceptable psychometric properties. In our data, Cronbach’s alpha was .91[.89, .93].
SOS-10-E
The SOS-10 (Blais et al., 1999) is a brief self-report scale of 10 items that measures psychological health and well-being which has shown satisfactory psychometric properties both for the original English version (Blais et al., 1999) and for the Spanish version (SOS-10-E; Rivas-Vazquez et al., 2001) as well as when used in Ecuador (Paz et al., 2020). Data in this study showed a Cronbach’s alpha of .93 [.92, .94].
Procedures
The participating services, the Red Ecuatoriana de Psicología por la Diversidad LGBTI (REPsiD) and the Centro de Psicología Aplicada (CPA) of the Universidad de Las Américas, were both located in Quito, the capital of Ecuador, a city with a population of almost 3 million people. Both are private services providing therapist training programs, with constructivist foundations, involving the presence of a therapist and a co-therapist (in training) in each therapy session. At the REPsiD there is no charge for the services, while for the CPA, the cost is low. More information about the training models that these centers use has been described in Zúñiga-Salazar et al. (2021) and Valdiviezo-Oña et al. (2022), respectively.
The participants completed the CORE-OM, OQ-45 and the SOS-10-E before the first session, and the CORE-OM on a session-by-session basis. Participants were enrolled from January 2018, until the pre-planned termination of recruitment in March 2020.
Analyses
Our epistemological position is pragmatic, and the statistical approach is of pre-planned description with estimation drawing on bootstrap 95% confidence intervals (CIs) observed around sample statistics, bootstrap CIs were used to avoid distributional issues. Analyses were exploratory and descriptive, designed (1) to explore comparability of the psychometric findings of the help-seeking sample with the original United Kingdom (UK) reports (Evans et al., 2002; Evans et al., 2000) and the Spanish version (Trujillo et al., 2016); and (2) to explore utility for help-seeking population in Ecuador.
The analyses aimed to follow the Problem, Plan, Data, Analysis, Conclusions and communication (PPDAC) approach (MacKay & Oldford, 2000; Spiegelhalter, 2020) to base our course of action in a problem-oriented solution-seeking process. The Problem was to assess CORE-OM and CORE-10 psychometric comparability (acceptability, psychometric properties) and suitability as clinimetric measures focusing on sensitivity to change, reflection of the magnitude of change and clinically significant differences. The Plan was to collect and analyse routine data and the Data were item level data from the measures as described. The Analyses followed those of the UK (Evans et al., 2002), Spanish (Trujillo et al., 2016) and Ecuadorian non-help-seeking studies (Paz et al., 2020). Acceptability was indicated by omission rates; internal reliability was evaluated using Cronbach’s alpha (Cronbach, 1951) and McDonald’s Omega (ω). For correlations between questionnaire scores and between scores and age we report Pearson correlations with bootstrap 95% CIs using the CECPfuns package (Evans, 2021). Gender and education effects are reported as the mean difference and Hedges’ g effect sizes, again with bootstrap CIs.
To have reference scores to distinguish between help-seeking and non-help-seeking Ecuadorian populations, Clinically Significant Change (CSC) criteria were computed using Jacobson & Truax (1991) criterion c, which is the variance-weighted midpoint between the means of functional and dysfunctional population, see (Evans et al., 1998). It is worth noting that the CSC cut-off is not a precise criterion nor an indicator of whether someone should receive treatment, rather, it provides a logical differentiation between typical help-seeking and non-help-seeking scores and a base from which to compute the rate of Clinically Significant Change. Data from the non-help-seeking (students and general population) sample of Paz et al. (2020) were used and 95% bootstrap CIs for the CSC (from getBootCICSC function in the CECPfuns package) reported. These CIs enable us to compare our observed cut-off points with those reported for other languages. We report comparison with other cutting points of other languages. The Reliable Change Index (RCI) of Jacobson et al. (1984) was also computed as it allows to estimate when change might have arisen by unreliability of the measure alone. In line with the earlier papers a 95% RCI interval is used for the CORE-OM and 90% for the CORE-10, the wider interval provides some compensation for the lower internal reliability of the latter measure. A Jacobson plot (Figure 2) is included to summarise the breakdown of change into five change categories: no reliable change, reliable deterioration, reliable improvement but stayed above the CSC, reliable improvement but stayed below CSC and clinical and reliable improvement. We also include violin plots (Figure 3) showing the participants’ scores distribution and aggregate CORE-10 items scores for the scores available on each occasion embedded in the CORE-OM. In addition, to show individual change, we show cat’s cradle plots (Figure 4), which are trajectory graphs mapping scores on the y axis against assessments on the x axis.
Most analyses are reported for the CORE-OM total score, the non-risk score (i.e., the 28 items not assessing risk), the risk score and the CORE-10 total score, though the violin plot and the cat’s cradle plots are reported only for the CORE-10 score. For the mean change effect size, we have reported Cohen’s dz. Cohen’s dz is the mean change divided by the SD of the change values. All analyses were conducted using R version 4.1.0 (R Core Team, 2021) and bootstrap CIs used the percentile method with 1,000 bootstrap replications.
Results
Demographics
Out of 259 participants, 147 were women (57% [51, 63]), 109 were men (42% [36, 48]) and gender was missing for three participants (1%). Ages ranged from 18 to 66 (M = 28.70 [27.57, 29.94], SD = 9.44). See Figures S1, S2 and S3 in the Supplementary material for more information regarding age. In relation to education, just three participants (1.1%) reported completing only elementary school (six years of education), 38 (15%) had completed high school (12 years), 189 (72.9%) reported completing higher education, 21 (8%) completed post-baccalaureate cycle (technical training), and for 3% this data was missing. Regarding work status, 154 participants reported being in work (59% [53, 65]) with a mean age of 31.12 [29.63, 32.63], 104 not working (40% [34, 46]) with a mean age of 25.21 [23.70, 26.99], and this information was missing from one participant. The number of sessions attended ranged from 1 to 20 with a mean of 6.3 and a median of 5. As supplementary material (Table S1) we have included information of the demographics of our sample and those of the general population in Quito and in Ecuador (Instituto Nacional de Estadísticas y Censos de Ecuador, 2010) within the age range of our sample (18 to 66). Compared to the population of Quito and Ecuador, the mean age of our sample is lower (Quito M = 36.65, SD = 13.01; Ecuador M = 36.72, SD = 13.19), the proportion of women is higher (Quito = 52.1%; Ecuador = 51.0%), as is the proportion of persons who attained higher education (Quito = 35.4%, Ecuador = 22.2%), and a significantly lower proportion of the clients of our sample were employed (Quito = 71.4%, Ecuador = 63.9%).
Acceptability
Of the 259 participants, 228 (88%) completed all 34 CORE-OM items pretreatment, and 259 (100%) completed at least 31 items which allowed item data to be prorated for an overall score. Items 3 (omission rate = 2.1%) “I have felt I have someone to turn to for support when needed”, 26 (1.3%) “I have thought I have no friends”, and 17 (1.3%) “I have felt overwhelmed by my problems” were the most omitted.
As noted, the CORE-OM has scores for risk and non-risk as well as the total score and a CORE-10 score can be computed using the 10 items embedded within the CORE-OM. For the 28 non-risk items 231 (89.2%) clients completed all items and 259 (100%) completed at least 26 of them; 255 completed all the risk items (98.5%, no prorating allowed); 246 completed all the CORE-10 items (95%) and 259 (100%) completed at least 9 items. For the SOS-10, 243 completed all the items (93.8%) and 251 (96.9%) completed at least 8 items (the authors allow pro-rating of up to two missing items); 170 (65.6%) completed all OQ-45.2 items and 241 (93.1%) completed the minimum necessary 41 items.
Internal Consistency
Cronbach’s alpha and McDonald’s Omega were used to evaluate the internal consistency for non-risk, risk, CORE-OM total, and CORE-10 scores. These were: .94[.93, .95] for the CORE-OM total score, .85[.81, .88] for the risk score, .94[.92, .95] for the non-risk score and .81[.78, .84] for the CORE-10 score. Omega coefficients were: .94 for the CORE-OM total score, .86 for the risk score, .92 for the non-risk score and .82 for the CORE-10 score.
Comparability: Reliability versus UK and Spain Studies
Figure 1 shows the Cronbach’s alpha coefficients for Ecuador, UK and Spain. Dotted lines are referential data (UK and Spain), horizontal lines are for any gender in Ecuador. Values for CORE-OM total and non-risk scores are comparable with those from the UK and Spain but the one for the risk score is higher, and the one for the CORE-10 score is lower in our study than in the UK one.
Validity: Correlations Between CORE Scores
As expected, all CORE scores were strongly associated. The correlations were strong between all scores though lowest for risk (Table 1).
Score . | Risk [95% CI]a . | Non-risk [95% CI]a . | CORE-OM total [95% CI]a . |
---|---|---|---|
Non-risk | .62 [.54, .68] | ||
CORE-OM total | .74 [.69, .79] | .99 [.98, .98] | |
CORE-10 | .67 [.59, .72] | .94 [.93, .95] | .95 [.93, .96] |
Score . | Risk [95% CI]a . | Non-risk [95% CI]a . | CORE-OM total [95% CI]a . |
---|---|---|---|
Non-risk | .62 [.54, .68] | ||
CORE-OM total | .74 [.69, .79] | .99 [.98, .98] | |
CORE-10 | .67 [.59, .72] | .94 [.93, .95] | .95 [.93, .96] |
Note:a Bootstrap CIs used the percentile method with 1,000 bootstrap replications. b Risk refers to the total score only of the items from the CORE-OM that assess risk to self or to others; Non-risk refers to the total score of all items from the CORE-OM that do not assess risk to self or to others.
Validity: Convergent
Convergent validity against SOS-10-E and OQ-45.2 total scores were also strong (Table 2). Correlations with SOS-10-E were negative, as higher SOS-10-E scores represent lower levels of distress. Again, the Risk score showed the lowest correlations (SOS-10-E = − .49, OQ-45.2 = .62). See Spearman correlations in the supplementary materials (Table S2).
Score | |||||
Questionnaire | M (SD) | CORE-OM total ρ[95% CI]a | Risk ρ[95% CI]a | Non-risk ρ[95% CI]a | CORE-10 ρ[95% CI]a |
SOS-10-E | 35.6 (13.0) | -.80[-.85, -.72] | -.49[-.58, -.39] | -.81[-.86, -.74] | -.76[-.81, -.70] |
OQ-45.2 | 73.9(22.5) | .90[.86, .93] | .62[.52, .71] | .90[.86, .93] | .86[.82, .90] |
Score | |||||
Questionnaire | M (SD) | CORE-OM total ρ[95% CI]a | Risk ρ[95% CI]a | Non-risk ρ[95% CI]a | CORE-10 ρ[95% CI]a |
SOS-10-E | 35.6 (13.0) | -.80[-.85, -.72] | -.49[-.58, -.39] | -.81[-.86, -.74] | -.76[-.81, -.70] |
OQ-45.2 | 73.9(22.5) | .90[.86, .93] | .62[.52, .71] | .90[.86, .93] | .86[.82, .90] |
Note: a Bootstrap CIs used the percentile method with 1,000 bootstrap replications. b Risk refers to the total score only of the items from the CORE-OM that assess risk to self or to others; Non-risk refers to the total score of all items from the CORE-OM that do not assess risk to self or to others.
Gender, Age and Education Effects
Despite the moderately tight confidence intervals, there were no statistically significant gender differences (Supplementary materials as Table S3). The mean gender difference is relatively small in all scores. This a very common socio-operational issue; gender differences which may be clear in non-help-seeking population samples are often attenuated or removed in help-seeking samples as there is an element of self-selection as generally the higher people’s scores are, the more likely it is that they will seek help. Loess smooth regression (see Figure S4 in Supplementary Materials) showed no statistically significant relationship between age, gender and the CORE-OM or CORE-10 total scores. Similarly, there were no statistically significant associations between educational level and the CORE-OM and CORE-10 total scores (see Figures S5 and S6).
Scores Distribution Split by Gender
For referential use we show distribution statistics for the CORE scores in Table 3. For distributional information on more percentiles see Supplementary materials (Table S4 for help-seeking sample and Table S5 for non-help-seeking sample).
Score . | Men . | Women . | ||
---|---|---|---|---|
95th percentile [95% CI]a | Max | 95th percentile [95% CI]a | Max | |
CORE-OM total | 2.92 [2.65, 3.12] | 3.20 | 2.76 [2.56, 3.06] | 3.41 |
Risk | 2.50 [2.33, 3.00] | 3.16 | 2.17 [1.97, 3.15] | 4.00 |
Non-risk | 3.03 [2.86, 3.21] | 3.39 | 2.88 [2.71, 3.22] | 3.75 |
CORE-10 | 2.90 [2.70, 3.30] | 3.60 | 2.87 [2.70, 3.18] | 3.30 |
Score . | Men . | Women . | ||
---|---|---|---|---|
95th percentile [95% CI]a | Max | 95th percentile [95% CI]a | Max | |
CORE-OM total | 2.92 [2.65, 3.12] | 3.20 | 2.76 [2.56, 3.06] | 3.41 |
Risk | 2.50 [2.33, 3.00] | 3.16 | 2.17 [1.97, 3.15] | 4.00 |
Non-risk | 3.03 [2.86, 3.21] | 3.39 | 2.88 [2.71, 3.22] | 3.75 |
CORE-10 | 2.90 [2.70, 3.30] | 3.60 | 2.87 [2.70, 3.18] | 3.30 |
Note: Max = Maximum score. a Bootstrap CIs used the percentile method with 1,000 bootstrap replications. b Risk refers to the total score only of the items from the CORE-OM that assess risk to self or to others; Non-risk refers to the total score of all items from the CORE-OM that do not assess risk to self or to others.
Validity: Help-seeking versus Non-help-seeking (students and community) Populations
Table 4 shows mean comparison and difference, as well as Hedges’s g effect sizes between the help-seeking and non-help-seeking population, and help-seeking and students from the non-help-seeking population. Large effect sizes were found for the CORE-OM total, non-risk score and risk score, and medium effect size was found for the CORE-10.
. | Help-seeking . | . | Non-help-seeking . | . | Difference Help-seeking vs. Non-help-seeking . | Students from non-help-seeking . | . | Difference Help-seeking vs. Students from non-help-seeking . | ||
---|---|---|---|---|---|---|---|---|---|---|
Score | Mean [95% CI]a | SD | Mean [95% CI]a | SD | Mean difference [95% CI]a | Hedges’s g [95% CI]a | Mean [95% CI]a | SD | Mean difference [95% CI]a | Hedges’s g [95% CI]a |
CORE-OM total | 1.70 [1.61, 1.78] | 0.67 | 0.93 [0.89, 0.96] | 0.52 | 0.77 [0.68, 0.86] | 1.37 [1.21, 1.58] | 1.04 [0.99, 1.08] | 0.54 | 0.66 [0.57, 0.75] | 1.13 [0.95, 1.32] |
Risk | 0.76 [0.66, 0.87] | 0.84 | 0.28 [0.25, 0.30] | 0.42 | 0.48 [0.38, 0.6] | 0.89 [0.72, 1.08] | 0.32 [0.28, 0.36] | 0.46 | 0.44 [0.33, 0.55] | 0.73 [0.56, 0.91] |
Non-risk | 1.90 [1.81, 1.98] | 0.69 | 1.07 [1.03, 1.11] | 0.57 | 0.83 [0.74, 0.92] | 1.38 [1.21 to 1.58] | 1.19 [1.14, 1.24] | 0.59 | 0.71 [0.61, 0.81] | 1.13 [0.95, 1.31] |
CORE-10 | 1.74 [1.66, 1.82] | 0.71 | 1.37 [1.34, 1.39] | 0.47 | 0.37 [0.29, 0.47] | 0.71 [0.53, 0.88] | 1.45 [1.41, 1.49] | 0.47 | 0.30 [0.19, 0.39] | 0.53 [0.35, 0.72] |
. | Help-seeking . | . | Non-help-seeking . | . | Difference Help-seeking vs. Non-help-seeking . | Students from non-help-seeking . | . | Difference Help-seeking vs. Students from non-help-seeking . | ||
---|---|---|---|---|---|---|---|---|---|---|
Score | Mean [95% CI]a | SD | Mean [95% CI]a | SD | Mean difference [95% CI]a | Hedges’s g [95% CI]a | Mean [95% CI]a | SD | Mean difference [95% CI]a | Hedges’s g [95% CI]a |
CORE-OM total | 1.70 [1.61, 1.78] | 0.67 | 0.93 [0.89, 0.96] | 0.52 | 0.77 [0.68, 0.86] | 1.37 [1.21, 1.58] | 1.04 [0.99, 1.08] | 0.54 | 0.66 [0.57, 0.75] | 1.13 [0.95, 1.32] |
Risk | 0.76 [0.66, 0.87] | 0.84 | 0.28 [0.25, 0.30] | 0.42 | 0.48 [0.38, 0.6] | 0.89 [0.72, 1.08] | 0.32 [0.28, 0.36] | 0.46 | 0.44 [0.33, 0.55] | 0.73 [0.56, 0.91] |
Non-risk | 1.90 [1.81, 1.98] | 0.69 | 1.07 [1.03, 1.11] | 0.57 | 0.83 [0.74, 0.92] | 1.38 [1.21 to 1.58] | 1.19 [1.14, 1.24] | 0.59 | 0.71 [0.61, 0.81] | 1.13 [0.95, 1.31] |
CORE-10 | 1.74 [1.66, 1.82] | 0.71 | 1.37 [1.34, 1.39] | 0.47 | 0.37 [0.29, 0.47] | 0.71 [0.53, 0.88] | 1.45 [1.41, 1.49] | 0.47 | 0.30 [0.19, 0.39] | 0.53 [0.35, 0.72] |
Note: a Bootstrap CIs used the percentile method with 1,000 bootstrap replications. b Risk refers to the total score only of the items from the CORE-OM that assess risk to self or to others; Non-risk refers to the total score of all items from the CORE-OM that do not assess risk to self or to others.
Cut-off Scores
The CSC cut-off score for the CORE-OM is 1.26 [1.22, 1.31] for all participants, 1.25 [1.18, 1.32] for men and 1.28 [1.22, 1.34] for women. For the risk score, it is 0.43 [0.40, 0.48] for all participants, 0.46 [0.40, 0.52] for men and 0.42 [0.36, 0.47] for women. For the non-risk score it is 1.44 [1.39, 1.49] for all participants, 1.42 [1.35, 1.50] for men and 1.46 [1.40, 1.53] for women. In the UK, Barkham et al. (2006) and Connell et al. (2007) reported a cut-off score of 1.0 for the CORE-OM total score, whereas Evans et al.(2002) reported a cut-off of 1.19 for men and 1.29 for women. In South Africa , Young (2009) reported 1.19 for men and for 1.29 for women. While in Spain, Trujillo et al. (2016) reported a cut-off of 1.06 for men and 1.13 for women. For the CORE-10 our data shows a cut-off point of 1.51 [1.47, 1.55] for all participants, 1.59 [1.53, 1.65] for men and 1.61 [1.56, 1.66] for women, whereas Barkham et al. (2013) reported a cut-off point of 1.1 for the whole sample. These values are convenient to distinguish between help-seeking and non-help-seeking population.
Sensitivity to Change
In Table 5 we present mean differences between first and last completion of the measures and effect size assessed through Cohen’s dz. Large effect sizes were obtained for all scores, except for Risk, which shows a medium effect size.
Score . | Mean difference (SD) . | 95% CIa . | Cohen’s dzb [95% CI]a . |
---|---|---|---|
CORE-OM total | -0.57 (0.60) | [-0.65, -0.48] | 0.95 [0.82, 1.11] |
Risk | -0.43 (0.72) | [-0.54, -0.33] | 0.61 [0.49, 0.73] |
Non-risk | -0.60 (0.64) | [-0.69, -0.50] | 0.93 [0.81 to 1.09] |
CORE-10 | -0.57 (0.65) | [-0.67, -0.48] | 0.87 [0.73 to 1.04] |
Score . | Mean difference (SD) . | 95% CIa . | Cohen’s dzb [95% CI]a . |
---|---|---|---|
CORE-OM total | -0.57 (0.60) | [-0.65, -0.48] | 0.95 [0.82, 1.11] |
Risk | -0.43 (0.72) | [-0.54, -0.33] | 0.61 [0.49, 0.73] |
Non-risk | -0.60 (0.64) | [-0.69, -0.50] | 0.93 [0.81 to 1.09] |
CORE-10 | -0.57 (0.65) | [-0.67, -0.48] | 0.87 [0.73 to 1.04] |
Note: a Bootstrap CIs used the percentile method with 1,000 bootstrap replications. b Mean change divided by the standard deviation of the change values. b Risk refers to the total score only of the items from the CORE-OM that assess risk to self or to others; Non-risk refers to the total score of all items from the CORE-OM that do not assess risk to self or to others.
Reliable Change
Using the full pretreatment CORE-OM score 169 (93%) scored in the dysfunctional range and 13 (7%) as functional. For the CORE-10 scores the counts were 135 (74%) and 47 (26%) respectively. Using the full posttreatment CORE-OM score 76 (42%) scored in the dysfunctional range and 107 (58%) as functional. For the CORE-10 scores the counts were 58 (32%) and 125 (68%) respectively.
Figure 2 shows the Jacobson plot of change considering the CORE-OM and the CORE-10 scores. The RCI for the CORE-OM using the usual 95% inclusion width is 0.40, while the RCI for the CORE-10 with a 90% width (following the calculation from the UK to compensate for possible reliability bias) is 0.62. Table 6 displays the distribution of clients based on the intersection of the categories of clinical change and reliable change.
. | Clinical change . | Reliable change . | . | ||
---|---|---|---|---|---|
Score . | Reliable deterioration . | No reliable change . | Reliable improvement . | Total . | |
CORE-OM | Clinically deteriorated | 0.0% (0) | 0.5% (1) | 0.5% (1) | |
Stayed low | 0.0% (0) | 6.6% (12) | 0.0% (0) | 6.6% (12) | |
Stayed high | 1.1% (2) | 34.1% (62) | 26.9% (49) | 62.1% (113) | |
Clinically improved | 4.4% (8) | 26.4% (48) | 30.8% (56) | ||
Total | 1.1% (2) | 45.6% (83) | 53.3% (97) | 100.0% (182) | |
CORE-10 | Clinically deteriorated | 0.0% (0) | 1.6% (3) | 1.6% (3) | |
Stayed low | 0.0% (0) | 23.1% (42) | 1.1% (2) | 24.2% (44) | |
Stayed high | 1.1% (2) | 25.8% (47) | 9.3% (17) | 36.3% (66) | |
Clinically improved | 12.6% (23) | 25.3% (46) | 37.9% (69) | ||
Total | 1.1% (2) | 63.2% (115) | 35.7% (65) | 100.0% (182) |
. | Clinical change . | Reliable change . | . | ||
---|---|---|---|---|---|
Score . | Reliable deterioration . | No reliable change . | Reliable improvement . | Total . | |
CORE-OM | Clinically deteriorated | 0.0% (0) | 0.5% (1) | 0.5% (1) | |
Stayed low | 0.0% (0) | 6.6% (12) | 0.0% (0) | 6.6% (12) | |
Stayed high | 1.1% (2) | 34.1% (62) | 26.9% (49) | 62.1% (113) | |
Clinically improved | 4.4% (8) | 26.4% (48) | 30.8% (56) | ||
Total | 1.1% (2) | 45.6% (83) | 53.3% (97) | 100.0% (182) | |
CORE-10 | Clinically deteriorated | 0.0% (0) | 1.6% (3) | 1.6% (3) | |
Stayed low | 0.0% (0) | 23.1% (42) | 1.1% (2) | 24.2% (44) | |
Stayed high | 1.1% (2) | 25.8% (47) | 9.3% (17) | 36.3% (66) | |
Clinically improved | 12.6% (23) | 25.3% (46) | 37.9% (69) | ||
Total | 1.1% (2) | 63.2% (115) | 35.7% (65) | 100.0% (182) |
Note. There are some blank cells because no one can simultaneously show reliable improvement and clinical deterioration nor vice versa.
Sessional Change: Aggregate Scores against Occasions
Figure 3 shows an extended violin plot that seeks to give a comprehensive picture of the CORE-10 scores from all 259 clients for all sessions at which scores were completed. Individual scores are easily seen in the late sessions which were attended by a few of the starting cohort but the sheer number of scores from early completions lead to overprinting where more than one client had the same score on the same session. This is handled by both transparency and jittering. The transparency means that where scores overlap despite the jittering in the earlier sessions, the density is conveyed by the density of the points. That still makes the distribution hard to assess for those early sessions with many scores and that is handled by the violin shapes which give a smoothed distribution of scores (in the width of the shape) for each session. The means are shown by black points, and it is apparent that they drop steadily from the scores at the first session (the horizontal reference line). The precision of estimation of each sessional mean is given in the vertical black lines (joined across sessions by the dotted lines). Violin plotting and estimation of CIs are each of little use when the cell size drops low, so they are not shown for sessions with n less than 20. The actual counts per session are given across the bottom of the plot. This plot cannot show individual trajectories as the cat’s cradle plot does, but it does give a very comprehensive picture of the data from the two clinics. At first glance, clients improve if they attend more sessions, but then they deteriorate at the end of the treatment; however, that must be interpreted with caution, since the number of clients diminishes as the number of sessions increases and participants stem from two different subsamples.
Sessional Change: Individual Scores against Occasions
Below, in Figure 4 we show a plot depicting the trajectory of change of the clients against sessions they attended. The plot tells if each client has improved, deteriorated, or did not reliably change. As can be seen, the trajectories of change are diverse, but most clients show improvement.
Conclusions
One key distinction between a clinimetric and a psychometric understanding of measures is that the former stresses utility for a clinical purpose not necessarily assuming uniform psychometric properties, only sufficient for purpose where the latter stresses psychometric properties, often particularly factorial cleanliness. The CORE-OM, CORE-10, OQ-45 and SOS-10 are all primarily designed for clinical utility. However, that begs the question of utility for what clinical purpose. We will return to that issue but start by reviewing the simple psychometric exploration of the CORE measures which was intended to check whether those seemed sufficient not to rule out use of the measures.
Psychometrics
Our findings indicate that both measures show acceptable or better psychometric properties in help-seeking Ecuadorean samples for assessing psychological distress. This study complements the Paz et al. (2020) report on the CORE-OM for non-help-seeking participants and provides several findings that point towards the clinimetric utility of both the CORE-OM and the CORE-10. Regarding reliability, internal consistency was excellent for the CORE-OM (Cronbach’s α = .94) and good for the CORE-10 (Cronbach’s α = .81), similar values to those found in UK and in Spain. Regarding convergent validity, we found high correlations between both measures and the SOS-10-E and the OQ-45.2. In another validity check both measures showed strong mean differences between help-seeking and non-help-seeking samples with no significant mean gender differences. This enabled us to report cut-off scores that can be useful for other services in Ecuador.
The final and vital psychometric validity check was of sensitivity to change, and both measures showed strong mean improvement in psychotherapy, the CORE-OM more so than the CORE-10. However, as well as the group mean aggregate changes, the data show marked diversity in individual change trajectories (Figure 4). This brings us to the clinimetric rather than the psychometric aspects of this study.
Clinimetrics
Though this study involved participants completing the full CORE-OM at every session, we now believe this is onerous for clients and, like Barkham et al. (2013), we recommend using the full CORE-OM before therapy and when there are planned endings but the CORE-10 session-by-session in between.
Sessional assessment is often promoted because it provides valuable information, which in another way is not possible to collect, from the many clients who opt out of therapy without prior notice. This is important in LA not least because statistically significantly higher opting out rates have been shown for racial minorities, less well educated, and lower income clients (Wierzbicki & Pekarik, 1993). This may make using sessional data particularly important in LA given the continent’s ethnic diversity, low educational levels for at least some subpopulations and the fact that it is one of the most economically unequal regions in the world (Lopez-Calva et al., 2015).
Change and outcome, i.e., final scores, must always be considered in the light of pretreatment scores. This paper complements Paz et al. (2020) and compares these pretreatment scores in help-seeking clients with the earlier findings for non-help-seeking samples. The two studies combine to provide not only the CSC cutting points reported above, but also the referential percentile data in Tables 3 and 4. We believe that pretreatment scores below the CSC should not be a sole basis to exclude people from therapy, given that many who were below the CSC before therapy still showed reliable improvement, far more than the 2.5% of those below the CSC that would be expected from the definition of the RCI.
However, using shorter measures sessionally is more than just about having some first/last score change for clients who do not come for a planned ending, its primary advantage is the insight it gives into trajectories of change. This brings up the crucial issue of clinimetric utility: insight for what purpose? Services, and managerial systems particularly, often use first/last change data to compare rates of reliable improvement and of clinically significant and reliable change (“recovery” in some terminologies). While there is some merit in this, it is vulnerable to some challenges: it can overvalue the cutting points and lose statistical power and precision in dichotomising/categorising and there is a particular problem when using very short, broad coverage measures like the CORE-10 in that they inevitably have lower internal reliabilities, and hence have larger RCIs and hence lower reliable change rates than longer or unifocal measures. This could be seen clearly in our data (See Figure 2 and Table 6), and it seems likely that the RCSC paradigm, may need to be replaced with alternative ways of categorising change when use of shorter sessional measures, with rates of unplanned endings, will impinge on reliable change statistics.
Turning to the positive clinimetric aspects of using shorter measures sessionally, the trajectory plots above, though dense, show how complex change can be across two to twenty sessions in these services. In many “global north” countries aggregation of very large service change datasets has led to the explosion of methods of “embedded change management” or “feedback informed treatment” in which sessional score change is used to redirect therapies. We are looking at these developments in Ecuador (Valdiviezo-Oña et al., 2023). We will continue to build datasets from Ecuador and, we hope, more widely in LA to see if such models can be applied in our settings. Pending those explorations and respecting real world complexity, an essential thread in the PBE and clinimetric approaches, we encourage other researchers and clinicians to evaluate change with such graphs as well as simple first/last summaries.
Like any study, our work has limitations, particularly the size of the dataset which, as the confidence intervals show, cannot define cut-off scores very precisely so these values should be regarded as preliminary. We will continue to recruit data from these and, we hope, other services and encourage other researchers and practitioners from the region to conduct studies with the CORE-OM and CORE-10 both to increase total dataset size but also to support comparisons across different LA countries and services. Though open to anyone, these two services cater particularly for students and are used by college students, LGTBIQA+ people and their relatives. Therefore, even though there is diversity amongst these participants, generalizability regarding age, context, socioeconomic status, ethnic origin, and other variables needs more data. Additionally, more data would be beneficial for conducting sensible explorations of client response profiles and treatment duration.
Another issue is that CORE-10 scores here came not from the standalone instrument, but the ten items within the CORE-OM. The services are now using the standalone CORE-10 on a sessional basis from the second session onwards and we will report on the data when the datasets are large enough. Having more data will allow us to report other types of analyses, for example, to subgroup durations of therapies and of trajectories of change, that are unsuitable to be conducted with the limited actual sample of 259 participants and the markedly skew distribution of therapy durations.
Another limitation of our study arises from the differences observed in our sample when compared to the studies conducted by Trujillo et al. (2016) for data in Spain, Evans et al. (2002) for the CORE-OM, and Barkham et al. (2013) for the CORE-10 in the UK. In particular, the age range of our participants is somewhat narrower than that observed in the Spanish study, while the age ranges in the UK studies were considerably broader. Furthermore, the mean age of our study’s participants is notably lower than the mean ages reported in all these prior studies. It’s worth noting, however, that, similar to these studies, a majority of the participants in our study were women. Despite these limitations, we believe the clinimetric PBE model has much to offer, in LA and globally. In Table 6 we provide several recommendations of assessment structures for using CORE-OM and CORE-10 with respect to clinimetric utility. Some of the content in Table 6 has already been presented in other works (e.g., Barkham et al., 2013; Evans et al., 2002); we are presenting a summary. We recommend that CORE-OM scores should not be compared with CORE-10 scores but where change from data on one measure has to be compared to data on the other, the CORE-10 scores should be used whether from the standalone measure or from the ten items within the CORE-OM.
Aim of the assessment . | Assessment structures . | Score to use . | Clinimetric Utility . |
---|---|---|---|
Screening or pretreatment assessment | |||
Descriptive analyses and decision support | CORE-OM | Full score, Non-Risk and Risk scores | Comparing individuals’ scores both within services, and with other services. Exploring effects such as gender or other sociodemographic variables. Using the CORE-OM gives better internal reliability and coverage of issues than using just the CORE-10 |
Outcome (Pre-post) | |||
Detailed analysis of change, but only with planned endings | CORE-OM + CORE-OM | Full score, Non-Risk and Risk scores | Assessing first/last change and RCI categorisation of change |
Analysis of change regardless of planned or unplanned ending | CORE-OM + CORE-10 | CORE-10 scores whether from items within CORE-OM or the standalone CORE-10 | Assessing first/last change and to calculate RCI. Lower reliability of the CORE-10 means a wider RCI |
Tracking trajectories of change (session-by-session) | |||
Analysis of the trajectories of change with full data | CORE-OM + CORE-10 (sessional) + CORE-OM | CORE 10 scores as above | Look at the change trajectory. Information can be used to provide feedback and make decisions about the treatment. |
Aim of the assessment . | Assessment structures . | Score to use . | Clinimetric Utility . |
---|---|---|---|
Screening or pretreatment assessment | |||
Descriptive analyses and decision support | CORE-OM | Full score, Non-Risk and Risk scores | Comparing individuals’ scores both within services, and with other services. Exploring effects such as gender or other sociodemographic variables. Using the CORE-OM gives better internal reliability and coverage of issues than using just the CORE-10 |
Outcome (Pre-post) | |||
Detailed analysis of change, but only with planned endings | CORE-OM + CORE-OM | Full score, Non-Risk and Risk scores | Assessing first/last change and RCI categorisation of change |
Analysis of change regardless of planned or unplanned ending | CORE-OM + CORE-10 | CORE-10 scores whether from items within CORE-OM or the standalone CORE-10 | Assessing first/last change and to calculate RCI. Lower reliability of the CORE-10 means a wider RCI |
Tracking trajectories of change (session-by-session) | |||
Analysis of the trajectories of change with full data | CORE-OM + CORE-10 (sessional) + CORE-OM | CORE 10 scores as above | Look at the change trajectory. Information can be used to provide feedback and make decisions about the treatment. |
Note: RCI = Reliable Change Index
Finally, freely accessible software and open-source tools for collecting and analyzing outcomes and trajectories of change in psychotherapy are particularly important in Low- and Middle-Income Countries such as those in LA. Software that can be created using R (https://www.r-project.org/) and its derivatives, formr (https://formr.org/) and Shiny (https://shiny.rstudio.com/) can be used to systematize and automate data collection and facilitate feedback for therapists and clients. These tools should be able to, at least: offer an easy-to-understand virtual environment for clients; automatically calculate scores and generate plots of the trajectory of change for each client; instantly show or send the information to therapists; tell if the score of the client classifies them as part of the functional/non-clinical or dysfunctional/clinical population; and provide reminders of pending/uncompleted assessments. MarBar (https://www.marbarsystem.com/) is one of the few free software available in Spanish for routine data collection purposes which shows trajectories of change and outcome information for therapists and services. Moreover, open-code resources such as the CECPfuns R package (Evans, 2021) aim to make analyses more amenable to practitioners and generate plots easily for those not used to statistical tools. Also, some Shiny apps (https://shiny.psyctc.org/) are now available to clinical and research professionals, providing an easy way to analyse data.
Contributions
Contributed to conception and design: JVO, CE, CP
Contributed to acquisition of data: CP
Contributed to analysis and interpretation of data: JVO, CE, CP
Drafted the article: JVO, CP
Revised the article: JVO, CE, CP
Approved the submitted version for publication: JVO, CE, CP
Acknowledgments
We thank all therapists, trainees and clients involved in the data collection process.
Funding
This study was carried out within the project PSI.CPE.21.03, funded by the Dirección General de Investigación y Vinculación of the Universidad de Las Américas.
Competing Interests
The authors declare that they have no conflict of interest.
Data Accessibility Statement
The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions. In line with the requirements of the ethics committee that approved this research, only de-identified participant data in an encrypted file can be made available, along with a data dictionary, to suitably qualified researchers who (a) obtain ethical approval for their proposed analysis; (b) pre-register their statistical analysis plan; (c) provide a signed data-sharing contract which enables data storage and analysis for a time-limited period.