In this paper, we present a model for comparing groups on scale score outcomes. The model has a number of features that make it desirable for analysis of scale scores. The model is based on ordinal regression, hence it is able to capture the shape of the data even when the data are highly discrete, or display marked ceiling or floor effects. Additionally, the model incorporates hierarchical modelling to create accurate summaries of the differences in the scale scores across groups. Statistically, the model assumes the data are ordinal, and hierarchically estimates the entire distribution of each group using factor smooths. Substantively, the model is capable of: estimating location-based, dispersion-based and ordinal descriptives estimates for each group; estimating the uncertainty about these estimates; and performing pairwise comparisons of the different estimates. The estimation method is Bayesian, however, we have created a GUI-based application that users may install on their computer to run the model, reducing the barrier to applying the method. The application takes in the raw data and user input, runs the model, and returns multiple model-based graphical summaries of patterns in the data, as well as tables for more precise reporting. We also share code that allows users extend the model to additional research contexts.

When studying an outcome variable across multiple groups, it is a common practice to analyse the data using analysis of variance (ANOVA) followed by pairwise comparisons with adjustments for multiple comparisons. This practice is sometimes part of the preliminary analyses or is a part of the core output of a research endeavour. We write this paper because we believe there are more informative and modern approaches for studying group differences. The approach we lay out is specific to scale scores, i.e. averaged scores of multiple items where each item has the same Likert-type response scale – as such data are ubiquitous in psychological research and practice.

The basis of the approach we present stems from the following beliefs:

The response scale of the data should be given serious consideration during analysis and interpretation of results.

The approach should allow for rich descriptions, communicated using a variety of effect size measures that are relevant to the work of researchers. Additionally, the uncertainty about effect size estimates should be presented (American Psychological Association, 2020, p. 88).

The approach should use modern methods of studying group differences that increase estimation efficiency (Davis-Stober et al., 2018).

We now elaborate the practices that stem from each of these beliefs.

**Attention to response scale.** In this paper, we focus on scale scores as defined above. Although there are methodological discussions about the adequacy of such scores (e.g. Bauer & Curran, 2015; Flake & Fried, 2020; Grice, 2001; McNeish & Wolf, 2020; Widaman & Revelle, 2022), this practice of creating them persists and will continue to. Alternatively, there are situations where researchers only have access to the scale scores. Scale scores have peculiar features. First, they are doubly bounded, e.g. usually between 1 and 5 or 1 and 7 depending on the number of response categories and variable coding. Second, scale scores can only take on specific values, and these values are defined by the number of response categories and items. For example, when one averages three items, and each item has five response categories coded as ${0,1,2,3,4}$, the resulting scale scores can only take on 13 ($3\xd7(4\u22120)+1$) distinct values: $0,0.33\xaf,0.66\xaf,1,\u2026,3.66\xaf,4$.^{1} Given these features, scales scores can take on a wide variety of distributions. The data may be heaped at specific values, e.g. many respondents selected response patterns that average to a score of 2. Alternatively, the data may exhibit strong ceiling or floor effects. Often, the normal distribution does not adequately describe scale scores. Additionally, if we are to consider the response scale thoughtfully, then researchers should be able to identify points on the response scale that are of substantive interest or clinical relevance. For example, when working with depression scale scores bounded at 1 and 5 with higher scores signifying higher levels of depression, one might identify a score of 3.5 as a threshold for high depression. And it would be of substantive interest to identify the proportion of cases exceeding this threshold for the entire sample, and for each individual group. We intend to present a model that attends to the peculiarities of scale scores.

**Effect size measures.** It is standard to focus on location differences (e.g. mean difference) to characterize differences between groups. However, differences in dispersion may also be of substantive interest (e.g. Raudenbush & Bryk, 1987), so we prefer a modelling approach that can reliably estimate both differences in location and dispersion between groups. In addition to location and dispersion, we also consider measures of dominance, specifically the probability that values from one group exceed values from another group (McGraw & Wong, 1992). This measure goes by many names e.g. common language effect size (McGraw & Wong, 1992), or probability of superiority (PS; Grissom, 1994) – we use the PS label in this paper. Cliff (1993) argues that measures like the PS are often more germane to the work of psychologists than differences in location. And Brooks et al. (2014) provide empirical evidence that the PS is easier to understand than more traditional measures of effect size. Additionally, as earlier mentioned, particular points on a scale score may be clinically relevant, such that it is substantively interesting to estimate the proportion of cases above or below a particular value on the response scale for each group. This is an additional effect size measure. Finally, differences between groups at different quantiles may be substantively interesting (Wenz, 2019), so we desire models that can reliably estimate all of the aforementioned effect size measures, while capturing uncertainty about all of these estimates. Similarly to Gelman et al. (2012), we recommend using graphs for communicating the resulting effect sizes, and we build measures of uncertainty into the different graphical displays.

**Estimation efficiency.** For testing differences on an outcome between multiple groups, we prefer a hierarchical modelling approach (Gelman et al., 2012). Rather than simply altering inference to account for multiple comparisons, hierarchical models shift both point estimates and intervals, adopting a sceptical view of differences between groups. Given the sample sizes common in psychological research, Davis-Stober et al. (2018) showed that methods such as hierarchical models improve overall accuracy of estimates. We note that we do not believe hierarchical modelling solves the problem of multiple comparisons. Best stated by Greenland & Hofman (2019), “multiple comparisons controversies are about context and costs”, and relate to decision making. For example, a Bonferroni correction may be justified when there is little reason to believe differences exist between groups, and there is a very high cost to false-positive claims. In this paper, we concern ourselves with the simpler aim of summarizing patterns in data on the basis of useful models – hierarchical models excel at this. We recommend Greenland & Hofman (2019) as an entry point into the literature on making decisions using information from models, especially in relation to multiple comparisons.

**Bayesian analysis.** A final practice that we adopt is Bayesian model estimation. We acknowledge that most members of our target audience are not familiar with Bayesian modelling. However, familiarity with Bayesian modelling is not required for adoption of the approaches we recommend. We provide a Windows application that on installation performs all the analyses and produces relevant graphical outputs – we demonstrate the use of the application in this paper. The application is based on R and Stan (Carpenter et al., 2017) – we share all application code. We also share code based on the brms package (Bürkner, 2017) that allows for running the models directly in R. Familiarity with brms or Stan should allow for modifications to the models we present. Our motivations for adopting a Bayesian approach are similar to those presented by (Kruschke, 2013). Specifically, Bayesian approaches: allow for directly assessing the evidence in favour of a position (e.g. the presence of an effect); and provide a wealth of information as opposed to comparable frequentist methods. And we intend to exploit these benefits of Bayesian statistics in the model we recommend. For readers interested in learning more about Bayesian methods, van de Schoot et al. (2014) provide a gentle introduction, Kruschke (2013) demonstrates a Bayesian analogue of the $t$-test, and McElreath (2020) is an easy to read textbook on applied Bayesian data analysis.

**On the limitations of linear regression.** In this paper, we will recommend an alternative model to linear regression/ANOVA approaches for group comparisons of scale scores. When linear regression is applied to group comparisons, there is an implicit assumption that the data are normally distributed within each group, i.e. a mean and standard deviation are sufficient to describe each group’s outcome data. However, as established by central limit theorem, this normality assumption is often practically unimportant for comparing group means. And violations of the more consequential assumption that the different groups have the same variance can be ameliorated by replacing the standard $t$ and $F$ tests with their Welch analogues. However, these claims of robustness are only limited to group mean comparisons. As a simple example, assuming data are normal implies the mean and median are identical. But scale scores can easily be skewed such that the median is notably different from the mean. Thus a model that assumes data are normal would provide incorrect inference about the median or other quantiles. What we offer in this paper is the opportunity to learn more from data by accurately estimating a variety of effects. The model we will recommend is capable of estimating the following effect sizes with scale scores: mean, variance and quantile differences, probability of superiority, and proportion of the sample beneath/above a threshold, as well as inference about these effects. The linear regression model is also capable of estimating these effects. But the scale scores would have to be normally distributed for linear regression to yield accurate estimates for any of these quantities besides the mean. In summary, linear regression applied to scale scores is not ‘broken’, we only present an opportunity to learn more from data.

In the next section of this paper, we describe the model we propose at a conceptual level; the full description of the model is in Appendix A. Then we present a data analysis example – most practising psychologists will get the most benefit from the section on data analysis, and we expect that the reader who understands the results from the data analysis section will be able to interpret the results of the analysis with their own data. Then we end with a discussion section where we review the work we have done, and provide some recommendations for future directions.

## Ordinal regression for psychological scale scores

In this paper, we assume the scales scores are bounded data that take on discrete values. In cases where many items have been averaged to create the scale scores and the scale scores are largely bell-shaped with very few cases at the bounds, we expect a normal distribution to adequately capture the data. However, this is an ideal scenario and scale scores often exhibit a variety of features that suggest the normal and other potential parametric distributions to be inadequate:

Heaping: Several respondents may select a combination of response categories across all items that average to the same scale score such that the scale scores are notably heaped at that value.

Notably discrete distributions: Scale scores are sometimes created from as few as three items. When this happens, the distribution of the data is likely to be highly discrete.

Ceiling and floor effects: The measured trait may be relatively low/high in the population such that the resulting scale scores are skewed, with non-negligible proportion of cases at the scale score bounds.

In each of these situations, insights based on the normal distribution (standard approach) or parametric distributions for bounded data such as the beta distribution (e.g. Moberg et al., 2009; Zou et al., 2010) could potentially mislead an investigator. Hence we prefer a robust approach that can account for the wide variety of features scale scores may exhibit.

The most common approach to robust comparison of two groups is the Mann-Whitney test, as there is no restriction on the conditional distribution of both groups. Cumulative link logistic regression (the most common form of ordinal regression) with a lone binary predictor is practically equivalent to the Mann-Whitney test (Whitehead, 1993). By swapping the Mann-Whitney test with ordinal regression, we arrive at a robust method that can be extended to additional research contexts, and can be enriched with tools from the general regression framework. The idea of using ordinal regression models with continuous data is not new (e.g. Liu et al., 2017; Manuguerra & Heller, 2010), with two implementations in R: the orm() function in the rms package (Harrell, 2021) and the ocm() function in the ordinalCont package (Manuguerra et al., 2020). But we are yet to see the application of these models in psychology. The basic model then is ordinal regression to test group differences. Results of extensive simulation studies in Liu et al. (2017) suggest that ordinal regression applied to continuous data is relatively robust under a wide variety of data conditions.

In the remainder of this section, we briefly elaborate three core buildings blocks underlying the model we present. Complete details about the model are in Appendix A.

**Why ordinal regression works for scale scores.** Differently from linear regression, ordinal regression treats the data as discrete and in its simplest form: ordinal regression estimates the proportion of cases with each distinct value. Given this approach, ordinal regression is capable of capturing a wide variety of data shapes, including distributions exhibited by scale scores – making it a robust form of regression. As an example, given a scale score composed of 3 items, each with response categories: 1 – 5, there will be at most 13 $(3\xd7(5\u22121)+1)$ distinct scale score response values. Hence, a typical ordinal regression model would estimate 13 proportions representing the proportion of cases with each distinct value.^{2} In this paper, we estimate separate sets of proportions for each group in the data, such that the model accurately captures the distribution of each group in the data. Hence, assuming four groups in the example above, the model would estimate 52 $(13\xd74)$ proportions – 13 per group reflecting the distribution of each group. These proportions are then used to compute the different aforementioned effect sizes – details in Appendix A.

**The role of Bayesian hierarchical modeling.** Estimating a separate set of proportions for each group will lead to a model with a large number of parameters, especially when there are either several groups to compare or several distinct values of the scale scores. And estimating a large number of parameters can cause overfitting – especially with smaller datasets – such that the conclusions reached on the basis of the model have a high degree of uncertainty. As a solution, we employ hierarchical modeling in an attempt to shrink the different proportions to be identical across groups. This reduces the chance of overfitting the data, thus increasing the accuracy of the resulting effect sizes (Davis-Stober et al., 2018; Gelman et al., 2012). Although one can estimate the hierarchical ordinal model we develop outside of a Bayesian framework, as previously stated, we adopt a Bayesian approach for the wealth of information it provides about resulting effect sizes. Finally, a common concern with Bayesian models is the choice of prior. The model we develop mostly relies on weakly informative priors (Gelman et al., 2008; Lemoine, 2019), i.e. these priors attempt to limit the parameters to realistic estimates. In the absence of problem specific information, we believe these choices will be reasonable for most applications.

**Estimation rather than testing.** The method we present allows for estimation of a wide range of summaries of group differences on the basis of a single highly flexible model; hence our approach is rooted in an *estimation* paradigm. An alternative approach is to estimate different models that correspond to different hypotheses about the data, after which the models are compared to identify which hypothesis best reflects patterns in the data – this would reflect a *testing* paradigm. For example, Schnuerch et al. (2022) recently developed an ordinal regression approach for comparing two groups on a single ordinal item; the approach tests four different competing hypotheses of how the groups are ordered on the ordinal item. We believe testing based approaches are valuable as tests allow for moving beyond tabulating research findings to reaching conclusions on questions (Morey et al., 2014). However, we do not believe that most research endeavours in psychology are capable of robust tests of theory. Most studies provide findings that are somewhat reflective of the idiosyncrasies of the particular study for reasons such as sample-specificity and choice of study protocols. Hence, it is our recommendation that researchers describe their findings extensively and report them clearly. Conclusive results may then be obtained on the basis of meta-analysis of clearly reported study findings or meta-studies (Baribault et al., 2018).

## Data analysis

### Data example used in tutorial

In this paper, we analysed differences in perceived discrimination scale scores between students of colour at a large Southern public university. Investigating and acquiring a nuanced understanding of group differences is pivotal in providing culturally responsive care, promoting equity, and reducing health disparities (Alvidrez et al., 2019; Leseth, 2015; *National Minority Health and Health Disparities Research Framework*, 2021). Group differences in discriminatory experiences are particularly important, given discrimination’s association with a number of negative outcomes, such as lower self-esteem, poorer well-being, and decreased mental health (Carter et al., 2019). The extent to which individuals experience and perceive discrimination varies across racial-ethnic groups (Daniller, 2021; Hackett et al., 2020; Lee et al., 2019). Reports of varying frequencies of discrimination are complemented by reports of differing mean scale scores (e.g. Cokley et al., 2017). Another important factor in understanding discriminatory experiences across groups, is the consideration of how one’s race, combined with other social identities (e.g., gender) influences the nature and type of discrimination they experience. Specifically, scholars have long asserted that the intersections of one’s social identities (e.g., race and gender) can determine the type of discrimination they may experience (Cole, 2009; Essed, 1991; Watson-Singleton et al., 2021). Though some scholars have begun to incorporate intersectional approaches in their research (e.g. M. G. Williams & Lewis, 2019), more research that compares experiences across groups is needed.

We were interested in differences in the reported levels of perceived discrimination at the intersection of gender and ethnoracial groups among student of colour: African American women ($n=41$), African American men ($n=20$), Asian American women ($n=70$), Asian American men ($n=50$), Latine women ($n=59$) and Latine men ($n=37$). The group sample sizes are unbalanced with some groups being relatively small, hence one can expect benefits from adopting a hierarchical approach for these data.

To obtain the scale scores, the responses to nine items measuring everyday discrimination (D. R. Williams et al., 1997) were averaged. Each item had 5 response categories: never (1), rarely (2), about half the time (3), often (4) and always (5). The resulting scale scores were bounded at 1 and 5 (see Figure 1), with increasing scores representing increasing levels of perceived discrimination. Certain response values were heaped in the data, e.g. 11 respondents had a scale score of 1, meaning these eleven respondents responded ‘Never’ to all 9 items; or 37 respondents had a scale score of 2. Additionally, the scale scores are right-skewed, and the normal distribution does not adequately describe the data.

### Data preparation

We begin by importing the data into the application. The application only accepts CSV files, with variable names in the top row, and data in the remaining rows. The CSV file may include several variables but only two variables are required for analysis: a group ID variable and the scale score outcome. The group ID variable must be whole numbers ($1,2,\u2026g$ where $g$ is the number of groups), i.e. there should be as many unique whole numbers as there are groups. For example, if there are four groups, the data for the grouping variable must be whole numbers with values: 1, 2, 3, 4 representing the different groups.

### Model diagnostics

We analyze the data following the steps in Figure 2. On running the analysis, we check that the model has been estimated (converged) without problems. There are two sets of checks for model convergence: (i) checks that the nature of the data combined with the model did not pose a problem for Stan – the Bayesian software used to fit the model; and (ii) checks that ensure parameter estimates have converged and can be interpreted. For the first set of checks, the application returns warnings in the *Instructions* tab about divergences and maximum tree depth; these metrics are unique to Stan. And the application provides instructions for refitting the model that will likely eliminate these warnings (*Adapt delta* and *Max treedepth* options in Figure 2). The second set of checks are in the *Parameter checks* tab of the application with guidelines for assessing parameter convergence for the three most important model parameters. The checks are based on commonplace Bayesian parameter convergence assessments: $R^$ (preferably $\u22641.05$), effective sample size (preferably $>1000$), parameter histogram (preferably single-peaked) and traceplots (showing random oscillations about a stationary point). For the current analysis, all checks were satisfactory – we expect both sets of checks to be satisfactory for most applications.

### Brief guide to interpretations

We devote the next few subsections to different outputs that help us understand the sample as a whole and differences between the groups. For this demonstration, we report sample medians and pairwise median differences, however, the application additionally permits estimation of sample means and pairwise mean differences, sample quantiles and pairwise quantile differences, and sample standard deviations and pairwise standard deviation ratios.^{3} To facilitate the presentation of our results, we report the 95% quantile interval – *there is a 95% chance* OR *one can be 95% confident* a quantity (e.g. sample median or median difference between two groups) falls in this interval.^{4} As seen in Figure 2, the user can set their desired interval, which is used across all outputs. For certain quantities, we report the probability of direction (Makowski et al., 2019) – the probability that a quantity is either positive or negative depending on which direction has the higher probability. For readers who are not familiar with Bayesian intervals or the probability of direction, we hope that our interpretations below show how readers may interpret these quantities. Note that all subsequent interpretations are contingent on the model being correct – no model ever is, but we treat the interpretations as acceptable given the expected adequacy of the model for scale scores. Additionally, the application always displays tables that allow for higher precision than the precision depicted in the plots.

To illustrate the advantage of our approach over linear regression, we performed a linear regression analysis of the scale scores using the six groups as predictors of the scale scores.

### Reporting group medians

Figure 3 contains the median of the overall sample and the group medians. The figure is based on several figures in Gelman et al. (2012). From this figure, we can (i) estimate the centre of the overall sample and for each group, and (ii) informally compare each group to the centre of the overall sample. Sample interpretations are:

We are 95% confident the overall sample median falls between 1.89 and 2.03; our best guess is 1.96 points.

We are 95% confident the median for Latine men falls between 1.65 and 1.96; our best guess is 1.82 points.

Informally, we can compare the interval of each group to the overall sample median (dashed vertical line in Figure 3). We can conclude with almost 95% confidence that on average, Latine men reported lower levels of perceived discrimination than the sample average. If the group’s 95% interval completely excluded the sample median, then we could make this claim with >95% confidence.

Given that the data are skewed, the linear regression model is only able to accurately estimate the group means not group medians which may be a more reasonable measure of central tendency for these data. We present the group medians returned by ordinal regression alongside means computed by both ordinal regression and linear regression in Figure 4. Broadly, the medians are not as high as the means due to the right-skew in the scale scores; hence linear regression results would suggest an individual most representative of their group had higher perceived discrimination than true. Additionally, the estimates of the group means produced by ordinal regression are shrunken towards the overall average reflecting a built-in skepticism of differences between groups. The linear regression intervals are also symmetric which can be problematic for describing the uncertainty about group means for bounded data. Assuming the data were extremely skewed such that the group means were close to the bounds of the scale scores, the intervals (especially with small samples) could suggest implausible values, e.g. a 95% upper limit larger than the scale score maximum. The ordinal regression intervals cannot produce such a result as these intervals fully capture and reflect the features of the scale scores.

One may be tempted to use the intervals in Figures 3 and 4 to compare the groups to each other e.g. comparing the interval for African-American women to the interval for African-American men – this is an overly conservative approach (Schenker & Gentleman, 2001). We address pairwise median comparisons next.

### Pairwise median difference comparisons

We report the findings of pairwise median difference comparisons in Figure 5. We interpret the cell in the upper left corner as an example:

African-American men reported higher levels of perceived discrimination than African-American women on average – we make this claim with 77% confidence. There was a 95% chance that the median difference fell between $\u22120.28$ and 0.73 points, and our best guess for the median difference was 0.18 points on the 1–5 point scale.

One may perform the same comparison for any other pairing. Some readers may be concerned about multiple comparisons. As earlier stated, all pairwise comparisons in the table are simultaneously valid for summarizing patterns in the data.

We also used linear regression to explore pairwise mean differences. Since these are pairwise contrasts of all pairs, we applied Tukey’s honest significant differences procedure to the linear regression contrasts. For ordinal regression, we explored the pairwise group differences at the 20th, 50th (median, as in Figure 5) and 80th percentiles and also at the mean. The results are in Figure 6. From the ordinal regression results, it is clear that differences between members of different groups are lessened for cases that are low within their groups (20th percentile comparison). These differences often increased for cases that are higher within their group. When comparing cases that were typical within their respective groups (median), there was more than a 95% chance of a non-zero gap for two contrasts: Latine men to both African-American women and men. When comparing cases that were high within their respective groups (80th percentile), there was more than a 95% chance of a non-zero gap for four contrasts: both Latine men and Asian-American women to both African-American women and men. On the other hand, linear regression is only capable of pairwise mean differences. Notably, the adjusted linear regression intervals are wider than the ordinal regression intervals for mean differences. This is because the pairwise (mean) differences are shrunken in the ordinal regression approach as implemented, resulting in less uncertain pairwise comparisons.

So far, we have reported graphs that show the average patterns in the data. Next, we turn to graphs that report findings about actual persons in the sample.

### Comparing the sample to thresholds on the response scale

Figure 7 allows for comparing the entire sample to different points on the response scale. For example, 2 represents “Rarely” on the individual item response scale, and a research team might deem any scale scores above 2 to be cause for concern. Accordingly, the following finding would be relevant:

We are 95% confident that between 41 and 53 out of every 100 people had a score above 2 points on the response scale – our best guess is that 47% of respondents had a scale score above 2 points.

The intervals 41% and 53% were obtained from a table produced by the application. The next output (Figure 8) answers the same question as the earlier output, but for each sample. As an example:

There was substantial variability in the proportion of the sample that reported a score above 2 points by group, ranging from about 30% of Latine men (95% interval [20%, 47%]) on the low end to about 65% of African American men (95% interval [46%, 81%]) on the high end.

The intervals were obtained from a table produced by the application. Figure 8 does not show the uncertainty intervals for each group – doing that can make it difficult to observe the group lines especially as the intervals are likely to overlap each other. But the application includes an option that allows the user show the uncertainty intervals.

Both Figures 7 and 8 can be important for understanding groups, especially when there are clinically or practically significant thresholds on the response scale. In the absence of such thresholds, each investigator is welcome to assess the overall sample and individual groups at any value of interest to the investigator.

### Pairwise probability of superiority

A natural question for group comparisons is: if we come across members of two groups, which member will have the higher score on the outcome of interest? The probability of superiority (PS) answers this question, and we report the findings of pairwise PS in Figure 9. We interpret the cell in the upper left corner as an example:

We are 73% confident that African-American men had higher levels of perceived discrimination than African-American women. If we compared 100 random pairs of African-American women and men on their level of perceived discrimination, we are 95% confident that African-American men would report the higher score somewhere between 41% of the time on the low end and 68% of the time on the high end – our best guess is that African-American men would report the higher score 54% of the time.

Unlike the median difference which relates to differences on average, this result tells us about the actual persons in our sample.

### Custom comparisons

An investigator may be interested in specific comparisons beyond individual pairwise comparisons of groups. For example, one may want to compare a single group against a combination of other groups, or two sets of groups against each other. For reporting, we used quantile dotplots and complementary CDF plots (Fernandes et al., 2018). The application requires that the practitioner identify the groups they wish to compare and provide labels for the groups. The application also produces a table with the estimate of the group difference, its 95% interval and the probability of direction. As an example, we compared the 80th-percentile of African-American and Latine respondents, see Figure 10 for results. Sample interpretations include:

There was about a 99% chance that the difference between African American and Latine respondents at the 80th-percentile was greater than 0 points – so one can be somewhat confident that high-scoring African-Americans reported higher levels of perceived discrimination than high-scoring Latine respondents on average.

Suppose we assume absolute differences greater than 0.25 on the response scale to be non-trivial, there was almost 80% chance that the difference between both groups at the 80-th percentile exceeded this threshold.

Based on the estimate and intervals produced by the application, there was a 95% chance that African-Americans reporting a relatively high level of perceived discrimination (80th-percentile) reported between 0.070 and 0.76 points higher than Latine respondents who reported a relatively high level of perceived discrimination (80th-percentile). Our best guess was that African-Americans reported 0.39 points higher than Latine respondents at the 80th-percentile.

The application additionally permits custom comparisons with mean differences and standard deviation ratios. Another use case for the custom comparisons is to produce the charts in Figure 10 for any two focal groups.

## Discussion

In this paper, we have presented a method for group comparisons when the response data are scale scores. The method should be adequate under a variety of features that might make traditional approaches less adequate for scale scores. Additionally, we have provided an application that allows psychologists replicate the method on their data, while producing outputs they can use to describe their data exhaustively. In addition to group comparisons of scale scores, the methods and application may be applied to test scores created by averaging multiple binary items. In summary, the methods and application would be ideal for group comparisons of any composite score that is created from items that share the same discrete response scale.

The method we have presented is capable of accurately estimating any quantity that can be calculated from a distribution function: means, quantiles, standard deviations and other measures of dispersion; as well as group comparisons for these quantities. This is because the distribution of each group is (hierarchically) estimated. The cost of estimating the entire distribution of each group is some reduction in efficiency (i.e. wider uncertainty intervals). For example, were one to be solely interested in a comparison of group standard deviations, one could devise an alternative approach that was more efficient than the method we have presented. However, it is unlikely that this alternative could simultaneously perform accurate quantile comparisons. Said colloquially: we have presented a method that does a lot of things relatively well, despite not being the best technique for any one of those things. If a researcher intends to understand differences between groups from multiple perspectives, we recommend the method in this paper as it allows the researcher to *exhaust* the information in the data.

We have also provided code that allows researchers to extend the method to additional data types or data analytic plans. Simple extensions include accounting for covariates – continuous covariates can be entered into models using flexible functions (such as splines), accounting for clustered data, improvement on the graphs we have used for presenting results. We also welcome the use of ordinal models on scale scores for other commonplace analyses in psychological research, such as moderation analysis with scale scores. We note that depending on the analysis, the ordinalCont package (Manuguerra et al., 2020) may be sufficient for the researcher’s needs.

Finally, researchers intending to adopt the methods here while planning their studies may have questions about sample size planning. For high stakes studies, the team statistician may be able to program a power analysis using simulation (e.g. Kurz, 2019; Sigal & Chalmers, 2016). Alternatively, one can practice *optional stopping* (Rouder, 2014), that is to recruit participants and continuously re-analyse the data until the team decides to stop recruiting or is unable to recruit additional participants.

## Acknowledgement

We are grateful to Leydi Johana Chaparro-Moreno for her helpful comments and feedback. We would also like to thank two anonymous reviewers and Associate Editor Don van Ravenzwaaij for their helpful comments.

## Competing Interests

The authors declare no competing interests.

## Data accessibility statement

All materials including dataset, desktop application and code are available at: https://osf.io/r7qgc. The OSF wiki page describes all files in the folder.

## Contributions

Contributed to conception and design: 1st author

Contributed to acquisition of data: 2nd author

Contributed to analysis and interpretation of data: 1st author

Drafted and/or revised the article: 1st author, 2nd author

Approved the submitted version for publication: 1st author, 2nd author

## Appendices

#### Appendix A. Model details

**Preliminary notes on notation.** The model notation for the normal distribution is location-scale rather than location-variance. Unknown parameters are represented with Greek alphabet, unknown scalars are regular font, unknown vectors are bolded, and unknown matrices are bolded and capitalized.

##### Statistical model

Assume $ni$ items with Likert response scales, and each item has $C$ response categories, the items are averaged to create scale scores $Y$ with theoretical range of $[1,C]$. We transform $Y$ to integers, $Z$, using the formula: $Z=ni\xd7(Y\u22121)+1$. $Z$ has theoretical range of $[1,J]$, where $J=ni\xd7(C\u22121)+1$. We can perform ordinal regression on $Z$, assuming $J\u22121$ thresholds regardless of the observed values of $Z$. This can be a relatively complex model: for example, with 20 items, each with response categories 1–5, we need to estimate 80 $(20\xd74)$ thresholds.

Alternatively, we follow Manuguerra et al. (2020) and approximate the ordinal thresholds with splines. Manuguerra et al. (2020) frame the problem as trying to predict the unknown thresholds for each $j\u2208{1,2,\u2026,J\u22121}$. Splines are a form of piecewise polynomial regression; in the simplest cases – including the current scenario – splines transform a single predictor variable into multiple variables/pieces where each piece is usually a low-degree polynomial known as a *basis function*. As with piecewise regression, a coefficient is estimated for each piece. Each piece primarily influences a small region of the data such that splines excel at capturing relations that are complex in some regions of the data, but simple in others.

There are three major choices to be made in deciding basis functions: the type or method of constructing the basis function; the number of pieces termed the degrees of freedom of the spline; the degree of the basis functions. These decisions can be made prior to data analysis. In the current application, these decisions need not be difficult.

Our goal is to approximate the ordinal thresholds which have an ordinal constraint: the threshold associated with $j+1$ must be at least as large as the threshold associated with $j$. Manuguerra et al. (2020) selected the I-spline basis function (Ramsay, 1988) which is able to produce monotone shapes. When all the coefficients of I-spline basis functions or spline weights are greater than or equal to 0, the resulting shape will be monotonically non-decreasing, thus satisfying the ordinal constraint.

Additionally, the importance of the degree of spline and degrees of freedom are minimized as long as the estimation of spline weights is penalized (Ruppert et al., 2003, section 5.5). Penalized estimation places a cost on larger coefficients such that the coefficients are less likely to be large compared to unpenalized (or standard) coefficients. This is a skeptical approach to estimating coefficients that has the effect of reducing overfitting. To penalize the spline weights, we apply a prior that attempts to shrink the spline weights towards 0. Additionally, we assume the splines are quadratic such that the resulting functions are smooth (Ruppert et al., 2003, section 3.6), and set the degrees of freedom to 18 – i.e. there are 18 pieces or basis functions. Instead of estimating $J\u22121$ thresholds, we estimate 18 spline weights thus simplifying the problem.

In summary, the baseline ordinal regression model for scale scores is:

where $inv-logit(x)=(1+e\u2212x)\u22121$, $\beta 0$ is the ordinal regression intercept, and $s(j)=sj\u2032\gamma $. $sj\u2032$ is the $j$-th row of the basis functions and $\gamma $ are the spline weights constrained non-negative. It is implicitly assumed that $Pr(Z\u2264J)=1$.

The baseline model for each case $i(\u22081,\u2026,N)$ ignoring differences between groups is:

where $\gamma $ are assumed half-normal with a scale term, $\sigma 0$. As $\sigma 0$ becomes smaller, then $\gamma $ become smaller, such that $\sigma 0$ shrinks the spline weights towards 0.

We consider two proposals for extending the model to multiple groups. First, one can expand $\beta 0$ to a group-varying coefficient and estimate it hierarchically, e.g. Gelman et al. (2012). Alternatively, one can permit the spline weights to vary by group, while estimating the spline weights hierarchically, e.g. section 7.7.4 in Wood (2017). We opt for the latter approach as it is better able to characterize each group uniquely. The first approach assumes the groups only differ in their location or center on the logit-scale, while the second approach allows for both varying locations and densities on the logit-scale. Practically, there are situations where only the second approach is able to accurately estimate quantiles for each group.

Accordingly, for the hierarchical model comparing $G$ groups, we assume the same I-spline basis matrix $(S)$ across all groups, and an $18\xd7G$ weight-matrix $(\Phi )$ where each column corresponds to the weights of each group, $\Phi =[\varphi 1,\u2026,\varphi G]$. The model for each case $i$ that belongs to group $g(\u22081,\u2026,G)$ is:

where $\u2299$ is the Hadamard (element-wise) product. $\gamma $ remains the average spline weights. $\lambda g$ is a vector used to create the weights for group $g$. If $\lambda g=1$ for some group $g$, then that group is not different from the overall population. We place log-normal prior on $\lambda g$ with location 0 (or median = $exp\u2061(0)=1$) and scale $\sigma 1$. As $\sigma 1\u21920$, $\lambda g\u21921$ and $\varphi g\u2192\gamma $, i.e. when $\sigma 1$ is closer to 0, the groups are more likely to have the same spline weights meaning that the groups are not so different from each other.

In practice, the modeller only needs to fit the hierarchical model in equation A2, as one can reconstruct all relevant estimates for the overall sample from the hierarchical model.

**Prior choices.** Given the structure of the model, we set some default choices for priors based on reasonable conventions. We assume spline weights, $\gamma $, are normal (precisely half-normal due to the non-negativity requirement) in a manner akin to ridge-regression (Park & Casella, 2008) i.e. 0 is assumed the most likely value of the weights and there is a scaling parameter, $\sigma 0$, that attempts to shrink the weights towards 0. The group deviation weights, $\lambda g$, are similarly log-normal in a manner that shrinks them towards 1, i.e. no difference from the average weights. Other prior choices are for the intercept, $\sigma 0$ and $\sigma 1$, and these are weakly informative priors (Gelman et al., 2008; Lemoine, 2019), i.e. these priors attempt to limit the parameters to realistic estimates. In the absence of problem specific information, we believe these choices will be reasonable for most applications. Where the researcher is capable of providing problem-specific information that can enrich the model, we recommend modifying the default choices here.

**Extant software implementations.** We know of two implementations of ordinal regression for continuous variables in R: the orm() function in the rms package (Harrell, 2021) and the ocm() function in the ordinalCont package (Manuguerra et al., 2020). orm() does not support hierarchical models and the ordinal thresholds are based on the observed values in the data only. On the other hand, ocm() models the ordinal thresholds using monotonic splines such that thresholds can be estimated for all potential values on the response scale. And though one can use ocm() to fit the baseline model in equation A1, ocm() cannot permit the splines to vary hierarchically as in equation A2.

##### Model outputs

We now show how to estimate several quantities from the hierarchical model. We are able to obtain uncertainty estimates for each quantity below using the posterior distribution of relevant parameters. For all equations below, let $t$ be the scale score equivalent of $j$, i.e. $t=(j\u22121)/ni+1$.

**Mass and distribution functions.** The model-based cumulative distribution function $(CDF, Pr(zig\u2264j))$ for each group is $inv-logit(sj\u2032\varphi g\u2212\beta 0)$. The CDF and complementary–CDF $(CCDF;1\u2212Pr(zig\u2264j))$ can answer questions such as: what is the probability that cases belonging to a group have a value less than or exceeding some threshold of interest? The probability mass function (PMF) is:

Henceforth, we refer to the PMF as $Pr(j)$ for brevity. The PMF for the overall sample is the weighted-average of the group PMFs, using the group sample sizes. And the CDF for the overall sample is the cumulative sum of the resulting overall sample PMF.

**Mean.** The mean is $\u2211j=1Jt\xd7Pr(j)$, recall $t=(j\u22121)/n+1$. The group means are calculated using the group PMFs, after-which one can perform pairwise comparisons. The overall mean is calculated using the overall sample PMF.

**Quantiles.**The $(100\xd7\tau )$-th percentile is the value at which the CDF is exactly $\tau $. Let $t\u2212$ and $t+$ be the values of $t$ that result in a CDF just below $\tau $ $(\tau \u2212)$ and a CDF just above $\tau $ $(\tau +)$ respectively. We use linear interpolation to estimate the exact value of $t$ that results in $\tau $:

**Standard deviation.** Assuming the mean as computed above is termed $\mu $, the standard deviation is: $(\u2211j=1JPr(j)(t\u2212\mu )2)/J$. The group SDs are calculated using the group PMFs, after-which one can perform pairwise comparisons. For pairwise comparisons, we calculate the (ratio of SDs) – 1, such that 0 means equal SDs. The overall sample SD is calculated using the overall sample PMF.

**Pairwise probability of superiority.** To estimate the PS comparing groups A and B with corresponding outcome data $YA$ and $YB$:

where the relevant PMF and CCDF are computed using the equations above.^{5}

**Custom comparisons.** Finally, we provide the option for custom comparisons of means, quantiles and SDs. For example, one may want to compare one group to the average of three other groups on their medians, or compare two sets of groups. To combine groups, we create a weighted-average PMF using the relevant group sample sizes. Then we compute the required quantity using the composite PMF.

##### Multinomial approach

Given that the focus of this paper is group comparisons, the ordinal approach above can be further specialized. For the hierarchical model, we count the number of responses with each value, $j\u2208{1,\u2026,J}$, on the response scale reported by members of group $g$, and assume the resulting count data $([m1g,\u2026,mJg])$ to be multinomial – $mjg$ is the count of scores with value $j$ from members of group $g$. The group-specific PMFs computed earlier results in the probabilities responsible for the observed counts. Precisely, the log-likelihood function (excluding the multinomial constant) is:

This is the implementation we use in the application as there are speed gains when working with the multinomial likelihood as opposed to the original ordinal formulation. However, the multinomial formulation here is not readily scaleable as it is not straightforward to include additional predictors, especially continuous predictors. This is because the data in the multinomial approach are the count $(mjg)$ of members of a group $g$ that selected a particular response value $j$ on the scale scores, and collapsing data this way becomes difficult with continuous predictors. We provide the source code for both approaches, and recommend the ordinal approach for those seeking to extend the methods presented so far.

#### Appendix B. Model comparison

The method as presented permits extensive summaries of group differences under the assumption that such differences exist and should be described with considerations for uncertainty. An alternative is to estimate models that restrict the groups in a variety of ways and compare them. The primary way to estimate such models would be to constrain the spline coefficients accordingly. We present one approach to creating such constraints by slightly modifying equation A2:

We introduce $\alpha $ which is an indicator for belief in group differences; when $\alpha =0$, all groups are identical, otherwise group differences are permitted and estimated – the modeler sets $\alpha $. The second change above is to allow the modeler cluster the groups, such that the group splines vary by cluster membership, $\lambda c$, and certain group splines are constrained equal. Retaining the example in the paper, one may believe that gender differences within race do not matter and fit such a model.

As we rely on the multinomial likelihood, group-wise log-likelihood (including the multinomial constant) can be computed for any given model and used to compare models via metrics such as the WAIC (Watanabe, 2010) or approximate leave-one-out cross-validation (Vehtari et al., 2017). We implement this approach in Stan (*model_mod_comp.stan*) and demonstrate an example (*example_pd_mult_mod_comp.R*) – scripts in OSF repository.

## Footnotes

The general pattern for distinct values is all integers in the interval $*$[l\times n, u\times n]$*$ divided by $*$n$*$, where $*$l$*$ and $*$u$*$ are the minimum and maximum value of the response categories and $*$n$*$ is the number of items in the scale. In the preceding example, the integers to be divided by $*$n = 3$*$ are $*$[0,1,\dots,12]$*$.

Depending on the number of distinct response values, the typical ordinal regression model may have to estimate a large number of parameters. In this paper, we apply a method proposed by Manuguerra et al. (2020) that reduces the number of parameters, see Appendix A for details.

The quantity is the pairwise ratio of group standard deviations – 1, such that 0 means equal standard deviations.

For readers familiar with Bayesian statistics, the quantile or equal-tailed interval is one type of credible interval.

This series sum is the discrete equivalent of the integral approach to computing the PS with continuous data (e.g. Agresti & Kateri, 2017, p. 216). The discrete form here accounts for ties in the data.

## References

*Biometrics*,

*73*(1), 214–219. https://doi.org/10.1111/biom.12565

*American Journal of Public Health*,

*109*(S1), S16–S20. https://doi.org/10.2105/ajph.2018.304883

*Publication manual of the American Psychological Association*(7th ed.).

*Proceedings of the National Academy of Sciences*,

*115*(11), 2607–2612. https://doi.org/10.1073/pnas.1708285114

*Advances in multilevel modeling for educational research: Addressing practical issues found in real-world applications*(pp. 3–38). Information Age Charlotte, NC.

*Journal of Applied Psychology*,

*99*(2), 332–340. https://doi.org/10.1037/a0034745

**brms**: An

*R*Package for Bayesian Multilevel Models Using

*Stan*.

*Journal of Statistical Software*,

*80*(1), 1–28. https://doi.org/10.18637/jss.v080.i01

*Stan*: A Probabilistic Programming Language.

*Journal of Statistical Software*,

*76*(1). https://doi.org/10.18637/jss.v076.i01

*Race and Social Problems*,

*11*(1), 15–32. https://doi.org/10.1007/s12552-018-9256-y

*Psychological Bulletin*,

*114*(3), 494–509. https://doi.org/10.1037/0033-2909.114.3.494

*Journal of Counseling Psychology*,

*64*(2), 141–154. https://doi.org/10.1037/cou0000198

*American Psychologist*,

*64*(3), 170–180. https://doi.org/10.1037/a0014564

*Majorities of Americans see at least some discrimination against Black, Hispanic and Asian people in the U.S*. https://www.pewresearch.org/fact-tank/2021/03/18/majorities-of-americans-see-atleast-some-discrimination-against-black-hispanic-and-asian-people-in-the-u-s/

*PLOS ONE*,

*13*(11), e0207239. https://doi.org/10.1371/journal.pone.0207239

*Understanding Everyday Racism: An Interdisciplinary Theory*. SAGE Publications, Inc. https://doi.org/10.4135/9781483345239

*Association for Computing Machinery*, 1–12. https://doi.org/10.1145/3173574.3173718

*Advances in Methods and Practices in Psychological Science*,

*3*(4), 456–465. https://doi.org/10.1177/2515245920952393

*Journal of Research on Educational Effectiveness*,

*5*(2), 189–211. https://doi.org/10.1080/19345747.2011.618213

*The Annals of Applied Statistics*,

*2*(4), 1360–1383. https://doi.org/10.1214/08-aoas191

*European Journal of Epidemiology*,

*34*(9), 801–808. https://doi.org/10.1007/s10654-019-00552-z

*Psychological Methods*,

*6*(4), 430–450. https://doi.org/10.1037/1082-989x.6.4.430

*Journal of Applied Psychology*,

*79*(2), 314–316. https://doi.org/10.1037/0021-9010.79.2.314

*BMC Public Health*,

*20*(1), 1652. https://doi.org/10.1186/s12889-020-09792-1

*rms: Regression modeling strategies [Computer software manual]*. R package version 6.2-0. https://CRAN.R-project.org/package=rms

*Journal of Experimental Psychology: General*,

*142*(2), 573–603. https://doi.org/10.1037/a0029146

*Bayesian power analysis: Part I. Prepare to reject $H_0$ with simulation*. https://solomonkurz.netlify.app/post/bayesian-power-analysis-part-i/

*PLOS ONE*,

*14*(1), e0210698. https://doi.org/10.1371/journal.pone.0210698

*Oikos*,

*128*(7), 912–928. https://doi.org/10.1111/oik.05985

*BJPsych Bulletin*,

*39*(4), 187–190. https://doi.org/10.1192/pb.bp.114.047936

*Statistics in Medicine*,

*36*(27), 4316–4335. https://doi.org/10.1002/sim.7433

*Frontiers in Psychology*,

*10*. https://doi.org/10.3389/fpsyg.2019.02767

*The International Journal of Biostatistics*,

*6*(1). https://doi.org/10.2202/1557-4679.1230

*Journal of Statistical Software*,

*96*(8), 1–25. https://doi.org/10.18637/jss.v096.i08

*Statistical rethinking : A Bayesian course with examples in R and Stan*. CRC Press/Taylor Francis. https://doi.org/10.1201/9780429029608

*Psychological Bulletin*,

*111*(2), 361–365. https://doi.org/10.1037/0033-2909.111.2.361

*Behavior Research Methods*,

*52*(6), 2287–2305. https://doi.org/10.3758/s13428-020-01398-0

*British Journal of Dermatology*,

*161*(2), 397–403. https://doi.org/10.1111/j.1365-2133.2009.09099.x

*Psychological Science*,

*25*(6), 1289–1290. https://doi.org/10.1177/0956797614525969

*National Minority Health and Health Disparities Research Framework*. (2021, December). https://www.nimhd.nih.gov/researchFramework

*Journal of the American Statistical Association*,

*103*(482), 681–686. https://doi.org/10.1198/016214508000000337

*Statistical Science*,

*3*(4), 425–441. https://doi.org/10.1214/ss/1177012761

*Journal of Educational and Behavioral Statistics*,

*12*(3), 241–269. https://doi.org/10.3102/10769986012003241

*Psychonomic Bulletin Review*,

*21*(2), 301–308. https://doi.org/10.3758/s13423-014-0595-4

*Semiparametric regression*. Cambridge University Press. https://doi.org/10.1017/cbo9780511755453

*The American Statistician*,

*55*(3), 182–186. https://doi.org/10.1198/000313001317097960

*Collabra: Psychology*,

*8*(1), 38594. https://doi.org/10.1525/collabra.38594

*Journal of Statistics Education*,

*24*(3), 136–156. https://doi.org/10.1080/10691898.2016.1246953

*Child Development*,

*85*(3), 842–860. https://doi.org/10.1111/cdev.12169

*Statistics and Computing*,

*27*(5), 1413–1432. https://doi.org/10.1007/s11222-016-9696-4

*Journal of Machine Learning Research*,

*11*, 3571–3594. http://www.jmlr.org/papers/v11/watanabe10a.html

*Cultural Diversity and Ethnic Minority Psychology*,

*43*(3), 368–380. https://doi.org/10.1037/cdp0000477

*Child Development*,

*90*(4), 1442–1452. https://doi.org/10.1111/cdev.13141

*Statistics in Medicine*,

*12*(24), 2257–2271. https://doi.org/10.1002/sim.4780122404

*Behavior Research Methods*. https://doi.org/10.3758/s13428-022-01849-w

*Journal of Health Psychology*,

*2*(3), 335–351. https://doi.org/10.1177/135910539700200305

*Psychology of Women Quarterly*,

*43*(3), 368–380. https://doi.org/10.1177/0361684319832511

*Generalized additive models: An introduction with R*(2nd ed.). CRC Press. https://doi.org/10.1201/9781315370279

*Statistics in Medicine*,

*29*(24), 2486–2500. https://doi.org/10.1002/sim.4012