Multilevel modeling techniques have gained traction among experimental psychologists for their ability to account for dependencies in nested data structures. Increasingly, these techniques are extended to the analysis of binary data (e.g., correct or incorrect responses). Despite their popularity, the information in logistic multilevel models is often underutilized when researchers focus solely on fixed effects and ignore important heterogeneity that exists between participants. In this tutorial, we review four techniques for estimating and quantifying the relative degree of betweenperson variability in logistic multilevel models in an accessible manner using real data. First, we introduce logistic multilevel modeling, including the interpretation of fixed and random effects. Second, we review the challenges associated with the estimation and interpretation of within and betweenparticipant variation in logistic multilevel models, particularly computing the intraclass correlation coefficient (ICC), which is usually a first, simple step in a linear MLM. Third, we demonstrate four existing methods of quantifying the ICC in logistic multilevel models and discuss their relative advantages and disadvantages. Fourth, we present bootstrapping methods to make statistical inference about these ICC estimates. To facilitate reuse, we developed R code to implement the discussed techniques, which is provided throughout the text and as supplemental materials.
1. Introduction
From 2007 to 2017 there has been a threefold increase in published psychology articles that use multilevel modeling techniques (Huang, 2018), which is in part due to the nested structure of psychological data: individual observations clustered within participants or participants clustered within groups. Across multiple areas of psychology, behavioural economics, and other behavioural sciences, nested data often arise from experiments in which participants have completed multiple trials of a task. Analyzing these data with traditional statistical tests (e.g., fixedeffects ANOVA, linear regression) fails to account for this nested structure and violates the assumption of independence of observations. When responses come from the same participant, they are more similar than if they come from independently sampled participants. These response redundancies deflate the effective sample size, cause incorrect standard errors, and often cause higher Type I error rates (Snijders & Bosker, 2011).
In addition to these data being nested, the outcome variable of interest is many psychological datasets are often binary (e.g., accuracy or binary choices), rather than continuous. Such data can be modeled using logistic multilevel models (i.e., logistic mixedeffects regressions), which model the probability of success/correct choice while handling the dependencies that arise in nested datasets. However, these techniques are underexploited in practice. That is, while it is becoming common for researchers to fit multilevel models to nested experimental data, interpretation of the results is primarily, if not exclusively, focused on the ‘fixed effects’ of the model. As a result, estimates based on the variance of the random effects— e.g., the intraclass correlation (ICC), which capture the relative importance of variability between participants’ responses (or between other types of clusters)—are seldom used to support or speak against conclusions.
Why should researchers care about the ICC in their multilevel logistic models? First, it quantifies theoretical phenomena, such as, how variable participants’ responses are during a task or how stable an effect is across people. Relatedly, this provides a starting point for explanatory power: if most of the variability in the data comes from differences between persons, then model building is a useful exercise with personlevel predictors, whereas if most of the variability in the data comes from withinperson variations, then the researcher will want to explore the variability of responding to their task and investigate differences between conditions. At present, it is uncommon for researchers to report or even investigate the random effects variance or the ICC when outcomes are binary, which is inconsistent with best practices for the transparent reporting of any multilevel model (Luo et al., 2021).
To support researchers in expanding their understanding of their data and encourage best practice, this tutorial explains how to estimate and interpret the ICC in logistic multilevel models. A tutorial focusing on this topic is needed for at least three reasons: 1) the unique challenges that computing and interpreting the ICC in logistic multilevel models poses are seldom addressed in an accessible manner, 2) while other works have provided methods to compute the ICC in a logistic multilevel context (e.g., Goldstein et al., 2002; Merlo et al., 2006), there does not exist a didactic survey of methods aimed at increasing uptake by psychologists (see, for instance, examples in social epidemiology; Austin & Merlo, 2017; Merlo et al., 2006), and 3) texts that do address this issue tend to focus on only a single method (e.g., Snijders & Bosker, 2011) ignoring the differences between various methods.
Accordingly, we have four goals. First, for the uninitiated, we provide a brief introduction to linear multilevel regression (in the Supplemental Materials, sections S6b, explained below) and logistic multilevel regression, which includes the interpretation of fixed and random effects. Second, we review the challenges associated with the interpretation and estimation of the relative contribution of within and betweenparticipant variation in logistic multilevel models—namely in the computation of the intraclass correlation (also known as the variance partition coefficient, VPC). Third, we explain four existing methods of quantifying the ICC in logistic multilevel models, highlighting the differences between the methods. Finally, we describe bootstrapping methods which permit statistical inferences about these variance estimates and comment on the application of the reviewed techniques for models with additional predictors and random slopes. Throughout the tutorial we include Code Boxes which illustrate the concepts discussed in the main text using the R programming language, in a stepbystep fashion.^{1}
To accomplish these goals, we use an example dataset from experimental cognitive psychology (see Section 2), where responses are nested withinparticipants. However, the topics discussed extend to cases where binary data may be nested within other types of clusters (e.g., groups, schools, countries, etc.). Namely, while the computations presented below do not change based on the nature of the grouping variable in the data, the interpretation of the ICC does. In contexts where the data are grouped by organizations or “clusters”, such as data collected from different schools or hospitals, the ICC measures the proportion of the total variance that is attributed to the variability between clusters, sometimes referred to as “the intracluster correlation.” This quantifies the impact of the clustering variable (e.g., schools) as well as how subjects within the same cluster are similar. Another interpretation of the ICC in this context then is the average correlation in the outcome data between measurements from the same cluster. When participants provide repeated measures (e.g., in a psychological experiment), the ICC measures the proportion of the total variance that is due to the differences between individuals. Here, the ICC can provide insights about how individual differences among subjects exert an influence on the outcome. It is this latter interpretation of the ICC, in the context of a binary outcome variable, that will be the focus of this tutorial.
We developed materials to be reusable and reproducible. We encourage readers to follow along with the supplementary material, which we refer to throughout (e.g., S1, referring to “supplement section 1”) and includes all R code with explanatory narrative that can be used to reproduce the results highlighted in the main text. The main text of this tutorial is aimed at someone who has experience executing multilevel models in R and would like to expand their knowledge to logistic models. For those with less background knowledge, readers can review the relevant sections of the supplemental materials (see Section S6).
2. Illustrative Data Description
Data for this tutorial are taken from a study examining how acute stress engenders avoidance of cognitively effortful tasks (Bogdanov et al., 2021; data available at https://osf.io/26w4u/). Thirtyeight young adults (20 female; PID in the dataframe) completed 300 trials of the wellcharacterized demand selection task (DST) in blocks of 150 trials, in which participants repeatedly chose between two nondescript cues that represented demand levels of a cognitively demanding taskswitching paradigm: a lowdemand condition and highdemand condition (effort_choice, 0 = low demand, 1 = high demand; see Figure 1).
In Bogdanov et al., (2021), participants completed the DST before and after being exposed to acute social stress (condition, control or stress), in a repeatedmeasures design. Their research question was whether willingness to exert cognitive effort (i.e., to choose the high effort cue) would change after being exposed to acute social stress. For the purposes of this tutorial, we will focus on data from sessions in which participants were not exposed to social stress to predict effort aversion (S1 demonstrates how to load this subset of the data).
The traditional finding using this task is that participants will demonstrate a marked preference for the lowdemand option over the highdemand option, reflecting a general bias against performing cognitively demanding activities, and in favour of less cognitively demanding activities (Kool et al., 2010). This recurring finding of demandavoidant preferences has been interpreted as evidence that humans have a default tendency to avoid cognitively effortful tasks (Kool et al., 2010; Kool & Botvinick, 2018).
In this tutorial, the outcome variable of interest is participants’ effort choices in the DST, which are binary (low demand or high demand) and nested (within participants; each participant makes many choices). We begin with how to model these data using logistic multilevel models, predicting the effect of effort level on cue preference in the DST, with the lme4 package in R (Bates et al., 2015). For those less familiar with linear multilevel regression, review S6a, for those who need a refresher on logistic regression, review section 3.
3. Logistic Multilevel Regression
Linear multilevel regression models (see S6b) can be generalized to model binary outcomes. In this case, the goal is to model the probability that a given participant chose the higheffort option (effort_choice) on a given trial. If our binary data are coded as 0 (low effort option chosen) and 1 (high effort option chosen), this probability will be equal to the mean of the outcome: $p(Yij=1)=Yj\u2015$, where i refers to the i^{th} trial and j refers to j^{th} participant. However, predicting probabilities as a linear combination of predictors creates problems, because probabilities are bounded between 0 and 1 inclusive, whereas predicted values from a linear model (see S6a for more information) can, in theory, take on any value between negative and positive infinity. To circumvent this issue in logistic regression, a linear combination of the model’s parameters and its predictors are computed on the logit scale—and then, afterwards, the regression predictions are transformed back to probabilities via a mean function. The regression on the logit scale is a linear model for the log of the odds of an event occurring (in this case, the odds of choosing the high effort option), where the odds represent the probability of the event occurring over its complement $(p(Yij=1)1\u2212p(Yij=1))$:
Here, the lefthand side of the equation represents the linear combination of $b0,b1,$ and $X$ on the logit scale (logodds, $LO$), where $X$ is some predictor variable of interest (presented here for completeness, though for most of this tutorial we will focus on interceptonly models). A feature of the logit scale is that it is simple to transform it into odds, and from there, into probabilities:
This gives us three ways to express the same linear combination of predictors in Eq. 1: log odds (i.e., logit scale), odds, and probabilities. In the case of effort choices in the DST, if $b0=\u22120.25$ and $b1=\u2212.09$, assuming $X=1$, then the log odds would be 0.34 (0.25 + 0.09; Eq. 1), which is proportional to an odds of 0.71 (exp(0.34); Eq. 2.), and a probability of 0.42 (0.71 / (1 + 0.71), Eq. 3), which all correspond to the same observed pattern: a participant is 0.71 times as likely to choose the highdemand option relative to lowdemand option, or, equivalently, that a participant will select the highdemand option 42% of the time. This demonstrates an overall trend for participants to avoid the high demand option.
These techniques can be extended to a multilevel framework to account for the nested structure of the data (for more detail, see S6b).
Here, $Yij$ represents the effort choice (0 or 1) on trial $i$ for participant $j$. Similarly, $Xij$ corresponds to the value of X on trial i for participant j. Accordingly, $b0j$ is the personspecific logodds for person $j$ when $Xij$ is equal to 0. $b1j$ is the personspecific effect of $Xij$ on Y (in logodds) for person $j$.
At the first level, participant $j$’s effort choice on trial $i$ are predicted from a linear combination of regression coefficients ($b1j$) and trialbytrial predictors (abstractly, $Xij$), shifted by an intercept term ($b0j$). The lefthand side of the level 1 equation (Eq. 4) corresponds to the log odds of choosing the high effort option.
As in the case of linear multilevel regression, Level 1 of equation 1 can be thought of as applying the standard (not multilevel) logistic regression to the data from each participant. Doing so would yield two coefficients on the logodds scale unique to each participant—$b0j$ and $b1j$ described above. Conceptually, “averaging” both sets of coefficients ($b0j$ and $b1j$) across participants would yield the fixed effects at level 2 ($\gamma 00,\gamma 10$) in Eq. 4. In other words, $\gamma 00$ represents the average logodds of $Y=1$ assuming $Xij$ is zero for the average or typical person, and $\gamma 10$ represents the average influence of X on the log odds of $Y=1$ for the average person. These fixed effects are the average of participantspecific effects in logodds. This is different from alternative techniques that return populationaveraged fixedeffects when applied to correlated or nested data (e.g. Generalized Estimating Equations; Hardin & Hilbe, 2014).
$\gamma 00$ and $\gamma 10$ are estimated “averages” of participantspecific values in logodds, but each participant will have coefficients that deviate from these average estimates. These deviations are captured by $U0j$ and $U1j$ in Eq. 4 and represent how much more or less a given participant’s intercept or slope was from $\gamma 00$ and $\gamma 10$, respectively. These values play a central role in a multilevel framework, as they capture variation between participants. They are assumed to be normally distributed with mean zero and variance $\tau 2$: $U0j\u223cN(0,\tau 02),U1j\u223cN(0,\tau 12),\u2026Ukj\u223c(N,0,\tau k2)$. In this regard, the regression coefficients from Level1 of equation 4 (the $b$s) are outcomes in Level2. For example, the expected value of a given $b0j$ will be the sum of the average estimate, $\gamma 00$, plus the participantlevel deviation, $U0j$.
Such a model can be fit in R using the glmer() function from the lme4 package (Bates et al., 2015; S2). First, a random intercept only model is fit, which captures participants’ general preponderance for selecting the high effort tasks. In lme4 syntax, the model formula begins with effort_choice ~ 1, which asks R to fit a model where effort choice is predicted by its mean, the intercept, represented by a “1”. Additionally, we add (1PID) to this equation, which reflects the random intercept (“1”) for each (“”) participant (“PID”) in our dataset. We then specify the data frame that contains our variables, data.Ctl, and finally we indicate that the family = 'binomial', which instructs R to fit a logistic regression model when it detects that the outcome variable is binary.
Looking at the “Fixed effects” section of the output in Code Box 1, the intercept ((Intercept) in the output) is 0.33, which represents the average logodds of choosing the high demand option for the typical person. This value can be converted to a probability using Eq. 2 and 3, which yields a 42% chance of choosing the highdemand option, which, consistent with past work, shows that, on average, people avoid highdemand options more often than chance (Kool et al., 2010; Patzelt et al., 2019).
Additionally, under the “Random effects” section of the output, the variance of participants’ choices around this grand mean ($\tau 02$, Variance in the output) is 0.76. This estimate is a quantification of the variability across individual participants. To gain a better intuition for this value, we can convert it from a variance to a standard deviation by taking the square root: $\tau 0=\tau 02=0.87.$ Since we assume normality roughly 95% of participants will have logodds of choosing the high effort options within $\gamma 00\xb12\tau 0=\u22120.33\xb12\u22170.87=[\u22122.07,1.41].$ Converting these values into probabilities using Eq. 2 and 3, 95% of participants will choose the high effort option between 11% and 80% of the time. Thus, while the typical person avoids effort, some prefer it. Additionally, a researcher could visualize this variability by plotting the predicted participantspecific choice proportions (S3). To graphically depict this variance, we can extract the intercept for each participant from the model. These values are known as “empirical Bayes estimates” and are a weighted combination of person specific information and the average across all persons ($\gamma 00$). The EB estimates for persons with more informative data, e.g., from completing more trials, would be closer to their personspecific estimates. While EB estimates for persons with less informative data would be closer to the average across all persons ($\gamma 00$). The more trials a person completes, the more their EB estimates can be differentiated from the average across all persons. Alternatively stated, EB estimates reflect skepticism about differences between persons (or clusters more generally). Doing so we obtain a per participant measure of effort avoidance ($b0j$)—in other words, the fixed effect, 0.33 ($\gamma 00$), plus each participant’s deviation from this fixed effect ($U0j$)—which yields each participant’s estimated logodds of choosing the high effort option. In R, this computation can be automated using the coef(logsitic_MLM0).
The output of Code Box 2 shows the first 6 participants’ estimated logodds of choosing the higheffort option. To emphasize how these values correspond to participant behaviour, we have visualized the correspondence between each participant’s estimated logodds of choosing the high effort option and their actual proportion of high effort choices (Figure 2A). As can be seen, participants with higher estimated logodds of choosing the high effort option also exhibited higher actual effort seeking behaviour. Notably, this relationship follows an “Sshaped” curve, which is the result of applying the logistic transformation discussed in Section 3 to the data (Eq. 13), allowing the logodds to take both positive and negative values, whereas actual proportions are constrained to be between 0 and 1. It is important to note here that while above we have considered the fixed effects to be “averages” of personlevel random effects, this is only true in the transformed logodds scale. When computing logodds back to probabilities, the nonlinearity of the transformation makes it so large differences in logodds correspond to only small differences in probabilities. For example, if we were to compare two (hypothetical) participants with estimated logodds of choosing the higheffort option of 3 and 4 in terms of their raw probabilities, these values would correspond to probabilities 0.95 and 0.98—a difference of 3%. Conversely, if we compare two participants with the same difference in logodds but lower absolute values, e.g., a participant with logodds 1 and another with logodds 2, then the corresponding probabilities are 0.73 and 0.88—a difference of 15%.
With this caveat in mind, we can also visualize betweenparticipant variation in logodds. As can be seen in Figure 2B, there is substantial variation in individual participants’ effortavoidance behaviour. As demonstrated in our numerical analysis above, a sizable proportion of participants demonstrate demandseeking behaviour (red area in Figure 2B), preferring the highdemand option to the lowdemand option on average (across trials). These individual differences expand theoretical understanding, because past work has often taken the fixed effects estimates of effort avoidance $(Y00)$ as evidence for a general and ubiquitous cognitive mechanism that aims to minimize effort exertion and maximize reward (e.g., Kool et al., 2010; Kool & Botvinick, 2018).
While visualizing these variations yields insight into the generality of the estimated fixed effect, it is useful for researchers to quantify the degree to which variability in the data stems from individual differences versus withinparticipant response noise (for example, VolpertEsmond et al., 2018). In other words, in a multilevel context, it is important to distinguish between variability that stems from level2 variance (i.e., differences between participants) and level1 variance (variability within participants’ response). The former provides insight into the generality of our conclusions—e.g., are all participants effort avoidant and, if not, which ones are and which ones are not?—while the latter speaks to the degree to which our experimental manipulation yields reliable responses from participants—e.g., how stable is one’s aversion to effort over the course of the task? Generallyspeaking, high withinperson variability would require more explanatory withinperson predictors, whereas high betweenperson variability would require a greater number of betweenperson predictors. With answers to these questions a researcher can move forward with model building.
In the context of linear multilevel models, variability attributable to between versus withinparticipant differences is relatively easy to compute and outputted by default in statistical software (see S6b), but this is not the case with logistic models.
4. Challenges in Quantifying Variance in Logistic Multilevel Regression
In a linear random intercept model, the total variance in the outcome is the sum of between and within variance components: $var(Yij)=\tau 02+\sigma 2$, where $\sigma 2$ is the residual, withinperson, variance (see S6b). As such, one can compute the proportion of total variance accounted for by variations between individuals, $\tau 02,$—a value called the intraclass correlation coefficient (ICC; a.k.a. variance partition coefficient), or simply $\rho $, as the ratio of variation about individuals to the total variance: $\rho =\tau 02var(Yij)$ (see S6b for more information). Conceptually, the ICC corresponds to the degree of nesting in a dataset, or the importance of the cluster variable (the variable in which observations are grouped; here, participants) in explaining variance in the outcome (Goldstein et al., 2002). This is why computing the ICC usually the best first step for a multilevel analysis (McCoach & Adelson, 2010). In cases where the data are nested within participants, the ICC summarizes the degree to which the outcome variable is sensitive to individual differences. Despite that many introductory texts on multilevel modeling discuss logistic MLMs as “simple” extensions of linear models, the ICC cannot be computed in the same way, and in fact not at all with the default output provided from lme4 or other commonplace multilevel modeling software.
A key difference between linear and logistic multilevel models is that logistic models do not estimate a residual variance term, $\sigma 2$—that is, a term that represents withinparticipant variability in the outcome. In a linear model, the residual variance term is the variance of deviations, negative and positive, from a Gaussian variable’s mean or prediction. In logistic models which use a Bernoulli distribution, the mean represents the probability of an event occurring. In turn, the variance of deviations about that mean/probability is entirely determined by the mean according to the following equation^{2}: $var=P(Y=1)(1\u2212P(Y=1))$, .42(1.42) = .24. As a result, a subtle, but important, difference between linear and logistic multilevel models is that logistic models, which are based on the Bernoulli distribution, lack a residual variance parameter, $\sigma 2$. The level 2 variance, $\tau 2$, is on the logit scale. The variance of the Bernoulli outcome as implied by the mean, however, is on the probability scale. As such, we cannot combine the level1 and level2 variance to form the total variance denominator as is done in the standard computation of the ICC.
These caveats have consequences for the interpretation of variance parameters in a logistic multilevel model. First, there is no straightforward value for the residual variance ($\sigma 2$) that captures withinparticipant variation in the outcome—and accordingly, lme4 does not provide one (the residual variance parameter when available would be printed under the Random effects section in Code Box 1). Thus, unlike a continuous outcome model, there is not a singular, “simple”, way to quantify the relative contribution of individual differences to variability in the data, and thus the degree of unexplained variance (Goldstein et al., 2002). Consequently, variance in the outcome cannot be decomposed into within and between components in a straightforward manner and the standard ICC formula cannot be used to create a ratio of betweenparticipant variance to total variance.
5. WithinParticipant Variance and ICC in Logistic Multilevel Models
References texts usually discuss one method of quantifying the ICC in a logistic multilevel framework: the latent threshold method (e.g., Snijders & Bosker, 2011). However, other tested methods exist in the literature, and we review three additional approaches 1) the simulation approach, 2) linearization, and 3) the median odds ratio. While all four approaches attempt to quantify the degree of variability between people, they do not estimate the same exact statistical quantity and will often return different values. As we will see, the first three approaches estimate the ICC in the (binary) outcome variable but they do so differently. The median odds ratio, however, estimates a different value, the heterogeneity between people in terms of the odds of occurrence of a binary outcome variable—i.e., the median odds ratio that would be observed if multiple random samples were drawn from the population. After illustrating how to implement and interpret each method in R stepbystep, using the random intercept only model fit as an example above (logistic_MLM0, see Code Box 1), we summarize the differences between techniques (see Table 1).
Latent Threshold Approach
The simplest and most popular method of calculating the ICC in a logistic MLM is to assume that the withincluster variation is equal to the variance of a logistic distribution with a scale parameter equal to one, yielding the value $\pi 23$ (Goldstein et al., 2002; Snijders & Bosker, 2011, p. 440). Using this approach then, the ICC is a quantification of the proportion of variance in (latent) logit units attributable to betweenperson differences.
This method arises from a common formulation or data generation mechanism for binary regression models. We begin by assuming a continuous outcome regression model, i.e. a weighted sum of predictors and an error term.^{3} However, the continuous outcome is unavailable, and instead we have a dichotomized version of the variable. Under this formulation, binary regression models attempt to retrieve the parameters or regression coefficients of the underlying continuous variable regression. Since the continuous variable is unavailable, we make certain “identification” assumptions to permit model estimation. For logistic regression, the relevant assumption is that the error term in the continuous variable regression follows a standard logistic distribution i.e. it has mean 0 and variance $\pi 23$ (Goldstein et al., 2002).
Applying this formulation to the data at hand, we may assume a continuous distribution of values across participants, and participants with higher values have greater preference for high effort tasks. The exact preference value of a participant is a weighted combination of their predictors and the logistic error term. When this preference value exceeds some threshold, that varies by participant, the participant chooses the high effort task. A randomintercept multilevel logistic regression returns the variation across persons, $\tau 02$, and we assume the variation within persons to be fixed at $\pi 23$. At which point one can calculate the ICC as a ratio of the variance component of interest and the total variance, $var=\tau 02+\pi 23$ (Goldstein et al., 2002).
In R, we can compute the ICC this way as follows (S4a):
In the Code Box above, The VarCorr command extracts the random variance components from a fitted model (here logistic_MLM0, i.e., the value .7554 from Code Box 1) . Alone, this yields a variance component per grouping variable and per number of random variance parameters ($\tau 02,\tau 12,\tau 22,\u2026$) specified in the model. In our case, we have one grouping variable, PID, and so we specify that we are interested in this the random intercept variance for this group using $PID[1]. We then use this value to compute the latent threshold ICC as described above. Executing the code in Code Box 3 yields an ICC = 0.19—that is, nearly 20% of variance in effort avoidant behaviour is attributable to differences between individuals or, conversely, 80% to withinparticipant variation.
In contexts where the modeler is primarily interested in a continuous outcome, but only has access to a dichotomized variable, this formulation for the ICC is ideal. For example, if a researcher were interested in a presumably continuous, but inaccessible, construct of effort aversion, then effortrelated choices in the DST might be seen as a dichotomous representation of this latent construct. It is worth noting, however, that the assumption that unobservable continuous variables masquerade as dichotomous ones may be untenable in many experimental contexts. While effort choice could be thought to reflect a continuous underlying distribution of preference, some may think of it as a truly dichotomous outcome—a participant chooses either one option or the other—in the same way that a person is either a bachelor or not. In these cases, it may be desirable to assume responses as truly dichotomous when calculating the ICC.
Simulation Approach
As an alternative to the latent threshold approach, Goldstein et al. (2002) recommended a simulationbased approach as a more general means of estimating the ICC on a binary (observed) scale. Differently from the Latent Threshold approach which focuses on the logit scale, this is an ICC measure on the probability scale. The key idea is to compute the Bernoulli variance over a large number of simulated datasets as a proxy for the estimate of residual variance. This computes the ICC from two sets of values: first, the personlevel estimated probabilities, which are obtained from the fixed and random effects of a fit model, and second, a measure of withinperson variance based on the Bernoulli variance seen above, $P(Y=1)(1\u2212P(Y=1))$, which is averaged across people. These two components can then be used to compute the ICC. In practice, this approach is done in four steps.
From an already fitted model (logistic_MLM0), simulate a large number ($M$) of normallydistributed participantlevel random effects using the fitted model’s random intercept variance estimate, $U0jm\u223cN(0,\tau 02)$, where $m$ refers to a single simulation, $m=1\u2026M$ and $j$ refers to a particular (simulated) participant.
For each random effect, compute predicted probabilities for each level 1 observation, $p^ijm$, according to the model’s fixed effects ($\gamma 00$) and the random effect obtained in step 1 (hence the $m$ subscript). Using the example model, logistic_MLM0, predicted probabilities of selecting the higheffort option on each trial would be computed as follows: $p^ijm=exp\u2061(\gamma 00+U0jm)1+exp(\gamma 00+U0jm)$.
For each of these predicted probabilities, compute the level 1 variance according to a Bernoulli distribution: $V^ijm=p^ijm(1\u2212p^ijm)$.
The ICC is then estimated as the ratio of participantlevel variance in predicted probabilities over total variance, which itself is composed of clusterlevel variance and average level1 variance: $ICC=Varm=1M(p^ijm)Varm=1M(p^ijm)+1M\u2211m=1MV^ijm$ .
Implementing this approach in R (S4b):
At the start of Code Box 4, we use the VarCorr function to extract the between cluster variance, as before (Code Box 3). We then extract the fixed intercept ($\gamma 00$ in Eq. 4) using the fixef command and specify the number of simulations, $M$. We then follow the steps described above to compute the ICC using the simulation method.
Executing the code in Code Box 4 yields an ICC of 0.14—that is, 14% of total variance in effort avoidance owes to differences between individuals. Notably, this estimate is smaller than that obtained by the latent threshold method. In the latent threshold method, the ICC is computed entirely on the logit scale, and the residual variance parameter is assumed known. In the simulation approach, both the between and withinparticipant variances are computed on the probability scale—the calculations include the fixed effects. Hence, the fixed effects shift the mean of the probability in ways that affect the variance of probabilities—which in turn increase at a slower rate than the variance on the logit scale. These differences lead to discrepancies in ICC estimates between the methods.
It is worth bearing in mind that because simulated deviations from the fixed effect are based on the fixed effects themselves, estimates of the ICC in models with additional covariates (i.e., not a null model) will depend on the covariate patterns. That being said, differences in the ICC associated with different covariate values may themselves be of interest, though this process increases in complexity as more covariates are added.
Linearization
Instead of assuming a nonlinear (logistic or Bernoulli) variance, it is possible to estimate a linear approximation of the logistic multilevel equation and use the variance from this approximation. Thus, linearization attempts to approximate the ICC otherwise obtained via simulation method (e.g., using the Simulation approach)—i.e., one in which variance is calculated on the probability scale. Goldstein et al. (2002) proposed the following approximation for the null model, where $p^ij$ refers to the estimated probability that $Yij=1$.
This approximation works for the case of the null model specifically (i.e., the model for which we typically compute the ICC), but Goldstein’s et al. (2002) approach may also work in the presence of multiple predictors (a topic we return to later). Considering the null model, then the variance is equal to a combination of level1 and level2 variance, as:
where $p^$ reflects the modelestimated probability of choosing the high effort option at the mean of the random intercepts ($\gamma 00$), and $V^1$ is the Bernoulli implied variance. $V^2$ is the variance of the level2 outcomes as predicted by the model, including information about each person ($\tau 2$; i.e., the variance in the (null) modelestimated value of choosing the high effort option for each participant). The total estimated variance is the sum of these values. This variance then appears in the denominator of the standard ICC calculation. Note that the equation above is a more general form of the linearization equation, as the present model (logistic_MLM0) is a null model and thus only contains one predictor ($\gamma 00$). In R (S4c):
As before, at the start of Code Box 5, we extract the random effects and fixed effects using the VarCorr and fixef functions in R. Following this, we follow the steps described above to compute the ICC. Executing Code Box 5 yields an ICC of approximately 16%—closer to the simulation method and lower than the latent threshold approach.
While this linearization method does not require simulation, much like the simulation approach, its value is conditional on the covariate structure. That is, the ICC will be conditional on the values of the predictor variables. We discuss ways around this limitation later, but it is at the discretion of the modeler whether the unique ICCs for different patterns of covariates are of interest over a global ICC estimate.^{4} Another consideration is that when $\tau 02$ is large, the linearization transformation will be applied to values along a broad span of the logistic curve, which is nonlinear and poorly approximated by a linear function. As such, the linearization method may only be appropriate when $\tau 02$ is relatively small.
Median Odds Ratio
The methods reviewed so far have aimed to compute the ICC for logistic multilevel models using techniques analogous to linear models. An alternative approach is to abandon the framework of the ICC and instead compute a measure of participantlevel variation that more readily captures the binary nature of the data, without the need for an estimate of withinparticipant variance. That is, a method in which the measure of variation is expressed in terms of odds, which is a more common scale for interpreting binary data models. Accordingly, the median odds ratio (MOR), proposed by Merlo et al. (2006), represents median participantlevel variation on the odds ratio scale. Conceptually, the MOR represents the median odds of success across every pairing of participants in the dataset. Imagine that every participant is paired to each other participant in the sample and their odds of choosing the high demand option are calculated according to their participantlevel deviation from the fixed intercept. The ratio of these pairs of odds are then computed, with the higher odds always placed in the numerator. Doing this for all pairs of participants would yield a distribution of odds ratios, the median of which would be the MOR. If the MOR is equal to 1, it would suggest no differences between participants (all participants are equally effortavoidant; that is, they avoid high effort stimulus $\gamma 00$ logodds of the time). If the MOR is considerably larger than 1, it would suggest sizable individual differences. As such, the MOR is a useful technique for those interested solely in the betweenparticipant effects, but offers no direct extension to quantify withinparticipant variability.
Practically, it is not necessary to carry out these computations in full, because the MOR can be readily computed using the following equation (see Merlo et al., 2006):
where $\Phi $ is the cumulative distribution function of a standard normal distribution, and $\varphi (0.75)$ is roughly equal to 0.67 (see Larsen & Merlo, 2005 for a derivation of this equation). In R, the MOR can be computed as follows (S4d):
Using the data at hand, the MOR is 2.29, which means that the median odds of avoiding the highdemand option increased by 2.29 times when randomly comparing two participants. In other words, the odds of effort avoidance in the present sample can vary, in median, by 2.29. Relative to the original estimate of betweenparticipant variability from the output of fitting logistic_MLM0 ($\tau 02$), the MOR reflects a summary of how variation in effort aversion when comparing random subsets of participants, whereas $\tau 02$ is an estimate of average variability in effort aversion across participants.
While the MOR is unlike the other approaches discussed thus far, it is worth emphasizing its conceptual similarity to the ICC (and thus its inclusion in this tutorial). Much like the ICC, the MOR provides information about individual variability between participants impacts the odds of the outcome variable being equal to 1 (in this case, choosing the high effort option). Put simply, both the MOR and ICC quantify how important the participant is in determining the outcome. Moreover, like other ICC values (and holding total variance constant), larger MOR values imply reduced withinperson variability, and viceversa. In this respect, the MOR and the ICC convey similar information. Nevertheless, unlike the ICC, the MOR is not expressed as a proportion. As such, this method offers no quantification of residual, withinparticipant, variance, which could be of interest to experimental psychologists.
Method  Estimated ICC for logistic_MLM0  Pros  Cons 
Latent Threshold  0.19 


Simulation Goldstein et al. (2002)  0.14 


Linearization (Goldstein et al., 2002)  0.16 


Median Odds Ratio (Merlo et al., 2006)  2.29 (in median odds scale) 


Method  Estimated ICC for logistic_MLM0  Pros  Cons 
Latent Threshold  0.19 


Simulation Goldstein et al. (2002)  0.14 


Linearization (Goldstein et al., 2002)  0.16 


Median Odds Ratio (Merlo et al., 2006)  2.29 (in median odds scale) 


6. Estimating Uncertainty about the ICC With Bootstrapping
In the previous section, we outlined four methods for quantifying variation in logistic multilevel modeling, as well as some relative advantages and disadvantages of each method. However, quantifying uncertainty around any given estimate of the ICC, which in our in our example varies between 1520% depending on the method, is not routinely discussed in the literature.
To quantify uncertainty in the ICC, Austin and Leckie (2020) proposed using a parametric bootstrapping process with percentile confidence intervals. In its simplest form, this process unfolds in three steps:
Simulate many datasets from the original logistic model, randomly sampling random effects from a normal distribution centered at the estimated mean and variance from a fit model.
Compute the quantity of the interest (the ICC or residual variance measure in this case).
Summarize these data (e.g., by taking the mean, computing a 95% interval, etc.) to make statistical inferences using the resulting sampling distribution.
In principle it would be possible to handcode a bootstrapping procedure to estimate these values using the information provided in Code Boxes throughout this tutorial (and, for the intrepid reader, we provide code to do so at https://anonymous.4open.science/r/logisticiccF6C8/ and S5b). For ease however, we provide a function that draws bootstrapped samples of the ICC using the methods specified above. Below, we use this function to obtain 100 bootstrapped estimates of the ICC, using the latent threshold method (S5a).
bootstrap_icc() takes as arguments a model, here, logisticMLM0, a grouping variable, here PID, a string representing one of the methods for computing the ICC described above, here icc_thre, and the number of samples to draw, B. Applying this yields a bootstrapped sampling distribution that characterizes uncertainty around the point estimate. For instance, while the point estimate for the latent threshold ICC in the present data was 0.19, the bootstrapped confidence interval suggests the ICC estimates most compatible with the current data range between 0.11 and 0.26 (see Table 2). In other words, anywhere between approximately a tenth to a quarter of the variance in effort avoidant behaviour is attributable to differences between individuals.
Variable  Quantile  
2.5%  97.5%  
ICC (Latent Threshold)  0.11  0.26 
ICC (Simulation)  0.11  0.23 
ICC (Linearization)  0.10  0.22 
MOR  1.83  2.77 
Variable  Quantile  
2.5%  97.5%  
ICC (Latent Threshold)  0.11  0.26 
ICC (Simulation)  0.11  0.23 
ICC (Linearization)  0.10  0.22 
MOR  1.83  2.77 
Notably, unlike confidence intervals calculated using standard errors, the lower bound of these bootstrapped confidence intervals will never go below 0 in the case of the ICC, or 1 in the case of an MOR, because these are the lowest possible values these metrics can take. Accordingly, 1 for the MOR and 0 for the ICC might be very conservative thresholds for minimal participantlevel variation. The advantage of the bootstrap approach is that researchers can set their own lower bound for what constitutes minimal heterogeneity in the outcome measure and explore the 95% interval relative to this threshold.
The 95% interval of bootstrapped distributions for MOR, and ICC using all of the methods discussed thus far from 100 samples are summarized in Table 2. Regardless of the measure, there are substantial individual differences in effort avoidant behaviour, suggesting that not all people are equally averse to effort action—a finding echoed by recent work exploring individual differences in cognitive effort investment, which speaks to the theoretical insights examining the random effects can provide (Otto & Daw, 2019; Sandra & Otto, 2018).
7. Computing the ICC and MOR in Models with Predictors and Random Slopes
Thus far, the tutorial has focused on a null model with no predictors and random intercepts only. We recommend this as a first step to modeling building because it provides a starting point to understanding the variability in the outcome. Researchers are then likely to proceed to adding predictors and estimating random slopes. In this section we review how to use the methods we have discussed with more complex models.
Models with WithinParticipant Predictors
First, we demonstrate how to estimate a model with a fixed withinparticipant predictor, trial number, and random intercepts. Code Box 8 shows the regression equation now includes trial0, which captures the number of trials that have elapsed since the beginning of a block in the DST (where a block is a subset of trials). The predictor has been centered at the median and scaled to be between 0 and 1. In line with past work (Kool et al., 2010), we consider whether participants’ desire to avoid higher effort tasked increased as their exposure to the experimental tasks increased. The fixed intercept, $\gamma 00$, reflects the average logodds of selecting the higheffort tasks when trial0 is equal to zero for the typical participant, and the fixed slope, $\gamma 10$, reflects the change in logodds from the beginning to the end of the task for the typical participant (from when trial0 equals 0 to when it equals 1). As before, we have specified that the intercept will vary per participant in a normally distributed fashion, with random variance $\tau 02$.
The simulation and linearization approaches for computing the ICC described above rely on an estimate of intercept variance ($\tau 02$), which depends on the value of the intercept in a model, which in turn depends on the covariate structure. For these approaches, if researchers are interested in various points, (e.g., when covariate 1 is high and covariate 2 is low, when both covariates are high, when both are low, and so on), they need to calculate multiple ICCs. In Figure 3A, we plot the ICC at each value of trial0 using the linearization method, which demonstrates how the ICCs changes as we move along different values of the covariate. Conversely, both the ICC computed using the latent threshold approach and the MOR implicitly assume fixed residual variance and will be assumed to be appropriate for any covariate structure. In the presence of covariates, the latent threshold ICC and MOR return values that are adjusted for the covariates. For example, the latent threshold ICC functions similarly to the residual ICC in linear multilevel models in that it is the latent variance attributable to differences between individuals after accounting for the variance attributable to covariates.^{5}
Another alternative to proliferation of simulation and linearization ICC values with increasingly complex covariate structures is to calculate the ICC at the average prediction of the observed data on the logit scale. To do so, the average predicted value from the model is used in place of the combination of fixed effects. The benefit of this method is that it is simpler while the cost is that it cannot capture the relative contribution of within vs. betweenparticipant variance at different levels of the predictor. Thus, faced with a model like our logisticMLM1 we must decide if we want to compute the ICC at different levels of the predictor—at the beginning vs. end of a block, for instance—or compute a global ICC at the average—here, the middle of a block. We take the latter approach by centering the predictor at the median, but the former approach could be taken if trial0 was recentered to the beginning or end of the task.
As an example, we can leverage the averaging approach described above for the linearization method described in Section 5 to compute a single ICC for logisticMLM1 estimated above, which yields an estimate of the ICC that is close to the estimate for the interceptonly model in Section 5 (ICC_{average} = 0.1554; ICC_{null} = 0.1552). As mentioned however, this estimate of the ICC will only be valid for the middle of the block (see Figure 3A). This is implemented in the icc_lin() function provided in the Supplemental Materials (S6d; see Code Box 9).
As an aside, in the case of the simulation method, this averaging approach can be extended via numerical integration to estimate the moments of the logitnormal distribution, for which, for example, the simulation method can be thought of as a method for approximating. This yields numerically very similar estimates of the ICC without the need for simulation. We describe this Numerical Integration Approach in the Supplemental Materials (S6d) and provide a separate function for it.
Models with BetweenPerson Predictors
The same issues apply when a level 2, in our example a between person, predictor is included. In Code Box 10, we fit a logistic multilevel model where effort choices are predicted by experimental condition: control (0) or stress (1). The original purpose of Bogdanov and colleagues’ (2021) study was to examine the role that acute stress played on effort avoidance. The model below tests this prediction, examining average effort avoidance rates depending on if participants were stressed or not before completing the DST.
Just as with the continuous predictor example above, the intercept and the ICC using the simulation and linearization approaches will depend on the value of Condition. We illustrate this in Figure 3B. In Code Box 11, we compute the averaged ICC using the linearization method. As before, we obtain a value not too far from, though slightly lower than, the linearized ICC computed for logisticMLM_0 (ICC_{null} = 0.16).
As before, these issues do not apply to the latent threshold ICC nor the MOR, which will be identical regardless of the covariate pattern.
Notably, while continuous predictors may yield an unwieldy number of possible ICCs, categorical variables may have very restricted covariate patterns. As a result, researchers may in fact be interested to examine how personlevel variability in the intercept (indexed by the ICC) varies per categorical conditions. Moreover, when a model includes categorical predictors, the average prediction may not reflect any actual case in the data, such it may be wiser to consider the ICC at specific values of the categorical predictors.
Models with Random Slopes
Finally, a modeler may be interested in estimating how much the influence of a level 1 predictor varies per participant. For example, we might expect that over the course of a block, participants become increasingly effort averse, but that some participants remain steadfast in their initial rate of effort aversion while others rapidly reject all effortful tasks after only a few trials. This type of conditional betweenparticipant variability can be modeled using random slopes. In Code Box 12 we estimate logistic_MLM2 by adding (trial0  PID) to estimate the slope variance across people. As can be seen in Code Box 12, a new random effects variance is included, which encodes how variable different participants’ effort preferences are to the trial number of the task.
Though adding random slopes to the model and computing the ICCs in the manner we have demonstrated is relatively straightforward in lme4, there is debate in the literature about whether the ICC should be computed for models that contain random slopes. For instance, Kreft & De Leeuw (1998) state that “The concept of intraclass correlation is based on a model with a random intercept only. No unique intraclass correlation can be calculated when a random slope is present in the model.” (p. 63). Conversely, Goldstein et al. (2002) highlight that it is, in theory, possible to estimate the ICC for a model with random slopes, but that doing so will be conditional on the pattern of covariates and that doing so changes the interpretation of the ICC—and its components: between and withinparticipant variation—to a point that it no longer serves the same purpose of assessing the relative contribution of between versus withinparticipant variation to total variance. We will not settle this debate in this tutorial and to our knowledge, no work exists that tackles this issue in logistic multilevel models. For now, we recommend implementing the ICC using the techniques described above for models with random intercepts, and we caution that the interpretations we’ve provided do not hold for models with random slopes.
8. Conclusion
Calculating the ICC to describe the effect of clustering in the dependent variable is a best practice for the early stages of multilevel model building. However, for logistic models, this is not straightforward to do, with popular programs not outputting the necessary estimates. In this tutorial, we have reviewed four methods for computing estimates of between and withinparticipant variability from logistic multilevel models and demonstrated how to calculate the ICC. We have highlighted relative advantages and disadvantages of each technique and provided R code to compute these estimates, both manually (using didactic code from the Code Boxes or the supplement to reproduce all analyses reported herein) or functionally (using the function code provided in the repository associated to this paper and described in S6d). Finally, we reviewed bootstrapping techniques to quantify uncertainty about these variance estimates.
Here, we focused on modeling binary data using a model with a logit link function, alternative link functions, like the probit function, are commonly employed. The techniques to quantify the ICC (excluding the MOR, because it is in odds ratios) described could be used for a probit model. That is, in a probit model, the value of the residual variance in the ICC using the threshold approach would be 1, rather than $\pi 23$, in the simulation approach, the mean function from inverselogit function would change to a normal cumulative density function, and for the linearization approach, the formula presented in equations 5 and 6 would also change.
At this point, readers may ask which method is the best overall, or which method they should use for a specific scenario. To help arbitrate between the different methods presented here, we point readers to Table 1 and Figure 4 for practical considerations. Additionally, the decision between methods might also be made on theoretical grounds. If researchers view the observed binary outcome as the result of a measurement process that dichotomized a continuous variable/trait, then the latent threshold approach is appropriate. If they are instead concerned only in binary preference, the linearization or simulation methods should be considered. Finally, if researchers wish to express betweenperson heterogeneity in terms of odds, the MOR is a good option.
Finally, another way to identify which measure to report is to consider the type of effect size the researcher plans to report. If the researchers’ primarily focus on logodds and oddsratios, then the latent approach and MOR are adequate, because they are already expressed in that scale. If the researchers are instead interested in transforming effects into probabilities or attempt to communicate effects on the probability scale (i.e., on the observed scale), then the simulation and linearization approaches may be more appropriate.
Overall, making these approaches easier to implement will enable researchers to investigate variability in their binary data and expand their theoretical considerations beyond effects in the aggregate.
Funding
This work was supported by a fellowship (awarded to S.D.) from the Natural Sciences and Engineering Research Council of Canada.
Author Contributions
S.D. conceived of the initial research question, reviewed literature, and wrote the initial draft of the tutorial. J.O.U. provided expert commentary throughout the writing and coding process. A.R.O. provided critical feedback throughout the writing process. J.K.F. helped write and edit the manuscript and supervised the project.
Competing Interests
The authors declare no conflict of interest.
Data and Materials Availability
All code and data used for this tutorial can be found at https://github.com/seandamiandevine/logisticicc.
Footnotes
For ease of use, function code for each method is available for download from the following GitHub repository, which compute the measure of interest on a fitted logistic multilevel regression model: https://github.com/seandamiandevine/logisticicc. A brief demonstration of how to call these functions is provided in the Supplemental Materials (S6d).
To demonstrate why this equation holds, let $X$ be some binary variable coded as 0 and 1, $\theta $ be the mean of the distribution of this variable ($p(Y=1)$), and $X2\u2015\u2212(X\u2015)2$be the equation for the variance. We begin by solving for $\theta (X2)$ as follows: $\theta (X2)=p(X=1)12+p(X=0)02=p(X=1)=\theta $. Following the standard equation for computing variance, $\sigma 2=(\u2211X2N)\u2212\mu 2$, keeping in mind that $N$ is already captured by $\theta $ by virtue of it being a probability ($\u2211XN$) and $\mu =p(X=1)=\theta $, we can compute the variance around $\theta $ as: $var(X)=\theta X2\u2212\theta 2=\theta \u2212\theta 2=\theta (1\u2212\theta )$. In this final form, we can see that the variance is just a restatement of $\theta $.
This formulation is mathematically equivalent to the GLM formulation for binary regression models based on the Bernoulli distribution, with a linear model for probabilities on an unbounded scale.
Nakagawa et al. (2017; 2013) have proposed a similar method for estimating the coefficient of determination ($R2$) in generalized multilevel models. In the case of binary outcomes, this method estimates the proportion of variance explained in a model. This value is proportional to the ICC when considering null models, as we do here (Snijders & Bosker, 2011), and in fact the latent threshold ICC is exactly equal to the conditional Rsquared value, assuming a null model. Accordingly, we do not review it in detail here, but point readers to Nakagawa’s (2013; 2017) work on topic. Notably, one can estimate $R2$ for models with other families of multilevel regression than binomial (e.g., Poisson), which may be of interest to some readers (see for instance the MuMIn package in R for an implementation of this method).
Since the residual variance is fixed across logistic models, estimated model parameters are automatically scaled across models to reflect changes in residual variance across models (an issue referred to as noncollapsibility or unobserved heterogeneity depending on the literature; Greenland et al., 1999; Mood, 2010). For example, if one extends a randomintercept model by including withinparticipant predictors that do not explain any participant level variance (e.g. personmean centered predictors), the intercept variance ($\tau 02$) would increase from the randomintercept, producing a larger latent threshold ICC or MOR since the residual variance has reduced from the earlier model.