On the Role of Response Styles in the Study of Intraindividual Variability

In a number of psychological disciplines, experience sampling studies are used to investigate when and why individuals vary in their cognitions, feelings, or personality states, to name just a few examples. However, to validly answer such research questions, researchers have to use reliable and valid measures of such intraindividual or within-person variability. Following Baird, Lucas, and Donnellan (2017), the present study uses data from two experience sampling studies to investigate whether response style variability can be assessed in a reliable way, whether variability measures are related to such response styles, and whether it is important to control for response styles in later analyses. We found that individuals differ in their response style variability and that these differences can be reliably assessed. Furthermore, response styles were moderately related to personality indicators, affect, and self-esteem variability. They were not associated with any of the outcome measures, so that most results remained unchanged when controlling for these differences. Nevertheless, our results indicate that researchers should additionally assess response styles when studying phenomena involving intraindividual variability.

In a number of psychological disciplines, experience sampling studies are used to investigate when and why individuals vary in their cognitions, feelings, or personality states, to name just a few examples. However, to validly answer such research questions, researchers have to use reliable and valid measures of such intraindividual or within-person variability. Following Baird, Lucas, and Donnellan (2017), the present study uses data from two experience sampling studies to investigate whether response style variability can be assessed in a reliable way, whether variability measures are related to such response styles, and whether it is important to control for response styles in later analyses. We found that individuals differ in their response style variability and that these differences can be reliably assessed. Furthermore, response styles were moderately related to personality indicators, affect, and self-esteem variability. They were not associated with any of the outcome measures, so that most results remained unchanged when controlling for these differences. Nevertheless, our results indicate that researchers should additionally assess response styles when studying phenomena involving intraindividual variability.
There is growing interest in psychological research in investigating the determinants and consequences of individuals' variations in their cognitions, feelings, and behaviors. For example, personality psychologists examine whether certain personality traits such as narcissism go along with higher self-esteem variability (e.g., Geukes, Nestler, Hutteman, Dufner, et al., 2017) or whether variability in states or in personality traits across social roles is related to wellbeing or adjustment (e.g., Baird et al., 2006;Magee et al., 2018). In life-span psychology, researchers investigate whether age predicts differences in positive and negative affect variability (Röcke & Brose, 2013). Moreover, emotional psychologists are interested in whether people vary in their application of emotion regulation strategies and whether this type of variability is related to trait negative affect (e.g., Blanke et al., 2020).
In these, but also in all other studies of within-person variability (also known as intraindividual variability), reliable and valid measures of variability are needed to properly investigate the research questions at hand. Interestingly, however, there are only few studies that address the measurement properties of intraindividual variability indicators. One exception is the study by Baird, Lucas, and Donellan (2017) in which individuals were asked to judge several neutral objects, allowing the authors to measure individuals' preferences for certain response categories.
Thereafter, participants took part in a daily diary study in which they repeatedly rated adjectives describing themselves (e.g., their self-esteem). The results showed that the response style measure was moderately associated with intraindividual variability in the substantive variables. This suggests that failure to control for response styles can lead to erroneous conclusions about the hypotheses being investigated -for example, falsely concluding that intraindividual variability in a substantive variable is associated with a certain determinant or consequence.
The aim of the present research is to replicate and extend the results by Baird et al. (2017) using data from two large experience sampling studies in which participants were repeatedly asked to provide ratings on different personality, affect, and self-esteem states. Specifically, we investigate the extent to which the intraindividual variability indicators between substantive variables are correlated, whether and how strongly they are correlated with response style variability measures, and finally, whether they are related to several adjustment indicators even when controlling for response styles.
one approach is to employ questionnaires, that is, to explicitly ask people to judge how variable they are. Several scales have been constructed for this purpose, including the Selfpluralism Scale (McReynolds et al., 2000; example item: "I act and feel essentially the same whether at home, at work, or with friends."), the Stability-of-self Scale (Rosenberg et al., 1989; example item: "My opinion of myself tends to change a good deal instead of always remaining the same. [reverse scored]"), or the Self-Concept Clarity Scale (Campbell et al., 1996; example item: "My beliefs about myself often conflict with one another." [reverse scored]).
Second, another way to measure variability is to examine how strongly a person varies in her responses to items within a scale assessing a construct or how strongly she varies across different scales within a test battery. For example, researchers computed participants' standard deviation across responses to items in an extraversion scale in order to measure intraindividual variability in extraversion (e.g., Reise & Waller, 1993). Third, a related approach is to ask participants to judge the variable under consideration (e.g., extraversion) across a number of different social roles, relationships, or contexts (e.g., extraversion with friends, with family etc., see e.g., Baird et al., 2006Baird et al., , 2017. Here as well, intraindividual variability is measured by computing the standard deviation of participants' responses to the items across social roles, relationships, or contexts.
Fourth and finally, another approach to measuring intraindividual variability is to assess a person's thoughts, feelings, or behaviors across multiple days or multiple times per day over a period of several days in an experience sampling or daily diary study (e.g., Baird et al., 2006;Fleeson, 2001;Geukes, Nestler, Hutteman, Dufner, et al., 2017;. The indicator of intraindividual variability for the variable in question is then obtained by computing an individual's standard deviation for the respective item across all measurement occasions. If multiple indicators or items for a construct are assessed, one could average the standard deviations across items. Another approach would be to first average the respective items within each measurement occasion and compute the standard deviation of this average for an individual over time. Independently of how one proceeds, the experience sampling/daily diary approach is perhaps the most face-valid approach to assessing intraindividual variability in a range of different constructs. Due to increasing opportunities to collect and thus the increasing availability of experience sampling and daily diary data, in our experience, this approach is also used most frequently across psychological disciplines to examine intraindividual variability phenomena. Therefore, we focus on this final approach in the present paper.

Limitations of Daily Diary Measures of Intraindividual Variability
Despite their strengths, intraindividual variability measures based on daily dairy data are not without problems. A first (mathematical) problem with standard deviations (or variances) is that they depend on a person's mean score on the respective item 1 . For example, when a person's average for an extraversion item is 5 and the item was assessed on a 1 to 5-point scale, no variance in extraversion is mathematically possible. However, a person who has a mean of 3 could have a wider range of extraversion scores and hence vary more strongly. Thus, a more extreme average is linked to a smaller standard deviation, while the maximum standard deviation value can be achieved by persons whose average corresponds with the scale mean. This implies that persons' standard deviations are correlated with their means and that one should account for this correlation when investigating research questions regarding intraindividual variability (see Baird et al., 2006). A second problem concerns the effects of response styles, that is, a person's tendency to prefer specific response scale options or categories over others (see Henninger & Meiser, 2020). For example, individuals may tend to choose the highest or lowest response categories, or they may tend to choose the middle category. Research shows that such individual differences in response styles can have a strong influence on the measurement of traits (e.g., Baumgartner & Steenkamp, 2001), occur consistently across traits (e.g., Weijters et al., 2010), and are stable across time (e.g., Wetzel et al., 2016). As response styles influence aggregate measures, they may also affect intraindividual variability indicators that are computed across repeated measurements. For instance, when people tend to more often use the extreme scale categories (either high or low), this can lead to a great and potentially overestimated intraindividual variability score, whereas when people often use the middle scale category, this can lead to a small and potentially underestimated intraindividual variability score.
To the best of our knowledge, Baird et al. (2017, Study 2) is the only study to directly examine whether intraindividual variability scores are contaminated by response styles. In Baird et al. (2017), participants took part in a daily diary study in which they were asked to rate different adjectives for the Big Five traits, positive and negative affect, and selfesteem once a day for 14 days. Prior to the daily diary phase of the study, they also completed a response style measure, and after the study, they reported on different measures of well-being and adjustment, for example, trait self-esteem and life satisfaction. The response style measure consisted of a neutral object rating task in which participants were asked to indicate their satisfaction with 25 neutral objects Another problem is that the construct the item measures might change in a linear or non-linear way over time. A remedy to this problem is to detrend the data for each person prior to data analyses or include a time trend variable when computing the standard deviations (see Geukes, Nestler, Hutteman, Dufner, et al., 2017). However, as such construct changes occur relatively slowly and therefore infrequently in daily diary studies, we do not discuss this issue further here. 1 On the Role of Response Styles in the Study of Intraindividual Variability Collabra: Psychology (e.g., the participant's telephone number, the local newspaper, etc.; see Judge & Bretz, 1992). For each person, a response style variability measure was computed by calculating the standard deviation across these judgments. The idea behind this measure is that in the absence of stable interindividual differences in response scale use, participants should have similar variability scores, since the 25 objects are relatively neutral. The results showed, however, that individuals differed in their standard deviations on the neutral object task ratings. Furthermore, the response style measure was moderately correlated with the intraindividual variability scores on the substantive variables, indicating that these indicators are contaminated by response styles. Finally, the authors showed that intraindividual variability scores on the substantive variables (e.g., state self-esteem) were unrelated to most trait well-being measures but correlated with the response style measures, indicating that, once one controls for response styles, the correlations between the substantive variables and trait well-being measures might disappear. However, the authors did not examine this latter prediction.

The Present Research
The aim of this article is to replicate and extend the results by Baird et al. (2017) using two large experience sampling studies. We deem successful replications important in the light of the current replication crisis in psychology generally, but also specifically because researchers routinely examine hypotheses concerning intraindividual variability in many psychological disciplines. It is therefore crucial to know-optimally, based on robust, that is, successfully replicated, empirical findings-whether measures of intraindividual variability are influenced by response styles. If this is the case, researchers would have to control for response style differences in their analyses to avoid drawing the wrong conclusions regarding their hypotheses. Of course, this would necessitate simple and reliable measures of response styles, whereby these properties should ideally have been replicated across multiple independent studies. With our research, we aim to provide such a first replication. However, we also want to extend the findings of Baird et al. (2017) by investigating a second response tendency measures and by examining the retest reliability of the measure. Furthermore, we also examine the associations of the response style measures across different operationalization of intraindividual variability and we provide an important extension regarding the associations with well-being measures.
We use data from two daily diary studies. In both studies, participants took part in an experience sampling study in which they completed state measures of different personality traits, affect, and self-esteem. After this experience sampling portion of the study, they also provided ratings on different trait measures concerning well-being. In the first study, called FLIP (fluctuations in personality), participants completed different measures of response tendencies after the experience sampling portion of the study, while in the second study, called FLUX (fluctuations), participants answered these measures at both the start and the end of the study. This allows us to investigate the stability of partici-pants' response style variability scores. Consistent with the findings by Baird et al. (2017), we assume that, first, participants display interindividual differences in response style variability and that those differences can be reliably captured and are relatively stable over time. Second, we assume that interindividual differences in response styles are related to differences in intraindividual variability in the substantive variables. Third and finally, we examine the associations between the variability measures and well-being measures after controlling for response styles. The results of these analyses depend on whether response style variability is related to the well-being outcomes we are studying. If that is the case, we predict that the correlations will disappear.

Sample
The sample consisted of 102 participants. Six participants did not complete the second online survey in which we assessed response styles and outcome measures. These participants were therefore excluded from the analyses, resulting in a final sample of 96 individuals. Most of the participants were women (81%) and their age ranged between 18 and 65 years (M = 22.46 years, SD = 5.59 years). The number of completed daily reports per participant ranged from 33 to 82 (M = 75.98, SD = 8.84, Mdn = 79.5). Participants received monetary compensation (about 50 Euro) for their participation. The FLIP study was approved by the ethics committee of the University of Leipzig (331/17-ek).

Procedure
The data collection phase took place from October to December 2017. On each day during this period (82 consecutive days), participants were asked to retrospectively rate their experiences and behaviors on this day with regard to Big Five personality states, their self-esteem, a number of motivational states, and an affect grid. Before and after this daily diary phase, participants completed online surveys to obtain demographic information as well as measures of personality traits (i.e., Big Five personality traits, agency and communion, narcissism, motives), well-being measures (self-esteem, positive and negative affect, life satisfaction), and measures of response styles. Here, we only use the daily diary measures of personality states and self-esteem and the questionnaire data from the second assessment (after the daily diary portion) concerning self-esteem, positive and negative affect, life satisfaction, and response styles.

Measures
Daily diary variability measures. Personality states were measured with 19 adjectives. We adapted these adjectives from Borkenau and Ostendorf (1998) and the interpersonal adjective list (Jacobs & Scholl, 2005). Three adjectives were indicators of neuroticism (nervous, relaxed, irritable), four were indicators of extraversion (assertive, unsociable, shy, outgoing), three indicated openness (curious, creative, witty), five indicated agreeableness (hostile, On the Role of Response Styles in the Study of Intraindividual Variability Collabra: Psychology compliant, sensitive, friendly, cynical, helpful), and three indicated conscientiousness (diligent, organized, negligent). Furthermore, one item measured daily self-esteem ("I am satisfied with myself"). Participants were instructed to rate how well the respective adjective or item matched their experiences on that particular day ("Today I was…"). All ratings were made on scales ranging from 1 ("strongly disagree") to 6 ("strongly agree"). Note that the personality state items were presented in a randomized order on each day and that the self-esteem item was assessed after the personality states.
We followed the approach by Baird et al. (2017; see also Baird et al., 2006) and calculated one intraindividual variability score across all personality states. To this end, we first calculated the within-person mean and within-person standard deviation for each of the nineteen adjectives. The standard deviations for each adjective were then regressed on each person's corresponding mean and the square of this mean. The residuals of these regressions, reflecting meancorrected variability scores, were then averaged to obtain a score of daily personality variability. The nineteen standard deviation residuals were also used to calculate Cronbach's alpha for this variability. In addition to this overall measure of intraindividual personality variability, we also calculated an intraindividual variability score for each of the Big Five dimensions by aggregating the respective mean-corrected variability scores. For self-esteem, we followed the same approach as for the personality state items (without aggregating, as there was only one item). To calculate the reliability of self-esteem variability, we computed the corrected standard deviation for the first half and second half of each participant's daily reports (e.g., Day 1 to Day 40 and Day 41 to Day 80). The correlation between the two standard deviations was then used to compute the reliability score.
We used the mean-corrected standard deviation in our analyses to stay as close as possible to the approach used in Baird et al. (2017). However, we also implemented two further approaches to examine (and ensure) that identical result patterns occur with different operationalizations of intraindividual variability. The first alternative approach consisted of aggregating the daily diary items per dimension and using the aggregate to compute a (mean-corrected) standard deviation. The second approach consisted of computing the relative variability index (see Mestdagh et al., 2018) instead of the mean-corrected standard deviation for each adjective. For this index, the individual's standard deviation for a variable is divided by an individual's maximum possible standard deviation given her/his mean resulting in a measure of intraindividual variability that is less influenced by the mean than is the mere standard deviation.
Self-concept Clarity. As an explicit measure of intraindividual variability, participants were asked to complete the Self-Concept Clarity Scale (Campbell et al., 1996;German version: Stucke, 2002) at the second assessment point. This scale consists of twelve items, each answered on a scale ranging from 1 ("strongly disagree") to 5 ("strongly agree", α = .90). Outcome measures. For global self-esteem, participants answered the Rosenberg Self-esteem Scale (Rosenberg, 1989, German version: von Collani & Herzberg, 2003, which consists of 10 items to be answered on a scale ranging from 1 ("strongly disagree") to 4 ("strongly agree", α =.92). Furthermore, as a measure of life satisfaction, they were asked to fill out the Satisfaction with Life Scale (Diener et al., 1985;German version: Glaesmer et al., 2011). This scale consists of 5 items to be answered on a scale ranging from 1 ("strongly disagree") to 7 ("strongly agree", α = .87). Additionally, participants answered a short version of the Positive and Negative Affect Schedule (Watson et al., 1988) containing 5 items measuring positive affect (α = .71; active, optimistic, determined, in a good mood, relaxed) and two items measuring negative affect (α = .57; inhibited, nervous). All seven items were answered on a scale ranging from 1 ("does not apply") to 6 ("applies perfectly").
Response style measures. Participants completed two measures of response styles: the Simpsons characters rating task and a neutral object rating task. Both measures were suggested by Baird et al. (2017) but only one (the neutral object rating task) was assessed in their daily diary study. In the Simpsons task, participants were asked to rate how well ten adjectives describe four characters from the Simpsons (Marge, Homer, Bart, and Lisa). The ten adjectives were energetic, trustful, dependable, vulnerable, philosophical, talkative, cooperative, organized, irritable, and intelligent, each rated on a scale ranging from 1 ("does not describe this person") to 5 ("describes this person very well"). To obtain the Simpsons variability measures, we proceeded exactly like Baird et al. (2017): For each participant, we first computed the mean and the standard deviation for each of the ten items across the four Simpsons characters. Each of the ten cross-character standard deviations were then regressed on the respective mean and square of the mean. Thereafter, the ten standard deviation residuals were averaged to create an overall measure of variability for the ratings of the four Simpsons characters. We also used the ten standard deviation residuals to compute Cronbach's alpha for this measure. The idea behind this measure is that, because participants are judging the same stimuli, there should be a similar amount of cross-character variability in the sample. Hence, variability in a person's personality state or self-esteem variability should be unrelated to a variability measure concerning cartoon characters.
For the neutral object rating task, participants indicated their satisfaction with 25 items describing neutral objects (e.g. "the movies produced today" or "public transportation"). Each item was rated on a scale ranging from 1 ("very unsatisfied") to 5 ("very satisfied"). To obtain an index of intraindividual variability for these ratings, we also calculated a standard deviation and the mean of each person's ratings across the 25 objects. The standard deviations were then regressed on the mean and its squared means to obtain the corrected variability score. As an index of reliability, we computed a split-half reliability by taking the average splithalf reliability from 1000 random splits of the 25 items (see Baird et al., 2017, for a similar approach). The underlying idea behind this measure is the same as for the Simpsons rating task: As most of these objects are relatively neutral, they should be rated similarly by different persons, with the consequence that participants should exhibit a similar amount of cross-object variability in the absence of interindividual differences in response styles.

Sample
This sample included 104 participants. Again, we excluded those participants with missing values in the response style and outcome measures in the second survey, resulting in a final sample of 93 participants. Most participants were female (85%); their age ranged between 18 and 37 years old (M = 23.27 years, SD = 3.63 years). The number of completed daily diaries ranged from 19 to 56 (M = 50.99, SD = 6.45, Mdn = 53.0). Again, participants received monetary compensation for their participation (about 40 Euro). The FLUX study was approved by the ethics committee of the German Research Foundation (SN 072018).

Procedure
The data collection phase lasted from September to December, 2018. During this period, individuals participated in a measurement-burst experience sampling study consisting of four bursts of 14 days each (56 days in total). The first, second, and third burst were each followed by breaks of 14 days in which participants were not required to provide any ratings.
During each burst, participants were asked to rate their behavior on each day with regard to Big Five personality states, self-esteem, mood, and motivational states. Furthermore, prior to the first burst and after the last burst, they completed an online survey containing demographic measures, measures of personality traits (e.g., the Big Five personality traits), well-being (e.g., life satisfaction), and response styles, respectively. Here, we only use the daily diary measures for personality states, affect states (not assessed in FLIP), and self-esteem states and the outcome measures from the second assessment (after the daily diary portion) concerning self-esteem, positive and negative affect, life, and job satisfaction, and loneliness. Response style measures were completed in both the initial and final surveys, and we use the data from both time points in the analyses.

Measures
Daily diary variability measures. Personality states were assessed with twelve adjectives. One adjective measured neuroticism (emotionally stable), four adjectives measured extraversion (assertive, unsociable, shy, outgoing), one openness (versatilely interested), four agreeableness (offending, hostile, compliant, sensitive), and one was an indicator of conscientiousness (reliable). Four adjectives were used to assess positive affect (active, happy, calm, relaxed) and three adjectives measured negative affect (annoyed, anxious, lonely). Finally, one item measured daily self-esteem ("I am satisfied with myself"). Similar to FLIP, participants were instructed to rate how well each adjective or item matched their behavioral experiences on that particular day ("Today I was…"). All ratings were made on scales ranging from 1 ("strongly disagree") to 6 ("strongly agree"). To obtain an overall measure of personality state variability, variabilities of the individual Big Five factors, positive and negative affect variability, and self-esteem variability, we followed the same statistical approach as de-scribed for the FLIP study. Please note that the personality state items and affect state items were presented in random order on each day and that the self-esteem item was assessed after the personality and affect states.
Self-Concept Clarity. Participants also completed the Self-Concept Clarity Scale as an explicit measure of intraindividual variability. In contrast to FLIP, the Self-Concept Clarity Scale was assessed prior to (α = .92) and after the daily diary phase of the study (α = .91).
Outcome measures. We used the same global self-esteem measure (α = .93) and the same life satisfaction measure (α = .90) as in the FLIP study. For positive and negative trait affect, participants answered the Positive and Negative Affect Schedule (Watson et al., 1988;German version: Krohne et al., 1996), which contains ten items measuring positive affect (α = .87) and ten items measuring negative affect (α = .88), each of which is rated on a scale ranging from 1 ("not at all") to 5 ("extremely"). We also assessed participants' job satisfaction with five items (adapted from Brayfield & Rothe, 1951), each rated on a scale ranging from 1 ("does not at apply") to 5 ("applies quite a lot", α = .91). Moreover, we assessed loneliness with four items (UCLA Loneliness Scale, Version 3; Russell, 1996;α = .88) to be answered on a scale ranging from 1 ("never") to 4 ("always").
Response style measures. In the FLUX study, we used the same two response style measures (i.e., the Simpsons task and the neutral object rating task) as in the FLIP study. We also used the same analytical approach to compute response tendency variability measures. The FLIP and FLUX studies differed only in the timing of the response style assessment: While the FLIP study measured response styles only after the daily diary phase, the FLUX study assessed response style measures both prior to and after the experience sampling phase of the study. This allowed us to also compute retest reliabilities for the two measures. Baird et al. (2017) report an average correlation of r = .28 between the daily dairy variability measures and the neutral object rating task measure (see their Table 2; the Simpson task was administered in the daily diary study in Baird et al., 2017). When we assume a population correlation of ρ = .25, α = .05, and a one-tailed test, the power to detect such an effect is 80% when the sample size is N = 96 (FLIP) and 79% when the sample size is N = 93 (FLUX). Thus, the statistical power for investigating our research questions is good.

Results
We provide the data and the R code necessary to reproduce the results in an OSF project https://osf.io/tajd9/). In each of the following paragraphs, we first report the results concerning FLIP and then the results concerning FLUX. Furthermore, we follow standard conventions to interpret correlations in terms of effect sizes: a correlation of r = .10 is considered a weak or small effect, a correlation of r = .30 as a moderate effect, and a correlation of r = .50 or larger as a large effect. Finally, across the two studies we found that the results and conclusions regarding the influence of  response styles based on the two alternative approaches to operationalize intraindividual variability were largely consistent with the conclusions based on the approach used in Baird et al. (2017). We therefore report the results of the two approaches only in Appendix 1 (standard deviation across an aggregate) and Appendix 2 (relative variability index), respectively. Table 1 shows the correlations between the different variability measures in the FLIP sample. As can be seen in the table's diagonal, all four variability measures were sufficiently reliable. For the personality variability measure, this suggests that the nineteen variability scores measure something common. Furthermore, and replicating earlier research (e.g., Baird et al., 2006Baird et al., , 2017, correlations between the Self-Concept Clarity Scale-as an explicit measure of variability-and the other variability measures were small. The only exceptions were the small to moderate correlation with the variability measure stemming from the neutral object ratings and the conscientiousness variability measure. The two response style variability measures were positively correlated. However, this small to moderate correlation was not significantly different from zero. Importantly, moderate to large correlations were found between the two response style measures and the variability measures of the substantive variables. This indicates that response styles may play an important role in intraindividual variability. Table 2 shows the correlations for FLUX. Again, all measures were sufficiently reliable in terms of internal consis-tency (i.e., Cronbach's alpha, split-half reliability). Furthermore, the Self-Concept Clarity Scale and the two response scale measures were also substantially correlated across time, indicating good retest reliability. The internal consistency scores and the very large intercorrelations between the personality variability measures, affect variability measures, and self-esteem variability measure again show that the substantive variability measures are associated with one another. Furthermore, the correlations between the Self-Concept Clarity Scale and the other variability measures were again rather low, with the exception of the small to moderate correlations with the neutral object variability measure at the first and second assessment. Consistent with the results for FLIP, we observed moderate to large correlations between the two response style variability measures measured at Time 1 and Time 2 and the variability measures for the substantive variables, indicating again that response styles may play an important role in intraindividual variability. Finally, only one of the four correlations of the two response style variability measures was significant and moderate in size.

Reliability and Correlations between Variability Measures
Altogether the results show-see Table 3 for a meta-analytic aggregate of the corresponding correlations from Tables 1 and 2-that interindividual differences in response style variability could be reliably assessed and that these differences were moderately related to interindividual differences in intraindividual variability of different substantive variables. Furthermore, across the two studies, we found that the variability measures of the substantive variables were highly correlated. Finally, the correlation between the response style measures was only small to moderate.

Correlations and Partial Correlations between Variability and Outcome Measures
In the next step, we examined the correlations between the substantive variability measures, the two response style variability measures, and several outcome measures related to well-being. Table 4 shows the results we obtained for FLIP. The correlations between the response style and outcome measures were small to moderate and not significantly different from zero. Self-esteem was moderately negatively correlated with almost all variability measures for the substantive variables, indicating that individuals who vary less (i.e., are more stable) have higher self-esteem values. These correlations were still significantly differently from zero when we controlled for the two response style measures (see the partial correlation column in Table 4). Life satisfaction was significantly associated with three measures of variability. After controlling for response styles, however, only one of these correlations was significantly different from zero and moderate in size (i.e., variability in conscientiousness). Finally, no significant associations were found for positive and negative affect. Table 5 presents the correlations (and partial correlations) we obtained for FLUX. Similar to FLIP, almost all of the correlations between the response style measures and the outcome measures were small to moderate, but not significantly different from zero. In contrast to FLIP, none of the variability measures for the substantive variables were related to self-esteem. The exception was variability in conscientiousness, which was also moderately correlated with On the Role of Response Styles in the Study of Intraindividual Variability Collabra: Psychology life satisfaction and negative affect. In addition, we found that variability in neuroticism, positive affect, and self-esteem were moderately associated with negative affect. All other correlations were small and not significantly different from zero. When we controlled for the response style measures assessed at the first time point 2 , most of the significant correlations remained significantly different from zero.
To summarize, across the two studies we found small to moderate correlations between the response style measures and the outcome measures (see Table 6 for a metaanalytic aggregate of the corresponding correlations from Tables 4 and 5). With regard to the associations between the variability measures for the substantive variables and the outcome measures, a mixed picture emerged. For some outcome variables, the correlations in FLUX were smaller but in the same direction as in FLIP (e.g., self-esteem; this explains why the aggregated correlations are significantly different from zero in Table 6), but for other variables the associations disappeared. However, in both samples the correlations did not substantially change when we controlled for response styles.

Discussion
In a number of psychological disciplines, it is of great interest to investigate whether and why individuals fluctuate in their cognition, affect, personality, and behavior. To properly examine research questions concerning such intraindividual variability, it is important to use reliable and valid measures of intraindividual variability. Here, we provided a replication and extension of a study by Baird et al. (2017) and thereby investigated whether variability measures based on self-reports are affected by response styles, and in turn, whether it is important to control for response styles when studying intraindividual variability phenomena.
Replicating results by Baird et al. (2017), we found that there are reliable and stable interindividual differences in response style variability and that these differences were moderately related to interindividual differences in personality, affect, and self-esteem variability. Extending results by Baird et al. (2017), response styles were not associated with the well-being outcome measures assessed here. This had the consequence that most bivariate correlations between substantive variables remained insignificant and unchanged when we controlled for these differences. However, a caveat with regard to these correlations between intrain-dividual variability indicators and outcomes is that most variability measures were either uncorrelated with the adjustment measures or the associations could not be replicated across the two studies. As such, we recommend further thorough replication efforts to substantiate our findings. Nevertheless, our results indicate that one should consider response styles when examining intraindividual variability, as is already the case when investigating aggregated responses.
There are several options for how to consider response styles. One possibility is to measure participants' response style values and control for these values in the analyses. Our results show that both the Simpsons task and the neutral object rating task are suitable measures for this purpose: Both measures can be easily and quickly collected before and/or after the experience sampling phase of a study. Furthermore, both measures are internally consistent and retest-reliable. Another possibility is to use a model-based approach in which response styles are directly considered in the statistical model. Such a model is based on the notion that, in addition to the corrected standard deviation or variance approach used here, the determinants and consequences of intraindividual variability can also be investigated with an extension of the multilevel growth model (with time points nested within individuals), in which the Level-1 residual variance is allowed to vary across individuals (see Geukes, Nestler, Hutteman, Dufner, et al., 2017;Hedeker et al., 2008Hedeker et al., , 2009Nestler, 2020Nestler, , 2021. In such models, it should be possible to consider response styles in the same way as in structural equation models or itemresponse models (see Henninger & Meiser, 2020). Up until now, however, we know of only one article that has proposed such a model (Deng et al., 2018). We consider the development of such approaches a fruitful and important avenue for future research, because they would allow to investigate whether our and Baird et al.'s (2017) findings would generalize beyond the statistical approach to model response styles. Furthermore, these models could also help to better understand an unexpected result of our study, namely that the two response style variability measures were only moderately correlated, although they are assumed to measure the same thing. In a model-based approach, one could determine the correlation of the two measures with the latent response style factor to examine, for example, whether the two variables capture different aspects of the latent response tendency variable.
We also computed partial correlations with the measures assessed at the second time point. The results were almost identical to the results obtained with the first time point measures.  Table 5   Note. The average correlation was computed as a weighted average in which each of the two correlations was weighted with the inverse of the standard error of the correlation (i.e., a fixed-effects meta-analysis). Correlations in bold are significant at p < .05; significance was determined by computing the 95%-confidence interval for the meta-analytic average. SCC = A third possibility would be to use alternative response formats that are less affected by response styles. For example, a binary response format, with true/false response options, eliminates certain types of response style effects (Wetzel et al., 2016). Similarly, forced-choice response formats have been found to be less susceptible to response styles (see e.g., Maydeu-Olivares & Brown, 2010; but see also Schulte et al., 2021). However, a problem with alternative response formats is that it is at present unclear how to operationalize intraindividual variability when using such response formats: While the use of standard deviations could still be justified for a binary response format, standard deviations can no longer be used for paired comparisons. Nevertheless, we believe that this could be an interesting future research direction. Finally, another way to examine the consequences and determinants of intraindividual variability is to avoid the exclusive use of self-reports, but to complement them with intraindividual variability measures based on informant reports. We are aware of only a few experience sampling study (e.g., CONNECT; Geukes et al., 2019) in which other persons' reports were collected in addition to self-reports. However, we believe that this should be done more often, because this allows to examine a number of interesting research questions, also with respect to the technical implementation.

Self
Overall, our results raise the question of how reliable previous results on the phenomenon of intraindividual variability are, since most studies did not control for differences in response styles. They also point to the need for more research on the measurement-theoretic properties of self-report measures of intraindividual variability as well as selfreports in experience sampling studies (see Sun et al., 2020;Sun & Vazire, 2019, for interesting approaches). In addition, we believe that an interesting task for future research is to investigate whether one can also capture such constructs and their variation with measures that are not based on self-reports. For instance, it might be possible to reliably and validly capture variability with third-party reports (at least for observable constructs) or relevant mobile sensor data. However researchers proceed, we hope that the study of the measurement-theoretic properties of variability measures will be pursued more vigorously in the future.

Data Accessibility Statement
We embrace the values of openness and transparency in science (Schönbrodt et al., 2015;osf.io/4dvkw/). We therefore follow the 21-word solution (Simmons et al., 2012) or refer to the project documentation in the OSF. We furthermore publish all raw data necessary to reproduce the reported results and provide scripts for all data analyses reported in this manuscript (see https://osf.io/tajd9/).

Competing Interests
All authors certify that they have no conflict of interests in the subject matter or materials discussed in this manuscript.

Contributions
Steffen Nestler (SN) and Katharina Geukes (KG) made substantial contributions to the conception and design of the reported studies. SN was responsible for data collection. SN, Tobit Zaun (TZ), and Theresa Eckes (TE) analyzed the current data and jointly interpreted the results. SN drafted the article and KG, TZ, and TE revised it critically. SN, KG, TZ, and TE approved the final version of this manuscript to be published.

Appendix 1
In this appendix, we report the results for all analyses when using another approach to obtain the intraindividual variability scores for the substantive variables. This approach consisted of (a) aggregating the daily diary items per substantive dimension (e.g., the extraversion items) and (b) computing a (mean-corrected) standard deviation using this aggregate. Note that this approach can only be used when multiple indicators for a substantive dimension have been assessed; otherwise, it is identical to the approach whose results are reported in the main text.
In FLIP, multiple indicators were available for each of the Big five dimensions. For each dimension, the variability scores based on the average of the mean-corrected standard deviations and the variability scores based on the mean-corrected standard deviation of the average were significantly correlated, neuroticism: r = 0.71, p < .01; extraversion: r = 0.64, p < .01; openness: r = 0.93, p < .01; agreeableness: r = 0.80, p < .01; conscientiousness: r = 0.75, p < .01. Table A1a presents the correlations (and partial correlations) we obtained for the five measures.
In FLUX, multiple indicators were only available for the Big five dimensions of extraversion and agreeableness and for positive and negative affect. Again, the variability scores based on the average of the mean-corrected standard deviations and the variability scores based on the mean-corrected standard deviation of the average were significantly correlated, extraversion: r = 0.71, p < .01; agreeableness: r = 0.78, p < .01; positive affect: r = 0.93, p < .01; negative affect: r = 0.90, p < .01. Table A1b presents the correlations (and partial correlations) we obtained for the five measures.
On the Role of Response Styles in the Study of Intraindividual Variability

Appendix 2
In this appendix, we report the results for all analyses when we use the relative variability index of Mestdagh et al. (2018) to obtain the intraindividual variability scores for the substantive variables.
Furthermore, Table A2d presents the correlations (and partial correlations) we obtained for the associations between the response style and intraindividual variability measures, respectively, and the outcome measures.