While personality research has repeatedly shown that self-, meta- and other-ratings of personality are correlated to a considerable degree, relatively little is known about how item features determine agreement between different perspectives, especially beyond observability and social desirability and at the level of personality nuances. The present study examines item features as predictors of different forms of agreement using two datasets (Study 1: targets N = 73, informants N = 549; Study 2: targets N = 189, informants N = 1352). Both studies used one-with-many designs in which each target was matched with several informants that differed in their attitudes toward the target. This matching was accomplished based on a first round robin assessment in larger groups. Targets then provided self- and meta-ratings, and informants provided other-ratings of personality on broad sets of items as markers of personality nuances. Items had been reliably rated with regard to five features: observability, social desirability, importance, stability and base rate. In both studies, we consistently found strong effects of item features on different types of inter-rater agreement. Inter-rater agreement was much higher for items referring to more observable target characteristics, and lower for more evaluative and (partially) for more important items. The important role of item features in future person perception research is discussed.
The vast majority of personality research uses items phrased in the natural language. An underlying basic assumption for this is that descriptions of target persons by means of such items tend to capture something real or “substantial” about those persons. However, the same substantive target characteristic may be described using different person-descriptive terms, and these terms differ in a variety of important aspects (e.g., the extent to which they describe observable behavior). Research has shown that these differences (called “item features” in the following) may be reliably rated by laypersons and that they are associated with the extent to which different perceptions of the same target (i.e., self-, other- and meta-perceptions) agree with one another. Understanding how different item features are associated with different forms of inter-rater agreement is important as inter-rater agreement is often used as a proxy for judgmental accuracy by personality researchers and because – regardless of whether judgments are accurate or not – impressions of a person necessarily have consequences for the person judged as they shape interactions with others, the targets’ surroundings and, ultimately, society as a whole.
Most previous research focused on one or two such features (mostly observability and social desirability) and their effect on self-other agreement and/or other-other agreement (e.g., Funder & Dobroth, 1987; John & Robins, 1993). The present series of studies aims to broaden the scope by examining the (joint and unique) effects not only of observability and social desirability, but also of three additional item features that were investigated by at least some previous research as well: importance, stability and base rate (e.g., Edwards, 1953, 1957; John & Robins, 1993; Leising et al., 2014). Also, besides self-other agreement and other-other agreement, we explore possible associations between the five item features and meta-accuracy (i.e., agreement between what targets assume others think about them [meta-perception] and what others actually think about them) and meta-insight (i.e., the extent to which targets are aware that others see them differently from how the targets view themselves) (Carlson et al., 2011; Carlson & Kenny, 2012; Gallrein et al., 2013).
In addition, most previous research used scales in which these person-descriptive terms were aggregated, and inter-rater agreement was examined at the level of facets and traits. For example, the highest agreement was typically found for Extraversion, which is high in observability and low in evaluativeness, while lower agreement was typically found for agreeableness, which is highly evaluative (e.g., Connelly & Ones, 2010; Funder & Dobroth, 1987; Vazire, 2010). However, when examining facets and traits, a lot of information about differences between items is lost and it becomes impossible to examine the effects of item features on inter-rater agreement at the level of items – as markers of personality “nuances” (McCrae, 2015; Mõttus et al., 2017). Recent research has shown that these personality nuances matter, however, and that sometimes they are even more powerful predictors of life outcomes than facets or traits (Stewart et al., 2022; Wessels et al., 2021). In the present paper, we therefore investigate item features and their effects at the level of nuances in two distinct studies (the second being a replication of the first), in order to gain a more differentiated understanding of the role of differences between items in inter-rater agreement.
Form of Agreement | Definition | Computation |
Self-Other Agreement | Agreement between a target’s self-perception and perceptions of that target by (several) others | Pearson correlation between self- and other-perceptions |
Other-Other Agreement (or Consensus) | Agreement between perceptions of a target by several perceivers | Intraclass correlation coefficient for single measures (ICC[1,1]) using other-perceptions |
Meta-Accuracy | Agreement between what targets assume others think about them (i.e., meta-perception) and what others actually think about them (i.e., other-perception) | Pearson correlation between meta- and other-perceptions |
Meta-Insight | The extent to which targets are aware that others see them differently from how the targets view themselves | Semi partial correlation based on a regression model predicting other-perceptions from meta-perceptions while controlling for self-perceptions |
Form of Agreement | Definition | Computation |
Self-Other Agreement | Agreement between a target’s self-perception and perceptions of that target by (several) others | Pearson correlation between self- and other-perceptions |
Other-Other Agreement (or Consensus) | Agreement between perceptions of a target by several perceivers | Intraclass correlation coefficient for single measures (ICC[1,1]) using other-perceptions |
Meta-Accuracy | Agreement between what targets assume others think about them (i.e., meta-perception) and what others actually think about them (i.e., other-perception) | Pearson correlation between meta- and other-perceptions |
Meta-Insight | The extent to which targets are aware that others see them differently from how the targets view themselves | Semi partial correlation based on a regression model predicting other-perceptions from meta-perceptions while controlling for self-perceptions |
Note. All four coefficients of agreement were transformed using Fisher’s r-to-z transformation.
Different Forms of Inter-Rater Agreement
The present series of studies investigates four distinct forms of inter-rater agreement (self-other agreement, other-other agreement, meta-accuracy and meta-insight; see left half of Table 1 for an overview) based on assessments from three different perspectives: self-perceptions, meta-perceptions, and other-perceptions. Previous research has established that self-other agreement and other-other agreement (consensus) are empirically related, demonstrating similar variations in agreement levels across different trait domains (Funder & Colvin, 1988; Funder & Dobroth, 1987; Funder & West, 1993). While other-other agreement consistently showed higher agreement levels compared to self-other agreement, the two forms of agreement can be predicted by similar patterns of factors (Kenny & West, 2010).
The present study broadens the scope to meta-perceptions. Meta-accuracy refers to the extent to which individuals understand how others perceive them, effectively measuring the accuracy of meta-perceptions (Carlson & Kenny, 2012). Whereas dyadic meta-accuracy refers to the ability to know how we are differentially seen by others, generalized meta-accuracy (also referred to as individual accuracy) is defined as the ability to understand how we are generally seen by others (Kenny, 1988). In the present paper, we focus on the latter form of meta-accuracy.
Given the strong correlations between self- and meta-perceptions, it can be posited that meta-accuracy largely reflects self-other agreement. This hypothesis is supported by evidence that meta-perceptions are primarily based on self-perceptions rather than external feedback (Kenny & DePaulo, 1993). However, meta-accuracy is generally found to be stronger than self-other agreement (Carlson et al., 2011; Gallrein et al., 2013), suggesting that people tend to have relatively accurate impressions of how they are viewed by others across various traits and contexts (Vazire & Carlson, 2010). In line with this, more recent studies have shown that individuals indeed possess a distinct understanding of their reputation, because when statistically controlling for self-perception, the association between meta-perception and how individuals are perceived by others is still significant (Carlson et al., 2011; Carlson & Kenny, 2012; Gallrein et al., 2013). This form of self-knowledge is considered as meta-insight.
Item Feature | Definition | Items* with | |||
Lowest Ratings | Highest Ratings | ||||
Study 1 | Study 2 | Study 1 | Study 2 | ||
Social Desirability | The extent to which using an item to describe a target expresses a more positive or negative attitude toward that target | Spiteful (1.32) | Domineering (1.70) | Cordial (9.32) | Empathetic (9.40) |
Observability | The extent to which the respective target characteristic that an item refers to may be observed from the outside | Feels useless at times (2.84) | Feels useless at times (3.07) | Communicative (8.90) | Communicative (8.73) |
Importance | The extent to which an item describes a trait that perceivers think is important to know about | Obedient (3.87) | Unimaginative (3.80) | Spiteful (8.81) | Helpful (8.27) |
Stability | The extent to which an item describes a person characteristic that is stable over time (trait) or varies across different situations (state) | Ignorant (3.74) | Feels useless at times (3.90) | Responsible (8.65) | Clever (8.73) |
Base Rate | The number of people to which an item is thought to apply | Spiteful (2.74) | All in all, is inclined to think that they are a failure (3.40) | Communicative (6.94) | Vulnerable (6.73) |
Item Feature | Definition | Items* with | |||
Lowest Ratings | Highest Ratings | ||||
Study 1 | Study 2 | Study 1 | Study 2 | ||
Social Desirability | The extent to which using an item to describe a target expresses a more positive or negative attitude toward that target | Spiteful (1.32) | Domineering (1.70) | Cordial (9.32) | Empathetic (9.40) |
Observability | The extent to which the respective target characteristic that an item refers to may be observed from the outside | Feels useless at times (2.84) | Feels useless at times (3.07) | Communicative (8.90) | Communicative (8.73) |
Importance | The extent to which an item describes a trait that perceivers think is important to know about | Obedient (3.87) | Unimaginative (3.80) | Spiteful (8.81) | Helpful (8.27) |
Stability | The extent to which an item describes a person characteristic that is stable over time (trait) or varies across different situations (state) | Ignorant (3.74) | Feels useless at times (3.90) | Responsible (8.65) | Clever (8.73) |
Base Rate | The number of people to which an item is thought to apply | Spiteful (2.74) | All in all, is inclined to think that they are a failure (3.40) | Communicative (6.94) | Vulnerable (6.73) |
Note. * English translation of German originals. Original German items can be retrieved from our OSF page (https://osf.io/fxk7c/).
Mean Item Ratings across 31 raters (Study 1) and 30 raters (Study 2) are reported in brackets. Item features were rated on a 1-10 response scale. For the statistical analyses, mean item ratings were normalized on a scale from -1 to 1.
Previous Research on Item Features and their Effects on Inter-Rater Agreement
We focus on how these different forms of inter-rater agreement are associated with systematic differences between items (i.e., features). In the following we will briefly recount previous research regarding the five item features (see left half of Table 2 for an overview) along with presenting some theoretical speculations as to how each item feature may be associated with inter-rater agreement. For each item feature, we will systematically explore both possible linear and curvilinear effects. Given that there is little theoretical reason to expect that such effects differ between the four types of inter-rater agreement, we abstain from formulating specific hypotheses in this regard. Hence, we expect our predictors to show a rather consistent pattern of effects for the different forms of inter-rater agreement, even though the effects might differ in terms of magnitude (Kenny & West, 2010).
Observability. Observability – sometimes also called visibility – is one of the most extensively examined item features. It describes differences between items in how well the respective target characteristic that the item refers to may be observed from the outside. Funder and Dobroth (1987) showed that items referring to target characteristics that were judged as being better observable were associated with higher self-other and other-other agreement. This makes sense, of course, because more observable cues would provide an information base that is more shared among different perceivers. Paunonen (1989) later provided evidence that observability is particularly important for inter-rater agreement in the early stages of getting to know a target. This may be because other factors influencing agreement are less significant at this early stage of the acquaintance process. For instance, there is likely a more normative set of situations for initial interactions, and fewer opportunities for direct communication with the target compared to later stages of acquaintanceship. The positive association between observability and self-other and other-other agreement was replicated numerous times (e.g., Funder & Colvin, 1988; John & Robins, 1993; Kenny & West, 2010) and more recent research also extended these findings to meta-accuracy and meta-insight (Carlson & Kenny, 2012; Elsaadawy & Carlson, 2022; Gallrein et al., 2013). It is thus reasonable to assume that people are able to adopt an observer’s perspective and thereby realize that another person only has access to more observable trait indicators (Gallrein et al., 2013). In the present series of studies, we expect to replicate the positive effect of observability on all forms of inter-rater agreement. In addition, we explore possible curvilinear effects of observability that, to our knowledge, were not considered before.
Social Desirability. The item feature “social desirability” (or “favorability”) is typically defined as the term’s positive or negative tone, that is, the extent to which using the term to describe a target expresses a more positive or negative attitude toward that target (Edwards, 1953, 1957). Items’ social desirability can be transformed into item evaluativeness, which describes the degree to which an item is evaluatively extreme or neutral (e.g., John & Robins, 1993). Research has now firmly established that (a) most person-descriptive items are indeed strongly evaluative (e.g., Anderson, 1968; Leising et al., 2012) , (b) item desirability very closely reflects the attitude that a perceiver who endorses the item has toward the respective target (e.g., Leising et al., 2015, 2021), and (c) item desirability correlates almost perfectly with items’ loadings on the first, highly evaluative factor in most factor analyses (which is why this factor may well be interpreted as reflecting perceiver attitudes, too) (e.g., Anusic et al., 2009; Biderman et al., 2019; Furr & Funder, 1998; Judge et al., 2002).
While early studies focusing on item desirability had found positive associations with inter-rater agreement (Funder & Colvin, 1988; Funder & Dobroth, 1987), John & Robins (1993) demonstrated that items with more evaluative (i.e., more desirable or undesirable) content were associated with lower inter-rater agreement (which is equivalent to an inverted U-shaped effect of social desirability). Theoretically, this makes sense, because as opposed to more neutral items, more evaluative items should reflect more of the respective perceivers’ attitudes which should lead to more disagreement the less these attitudes are shared (Leising et al., 2015). While a meta-analysis by Kenny and West (2010) could not corroborate the effect of evaluativeness on self-other or other-other agreement in larger models with multiple predictors, there was at least a bivariate negative association between evaluativeness and self-other agreement (r = -.35) (D. A. Kenny, personal communication with Richard Rau, April 15, 2024), which was also shown more recently by De Vries et al. (2016). Thus, we expect to replicate the inverted U-shaped effect of social desirability on different forms of inter-rater agreement.
Base Rate. Base rate is usually defined as the “number of people to which a term is thought to apply” (Leising et al., 2014, p. 2). This concept is arguably similar to what Wood and Furr (2016) refer to as “normativity,” and we explore and discuss this connection in relation to the findings of the present paper. When applied to items, base rate simply means average endorsement rates. Edwards (1953) showed that average endorsement in self-ratings correlates very highly (r = .87) with ratings of item desirability. The finding has been replicated numerous times (e.g., Wood & Furr, 2016), including for other-ratings. One explanation for this finding is that the average participant has a relatively positive attitude toward their respective target. This would suggest making the same predictions for the association between base rates and inter-rater agreement as between social desirability and inter-rater agreement.
The same prediction can be made even if base rate is interpreted in a slightly different way, namely as a substantive target characteristic. Base rate then means the average target’s level on that characteristic (i.e., the extent to which the average person actually shows the behaviors that are associated with the term). Gosling et al. (1998) found base rate (which they defined as the average frequency of occurrence of individual behavioral acts) to be positively associated with observer-observer agreement and self-observer agreement. They argued that low base rates should be associated with lower correlations between raters, as behaviors with low base rates are less likely to be observed, implying smaller variances across targets, and variance restriction leads to lower correlations. The exact same reasoning would apply, however, to items with very high base rates. Unfortunately, Gosling et al. (1998) did not report the base rates of the individual items in their study, so it remains unclear whether, for instance, they only found positive relationships between base rate and agreement because their item sample did not comprise any items with very high base rates. Here, we assume that very high or very low base rates will be associated with lower between-target variance. Therefore, in the present study, we expect to find a negative quadratic relationship between base rate and different forms of inter-rater agreement.
Importance. Despite its rather obvious relevance for person judgments, we are not aware of any study that has looked at the effect of importance on inter-rater agreement. “Importance” is typically defined as the degree to which an item describes a trait that perceivers think is important to know about (Wood, 2015). Thus, one possible understanding of this definition is that items rated higher in importance capture substantive target characteristics that perceivers would be most interested to know about. For example, the behaviors referenced by the term “serial killer” might be of more urgent relevance to the average perceiver than the behaviors referenced by the term “musical”. If this is what importance is about, then we might expect importance to have a positive linear effect on inter-rater agreement because perceivers should be motivated to pay closer attention to target characteristics that matter more to them.
Another possible understanding is that items rated higher in importance may express more of the respective judge’s attitude toward the respective target. Here, the target’s actual substantive characteristics would not matter as much as the fact that the judge obviously has reason (not) to be fond of the target: The judge implicitly marks the target as more of a friend or more of a foe of themselves, and this is what makes the description more relevant to others. This would basically amount to equating ratings of importance with item evaluativeness. In fact, both Leising et al. (2014) and Wood (2015) found strong positive relations between importance and evaluativeness (r = .63 and rs ≥ .43, respectively), using different sets of person-descriptive adjectives from the German and English languages. As evaluativeness was shown to have a negative effect on inter-rater agreement (see above), we might find similar effects for importance. However, if we partial out social desirability, we might find the positive effect that was assumed above.
Stability. Ever since the inception of modern-day personality psychology, it has been acknowledged that items differ in how stable the substantive target characteristics that they refer to are over time. For example, a person may be appropriately described as “excited” one minute but not the next minute, whereas we would expect far less variation over time in descriptions of how “gifted” the person is. Whereas the pioneers of psycholexical research expressly focused on highly stable target characteristics, more recent research has begun considering traitness (vs. stateness) as a continuous variable again (e.g., Leising et al., 2014). Hence, “stability”1 may be defined as the extent to which an item describes a person characteristic that is stable over time (trait) or varies across different situations (state). To our knowledge, despite its obvious relevance for person perception, no study so far has analyzed the influence of this item feature on inter-rater agreement. It seems plausible to assume that descriptions using items that are rated higher in stability should produce higher inter-rater agreement, because different perceivers would base their judgments on such items on more similar observations (Leising & Schilling, 2021).
Distributions of and Correlations Among the Five Item Features
Previous research usually looked at possible associations between a certain item feature and inter-rater agreement separately. However, research has also shown that not all of these item features are statistically independent of one another (Leising et al., 2014; Table 2): When holding all other item features constant, Leising et al. (2014) found significant partial associations of social desirability with base rate (r = .47) and stability (r = .41). Thus, more positive terms are applied to more targets and refer to more stable target characteristics. Furthermore, importance had weak partial positive associations with stability (r = .11) and with base rate (r = .10), and a weak partial negative association with observability (r = -.10). In order to understand their unique contributions to inter-rater agreement when controlling for the respective other variables, it is therefore important to also examine the item features in joint models. This is what we do in the present series of studies.
Overview of the Current Study
The present study addresses several limitations of previous research. First, for some of the above-named item features (i.e., stability and importance), we are not aware of any empirical studies investigating their effect on agreement between judgments. In the present study, we include all of these item features as predictors of inter-rater agreement. Second, previous studies tended to focus on a single item feature or a small number of item features, thus ignoring the possible overlap and redundancy among features. In the present study, we investigate the effects of each feature separately, but we also compare their relative influences directly with one another, in multivariate analyses. Third, most previous studies used samples of informants (acquaintances) that were recruited by the targets themselves. Research has shown that such informant samples tend to be fairly homogeneous in terms of how well the informants know the targets (i.e., very well) and in terms of how much the informants like the targets (i.e., very much) (Leising et al., 2010). Such variance restriction at the dyadic level may make it less likely to find any systematic influences of item features. In the present study, we use samples of informants that were recruited with the explicit goal of more broadly representing the full spectrum of dyadic variance (see Methods section for details). Fourth, previous studies did not systematically investigate linear and curvilinear effects of the item features they focused on. In the present study, we do so for all five item features in order to gain a more complete picture of the associations of the different item features and different forms of agreement. Fifth, most previous studies focused on only one or two forms of agreement between judgments, usually self-other agreement and/or other-other agreement. In the present study, we investigate these as well, but we also investigate meta-accuracy and meta-insight. Since the hypothesized explanations for the associations between the item features and the different forms of agreement are not specific to any particular perspective (self-, other-, or meta-perception), we do not expect to find any remarkable differences between the associations of the different forms of agreement with the respective item features.
The present study should be considered largely exploratory in nature, as we did not pre-register any specific hypotheses. We investigate the (joint and unique) effects of five item features on the four types of agreement between perspectives. In doing so, we attempt to replicate the major effects that were reported in previous studies, but we also test whether some of the more intuitively plausible expectations outlined above could be corroborated.
Methods
The first study aimed to explore the associations between the item features and the different forms of agreement, and to establish an optimal statistical modeling approach. Study 2 aimed to replicate findings from Study 1. In Study 2, we applied the exact same analyses from Study 1 to a second, substantively larger sample with a very similar design. Self-, meta- and other-perception data from Study 1 were previously used in Gallrein et al. (2016), and self- and other-perception data from the same study were published in Wessels, Zimmermann, Biesanz, and Leising (2020). Self- and other-perception data from Study 2 were also published in the latter, as well as in Wessels, Zimmermann, and Leising (2021). Meta-perception data from Study 2 and item features data from both studies (except for social desirability) have not been used before. In the following, we only report those measures from the larger dataset that were relevant to the scope of the present analyses. We report all analyses that were conducted with regard to the research questions in these two studies. 2
Recruitment
In both studies, participants were recruited from university seminars and groups of student representatives at German universities. Based on a first sociometric questionnaire conducted in a round-robin format, targets and so-called “group informants” were selected from these groups with an algorithm aimed at maximizing liking (and knowing) variances among informants. Targets were asked to recruit additional informants from their own social networks (i.e., target-nominated informants). In a second questionnaire, which was designed in a one-with-many fashion (Kenny et al., 2006), targets provided self- and meta-perceptions, and several informants per target provided other-perceptions of their target’s personality. More detailed information on this multi-step procedure can be found in the methods’ section in Wessels, Zimmermann, Biesanz, and Leising (2020). In the present study, “other-perceptions” refer to both group informants’ and target-nominated informants’ ratings of their targets. Thus, we do not differentiate between the two types of informants.
Samples
Self-, meta- and other-perception data in Study 1 are based on seventy-three targets (female = 38, 4 failed to report sex; age: M = 23.19, SD = 3.45), 403 group informants (GI; female = 272, 4 failed to report sex; age: M = 23.54, SD = 4.27), and 146 target-nominated informants (TNI; female = 78, 9 failed to report sex; age: M = 30.14, SD = 12.73). On average, there were 5.52 group informants per target (range: 3 - 6) and 2.61 target-nominated informants per target who had named at least one TNI (range: 1 - 3).
Personality perceptions in Study 2 are based on 189 targets (female = 124; age: M = 22.49, SD = 3.48), 943 group informants (GI; female = 643; age: M = 22.62, SD = 3.16), and 409 target-nominated informants (TNI; female = 264; age: M = 30.01, SD = 13.17). On average, there were 4.99 group informants per target (range: 2 - 6) and 2.59 target-nominated informants per target who had nominated at least one informant (range: 1-3). More information on data inclusion and exclusion criteria can be obtained from the methods’ section in Wessels, Zimmermann, Biesanz, and Leising (2020).
Measures
Personality Ratings
In both studies, we determined the different forms of agreements based on self-, meta- and other-perceptions of personality assessed with broad sets of 107 items (Study 1) and 111 items (Study 2), respectively (see https://osf.io/fxk7c/). The response scale for all items ranged from 1 (does not apply at all) to 5 (applies exactly).
Item sets in both studies comprised a set of 30 German adjectives (e.g., “clever”) by Borkenau and Ostendorf (1998) to measure the Big Five personality traits, a 16-item short version (two per scale, e.g., “shy”) of the Interpersonal Adjective List (IAL; Jacobs & Scholl, 2005), a 10-item Rosenberg Self-Esteem Scale (RSE; e.g., “I feel that I have a number of good qualities”; Rosenberg, 1979), 25 self-developed items on maladaptive personality traits (e.g., “Is constantly worried about almost everything”; slightly revised in Study 2), and a broader item to measure a more self-critical stance (i.e., “Is self-critical”).
In addition, Study 1 items included a relatively broad item capturing a more positive stance towards themselves (i.e., “Likes him- or herself very much”), the German 10-item short version (e.g., “Tends to criticize others”; Rammstedt & John, 2007) of the Big-Five Inventory (BFI; John et al., 1991), and fourteen adjectives covering additional characteristics (e.g., “handsome”) from a larger sample of terms collected by Leising et al. (2012).
In addition to the above-mentioned measures, Study 2 items comprised another 25 self-developed items on maladaptive personality traits (e.g., “Prefers to be alone rather than to be being with others”), the Single Item Self-Esteem Scale (SISE, Robins et al., 2001), the Single Item Narcissism Scale (SINS, Konrath et al., 2014), and the two-item Patient Health Questionnaire (e.g., “Little interest or pleasure in doing things”) as a measure of depression (PHQ-2; Löwe et al., 2005).
We randomized the presentation order of these measures as well as the order of the items within each measure.
Ratings of Item Features
In order to be able to determine the effect of item features on the different forms of agreement between perspectives, the items had to be independently rated with regard to these features. All 107 items from Study 1 were rated on all five features by the same thirty-one raters (female = 16, 2 failed to report sex) between the age of 20 and 58 (M = 29.17, SD = 9.66) who were recruited from the personal social network of the first author. All 111 items from Study 2 were independently rated regarding the five item features by another group of thirty raters (female = 18, 3 failed to report sex) between the age of 20 and 37 (M = 25.90, SD = 4.14) who were recruited via public advertisement and received 15 euros for their participation.
The items were rated on a 10-point scale regarding the five dimensions: Observability (i.e., how well can the respective characteristic be observed from the outside?), ranging from 1 = “not observable at all” to 10 = “very well observable”; Social Desirability (i.e., to what extent does a description of a person with the respective item imply a positive or negative evaluation?), ranging from 1 = “very negative” to 10 = “very positive”; Importance (i.e., how important is it for a perceiver to know whether another person possesses the respective characteristic?), ranging from 1 = “completely unimportant” to 10 = “completely important”; Stability (i.e., to what extent does the respective item describe a changeable, situational or rather a more stable characteristic?), ranging from 1 = “state/very changeable/situational” to 10 = “trait/very stable/situation-independent”; and Base Rate (i.e., how many people may be characterized in terms of this item?), ranging from 1 = “applies to very few” to 10 = “applies to very many”. The order in which the five dimensions were presented to the raters was balanced. We also determined evaluativeness scores for each item (i.e., the extent to which an item implies an evaluatively extreme – that is, positive or negative – description) by computing the absolute difference between the item’s average desirability rating (across raters) and the midpoint for the rating scale (5.50) (John & Robins, 1993).
Statistical Analyses
As we aimed to replicate the findings from Study 1 in Study 2, we used the exact same statistical analyses in both studies. Participant data and analysis scripts needed to reproduce the analyses can be found on this paper’s project page on the Open Science Framework (https://osf.io/fxk7c/). In particular, we first computed inter-rater agreement for each item and then predicted item differences in inter-rater agreement by five different item features. For each form of inter-rater agreement, we performed separate analyses (see the right half of Table 1 for an overview). Self-other agreement was computed as the Pearson correlation between self- and other-perceptions, meta-accuracy was defined as the correlation between meta- and other-perceptions, and meta-insight (i.e., the extent to which targets are aware that others see them differently from how the targets view themselves) was computed as a semi partial correlation based on a regression model predicting other-perceptions from meta-perceptions while controlling for self-perceptions (Carlson et al., 2011; Carlson & Kenny, 2012; Gallrein et al., 2013). Other-other agreement was defined as the intraclass correlation coefficient for single measures (ICC[1,1]) using other-perceptions. While the first three coefficients of agreement were computed based on pairwise complete data, the latter was estimated based on a multilevel approach ensuring the inclusion of all available data points. All analyses were conducted at the level of perceivers/dyads (i.e., the responses of the targets were included multiple times in the analyses). All four coefficients of agreement were transformed using Fisher’s r-to-z transformation.
To test the effects of items’ social desirability, importance, observability, base rate and stability on agreement, we conducted multiple regression analyses at the level of items, predicting agreement by each of the five item features separately, and in a joint model by all five item features at once. This resulted in a total of 6 (item features plus joint model) x 4 (forms of agreement) multiple regression analyses. Item features were normalized on a scale from -1 to 1. We also included quadratic terms for each item feature to account for (possible) curvilinear relationships.
Note that conventional standard errors are not suitable in this approach, because (a) targets are nested in perceivers (which may lead to biased standard errors for coefficients of agreement), and (b) coefficients of agreement (which represent the outcome variable in the second step of the analyses) are associated with sampling error (which is ignored in standard regression analyses). To address these issues, we used the bootstrap method. Specifically, we drew 5,000 resamples (with replacement) from the sample of targets, repeated the entire data analysis procedure within each resample, and used the resulting parameter distributions to derive 95% percentile confidence intervals for (a) itemwise coefficients of agreement and (b) unstandardized regression coefficients and coefficients of determination (adjusted R²) of the multiple regression analyses predicting coefficients of agreement. This approach is equivalent to a nonparametric complete level-2 unit bootstrap, which is recommended for nested data structures such as those in our study (Goldstein, 2011).
An important preliminary question is to what extent the sample sizes in the two studies (i.e., the number of items, targets, and perceivers) are sufficient to detect actual effects of item features on inter-rater agreement. A simulation study would be ideal to clarify this question, but is unfortunately difficult to implement due to the computationally intensive bootstrap method. In order to at least approximate the available power, we took advantage of the fact that our analysis has similarities to the test of a moderator in the context of a meta-analysis. That is, the coefficients of agreement can be thought of as effect sizes, the items as studies, the item features as moderators, and the number of targets as the average sample size. Under these assumptions (and setting alpha error to .05), the R package metapower (Griffin, 2021) suggests that an effect of a moderator in terms of the difference between r = .10 and r = .15 could be detected with 59% probability in Study 1 and with 95% probability in Study 2. It should be noted, however, that these estimates do not consider the statistical dependence of the agreement coefficients, the number of perceivers per target, and the continuous distribution of item features. Nevertheless, it can be said that the sample sizes in classical studies (e.g., Funder & Dobroth, 1987; John & Robins, 1993) were comparable or even smaller than in our studies.
Results
Item Features
We present the means, standard deviations, minima, maxima and inter-rater reliabilities for the ratings of the item features in both studies in Table 3. Inter-rater reliability (ICC [2,k]) was highly satisfactory for all ratings. In line with previous findings by Leising et al. (2014), all dimensions showed a unimodal and symmetric distribution, except for social desirability, which had a bimodal distribution.
Study 1 | OBS | IMP | ST | BR | |||||||||
Rating Dimension | Relia bility | Min | Max | Mean | SD | r | Partial r | r | Partial r | r | Partial r | r | Partial r |
DES | .99 | 1.26 | 9.32 | 5.23 | 2.47 | .28* | .17* | -.04 | -.19 | .41* | .28* | .61* | .55* |
EV | --/-- | 0.02 | 4.24 | 2.28 | 0.96 | .11 | .09 | .76* | .75* | .19* | .04 | -.004 | -.10 |
OBS | .96 | 2.39 | 9.23 | 6.19 | 1.48 | .08 | .05 | .28* | .17 | .18 | -.02 | ||
IMP | .93 | 3.39 | 9.13 | 6.29 | 1.29 | .23* | .25* | .07 | .10 | ||||
ST | .89 | 3.74 | 8.65 | 6.84 | 1.00 | .32* | .07 | ||||||
BR | .93 | 1.81 | 6.94 | 5.02 | 1.10 | ||||||||
Study 2 | OBS | IMP | ST | BR | |||||||||
Rating Dimension | Relia bility | Min | Max | Mean | SD | r | Partial r | r | Partial r | r | Partial r | r | Partial r |
DES | .99 | 1.70 | 9.40 | 4.62 | 2.29 | .18* | .08 | .06 | -.13 | .31* | .28* | .48* | .44* |
EV | --/-- | 0.00 | 3.90 | 2.24 | 0.99 | .16 | .00 | .62* | .60* | .20* | .01 | .05 | -.08 |
OBS | .94 | 2.23 | 8.77 | 6.07 | 1.48 | .28* | .23* | .15 | .01 | .24* | .14 | ||
IMP | .86 | 3.80 | 8.60 | 6.19 | 1.09 | .32* | .30* | .17 | .12 | ||||
ST | .81 | 3.90 | 8.73 | 6.78 | 0.88 | .17 | -.02 | ||||||
BR | .89 | 2.90 | 6.73 | 4.92 | 0.95 |
Study 1 | OBS | IMP | ST | BR | |||||||||
Rating Dimension | Relia bility | Min | Max | Mean | SD | r | Partial r | r | Partial r | r | Partial r | r | Partial r |
DES | .99 | 1.26 | 9.32 | 5.23 | 2.47 | .28* | .17* | -.04 | -.19 | .41* | .28* | .61* | .55* |
EV | --/-- | 0.02 | 4.24 | 2.28 | 0.96 | .11 | .09 | .76* | .75* | .19* | .04 | -.004 | -.10 |
OBS | .96 | 2.39 | 9.23 | 6.19 | 1.48 | .08 | .05 | .28* | .17 | .18 | -.02 | ||
IMP | .93 | 3.39 | 9.13 | 6.29 | 1.29 | .23* | .25* | .07 | .10 | ||||
ST | .89 | 3.74 | 8.65 | 6.84 | 1.00 | .32* | .07 | ||||||
BR | .93 | 1.81 | 6.94 | 5.02 | 1.10 | ||||||||
Study 2 | OBS | IMP | ST | BR | |||||||||
Rating Dimension | Relia bility | Min | Max | Mean | SD | r | Partial r | r | Partial r | r | Partial r | r | Partial r |
DES | .99 | 1.70 | 9.40 | 4.62 | 2.29 | .18* | .08 | .06 | -.13 | .31* | .28* | .48* | .44* |
EV | --/-- | 0.00 | 3.90 | 2.24 | 0.99 | .16 | .00 | .62* | .60* | .20* | .01 | .05 | -.08 |
OBS | .94 | 2.23 | 8.77 | 6.07 | 1.48 | .28* | .23* | .15 | .01 | .24* | .14 | ||
IMP | .86 | 3.80 | 8.60 | 6.19 | 1.09 | .32* | .30* | .17 | .12 | ||||
ST | .81 | 3.90 | 8.73 | 6.78 | 0.88 | .17 | -.02 | ||||||
BR | .89 | 2.90 | 6.73 | 4.92 | 0.95 |
Note. Reliability = Inter-rater agreement ICC (2,k) among 31 raters in Study 1 and 30 raters in Study 2, respectively, in judging the terms on the respective dimension. DES = Social Desirability, EV = Evaluativeness, OBS = Observability, IMP = Importance, ST = Stability, BR = Base Rate. Item features were rated on a 1 - 10 response scale. Evaluativeness ratings reflect the absolute difference between the item’s average desirability rating (across raters) and the mid-point for the rating scale (5.50) (John und Robins 1993) As evaluativeness is computed on the basis of social desirability there is no separate inter-rater agreement and no inter-correlations were computed between the two dimensions.
* 95% confidence interval based on the bootstrap method with 5,000 resamples of items did not contain zero.
Examples for items with the lowest and highest ratings on each dimension are displayed in the right half of Table 2. Whereas the items largely covered the whole range (on a 10-point scale) of social desirability and observability, in both studies there was a lack of items with very low ratings on stability and importance (see also Table 3). Similarly, in neither study did we have items with very high ratings of base rate. This suggests that our measures of personality covered neither very state-like person characteristics, nor very unimportant characteristics nor characteristics applying to most people.
We examined possible associations between item features by inspecting scatter plots and by computing Pearson correlations, displayed in the right half of Table 3. Findings were very similar across the two studies and largely replicated previous findings by Leising et al. (2014), who analyzed a large set of adjectives instead of questionnaire items. First, scatter plots showed a strong curvilinear (U-shaped) relationship between social desirability and importance, which was also reflected in a very strong positive bivariate correlation between evaluativeness and importance (see Table 3). That is, highly evaluative items, including both highly undesirable and desirable ones, were also rated as referring to more important person features.
All other significant associations between item features were linear. As in previous research, there were still some significant associations between item features when holding all other item features constant (see partial correlations in Table 3): The rank-order of partial correlations from Study 1, Study 2, and Leising et al. (2014) was relatively stable and ranged from r =.62 to r = .75. Across both studies, items judged as describing more stable characteristics were also rated as more positive and more important. Moreover, social desirability had a strong positive association with base rate. That is, more positive items were applied to a greater number of persons, once more corroborating that the average person is described positively overall (Borkenau & Zaltauskas, 2009; Edwards, 1953; Leising et al., 2010).
In order to explore whether estimated base rates and normativity (Wood & Furr, 2016) may be two measurement approaches to the same construct, we also examined the estimated base rates’ associations with the empirical item means (i.e., normativity; Wood & Furr, 2016): Indeed, we found strong associations between the two in both studies (r = .70, p < .001 in Study 1, and r = .63, p < .001 in Study 2). However, correlations between the empirical item means (i.e., normativity) and estimated social desirability were considerably stronger (r = .93, p < .001 in Study 1, and r = .91, p < .001 in Study 2), while correlations between estimated base rates and social desirability were less pronounced (r = .61, p < .001 in Study 1, and r = .48, p < .001 in Study 2, see Table 3). Taken together, these findings support the notion that the two approaches assess two related but different constructs.
Average Levels of Agreement
The average levels of agreement across all items were all significantly different from zero, comparable between the two studies and consistent with levels of agreement found in previous research (e.g., Gallrein et al., 2013). For ease of interpretation, we back-transformed Fisher z-values into correlations in the following. Self-other agreement was r = .17 with a 95% confidence interval of .15 to .19 in Study 1 and r = .16 [.14, .17] in Study 2. The item with the lowest self-other agreement was “never really enjoys anything” with r = -.06, while the highest self-other agreement was reached for “is sporty” with r = .48 in Study 1. In Study 2, self-other agreement ranged from r = -.001 (“little interest or pleasure in doing things”) to r = .44 (“is reserved”). Other-other agreement was r = .15 [.13, .17] in Study 1 and r = .14 [.12, .15] in Study 2, ranging from r = .003 (“is interesting”) to r = .40 (“is sporty”) in Study 1 and from r = .03 (“often behaves as others expect, even if this is not good for him/her”) to r = .42 (“is quiet”) in Study 2. Meta-accuracy was r = .18 [.16, .20] in Study 1 and r = .17 [.16, .19] in Study 2. It ranged from r = -.03 (“is able to do things as well as most other people”) to r = .49 (“is sporty”) in Study 1 and from r = -.01 (“who feels they do not have much to be proud of”) to r = .49 (“is reserved”) in Study 2. Meta-insight was r = .10 [.08, .11] in Study 1 and r = .10 [.09, .11] in Study 2, with agreement scores ranging from r = -.05 (“is able to do things as well as most other people”) to r = .25 (“likes themselves very much”) in Study 1 and from r = -.05 (“feels they do not have much to be proud of”) to r = .25 (“is reserved”) in Study 2.
Associations of Item Features and Agreement
Table 4 and Table 5 summarize the results of the multiple regression analyses in Study 1 and Study 2, respectively, predicting the different forms of agreement from item features. Note that we did not compute separate models for evaluativeness, as the effect of this item feature is conceptually captured by the quadratic effect of social desirability. We will present the results for each item feature separately, and then turn to the combined model, which includes all five item features as predictors at once. Results are presented across items; a table with itemwise results can be retrieved from our OSF page (https://osf.io/fxk7c/).
Self-other agreement | Other-other agreement | Meta-Accuracy | Meta-Insight | |||||||||||
Estimate | 95% CI | Estimate | 95% CI | Estimate | 95% CI | Estimate | 95% CI | |||||||
LL | UL | LL | UL | LL | UL | LL | UL | |||||||
DES | Adj R² | 0.170 | 0.060 | 0.213 | 0.166 | 0.070 | 0.208 | 0.175 | 0.056 | 0.228 | 0.085 | -0.009 | 0.136 | |
Intercept | 0.236 | 0.193 | 0.278 | 0.204 | 0.161 | 0.240 | 0.248 | 0.203 | 0.295 | 0.122 | 0.087 | 0.153 | ||
linear | 0.043 | 0.020 | 0.064 | 0.011 | -0.007 | 0.029 | 0.045 | 0.018 | 0.073 | 0.025 | 0.000 | 0.049 | ||
Squared | -0.203 | -0.281 | -0.122 | -0.182 | -0.237 | -0.123 | -0.207 | -0.285 | -0.129 | -0.080 | -0.134 | -0.022 | ||
OBS | Adj R² | 0.171 | 0.047 | 0.235 | 0.221 | 0.093 | 0.274 | 0.244 | 0.079 | 0.294 | 0.182 | 0.001 | 0.229 | |
Intercept | 0.137 | 0.108 | 0.164 | 0.117 | 0.089 | 0.138 | 0.144 | 0.115 | 0.172 | 0.078 | 0.054 | 0.101 | ||
linear | 0.103 | 0.052 | 0.150 | 0.103 | 0.065 | 0.137 | 0.136 | 0.077 | 0.193 | 0.078 | 0.026 | 0.124 | ||
Squared | 0.147 | 0.024 | 0.275 | 0.117 | 0.018 | 0.220 | 0.138 | 0.015 | 0.272 | 0.044 | -0.051 | 0.139 | ||
IMP | Adj R² | 0.254 | 0.101 | 0.307 | 0.260 | 0.138 | 0.302 | 0.199 | 0.069 | 0.265 | 0.036 | -0.016 | 0.089 | |
Intercept | 0.209 | 0.174 | 0.244 | 0.176 | 0.138 | 0.209 | 0.216 | 0.178 | 0.256 | 0.108 | 0.078 | 0.137 | ||
linear | -0.174 | -0.240 | -0.102 | -0.167 | -0.213 | -0.115 | -0.164 | -0.233 | -0.096 | -0.020 | -0.073 | 0.033 | ||
Squared | -0.066 | -0.203 | 0.070 | 0.006 | -0.118 | 0.136 | -0.043 | -0.205 | 0.121 | -0.079 | -0.206 | 0.049 | ||
ST | Adj R² | -0.004 | -0.018 | 0.023 | 0.022 | -0.005 | 0.049 | 0.004 | -0.016 | 0.033 | -0.003 | -0.019 | 0.037 | |
intercept | 0.159 | 0.126 | 0.191 | 0.145 | 0.110 | 0.173 | 0.165 | 0.131 | 0.201 | 0.085 | 0.054 | 0.112 | ||
linear | 0.091 | 0.012 | 0.165 | 0.117 | 0.064 | 0.168 | 0.108 | 0.033 | 0.179 | 0.038 | -0.030 | 0.107 | ||
squared | -0.102 | -0.257 | 0.055 | -0.229 | -0.310 | -0.142 | -0.107 | -0.212 | 0.009 | -0.004 | -0.109 | 0.103 | ||
BR | Adj R² | 0.035 | -0.005 | 0.069 | 0.023 | -0.007 | 0.054 | 0.056 | 0.007 | 0.094 | 0.068 | -0.005 | 0.109 | |
intercept | 0.188 | 0.156 | 0.219 | 0.157 | 0.124 | 0.183 | 0.201 | 0.166 | 0.237 | 0.108 | 0.079 | 0.134 | ||
linear | -0.071 | -0.134 | -0.006 | -0.090 | -0.135 | -0.043 | -0.098 | -0.174 | -0.028 | -0.061 | -0.127 | 0.006 | ||
squared | -0.337 | -0.498 | -0.167 | -0.259 | -0.376 | -0.126 | -0.413 | -0.581 | -0.249 | -0.262 | -0.418 | -0.103 | ||
Joint Model | Adj R² | 0.431 | 0.222 | 0.455 | 0.520 | 0.344 | 0.537 | 0.466 | 0.230 | 0.495 | 0.265 | 0.016 | 0.298 | |
Intercept | 0.155 | 0.116 | 0.192 | 0.138 | 0.104 | 0.166 | 0.176 | 0.137 | 0.214 | 0.103 | 0.064 | 0.138 | ||
DES | linear | -0.018 | -0.051 | 0.015 | -0.033 | -0.055 | -0.010 | -0.005 | -0.040 | 0.031 | 0.010 | -0.020 | 0.040 | |
squared | 0.055 | -0.047 | 0.156 | -0.010 | -0.118 | 0.092 | -0.060 | -0.177 | 0.064 | -0.111 | -0.201 | -0.013 | ||
OBS | linear | 0.111 | 0.055 | 0.167 | 0.127 | 0.086 | 0.165 | 0.147 | 0.083 | 0.210 | 0.077 | 0.021 | 0.125 | |
squared | 0.090 | -0.042 | 0.229 | 0.064 | -0.036 | 0.164 | 0.085 | -0.062 | 0.237 | 0.045 | -0.062 | 0.153 | ||
IMP | linear | -0.201 | -0.281 | -0.118 | -0.169 | -0.232 | -0.102 | -0.140 | -0.241 | -0.046 | 0.034 | -0.047 | 0.112 | |
squared | -0.123 | -0.276 | 0.028 | -0.023 | -0.161 | 0.124 | -0.045 | -0.225 | 0.137 | -0.040 | -0.181 | 0.103 | ||
ST | linear | 0.088 | 0.011 | 0.161 | 0.084 | 0.038 | 0.126 | 0.061 | 0.002 | 0.118 | -0.010 | -0.070 | 0.053 | |
squared | -0.010 | -0.158 | 0.151 | -0.061 | -0.165 | 0.045 | 0.043 | -0.081 | 0.168 | 0.095 | -0.034 | 0.229 | ||
BR | linear | -0.014 | -0.085 | 0.055 | -0.005 | -0.063 | 0.052 | -0.061 | -0.148 | 0.028 | -0.075 | -0.157 | 0.014 | |
squared | -0.108 | -0.273 | 0.070 | -0.025 | -0.155 | 0.112 | -0.147 | -0.322 | 0.014 | -0.143 | -0.309 | 0.015 |
Self-other agreement | Other-other agreement | Meta-Accuracy | Meta-Insight | |||||||||||
Estimate | 95% CI | Estimate | 95% CI | Estimate | 95% CI | Estimate | 95% CI | |||||||
LL | UL | LL | UL | LL | UL | LL | UL | |||||||
DES | Adj R² | 0.170 | 0.060 | 0.213 | 0.166 | 0.070 | 0.208 | 0.175 | 0.056 | 0.228 | 0.085 | -0.009 | 0.136 | |
Intercept | 0.236 | 0.193 | 0.278 | 0.204 | 0.161 | 0.240 | 0.248 | 0.203 | 0.295 | 0.122 | 0.087 | 0.153 | ||
linear | 0.043 | 0.020 | 0.064 | 0.011 | -0.007 | 0.029 | 0.045 | 0.018 | 0.073 | 0.025 | 0.000 | 0.049 | ||
Squared | -0.203 | -0.281 | -0.122 | -0.182 | -0.237 | -0.123 | -0.207 | -0.285 | -0.129 | -0.080 | -0.134 | -0.022 | ||
OBS | Adj R² | 0.171 | 0.047 | 0.235 | 0.221 | 0.093 | 0.274 | 0.244 | 0.079 | 0.294 | 0.182 | 0.001 | 0.229 | |
Intercept | 0.137 | 0.108 | 0.164 | 0.117 | 0.089 | 0.138 | 0.144 | 0.115 | 0.172 | 0.078 | 0.054 | 0.101 | ||
linear | 0.103 | 0.052 | 0.150 | 0.103 | 0.065 | 0.137 | 0.136 | 0.077 | 0.193 | 0.078 | 0.026 | 0.124 | ||
Squared | 0.147 | 0.024 | 0.275 | 0.117 | 0.018 | 0.220 | 0.138 | 0.015 | 0.272 | 0.044 | -0.051 | 0.139 | ||
IMP | Adj R² | 0.254 | 0.101 | 0.307 | 0.260 | 0.138 | 0.302 | 0.199 | 0.069 | 0.265 | 0.036 | -0.016 | 0.089 | |
Intercept | 0.209 | 0.174 | 0.244 | 0.176 | 0.138 | 0.209 | 0.216 | 0.178 | 0.256 | 0.108 | 0.078 | 0.137 | ||
linear | -0.174 | -0.240 | -0.102 | -0.167 | -0.213 | -0.115 | -0.164 | -0.233 | -0.096 | -0.020 | -0.073 | 0.033 | ||
Squared | -0.066 | -0.203 | 0.070 | 0.006 | -0.118 | 0.136 | -0.043 | -0.205 | 0.121 | -0.079 | -0.206 | 0.049 | ||
ST | Adj R² | -0.004 | -0.018 | 0.023 | 0.022 | -0.005 | 0.049 | 0.004 | -0.016 | 0.033 | -0.003 | -0.019 | 0.037 | |
intercept | 0.159 | 0.126 | 0.191 | 0.145 | 0.110 | 0.173 | 0.165 | 0.131 | 0.201 | 0.085 | 0.054 | 0.112 | ||
linear | 0.091 | 0.012 | 0.165 | 0.117 | 0.064 | 0.168 | 0.108 | 0.033 | 0.179 | 0.038 | -0.030 | 0.107 | ||
squared | -0.102 | -0.257 | 0.055 | -0.229 | -0.310 | -0.142 | -0.107 | -0.212 | 0.009 | -0.004 | -0.109 | 0.103 | ||
BR | Adj R² | 0.035 | -0.005 | 0.069 | 0.023 | -0.007 | 0.054 | 0.056 | 0.007 | 0.094 | 0.068 | -0.005 | 0.109 | |
intercept | 0.188 | 0.156 | 0.219 | 0.157 | 0.124 | 0.183 | 0.201 | 0.166 | 0.237 | 0.108 | 0.079 | 0.134 | ||
linear | -0.071 | -0.134 | -0.006 | -0.090 | -0.135 | -0.043 | -0.098 | -0.174 | -0.028 | -0.061 | -0.127 | 0.006 | ||
squared | -0.337 | -0.498 | -0.167 | -0.259 | -0.376 | -0.126 | -0.413 | -0.581 | -0.249 | -0.262 | -0.418 | -0.103 | ||
Joint Model | Adj R² | 0.431 | 0.222 | 0.455 | 0.520 | 0.344 | 0.537 | 0.466 | 0.230 | 0.495 | 0.265 | 0.016 | 0.298 | |
Intercept | 0.155 | 0.116 | 0.192 | 0.138 | 0.104 | 0.166 | 0.176 | 0.137 | 0.214 | 0.103 | 0.064 | 0.138 | ||
DES | linear | -0.018 | -0.051 | 0.015 | -0.033 | -0.055 | -0.010 | -0.005 | -0.040 | 0.031 | 0.010 | -0.020 | 0.040 | |
squared | 0.055 | -0.047 | 0.156 | -0.010 | -0.118 | 0.092 | -0.060 | -0.177 | 0.064 | -0.111 | -0.201 | -0.013 | ||
OBS | linear | 0.111 | 0.055 | 0.167 | 0.127 | 0.086 | 0.165 | 0.147 | 0.083 | 0.210 | 0.077 | 0.021 | 0.125 | |
squared | 0.090 | -0.042 | 0.229 | 0.064 | -0.036 | 0.164 | 0.085 | -0.062 | 0.237 | 0.045 | -0.062 | 0.153 | ||
IMP | linear | -0.201 | -0.281 | -0.118 | -0.169 | -0.232 | -0.102 | -0.140 | -0.241 | -0.046 | 0.034 | -0.047 | 0.112 | |
squared | -0.123 | -0.276 | 0.028 | -0.023 | -0.161 | 0.124 | -0.045 | -0.225 | 0.137 | -0.040 | -0.181 | 0.103 | ||
ST | linear | 0.088 | 0.011 | 0.161 | 0.084 | 0.038 | 0.126 | 0.061 | 0.002 | 0.118 | -0.010 | -0.070 | 0.053 | |
squared | -0.010 | -0.158 | 0.151 | -0.061 | -0.165 | 0.045 | 0.043 | -0.081 | 0.168 | 0.095 | -0.034 | 0.229 | ||
BR | linear | -0.014 | -0.085 | 0.055 | -0.005 | -0.063 | 0.052 | -0.061 | -0.148 | 0.028 | -0.075 | -0.157 | 0.014 | |
squared | -0.108 | -0.273 | 0.070 | -0.025 | -0.155 | 0.112 | -0.147 | -0.322 | 0.014 | -0.143 | -0.309 | 0.015 |
Note. DES = Social Desirability, OBS = Observability, IMP = Importance, ST = Stability, BR = Base Rate, Adj R² = Adjusted R². CI = confidence interval. CIs are based on the bootstrap method using 5,000 resamples of targets. Boldface indicates CI does not contain zero.
Self-other agreement | Other-other agreement | Meta-Accuracy | Meta-Insight | |||||||||||
Estimate | 95% CI | Estimate | 95% CI | Estimate | 95% CI | Estimate | 95% CI | |||||||
LL | UL | LL | UL | LL | UL | LL | UL | |||||||
DES | Adj R² | 0.054 | 0.003 | 0.109 | 0.065 | 0.022 | 0.104 | 0.056 | 0.011 | 0.105 | 0.020 | -0.015 | 0.068 | |
intercept | 0.192 | 0.166 | 0.217 | 0.174 | 0.153 | 0.192 | 0.214 | 0.187 | 0.240 | 0.112 | 0.094 | 0.129 | ||
linear | 0.028 | 0.009 | 0.048 | 0.026 | 0.014 | 0.039 | 0.028 | 0.011 | 0.046 | 0.014 | -0.002 | 0.029 | ||
squared | -0.102 | -0.159 | -0.045 | -0.111 | -0.151 | -0.069 | -0.118 | -0.173 | -0.064 | -0.045 | -0.083 | -0.007 | ||
OBS | Adj R² | 0.386 | 0.245 | 0.425 | 0.445 | 0.334 | 0.494 | 0.429 | 0.285 | 0.477 | 0.291 | 0.107 | 0.325 | |
intercept | 0.117 | 0.099 | 0.135 | 0.096 | 0.083 | 0.108 | 0.128 | 0.111 | 0.146 | 0.076 | 0.061 | 0.091 | ||
linear | 0.128 | 0.095 | 0.161 | 0.140 | 0.112 | 0.166 | 0.148 | 0.111 | 0.185 | 0.075 | 0.047 | 0.104 | ||
squared | 0.195 | 0.120 | 0.271 | 0.180 | 0.120 | 0.242 | 0.219 | 0.141 | 0.295 | 0.086 | 0.030 | 0.142 | ||
IMP | Adj R² | 0.006 | -0.016 | 0.043 | 0.010 | -0.012 | 0.039 | 0.011 | -0.014 | 0.047 | -0.006 | -0.018 | 0.032 | |
intercept | 0.167 | 0.145 | 0.189 | 0.146 | 0.129 | 0.160 | 0.186 | 0.163 | 0.208 | 0.101 | 0.084 | 0.116 | ||
linear | -0.036 | -0.092 | 0.021 | -0.059 | -0.095 | -0.020 | -0.059 | -0.111 | -0.005 | -0.021 | -0.064 | 0.022 | ||
squared | -0.064 | -0.176 | 0.051 | -0.009 | -0.092 | 0.077 | -0.035 | -0.159 | 0.085 | -0.015 | -0.132 | 0.099 | ||
ST | Adj R² | 0.003 | -0.013 | 0.025 | 0.014 | -0.002 | 0.031 | -0.005 | -0.016 | 0.014 | -0.013 | -0.018 | 0.016 | |
intercept | 0.152 | 0.129 | 0.175 | 0.131 | 0.116 | 0.144 | 0.176 | 0.153 | 0.199 | 0.101 | 0.083 | 0.118 | ||
linear | 0.117 | 0.049 | 0.183 | 0.138 | 0.091 | 0.180 | 0.073 | 0.003 | 0.144 | 0.001 | -0.063 | 0.072 | ||
squared | -0.240 | -0.359 | -0.118 | -0.289 | -0.375 | -0.194 | -0.195 | -0.330 | -0.061 | -0.041 | -0.171 | 0.082 | ||
BR | Adj R² | -0.012 | -0.018 | 0.014 | 0.007 | -0.009 | 0.033 | -0.009 | -0.018 | 0.015 | 0.004 | -0.016 | 0.039 | |
intercept | 0.162 | 0.141 | 0.182 | 0.147 | 0.131 | 0.162 | 0.180 | 0.159 | 0.201 | 0.099 | 0.084 | 0.114 | ||
linear | 0.016 | -0.045 | 0.078 | 0.031 | -0.011 | 0.076 | 0.038 | -0.013 | 0.089 | 0.057 | 0.006 | 0.107 | ||
squared | -0.059 | -0.216 | 0.097 | -0.112 | -0.237 | 0.017 | -0.030 | -0.182 | 0.115 | 0.073 | -0.064 | 0.208 | ||
Joint Model | Adj R² | 0.499 | 0.342 | 0.526 | 0.590 | 0.472 | 0.623 | 0.567 | 0.416 | 0.594 | 0.377 | 0.154 | 0.410 | |
intercept | 0.155 | 0.128 | 0.180 | 0.140 | 0.121 | 0.155 | 0.179 | 0.153 | 0.204 | 0.103 | 0.081 | 0.125 | ||
DES | linear | 0.006 | -0.015 | 0.028 | -0.004 | -0.019 | 0.011 | 0.000 | -0.020 | 0.021 | -0.002 | -0.021 | 0.015 | |
squared | -0.123 | -0.195 | -0.052 | -0.135 | -0.179 | -0.089 | -0.143 | -0.207 | -0.080 | -0.058 | -0.110 | -0.005 | ||
OBS | linear | 0.167 | 0.126 | 0.208 | 0.174 | 0.141 | 0.206 | 0.205 | 0.160 | 0.250 | 0.109 | 0.075 | 0.141 | |
squared | 0.177 | 0.107 | 0.247 | 0.165 | 0.109 | 0.221 | 0.180 | 0.105 | 0.254 | 0.054 | -0.005 | 0.112 | ||
IMP | linear | -0.007 | -0.065 | 0.052 | -0.032 | -0.069 | 0.006 | -0.023 | -0.079 | 0.034 | -0.009 | -0.064 | 0.045 | |
squared | -0.132 | -0.258 | -0.006 | -0.081 | -0.164 | 0.005 | -0.113 | -0.239 | 0.007 | -0.054 | -0.175 | 0.067 | ||
ST | linear | 0.013 | -0.055 | 0.083 | 0.032 | -0.011 | 0.071 | -0.058 | -0.123 | 0.008 | -0.080 | -0.145 | -0.009 | |
squared | -0.003 | -0.125 | 0.120 | -0.031 | -0.107 | 0.051 | 0.106 | -0.013 | 0.224 | 0.121 | -0.004 | 0.244 | ||
BR | linear | -0.002 | -0.067 | 0.064 | 0.030 | -0.016 | 0.076 | 0.033 | -0.022 | 0.090 | 0.065 | 0.009 | 0.118 | |
squared | 0.084 | -0.082 | 0.245 | 0.054 | -0.073 | 0.179 | 0.172 | 0.003 | 0.331 | 0.202 | 0.049 | 0.352 |
Self-other agreement | Other-other agreement | Meta-Accuracy | Meta-Insight | |||||||||||
Estimate | 95% CI | Estimate | 95% CI | Estimate | 95% CI | Estimate | 95% CI | |||||||
LL | UL | LL | UL | LL | UL | LL | UL | |||||||
DES | Adj R² | 0.054 | 0.003 | 0.109 | 0.065 | 0.022 | 0.104 | 0.056 | 0.011 | 0.105 | 0.020 | -0.015 | 0.068 | |
intercept | 0.192 | 0.166 | 0.217 | 0.174 | 0.153 | 0.192 | 0.214 | 0.187 | 0.240 | 0.112 | 0.094 | 0.129 | ||
linear | 0.028 | 0.009 | 0.048 | 0.026 | 0.014 | 0.039 | 0.028 | 0.011 | 0.046 | 0.014 | -0.002 | 0.029 | ||
squared | -0.102 | -0.159 | -0.045 | -0.111 | -0.151 | -0.069 | -0.118 | -0.173 | -0.064 | -0.045 | -0.083 | -0.007 | ||
OBS | Adj R² | 0.386 | 0.245 | 0.425 | 0.445 | 0.334 | 0.494 | 0.429 | 0.285 | 0.477 | 0.291 | 0.107 | 0.325 | |
intercept | 0.117 | 0.099 | 0.135 | 0.096 | 0.083 | 0.108 | 0.128 | 0.111 | 0.146 | 0.076 | 0.061 | 0.091 | ||
linear | 0.128 | 0.095 | 0.161 | 0.140 | 0.112 | 0.166 | 0.148 | 0.111 | 0.185 | 0.075 | 0.047 | 0.104 | ||
squared | 0.195 | 0.120 | 0.271 | 0.180 | 0.120 | 0.242 | 0.219 | 0.141 | 0.295 | 0.086 | 0.030 | 0.142 | ||
IMP | Adj R² | 0.006 | -0.016 | 0.043 | 0.010 | -0.012 | 0.039 | 0.011 | -0.014 | 0.047 | -0.006 | -0.018 | 0.032 | |
intercept | 0.167 | 0.145 | 0.189 | 0.146 | 0.129 | 0.160 | 0.186 | 0.163 | 0.208 | 0.101 | 0.084 | 0.116 | ||
linear | -0.036 | -0.092 | 0.021 | -0.059 | -0.095 | -0.020 | -0.059 | -0.111 | -0.005 | -0.021 | -0.064 | 0.022 | ||
squared | -0.064 | -0.176 | 0.051 | -0.009 | -0.092 | 0.077 | -0.035 | -0.159 | 0.085 | -0.015 | -0.132 | 0.099 | ||
ST | Adj R² | 0.003 | -0.013 | 0.025 | 0.014 | -0.002 | 0.031 | -0.005 | -0.016 | 0.014 | -0.013 | -0.018 | 0.016 | |
intercept | 0.152 | 0.129 | 0.175 | 0.131 | 0.116 | 0.144 | 0.176 | 0.153 | 0.199 | 0.101 | 0.083 | 0.118 | ||
linear | 0.117 | 0.049 | 0.183 | 0.138 | 0.091 | 0.180 | 0.073 | 0.003 | 0.144 | 0.001 | -0.063 | 0.072 | ||
squared | -0.240 | -0.359 | -0.118 | -0.289 | -0.375 | -0.194 | -0.195 | -0.330 | -0.061 | -0.041 | -0.171 | 0.082 | ||
BR | Adj R² | -0.012 | -0.018 | 0.014 | 0.007 | -0.009 | 0.033 | -0.009 | -0.018 | 0.015 | 0.004 | -0.016 | 0.039 | |
intercept | 0.162 | 0.141 | 0.182 | 0.147 | 0.131 | 0.162 | 0.180 | 0.159 | 0.201 | 0.099 | 0.084 | 0.114 | ||
linear | 0.016 | -0.045 | 0.078 | 0.031 | -0.011 | 0.076 | 0.038 | -0.013 | 0.089 | 0.057 | 0.006 | 0.107 | ||
squared | -0.059 | -0.216 | 0.097 | -0.112 | -0.237 | 0.017 | -0.030 | -0.182 | 0.115 | 0.073 | -0.064 | 0.208 | ||
Joint Model | Adj R² | 0.499 | 0.342 | 0.526 | 0.590 | 0.472 | 0.623 | 0.567 | 0.416 | 0.594 | 0.377 | 0.154 | 0.410 | |
intercept | 0.155 | 0.128 | 0.180 | 0.140 | 0.121 | 0.155 | 0.179 | 0.153 | 0.204 | 0.103 | 0.081 | 0.125 | ||
DES | linear | 0.006 | -0.015 | 0.028 | -0.004 | -0.019 | 0.011 | 0.000 | -0.020 | 0.021 | -0.002 | -0.021 | 0.015 | |
squared | -0.123 | -0.195 | -0.052 | -0.135 | -0.179 | -0.089 | -0.143 | -0.207 | -0.080 | -0.058 | -0.110 | -0.005 | ||
OBS | linear | 0.167 | 0.126 | 0.208 | 0.174 | 0.141 | 0.206 | 0.205 | 0.160 | 0.250 | 0.109 | 0.075 | 0.141 | |
squared | 0.177 | 0.107 | 0.247 | 0.165 | 0.109 | 0.221 | 0.180 | 0.105 | 0.254 | 0.054 | -0.005 | 0.112 | ||
IMP | linear | -0.007 | -0.065 | 0.052 | -0.032 | -0.069 | 0.006 | -0.023 | -0.079 | 0.034 | -0.009 | -0.064 | 0.045 | |
squared | -0.132 | -0.258 | -0.006 | -0.081 | -0.164 | 0.005 | -0.113 | -0.239 | 0.007 | -0.054 | -0.175 | 0.067 | ||
ST | linear | 0.013 | -0.055 | 0.083 | 0.032 | -0.011 | 0.071 | -0.058 | -0.123 | 0.008 | -0.080 | -0.145 | -0.009 | |
squared | -0.003 | -0.125 | 0.120 | -0.031 | -0.107 | 0.051 | 0.106 | -0.013 | 0.224 | 0.121 | -0.004 | 0.244 | ||
BR | linear | -0.002 | -0.067 | 0.064 | 0.030 | -0.016 | 0.076 | 0.033 | -0.022 | 0.090 | 0.065 | 0.009 | 0.118 | |
squared | 0.084 | -0.082 | 0.245 | 0.054 | -0.073 | 0.179 | 0.172 | 0.003 | 0.331 | 0.202 | 0.049 | 0.352 |
Note. DES = Social Desirability, OBS = Observability, IMP = Importance, ST = Stability, BR = Base Rate, Adj R² = Adjusted R². CI = confidence interval. CIs are based on the bootstrap method using 5,000 resamples of targets. Boldface indicates CI does not contain zero.
Observability
Study 1. In line with previous research, there was a significant linear effect of observability on all types of agreement (Figure 1, Panel A): The more observable the person characteristics that an item refers to, the better the agreement between different perspectives was, independent of whether agreement was computed with self-, meta- and/or other-perceptions. Interestingly, we consistently found an additional significant positive quadratic effect of observability on all forms of agreement but meta-insight. The adjusted R² was significant for all models and ranged from .17 to .24.
Study 2. Replicating findings from Study 1, we found both a consistent linear effect as well as a significant positive quadratic effect of observability on all forms of agreement (Figure 1, Panel B). In Study 2, the quadratic effect was also significant for meta-insight. Again, the adjusted R² was significant for all models: observability explained between 11% (meta-insight) and 42% (other-other agreement) of the variance in the different forms of agreement.
Social Desirability
Study 1. For social desirability, there was a significant positive effect on self-other agreement and meta-accuracy, combined with a significant negative quadratic effect for all forms of agreement (Figure 2, Panel A). Scatterplots visualize this inverted U-shaped relationship between desirability and different forms agreement (Figure 2, Panel A), which was also reported by John and Robins (1993). The adjusted R² ranged between .17 for self- other and other-other agreement and .18 for meta-accuracy. The model was not significant for meta-insight (adjusted R² = .09).
Study 2. As in Study 1, we found rather consistent evidence for a negative quadratic relationship between social desirability and different forms of agreement (Figure 2, Panel B). Except for the linear effect regarding meta-insight, all models revealed significant linear and quadratic effects of social desirability on inter-rater agreement. The adjusted R² was descriptively lower than in Study 1 and ranged between .05 for self- other and .06 for other-other agreement and meta-accuracy. Again, the model was not significant for meta-insight (adjusted R² = .02).
Importance
Study 1. Importance showed consistent negative linear effects on the different forms of agreement (Figure 3, Panel A). Adjusted R² were significant for all forms of agreement but meta-insight (adjusted R² = .04) and ranged between .20 for meta-accuracy, .25 for self-other agreement and .26 for other-other agreement.
Study 2. Similar to Study 1, analyses revealed negative linear effects of importance on inter-rater agreement (Figure 3, Panel B), however, these were only significant for two of the four forms of agreement (other-other agreement and meta-accuracy). Confidence intervals for the adjusted R² all contained zero.
Stability
Study 1. For stability, analyses showed rather consistent positive linear effects, which were significant for all forms of agreement except for meta-insight. A slight negative quadratic effect, which can also be detected in scatterplots (Figure 4, Panel A), was only significant for other-other agreement. However, adjusted R² was not significant for any of these models.
Study 2. Effects for stability were more pronounced in Study 2 than in Study 1: There was a significant positive linear effect and a negative quadratic effect for self-other agreement, other-other agreement and meta-accuracy (see Figure 4, Panel B). Only when predicting meta-insight, effects of stability were not significant. However, the adjusted R² contained zero in all these models.
Base Rate
Study 1. For models which predicted different forms of agreement from base rate only, scatterplots suggested a negative quadratic relationship (Figure 5, Panel A). Indeed, models revealed a significant negative linear effect as well as a significant negative quadratic effect of base rate on self-other agreement, other-other agreement and meta-accuracy. Estimates predicting meta-insight were also negative, but only the confidence interval of the quadratic effect did not contain zero. Adjusted R² was significant only for the model predicting meta-accuracy (R² = .06).
Study 2. In contrast to Study 1, base rate had barely any effect on the different forms of agreement in Study 2 (Figure 5, Panel B): There was only a significant positive linear effect on meta-insight. Adjusted R² for the different models was not significant.
Joint Models
Study 1. When entering all five item features at once, all models became significant with an adjusted R² between .27 for meta-insight and .52 for other-other agreement (see Table 4). Observability and stability had unique positive linear effects, and importance had unique negative linear effects on all forms of agreements but meta-insight. In addition, observability had a unique positive linear effect on meta-insight. Social desirability had a unique negative linear effect on other-other agreement, and a unique negative quadratic effect on meta-insight. There was no significant effect for base rate.
Study 2. All models predicting inter-rater agreement from all five item features at once were significant and explained between 38% (for meta-insight) and 59% (other-other agreement) of the variance. The pattern of results differed somewhat from the results in Study 1 (see Table 5): Observability consistently showed a unique positive linear effect on all forms of inter-rater agreement, along with a unique positive quadratic effect on all forms of agreement except meta-insight. For social desirability, we found consistent unique negative quadratic effects across all forms of agreement, with no unique linear effects. Importance had a unique negative quadratic effect on self-other agreement only, without any unique linear negative effects. Stability showed a unique negative linear effect on meta-insight only. Lastly, base rate had a unique positive linear effect on meta-insight, along with a unique positive quadratic effect on both meta-insight and meta-accuracy.
Discussion
The present paper examined the joint and unique associations between five item features – observability, social desirability, importance, stability and base rate – and four different forms of inter-rater agreement: self-other agreement, other-other agreement, meta-accuracy and meta-insight. We found a number of replicable and theoretically reasonable associations between these two sets of variables. Specifically, inter-rater agreement was much higher for items referring to more observable target characteristics, and lower for more evaluative items. These effects were substantial in size.
Compared to previous research, the present study had a number of important strengths: (a) We included these five item features, some of which had not been examined before with regard to their effect on inter-rater agreement. (b) We explored the effects of each item feature individually, and their unique effects when jointly used as predictors. Doing both is important because of the various associations among the item features themselves. (c) We studied four different types of inter-rater agreement, not just self-other and other-other agreement. (d) We explored possible curvilinear associations between item features and inter-rater agreement, not only linear ones. (e) We used two independent samples to be able to replicate our findings. (f) These samples had been collected with the explicit goal of capturing a broad spectrum of dyadic relationships between perceivers and targets in terms of liking and knowing that is typical for everyday life. Despite the use of two independent samples, the studies were not pre-registered and should thus be considered to be exploratory in nature. Nevertheless, they yielded a number of consistent and important findings. We will now discuss some of the key findings from Study 1 and 2 in more detail.
Item Features
First, like in previous studies (e.g., Leising et al., 2014), we found that all five item features could be rated with very good reliability by a relatively small number of laypersons. At the same time, correlations between the different item features were lower than their reliability, showing that they are clearly distinguishable from one another. This means that people, on average, seem to be aware of these differences between person-descriptors and agree relatively well with one another in ordering them along the respective continua.
Second, on all of the item features that we explored in the present study, the two item samples covered a large percentage of the available range. There were some limitations to this, however: Items with low stability, low importance, and high base rate were not part of either item sample. Especially the first two of these findings make perfect sense, given that most of the items came from psychometric personality scales and it would be difficult to conceive of such scales containing many items referring to volatile states or to person characteristics that are deemed largely irrelevant. Research is needed that investigates broader and more representative item sets which are collected with fewer such contextual restrictions, such as the one recently collected by Leistner et al. (2024).
The fact that none of the items used in the present two samples had a particularly high base rate deserves some consideration of its own. The key question is how are the base rates of person-descriptive terms distributed in the natural language? In terms of mere informativeness, one might suspect that most terms have intermediate base rates (i.e., around 0 on our normalized base rate continuum) because such items would permit making the largest number of distinctions among people. One might also suspect that the natural language contains some items with higher base rates and some with lower base rates, in order to also allow for distinctions among people in those regions of the measured content dimension. All of this would be consistent with broad general principles of item selection in constructing psychometric tests. However, our two studies suggest that there is some asymmetry at work with regard to item base rates, in that the average base rate may be relatively low. Although the item samples we used in the present series of studies were not collected in a (stimulus-)representative fashion, the average base rate was also low in a much larger and more representative item sample collected by Leising et al. (2014). Thus, it seems possible that most of the terms that we have available for describing ourselves and others apply to relatively few people – they distinguish between the many who almost never do X, and the few who do X at least sometimes. Future studies should look into this possibility in a more systematic fashion.
Third, our findings enhance our understanding of base rates’ construct validity. To explore whether “base rate” in our studies reflects a construct that Wood & Furr (2016) referred to as “normativity,” albeit with a different measurement approach, we compared the correlations between our estimated base rates – assessed by a separate group of raters – and the empirical item means (i.e., normativity). We found a strong correlation between the two in both studies; however, the correlation between normativity and the items’ social desirability was even stronger. That is, in line with previous research our findings support the notion that the average target is described overly positively (e.g., Borkenau & Zaltauskas, 2009; Rogers & Biesanz, 2015), and that normativity is also about how much the average perceiver likes the average target (e.g., Leising et al., 2010). Moreover, it suggests that our base rate estimates and the items empirical means (i.e., normativity) may be related constructs, though not identical.
In support of this notion, the correlations between base rate estimates as directly rated by laypersons and social desirability ratings were less pronounced. This suggests that base rate estimates may have been less influenced by perceiver attitudes compared to the empirical item means and that estimated base rates more accurately reflect the actual base rates. It is conceivable that raters incorporate real behavior more into their base rate estimates than they do when making ratings of specific targets, which are more susceptible to dyadic attitudes. However, given the positive association with social desirability, raters may still struggle to fully detach from the evaluative connotations associated with the terms used. Future research will need to further examine the construct validity of the base rate measure.
Fourth, across the two studies we found a fairly consistent pattern of correlations between the various item features. Among these, the associations between base rate and social desirability and the associations between importance and evaluativeness were particularly pronounced. They were also highly consistent with those reported in the previous study by Leising et al. (2014). This suggests that the item samples in the current two studies were at least not totally “off” in regard to the overall universe of person-descriptors. The inter-correlations between the item features once more highlight the importance of considering them jointly in order to determine their unique effects on inter-rater agreement while controlling for the respective other variables.
Inter-Rater Agreement
Overall, we found that average levels of agreement across all items were significantly different from zero and comparable between the two studies. It is noteworthy though that levels of agreement we found were descriptively lower than in some previous studies (e.g., Carlson & Kenny, 2012; Kenny & West, 2010). We see at least two possible explanations for these differences: First, most previous studies had used relatively homogeneous groups of informants who tended to like and know their targets very well. In contrast, we had involved informants who differed considerably in terms of liking and knowing their targets, resulting in comparatively lower mean scores on these dyadic variables. Thus, one explanation for the lower levels of agreement that we found in our studies could be the lower average level of acquaintance between targets and informants (e.g., Leising et al., 2010). Assuming that the greater variance in knowing and liking among our participants is more typical of many everyday situations, our mean levels of agreement may also be more representative of agreement between different perspectives in everyday life.
Second, our measure of meta-perceptions was very broad, asking what “most others” think of the target. Many of the previous studies assessed meta-perceptions in a more specific way (e.g., by asking about meta-perceptions regarding specific groups of raters, or by calculating the mean of dyadic meta-perceptions; e.g., Carlson & Kenny, 2012). Thus, it is possible that our assessment of meta-perceptions led to lower estimates of (generalized) meta-accuracy and meta-insight, because they were not tailored to the specific informants who provided their impressions of the targets. However, in line with other ways of measuring meta-perceptions, as well as with previous research using our approach (Gallrein et al., 2013), we found that generalized meta-accuracy was higher than self-other agreement, suggesting that people are also able to assess the impressions they make on others more generally. Therefore, we think that assessing meta-perceptions in the way we did in the present paper may be an economical alternative.
Effects of Item Features on Different Forms of Agreement
Both studies consistently found very strong joint effects of item features on different types of inter-rater agreement. This is evident from the adjusted R² values for the joint models, explaining a substantial amount of variance in inter-rater agreement, with Study 2 explaining an even larger amount of variance across all models (38% – 59%) compared to Study 1 (27% – 52%). Therefore, it seems obvious to us that item features deserve a lot more attention in future person perception studies and theoretical models. Currently, item features do not play a particularly important role in understanding inter-rater agreement in many of the influential models in the field (Funder, 1995; Kenny, 1994; Kenny et al., 2015).
Based on scatterplots and models with a single item feature as the predictor, we observed a similar pattern of effects across all four forms of inter-rater agreement and both studies, though the effects were least pronounced for meta-insight. In line with our expectations, these findings suggest that, despite some important perceptual differences between the different perspectives (e.g., Vazire, 2010), the effects of item features do not differ much between the different forms of agreement (Kenny & West, 2010).
Overall, results showed that observability, social desirability and – to a smaller degree – importance were significant predictors of inter-rater agreement, while base rate and stability were less influential. While scatterplots as well as models with only one item feature as predictor showed a rather consistent pattern of results across the two studies, results of models with joint effects of all five item features were less clear. When interpreting results from the two studies, Study 2 results may be considered more reliable as the sample size was considerably larger than in Study 1. Two findings were the most consistent across the two studies: First, inter-rater agreement was higher for items judged as referring to more observable target characteristics. This may be the least surprising finding of the present study, as the same effect has been replicated numerous times since Funder and Dobroth’s (1987) original study. Interestingly, however, in larger models predicting inter-rater agreement from all five item features at once, we were able to demonstrate that this linear effect of observability was unique—meaning it persisted even when controlling for the other four item features. Additionally, both of our studies yielded strong and consistent evidence for a curvilinear component to this effect, which had not been reported before: inter-rater agreement does not only increase with items’ observability, but it does so at an accelerating rate as one approaches the higher end of the observability scale. This implies that for highly observable items, the ability of different perceivers to agree on a target’s characteristics becomes exponentially easier, underscoring the crucial role that observability plays in person judgments.
Second, in line with previous research (D. A. Kenny, personal communication with R. Rau, April 15, 2024; John & Robins, 1993; De Vries et al., 2016), in both studies we found a negative quadratic effect of social desirability on inter-rater agreement. That is, items with more evaluative (i.e., more desirable or undesirable) content were associated with lower inter-rater agreement. Notably, we also found the negative quadratic effect of social desirability to be somewhat unique, as it was consistently present across all four joint models predicting different forms of agreement in Study 2 (but only in one of the four joint models in Study 1). Building on the theoretical idea that the rated social desirability of an item very closely mirrors the way in which its usage will reflect the perceiver’s attitude toward the target (Leising et al., 2015), attitude differences between perceivers will also have their greatest influence – and thus reduce inter-rater agreement – for highly evaluative items. This is precisely what we found, as both of our studies specifically recruited informants with varying attitudes toward the respective targets, thereby maximizing dyadic variance in liking. Thus, we assume that the strength of this effect in studies depends on both the variance in the social desirability of items and the variance in perceiver attitudes toward the same target. Future research should routinely account for and report both of these factors. Ideally, this should be done in a way that allows for direct comparisons between studies, such as using a standardized liking scale across studies.
Across the two studies, we rather consistently found negative associations between importance and inter-rater agreement: the negative linear effect was significant in three of four models on Study 1, and in two of four models in Study 2. It was unique also in three of the four joint models in Study 1, but did not persist in the joint models of Study 2. This negative relationship might seem somewhat paradoxical at first, because it implies that people will agree the least with one another in judging those of a person’s characteristics that are deemed most important to know about. Would we not expect that more important characteristics should be rated with better agreement, because (e.g.) all perceivers pay more attention to those characteristics? The finding cannot be explained by the importance’s strong association with evaluativeness (see above), as controlling for (squared) social desirability in the joint models did not reduce the strength of the negative associations, let alone turn the associations of importance with inter-rater agreement positive. These results present a challenge for interpretation and highlight the need for stronger theoretical reasoning around the concept of importance in future research. One preliminary hypothesis is that perceivers may agree on which traits are important to assess but disagree on the level at which those traits are desirable. In other words, individuals may hold different values that shape their judgments of important characteristics in the same person, leading to greater disagreement on items they both consider important. For example, Peter and Paula might both think it is crucial to assess someone’s diligence, but for different reasons: Peter is frustrated by people who are not diligent enough, while Paula is annoyed by those who are overly diligent. In other words, between-perceiver variance in values may operate as noise to the extent that the dimension which those values refer to is considered important. Another avenue for future research may be the role of social context in shaping the importance of specific traits. For instance, diligence might be crucial when evaluating a colleague but less so when assessing a spouse. In studies like ours, where perceivers have different types of relationships with the target, these contextual differences could contribute to lower agreement on traits considered important.
To deepen the theoretical understanding of the concept of importance, future studies could thus a) replicate our findings with items that cover the full range of importance, b) explore how differences in values affect the perception of both important and unimportant traits, c) examine potential interactions between importance and social desirability, and d) investigate how the importance of traits may shift depending on the social context of person perception, in order to better clarify its role in inter-rater agreement.
Limitations and Outlook
Our item sample was restricted in that it did not contain very state-like characteristics or characteristics that most people share (see above). Just like most previous research, our study used measures with items that were selected somewhat arbitrarily by researchers which limits the generalizability of the results (Leising et al., 2012; Leistner et al., 2024). We therefore believe that future studies should use larger and more representatively sampled sets of items, including reliable ratings of all of the most prominent item features (including, for instance, level of abstraction, which was not considered in the present study), which should be made publicly available (see e.g., Leistner et al., 2024). An increasing use of these item sets by subsequent studies could (a) significantly improve on stimulus representativeness for person perception in daily life and (b) enable replication studies with independent sets of items that nevertheless are similarly structured in terms of item features, which (c) also helps to improve generalizability (e.g., by way of actual cross-validation). The range of an item feature captured by a given study is important in terms of interpretation, especially with curvilinear effects: Depending on what part of the range you cover, you may find a positive, a negative, no, or the (true) curvilinear effect. Using such a representative item sample may help shed light on the important role of item features in person perception, which we suggest deserves more recognition in person perception research.
In addition to improving the representativeness of the items, the representativeness of dyads should also be considered. In this study, we used an elaborate recruitment strategy for targets and perceivers, which we argue, but do not know for sure, will better reflect dyadic differences in real life (e.g., in terms of liking). In other scenarios, for example when perceivers are in relative agreement about who the “bad” and “good” targets are, the association of item features with inter-rater agreement may theoretically be reversed. In addition, some of our results may also be specific to the one-with-many design. In this design, perceivers judge only one target at a time, whereas in a half-block design, for example, contrast effects can be much more pronounced because a perceiver judges many targets and compares them with regard to the items. Overall, it seems important that the design of the study corresponds as closely as possible to the structure of person perception in daily life.
Finally, we would like to point out the potential of combining the approach taken here to examine the effects of item features on inter-rater agreement with established approaches such as Multi-Level-Profile-Analysis (MPA) of person descriptions (Biesanz, 2010; Furr, 2009). These approaches differ in a number of ways: for example, our analysis did not take into account differences in means and variances between items (due to the use of standardized coefficients of agreement), and we used a two-step procedure to estimate effects (rather than estimating all parameters in a single linear mixed model, as in MPA). However, the most important difference is probably that our analysis focused on item features as predictors of agreement, whereas MPA focuses on features of targets, perceivers, or dyads. When item features have been considered in previous studies based on MPA (e.g., Wessels et al., 2020), they have only been considered as predictors of other-perception, but not as predictors of the agreement between self- and other-perception. In order to extend MPA in this respect, it is necessary to include (random) interaction effects between (distinct) self-perceptions and item features in the model. It would be particularly promising to include higher-order interactions with dyadic variables (such as the extent to which perceivers know and like their targets), thus opening the way to examine the extent to which such variables moderate the influence of item features on inter-rater agreement. Of course, the proper investigation of such interaction effects requires precise theoretical considerations and sufficiently large item samples that well reflect the possible ranges of item features within the existing universe of person-descriptors.
Contributions
Contributed to conception and design: DL, JZ, NMW
Contributed to acquisition of data: NMW
Contributed to analysis and interpretation of data: JZ, NMW, DL
Drafted and/or revised the article: NMW, JZ, DL
Approved the submitted version for publication: NMW, JZ, DL
Funding Information
Preparation of this manuscript was supported by the German Research Foundation’s grants LE2151/5-1 to Daniel Leising and ZI1533/1-1 to Johannes Zimmermann.
Competing Interests
We declare that no competing interests exist.
Data Accessibility Statement
All the items, participant data, analysis scripts and additional results can be found on this paper’s project page on the Open Science Framework: https://osf.io/fxk7c/.
Figure titles and legends
Figure 1. Study 1 (Panel A) and Study 2 (Panel B) results of multiple regression models with observability predicting the four different forms of agreement.
Figure 2. Study 1 (Panel A) and Study 2 (Panel B) results of multiple regression models with social desirability predicting the four different forms of agreement.
Figure 3. Study 1 (Panel A) and Study 2 (Panel B) results of multiple regression models with importance predicting the four different forms of agreement.
Figure 4. Study 1 (Panel A) and Study 2 (Panel B) results of multiple regression models with stability predicting the four different forms of agreement.
Figure 5. Study 1 (Panel A) and Study 2 (Panel B) results of multiple regression models with base rate predicting the four different forms of agreement.
Footnotes
While previous research (e.g., Leising et al., 2014) used the term “traitness” to describe this dimension, in line with more recent research (e.g., Leistner et al., 2024) we use the term “stability” in the present paper. This way we also avoid possible confusion with the term “traitedness” (i.e., individuals differ in the extent to which traits are relevant to them; Allport, 1967; Bem & Allen, 1974)
We attempted to replicate these findings in a third study, with a broader item set drawn from the natural language, but restricted to other-other agreement and public targets (https://osf.io/tb4rp). However, we later realized that we had made a fundamental mistake in that study’s design (increasing target variance, rather than dyadic variance of liking). The third study is thus not included here and will have to be repeated after correcting for said mistake.