Serial judgments of different target persons in a given situation can depend on the target’s position in the series: Perceivers may initially withhold extreme judgments to avoid violating their judgment algorithm’s consistency in case of more extreme observations later on. With subsequent observations, perceivers may better calibrate their judgment scale. We extend this theoretical reasoning on calibration effects to personality inferences: Between-target rating variability, consensus, and accuracy may be diminished in initial judgments if perceivers prefer moderate judgments regardless of the target. We tested and cross-validated these preregistered expectations using a sample of 3,963 perceivers who judged 200 targets regarding their personalities in 20 different situations. Whereas rating variability and consensus increased across the judgment series, accuracy did not. However, initial judgments were not always more moderate (e.g., higher Agreeableness) suggesting that perceivers reference trait-specific default values. This may be beneficial or detrimental for targets, depending on how their actual characteristics compare.
Serial judgments refer to sequences of different target persons being evaluated in a given situation. Such setups are rather ubiquitous and can have far-reaching consequences. This is most obvious in contexts of explicit evaluation: Doctors assess patient after patient, judges rule case after case, teachers grade student after student. It is similarly true for contexts focused more on evaluating a person’s traits rather than one specific outcome (e.g., performance) which includes typical tasks faced by applied psychologists in various fields (e.g., therapists, human resource experts, or counselors). Moreover, people regularly face series of others in everyday life without making explicit judgments: They serve series of customers, talk to clients, go on dates, or pass by strangers on the street. People constantly form impressions of others and their behavior in everyday life, trying to pinpoint interindividual differences or predict future behavior. The order in which different target persons are “presented” may be determined by their surname, temporal availability or even chance – it is usually unrelated to the characteristic to be judged and should therefore not affect the judgments of a fair perceiver. An excellent performance, for example, should be judged as such, regardless of whether it is the first or 100th in a series.
Unsurprisingly, however, research indicates that the human perceiver is indeed affected by the order of things: The position in a sequence at which someone is judged can be quite influential and accompanied by beneficial or detrimental consequences for the target: Performing later in sports or music competitions, for example, was often associated with better outcomes than presenting early on (e.g., Bruine de Bruin, 2005; Flôres & Ginsburgh, 1996; Glejser & Heyndels, 2001; Rotthoff, 2015; Wilson, 1977). Similar effects appear in other domains: Plonsky et al. (2020) showed that asylum request cases presented in court later in a sequence were more likely to be granted, and Orazbayev (2017) suggested that scientific manuscripts submitted later on a given day were less likely to be desk-rejected. Notably, all available evidence stems from contexts that concern the explicit evaluation of very specific variables such as performance. The larger field of person perception in turn has thus far ignored between-target order effects. The present study aims to close this gap and focus on personality inferences from behavior observation. In the following, we describe how serial position effects may come about and how they may shape general person judgments.
The Calibration Hypothesis
Unkelbach and Memmert (2014) suggest a general explanation for serial position effects based on an integration of existing findings with their own research from various fields (e.g., soccer refereeing, university oral exams): They argue that when people judge a series of different targets in a given situation (e.g., exam performances), judges tend to withhold “extreme” judgments in the beginning and are instead drawn to “moderate” judgments. The main prediction from this explanation is that a good performance is evaluated worse when it is judged in the beginning compared to the end of a sequence, whereas a poor performance in the beginning of a sequence is evaluated better than the same performance presented later in the sequence.
To arrive at this prediction, Unkelbach and colleagues (e.g., Fasold et al., 2015; Unkelbach et al., 2012; Unkelbach & Memmert, 2008) combined the range-frequency model by Parducci (1965) with the consistency model by Haubensak (1992). Based on these two models, they propose that the described order effects may be explained in terms of a calibration of the perceivers: At the beginning of a judgment sequence, perceivers do not know the range of the behaviors (e.g., performances) they are to observe and judge – the minimum and maximum levels are unknown to them. More technically, perceivers have not yet formed an internal function with which they can translate perceived stimuli into a judgment; they need to calibrate an internal scale first (i.e., establish the scale range). When this lack of calibration is combined with people’s desire to provide consistent judgments (in the sense that their judgments should reflect the perceived differences between targets), assigning a more “moderate” judgment (e.g., closer to the midpoint of a response scale in a questionnaire) at first is sensible. It leaves the possibility to provide higher or lower judgments in case the next target displays even higher or lower levels of the behavior and therefore lowers the chances of consistency violations.
To illustrate, let us consider the case of only two serial judgments: If the perceiver uses the response scale mean (i.e., the technically moderate option) on the observed behavior of the first target, the scale’s consistency cannot be violated. The second target may act more extreme in either direction or the same – all possibilities are still available to the perceiver. However, if the perceiver uses the extreme points of the scale for the first target, a violation is possible: If the following target acts more extreme, this cannot be expressed within the provided scale range. After perceivers have observed multiple targets, they can use their prior information for a comparison with the new target. Additionally, the likelihood that they have observed the extremes increases, and thus, the possibility of scale violations decreases. At the end of the sequence, judges may use the scale endpoints with zero possibilities of scale consistency violations. This fact is utilized in some competitive judgment contexts: In competitive gymnastics, for example, performers are ordered by coaches such that the best comes last (Plessner, 1999). To summarize: Judgments across a range of targets may become more variable due to increased judgment extremity, depending on how much “comparison information” was presented to the perceivers beforehand (i.e., the length of the series).
Notably, the reasoning behind calibration effects applies to lay and expert perceivers alike (Unkelbach et al., 2012): An inexperienced perceiver has little or no previous experience with the behavior they must judge. Therefore, they do not know the extremes and expectable range – which leads to initial avoidance of the extremes. An experienced perceiver, on the other hand, brings prior knowledge into a new situation. They have seen a large range of the behavior in question, likely including extremes. At the same time, they still do not know what range of behavior they will observe in this new situation or new judgment sequence. Thus, even when seeing an excellent performance at first, experts may avoid extreme scores because they know from experience that even more extreme performances are theoretically possible, but not if they are going to see something more extreme during the current judgment series.
Calibration Effects in Person Perception
So far, our examples have referenced explicit evaluation contexts such as classrooms or sport arenas where the outcome of interest often is performance. This is because calibration effects have only been investigated in such contexts. We think it is important to consider them for person judgments in general. Specifically, calibration may affect the inferences we make about what other people are like (i.e., their personality) based on observing their behavior.
The basic process of judgment formation in both performance-based and personality judgments is similar, which is why calibration effects may be relevant in both: People observe others’ behaviors and assign meaning in relation to a judgment dimension. The parameters of this process, however, may also distinguish the two types of judgments from one another. For example, there are typically no set rules for personality judgments as to which behaviors are relevant for a given trait dimension, but there often are such rules for performance evaluations. Many performance judgments are also limited to one type of situation (e.g., a sports competition) in which the range of relevant behaviors that targets are supposed (not) to show is fairly restricted. In contrast, most personality judgments are formed across a host of situations that differ in the traits they pull for, norms they imply, and behaviors different targets may opt for.
Order effects have long been a topic in the person perception literature, but the focus has exclusively been on within-target effects (e.g., primacy effects when getting to know someone; e.g., Asch, 1946; Wiedenroth et al., 2020). However, the sequential nature of experiencing different persons in a situation is also an important aspect in many interpersonal contexts, including those where both the serial and the evaluative nature are explicit (e.g., personnel selection), but also mundane situations where they may not be as obvious (e.g., talking to several new people at a party). Even beyond real-life contexts, serial judgments are a tool used in (personality) research making it essential to understand if and how they are affected by presentation order.
This study aims to establish whether serial position effects are relevant to person judgments in general. To this end, we apply the presented reasoning on calibration effects to person judgments on multiple dimensions of personality in multiple situations. Specifically, we focus on a setup where perceivers sequentially observe and judge a series of different targets who were individually filmed in standardized situations. For example, imagine watching a video of a target talking about their hobbies after which you are asked to rate how sociable they are. Subsequently, you watch and rate a second target talking about their hobbies and you rate their sociability, and so on. Other perceivers will observe the same targets but in a different order. Assuming the reality of calibration effects, we would expect that perceivers avoid assigning their first target an extreme rating to avoid consistency violations in their ratings of subsequent targets. For example, if perceivers rate how sociable the target was on a scale of 1 to 5, they may avoid assigning ones or fives, instead veering to more moderate response options (i.e., in this case, closer to the response scale mean 3).
According to this reasoning, three empirical outcomes may be expected, given that several perceivers each judge the same series of targets, but the order in which the targets are presented is varied across perceivers. The fundamental assumption is that perceivers prefer moderate ratings in early judgments, to avoid consistency violations later on. That means, early judgments of different targets will “clump” more tightly around the moderate rating option regardless of the targets’ actual behavior. Thereby, some of the actual differences between targets will be diminished in early judgments, which means a lower relative proportion of systematic target variance and higher relative proportion of unsystematic error variance. Thus, in early judgments compared to later judgments, we would expect (1) an overall lower interindividual (i.e., between-target) variability in ratings across the different targets as perceivers steer away from extreme responses. Moreover, the lower target variance in early judgments could in turn affect (2) inter-rater agreement (i.e., consensus) and (3) rating accuracy. Consensus describes how well different perceivers align in their evaluation of targets’ (relative) standings on a trait. Thus, lower systematic target variance in early judgments would be expected to lead to lower consensus. With later judgments, we would expect consensus to increase: Perceivers would calibrate their internal judgments scales, and a clearer rank-order of the targets would emerge as perceivers gain the opportunity to compare later targets with previous ones. The same reasoning applies to rating accuracy: As target variance is diminished, early ratings should show lower correspondence to external criteria (e.g., targets’ self-ratings) whereas later judgments would become more accurate as perceivers calibrate.
The Present Study
In the present study, we test these three predictions: Between-target rating variability, consensus, and rating accuracy should increase as a function of the targets’ serial position within a sequence, because raters are more likely to use extreme judgments as the series progresses.
We employ a large data set containing 20 subsamples, within each of which about 200 perceivers each judged a series of 10 out of 200 targets with regard to their personalities based on observing their videotaped behavior in a given situation. Our hypotheses and analysis plan were preregistered (see Method section). The present research goes substantially beyond previous research, where only one type of situation (e.g., oral exams) and a single rating dimension (e.g., exam performance) were investigated (e.g., Fasold et al., 2013). We examine calibration effects in personality inferences from observed behavior across a range of situations and multiple traits.
Method
We used data from a larger interpersonal perception project aimed at investigating a range of effects (funded by the German Research Foundation, grant LE 2151/6-1). As some of these investigations have already been published (Heynicke et al., 2022; Wiedenroth et al., 2020, 2024; Wiedenroth & Leising, 2020), the project description in the following draws on prior accounts for the sake of consistency: 200 target persons and acquainted informants first assessed the targets’ personalities. Next, the targets were filmed performing a variety of tasks. Perceivers that were unacquainted with the targets then watched videos of ten different targets performing the same task and rated those targets’ personalities. We describe all data from the project that are relevant to examining calibration effects as defined above. We report how sample size was determined, all data exclusions, all manipulations, and all measures relevant to this study. The larger project includes additional measures and data, such as a condition wherein a different set of perceivers each saw ten videos of the same target in different situations. They are irrelevant to this study and not described in any more detail, but the complete data set and codebook with all variables are available on the Open Science Framework (OSF; https://osf.io/q6xdn/). This repository also includes materials used for data collection. Materials specifically relevant to this study (e.g., analysis code, preregistrations, Supplements S1−S4 referenced in the following) are available here (https://osf.io/fzgep/).
Targets
We recruited target persons locally (e.g., via online and newspaper ads) from the general population in a medium-sized German city. Targets assessed their own personality online. Additionally, targets recruited up to three acquaintances from their personal networks to assess the respective target’s personality as knowledgeable informants. The target sample consisted of 200 people (102 female/98 male) between 17 and 80 years (Mage = 33.29, SDage = 14.48). The informant sample comprised a total of 508 people (300 female/208 male) between 18 and 78 years (Mage = 35.54, SDage = 14.53). Four targets did not have any informant-reports. The target sample size had been specified beforehand in the grant proposal of the project. It was informed by expected effect sizes of several different effects that the project addresses (Wiedenroth et al., 2020, 2024; Wiedenroth & Leising, 2020). Given that in our setup perceivers repeatedly rated the 200 targets (see below), the following assumption underestimates power, but in general, a sample size of 200 implies more than 80% power in capturing effects of |ρ| = .20 (two-tailed, α = .05; G*Power Version 3.1.9.2, Faul et al., 2014).
Measures
Targets and informants respectively used first- and third-person versions of two measures to assess the targets’ personalities. The first was a list of 30 person-descriptive German adjectives that assess the Big Five personality factors compiled by Borkenau and Ostendorf (1998): Agreeableness, Conscientiousness, Extraversion, Neuroticism, and Openness to Experiences (called Intellect in the original article). This measure will be named BO-30 in the following. The second measure was a German translation (Treiber et al., 2013) of the NEO-IPIP-120 (Johnson, 2014). Ratings for both measures were given on a 5-point scale ranging from 1=does not apply at all to 5=applies exactly. Target self- and informant-ratings served as accuracy criteria (see below). For this, we only used the BO-30 ratings to (1) avoid effects due to the different measures as perceivers only used the BO-30 to rate the targets (see below) and (2) limit the number of tests.
Videotaping
Next, targets came to the lab and were filmed while performing a variety of tasks in 20 different, standardized situations (e.g., telling a joke or doing mental arithmetic; see Supplement S1 on the OSF for the full list). The situations lasted 1−3 minutes, and targets completed them in varied order determined by Latin Squares (Williams, 1949). Targets were compensated with 40€ for their participation.
Perceivers
Finally, to watch and judge these videos, we recruited participants from all over Germany. These perceivers received individual single-use links to an online platform where they could watch the videos and provide their ratings. The rating scheme described in the following is also depicted in Table 1: Each perceiver watched ten different targets in the same situation. The construction was such that for a given videotaped situation there were always ten perceivers (a “perceiver block”) that saw the same ten targets (a “target block”) in the same situation. Thus, with the total of 200 targets, there were 20 target and perceiver blocks per situation. The 200 perceivers that judged the targets in a given situation will be called a “perceiver sample” in the following. As there were a total of 20 situations, the overall sample size of perceivers we aimed for was N = 4,000. Participants were randomly assigned to perceiver/target blocks and situations.
Situationa | Perceiver sample | Perceiver block | Perceivers | Target block | Targets |
S1 | PS1 | PB1 | P1-P10 | TB1 | T1-T10 |
PB2 | P11-P20 | TB2 | T11-T20 | ||
[…] | […] | […] | […] | ||
PB20 | P191-P200 | TB20 | T191-T200 | ||
S2 | PS2 | PB21 | P201-P210 | TB1 | T1-T10 |
PB22 | P211-P220 | TB2 | T11-T20 | ||
[…] | […] | […] | […] | ||
PB40 | P391-P400 | TB20 | T191-T200 | ||
[…] | […] | […] | […] | TB1 | T1-T10 |
[…] | […] | TB2 | T11-T20 | ||
[…] | […] | […] | […] | ||
[…] | […] | TB20 | T191-T200 | ||
S20 | PS20 | PB381 | P3801-P3810 | TB1 | T1-T10 |
PB382 | P3811-P3820 | TB2 | T11-T20 | ||
[…] | […] | […] | […] | ||
PB400 | P3991-P4000 | TB20 | T191-T200 |
Situationa | Perceiver sample | Perceiver block | Perceivers | Target block | Targets |
S1 | PS1 | PB1 | P1-P10 | TB1 | T1-T10 |
PB2 | P11-P20 | TB2 | T11-T20 | ||
[…] | […] | […] | […] | ||
PB20 | P191-P200 | TB20 | T191-T200 | ||
S2 | PS2 | PB21 | P201-P210 | TB1 | T1-T10 |
PB22 | P211-P220 | TB2 | T11-T20 | ||
[…] | […] | […] | […] | ||
PB40 | P391-P400 | TB20 | T191-T200 | ||
[…] | […] | […] | […] | TB1 | T1-T10 |
[…] | […] | TB2 | T11-T20 | ||
[…] | […] | […] | […] | ||
[…] | […] | TB20 | T191-T200 | ||
S20 | PS20 | PB381 | P3801-P3810 | TB1 | T1-T10 |
PB382 | P3811-P3820 | TB2 | T11-T20 | ||
[…] | […] | […] | […] | ||
PB400 | P3991-P4000 | TB20 | T191-T200 |
Note. S = situation, P = perceiver, T = target, PS = perceiver sample, PB = perceiver block, TB = target block.
aThe 20 videotaped situations (S1 to S20) were randomly assigned to numbers 1 to 20 to randomize their order for analysis (see Supplement S1).
For each perceiver/target block in each situation, the presentation order of the ten videos (i.e., ten targets) was varied across the ten perceivers using Latin Squares (Williams, 1949). This ensured not only that each target was presented once at each of the ten possible positions in the video sequence that perceivers watched, but also that each target appeared equally often before and after each other target. Figure 1 displays an example of such a Latin Square arrangement for one perceiver/target block. Note that, whereas the perceivers and the targets within each of these Latin Squares remain the same, the positions at which the targets are presented to the perceivers vary systematically. This is a rather ideal design to study effects of presentation order.
Directly after watching each video, perceivers rated the personality of the target based on their behavior in that video using the BO-30. Criteria for participant exclusion had been defined prior to any data analysis: Apart from a minimum age of 18 years, perceivers were excluded if they did not complete the assessment, indicated they knew a target, reported problems with the video or audio quality, or fulfilled criteria of careless responding. Careless responding was defined by a perceiver’s ratings showing (1) zero variance for any video or (2) maximum inconsistency for ratings of positive and negative items1. With these criteria, the final perceiver sample included 3,963 (2,429 female/1,534 male) that were between 18 and 82 years old (Mage = 28.56, SDage = 10.71). Perceivers received 10€ for participating.
Data Analytical Approach
We preregistered our expectations and primary analysis plan (https://osf.io/dv2y4/). As our hypotheses concern judgments of a sequence of different targets in the same situation, the data structure provided us with 20 independently usable perceiver samples each comprising 20 blocks of about 10 perceivers: In each of the 20 videotaped situations, the targets were judged by different perceivers. We used these 20 perceiver samples to progress from exploratory to confirmatory analyses, ultimately cross-validating our findings. In each sample, each perceiver watched and judged a series of 10 of the 200 different targets in the same situation. The position at which targets were presented in the perceivers’ sequence of ten targets (see Figure 1) is our primary predictor of interest for all three hypotheses: A later position should be accompanied by (1) increased rating variability across targets, (2) increased consensus, and (3) increased judgment accuracy.
For all analyses, we used domain-level ratings. That is, we aggregated a given perceiver’s ratings of a given target in a given situation on the individual BO-30 items into the five scale scores reflecting the Big Five domains. Although our core hypotheses are the same for all Big Five domains, they were analyzed separately to account for possible domain differences (e.g., due to different visibility of the traits; Vazire, 2010).
Dependent Variables
We calculated the three dependent variables (DV; i.e., variability, consensus, and accuracy) within a respective perceiver sample first per perceiver block and then averaged them across the 20 perceiver blocks (see Table 1 for data structure).
Variability. We operationalized rating variability in line with the theoretical reasoning for calibration effects: We predicted between-target rating variability to increase across the judgment series because perceivers initially prefer moderate judgments and subsequently grow more likely to use extremer options. Thus, rating variability should initially be more limited around the moderate response option, which in our case is the middle response 3 of the 5-point Likert scale. Accordingly, we calculated variability as the average squared deviation of all judgments from the midpoint of the response scale2. Variability was calculated position-wise, that is, first for the ratings of the 10 targets perceivers saw at position 1, then for the ratings at position 2, and so on. We then investigated if and how the variability estimates could be predicted from the position of the respective judgments using linear regression. We predicted an increase.
Consensus. We calculated consensus as the correlation of every pairwise combination of ratings of the same target from two different positions (e.g., correlation of ratings of the 10 targets made at position 1 with those from position 2). This results in 45 correlations which were Fisher z-transformed and predicted using linear regression from the “average position” of the two ratings that went into calculating a given agreement correlation (e.g., for the correlation of judgments from position 1 and 2, the average position is 1.5). We expected consensus to increase with (average) position, due to better calibration.
Accuracy. For rating accuracy, we averaged the target’s self-ratings and the respective informant-ratings on the BO-30 and used that average as our accuracy criterion3. The position-wise perceiver judgments were correlated with our accuracy criterion. The resulting 10 accuracy estimates were Fisher z-transformed and predicted using linear regression from the position of the respective judgments. We expected accuracy to increase with position, due to better calibration.
Model Fit
To index the fit of our regression models, we used determination coefficient R2. For R2 to be an indicator of goodness of fit and comparable across different approaches of fitting models (e.g., freely estimating coefficients vs. testing a model with specified coefficients, such as a fixed slope), it must be defined as a coefficient of determination that compares the amount of variance explained by the predicted values to the amount of total variance. This value equals the squared Pearson correlation coefficient only for some cases. We use the following formula that is frequently used and recommended as being widely applicable (Kvålseth, 1985; formula 1 for R2):
With: y = observed values of dependent variable; = predicted values of y based on estimated model; = observed mean value of y.
Note that this indicator can become negative if the model that is tested fits the data worse than a horizontal line through the mean of the data. This can be the case if parameters of the tested model are fixed as we do for cross-validation (e.g., if the slope of the tested models is pre-specified and has the opposite directionality of the apparent slope in the actual data). We used bootstrapping to provide 99% confidence intervals (bias-corrected and accelerated; BCa) to determine significance of R2, with 10,000 resamples being drawn from the target blocks in a perceiver sample.
Sequential Approach
We used the 20 perceiver samples, based on the 20 different situations the targets were filmed in, to progress stepwise from exploration to confirmation. That is, the perceiver samples were brought into random order (see Supplement S1). The first sample was used for exploration and based on the findings, we preregistered and performed newly specified analyses with the next sample. We repeated this process of specifying, testing, and revising, using new samples several times, resulting in five consecutive preregistrations (https://osf.io/dv2y4/, https://osf.io/xq6nr/, https://osf.io/t2hb3/, https://osf.io/qnv3u/, https://osf.io/xkphn/). Conceptually, these five analysis rounds correspond to five studies, for which one would preregister predictions based on the data from the preceding study. Analyses were done with R (version 3.5.1; R Core Team, 2018).
Results
The present analyses focused on predicting changes in the between-target rating variability, consensus, and rating accuracy of observation-based personality judgments across series of different targets. We used independent subsamples to progress from exploration to confirmation in five consecutive rounds of analyses. The following section first briefly summarizes the approach and results from the more exploratory (but increasingly confirmatory) Rounds 1−4. Then it focuses on the most robust and confirmatory analyses in Round 5. A full detailed report of Rounds 1−4 can be found in Supplements S2 and S3.
Exploration to Confirmation: Analysis Rounds 1-4
We used the first three rounds of analyses (see Supplement S2) for exploration with stepwise preregistrations, to find out if the hypothesized effects existed and how they could be modeled by experimenting with different regression models. Rounds 1−3 used the first five randomly selected perceiver samples. In Round 4 (see Supplement S3), we used the next five perceiver samples for a first investigation of how well previously found models could be confirmed, using two different approaches: First, abstract regression models in which intercepts and slopes were freely estimated (see Table 2) were fitted to the data. Second, we started cross-validating by fitting models with fixed slopes (regression weight B1 in Table 2) obtained in the previous analyses. In comparison to free models, where the slope is fitted optimally to the data at hand, this avoids an over-estimation of expected effects, or model fit, due to over-fitting.
Dependent variable | Regression model | |
Variability | Linear | |
Logarithmic | ||
Consensus | Linear | |
Accuracy | Linear |
Dependent variable | Regression model | |
Variability | Linear | |
Logarithmic | ||
Consensus | Linear | |
Accuracy | Linear |
Note. Linear = linear regression models using an untransformed predictor; Logarithmic = linear regression models using the natural logarithm (ln) of the independent variable as a predictor. Consensus and accuracy were Fisher z-transformed.
Results showed a positive effect of presentation order on variability for Conscientiousness and Openness in most perceiver samples. The increase in variability mostly happened across the first few positions such that logarithmic models showed better fit than simple linear models. Note that we use the terms linear and logarithmic to differentiate between two types of linear regression models: one using position as predictor, and the other using the natural logarithm of position as predictor (see Table 2). The logarithmic model includes a negatively accelerated increase of the dependent variable and also sets the influence of position at position 1 to zero. For Neuroticism, on average, there was no or a minimal positive association. The effect of presentation order on variability for Agreeableness and Extraversion was consistently negative. For Agreeableness, this seemed to be mainly driven by an elevated variability at position 1, such that logarithmic models fitted well. The variability decrease for Extraversion was better captured by linear models.
Regarding consensus, presentation order had the expected positive effect overall, but there were some exceptions when investigating individual perceiver samples. The picture was similar for accuracy. For example, in the averaged sample (across five perceiver samples) in Round 4, accuracy increased with presentation order, but in the individual subsamples this was not always the case. For both consensus and accuracy, existing increases were fitted well by linear models (consistent increase across serial positions).
Final Cross-Validation: Analysis Round 5
After the first four analysis rounds, results seemed sufficiently consistent for our final confirmatory analysis and cross-validation with the ten remaining perceiver samples.
Preparation
To obtain the most reliable results possible, we investigated all remaining ten perceiver samples at once. To that end, dependent variables were first separately calculated for each target block within each perceiver sample, and then averaged across perceiver samples. Thus, the results reported below are based on 20 such averages.
Our goal was a cross-validation of the models found previously. In preparation for this, the regression models shown in Table 3 were freely fitted to the average of all ten samples analyzed thus far, in order to derive the most reliable and information-saturated slope estimates possible for cross-validation (see Supplement S4). Consistent with previous analyses, logarithmic curves modeled variability changes better than linear models did for all domains except Extraversion. For consensus and accuracy, there was no indication that other than linear types of models would be superior.
Variability | ||||||
Scale | Para- meter | Linear model B0 + B1 × Position | Logarithmic model B0 + B1 × ln(Position) | |||
Specified slope | Free | Specified slope | Free | |||
A | R2 | .40 [.01, .66] | .47 [.28, .69] | .42 [-.20, .71] | .59 [.40, .79] | |
B0 | 1.005 [0.977, 1.040] | 0.979 [0.951, 1.013] | 1.042 [1.014, 1.077] | 0.997 [0.964, 1.038] | ||
B1 | -0.017 | -0.012 [-0.018, -0.008] | -0.086 | -0.056 [-0.080, -0.037] | ||
C | R2 | .22 [.10, .43] | .35 [.07, .74] | .40 [.27, .67] | .60 [.28, .90] | |
B0 | 0.648 [0.609, 0.697] | 0.624 [0.573, 0.693] | 0.639 [0.600, 0.688] | 0.605 [0.557, 0.682] | ||
B1 | 0.003 | 0.007 [0.002, 0.012] | 0.016 | 0.038 [0.015, 0.056] | ||
E | R2 | .55 [.16, .82] | .56 [.14, .93] | .32 [-.04, .57] | .34 [.01, .80] | |
B0 | 0.706 [0.682, 0.737] | 0.711 [0.686, 0.741] | 0.698 [0.674, 0.729] | 0.706 [0.676, 0.742] | ||
B1 | -0.007 | -0.008 [-0.012, -0.003] | -0.019 | -0.024 [-0.047, -0.001] | ||
N | R2 | .25 [.08, .69] | .25 [.01, .86] | .20 [-1.09, .68] | .31 [.01, .82] | |
B0 | 0.406 [0.386, 0.417] | 0.405 [0.371, 0.427] | 0.396 [0.376, 0.407] | 0.403 [0.365, 0.429] | ||
B1 | 0.001 | 0.002 [-0.002, 0.005] | 0.011 | 0.007 [-0.007, 0.023] | ||
O | R2 | .31 [.10, .52] | .31 [.15, .53] | .61 [.43, .82] | .61 [.43, .80] | |
B0 | 0.649 [0.621, 0.677] | 0.657 [0.622, 0.687] | 0.623 [0.595, 0.651] | 0.622 [0.584, 0.655] | ||
B1 | 0.012 | 0.011 [0.007, 0.016] | 0.062 | 0.063 [0.046, 0.083] | ||
Consensus | Accuracy | |||||
Linear model B0 + B1 × Average Position | Linear model B0 + B1 × Position | |||||
Specified slope | Free | Specified slope | Free | |||
A | R2 | -.24 [-.44, -.14] | .00 [.00, .00] | -.38 [-2.25, .01] | .02 [.00, .60] | |
B0 | 0.141 [0.103, 0.166] | 0.173 [0.138, 0.213] | 0.037 [-0.030, 0.116] | 0.078 [0.001, 0.188] | ||
B1 | 0.006 | 0.000 [-0.006, 0.006] | 0.006 | -0.002 [-0.011, 0.006] | ||
C | R2 | .16 [-.31, .51] | .25 [.05, .56] | -.24 [-2.61, .14] | .00 [.00, .00] | |
B0 | 0.187 [0.142, 0.222] | 0.212 [0.138, 0.266] | 0.232 [0.143, 0.322] | 0.251 [0.169, 0.336] | ||
B1 | 0.012 | 0.008 [-0.000, 0.018] | 0.004 | 0.000 [-0.007, 0.007] | ||
E | R2 | .43 [.39, .52] | .45 [.36, .63] | .27 [-.35, .79] | .31 [.01, .86] | |
B0 | 0.287 [0.241, 0.333] | 0.269 [0.203, 0.335] | 0.177 [0.065, 0.280] | 0.184 [0.086, 0.279] | ||
B1 | 0.013 | 0.016 [0.010, 0.023] | 0.005 | 0.004 [-0.003, 0.014] | ||
N | R2 | .19 [.08, .34] | .19 [.06, .44] | .32 [-.06, .78] | .32 [.01, .90] | |
B0 | 0.166 [0.137, 0.194] | 0.166 [0.134, 0.203] | 0.164 [0.085,0.228] | 0.164 [0.061, 0.236] | ||
B1 | 0.007 | 0.007 [0.001, 0.013] | 0.005 | 0.005 [-0.003, 0.012] | ||
O | R2 | .32 [.30, .40] | .42 [.30, .63] | .13 [-.01, .44] | .14 [.00, .67] | |
B0 | 0.263 [0.222, 0.302] | 0.231 [0.194, 0.277] | 0.170 [0.120, 0.241] | 0.165 [0.112, 0.221] | ||
B1 | 0.006 | 0.012 [0.007, 0.019] | 0.002 | 0.003 [-0.003, 0.011] |
Variability | ||||||
Scale | Para- meter | Linear model B0 + B1 × Position | Logarithmic model B0 + B1 × ln(Position) | |||
Specified slope | Free | Specified slope | Free | |||
A | R2 | .40 [.01, .66] | .47 [.28, .69] | .42 [-.20, .71] | .59 [.40, .79] | |
B0 | 1.005 [0.977, 1.040] | 0.979 [0.951, 1.013] | 1.042 [1.014, 1.077] | 0.997 [0.964, 1.038] | ||
B1 | -0.017 | -0.012 [-0.018, -0.008] | -0.086 | -0.056 [-0.080, -0.037] | ||
C | R2 | .22 [.10, .43] | .35 [.07, .74] | .40 [.27, .67] | .60 [.28, .90] | |
B0 | 0.648 [0.609, 0.697] | 0.624 [0.573, 0.693] | 0.639 [0.600, 0.688] | 0.605 [0.557, 0.682] | ||
B1 | 0.003 | 0.007 [0.002, 0.012] | 0.016 | 0.038 [0.015, 0.056] | ||
E | R2 | .55 [.16, .82] | .56 [.14, .93] | .32 [-.04, .57] | .34 [.01, .80] | |
B0 | 0.706 [0.682, 0.737] | 0.711 [0.686, 0.741] | 0.698 [0.674, 0.729] | 0.706 [0.676, 0.742] | ||
B1 | -0.007 | -0.008 [-0.012, -0.003] | -0.019 | -0.024 [-0.047, -0.001] | ||
N | R2 | .25 [.08, .69] | .25 [.01, .86] | .20 [-1.09, .68] | .31 [.01, .82] | |
B0 | 0.406 [0.386, 0.417] | 0.405 [0.371, 0.427] | 0.396 [0.376, 0.407] | 0.403 [0.365, 0.429] | ||
B1 | 0.001 | 0.002 [-0.002, 0.005] | 0.011 | 0.007 [-0.007, 0.023] | ||
O | R2 | .31 [.10, .52] | .31 [.15, .53] | .61 [.43, .82] | .61 [.43, .80] | |
B0 | 0.649 [0.621, 0.677] | 0.657 [0.622, 0.687] | 0.623 [0.595, 0.651] | 0.622 [0.584, 0.655] | ||
B1 | 0.012 | 0.011 [0.007, 0.016] | 0.062 | 0.063 [0.046, 0.083] | ||
Consensus | Accuracy | |||||
Linear model B0 + B1 × Average Position | Linear model B0 + B1 × Position | |||||
Specified slope | Free | Specified slope | Free | |||
A | R2 | -.24 [-.44, -.14] | .00 [.00, .00] | -.38 [-2.25, .01] | .02 [.00, .60] | |
B0 | 0.141 [0.103, 0.166] | 0.173 [0.138, 0.213] | 0.037 [-0.030, 0.116] | 0.078 [0.001, 0.188] | ||
B1 | 0.006 | 0.000 [-0.006, 0.006] | 0.006 | -0.002 [-0.011, 0.006] | ||
C | R2 | .16 [-.31, .51] | .25 [.05, .56] | -.24 [-2.61, .14] | .00 [.00, .00] | |
B0 | 0.187 [0.142, 0.222] | 0.212 [0.138, 0.266] | 0.232 [0.143, 0.322] | 0.251 [0.169, 0.336] | ||
B1 | 0.012 | 0.008 [-0.000, 0.018] | 0.004 | 0.000 [-0.007, 0.007] | ||
E | R2 | .43 [.39, .52] | .45 [.36, .63] | .27 [-.35, .79] | .31 [.01, .86] | |
B0 | 0.287 [0.241, 0.333] | 0.269 [0.203, 0.335] | 0.177 [0.065, 0.280] | 0.184 [0.086, 0.279] | ||
B1 | 0.013 | 0.016 [0.010, 0.023] | 0.005 | 0.004 [-0.003, 0.014] | ||
N | R2 | .19 [.08, .34] | .19 [.06, .44] | .32 [-.06, .78] | .32 [.01, .90] | |
B0 | 0.166 [0.137, 0.194] | 0.166 [0.134, 0.203] | 0.164 [0.085,0.228] | 0.164 [0.061, 0.236] | ||
B1 | 0.007 | 0.007 [0.001, 0.013] | 0.005 | 0.005 [-0.003, 0.012] | ||
O | R2 | .32 [.30, .40] | .42 [.30, .63] | .13 [-.01, .44] | .14 [.00, .67] | |
B0 | 0.263 [0.222, 0.302] | 0.231 [0.194, 0.277] | 0.170 [0.120, 0.241] | 0.165 [0.112, 0.221] | ||
B1 | 0.006 | 0.012 [0.007, 0.019] | 0.002 | 0.003 [-0.003, 0.011] |
Note. Brackets show 99% CIs from bootstrapping. Consensus and accuracy were Fisher z-transformed. Specified slope: model with fixed slope obtained from previous analyses. Free: free estimation of all coefficients. B0 = intercept, B1 = slope, A = Agreeableness, C = Conscientiousness, E = Extraversion, N = Neuroticism, O = Openness.
We specified one model per personality domain and hypothesis, using the slope of the average of the first ten perceiver samples (see Supplement S4). Intercepts remained unspecified. For rating variability, we chose to cross-validate logarithmic models for all domains except Extraversion, as they proved superior. The respective other models (i.e., logarithmic for Extraversion and linear for all other domains) were fitted additionally, but solely for comparison purposes. The models derived this way, based on the full information from all previous perceiver samples, were then fitted to the average of the ten remaining perceiver samples. All analyses were pre-registered and carried out accordingly. R2 with 99% bootstrap CIs, for which resamples were drawn from the 20 averaged target blocks, indicated whether model fit was significant. Additionally, we applied our initial approach of fitting freely estimated models. These models provide more liberal tests of our core hypotheses and comparing the results with those of the cross-validation procedure enabled us to better gauge the relevance of overfitting.
Main Results
We report results for the freely estimated and the cross-validation models alongside each other (see Figure 2 and Table 3). Table 3 displays model fit and CIs for both the freely fitted (free) and the cross-validation (specified slope) models. Additionally, it shows the models’ intercepts (B0) and slopes (B1). Intercepts were freely estimated for both types of models. The slope (i.e., regression weight of the predictor “position”) for specified models was fixed and is the empirical estimate from the average of the first ten perceiver samples.
Variability. As expected, rating variability (i.e., the average squared deviation of ratings from the response scale midpoint) increased with position for Conscientiousness, Neuroticism, and Openness and this increase was described better by logarithmic than linear models. Thus, changes in variability were mostly evident at the beginning of the judgment sequence, across the first few positions. For Agreeableness and Extraversion, variability decreased across the judgment sequence. Notably, for Agreeableness, there consistently appeared a pronounced drop in variability at the beginning of the judgment sequence due to high variability at position 1. For the freely estimated models, Agreeableness was thus better described by a logarithmic curve, whereas the decrease for Extraversion was linear. The cross-validation models with specified slopes fitted the new data significantly, except for the logarithmic models of Agreeableness, Extraversion, and Neuroticism. The overall model fit was substantial: Considering only the preregistered cross-validation model for each domain (logarithmic for Agreeableness, Conscientiousness, Neuroticism, Openness, and linear for Extraversion), R2 ranged from .20 to .61. Overall, three of our five predictions were confirmed by cross-validation (exception: Agreeableness and Neuroticism), and all five were supported by free model estimation in terms of the predicted direction and fit. Note, however, that the changes (as shown by the B1 coefficients in Table 3 and visualized in Figure 2) in variability were overall rather small. In the case of Neuroticism, the change was so minimal it is hard to discern even visually, along with an effect estimate whose CI contained zero.
Consensus. Inter-rater agreement increased with the average position in the judgment sequence for Extraversion, Neuroticism, and Openness. For Conscientiousness, the effect was also positive, but the CI included zero. For Agreeableness no association was evident at all. The increases were described well by linear models and with considerable fit. R2 for the freely estimated models was between .19 and .45 (not considering Agreeableness). Cross-validation models fitted significantly for Extraversion, Neuroticism, and Openness (R2=.19−.43), showing that the positive slopes found previously were able to describe consensus increases in the independent cross-validation sample as well.
Accuracy. Considering the freely estimated models, regression weights suggested increasing accuracy with position only for Extraversion, Neuroticism, and Openness, but CIs all included zero. None of the cross-validation models fitted the new data; the models previously established were not suitable to explain accuracy variance in the cross-validation sample. Even for freely estimated models, CIs of model fit barely excluded zero. Thus, in this final analysis, the accuracy of judgments appeared to be mostly unaffected by presentation order4.
Free vs. Specified Slope Model Estimation. Comparing the freely estimated models with the models with fixed slopes in Table 3 reflects comparing a replication with a cross-validation. If an effect of the same directionality as specified by the fixed slopes in the cross-validation models was present in the new data, freely estimated models typically showed somewhat better fit than cross-validation models because the slopes were optimally fitted to the new data. Fixing the slope using the previously calculated empirical value, estimated more realistically how well the models that we developed work with new data.
Exploration of the Variability Effect
Different from our predictions, variability (defined as the average squared deviation of ratings from the midpoint of the response scale) decreased for Agreeableness and Extraversion. To gain more insights into this phenomenon, we exploratorily considered the means and standard deviations of the ratings. Specifically, we used the average of the last ten perceiver samples, calculated means and standard deviations per target block and position, and then aggregated across target blocks. Means and standard deviations along with variability as deviation from the response scale mean are displayed in Table 4. Interestingly, perceivers assigned relatively high Agreeableness and low Extraversion ratings to the targets they saw in the beginning. This then drives the higher deviations from the theoretical mean at the beginning and the subsequent decrease. Thus, perceivers were not drawn to more moderate judgments for these two domains at the beginning. However, the standard deviations which use the empirical mean did show an increase with position for all domains. That is, while Agreeableness and Extraversion ratings were farther from the response scale mean in the beginning, the between-target variability around the empirical mean was low at the same time. Our initial hypothesis did predict lower between-target variability in the beginning – which would translate into simple standard deviations as well, but we expected this low variability to be caused by raters being drawn to moderate judgments. It seems reasonable to conclude that our theoretical expectations were only partly correct: There was consistent evidence for an increase of between-target differentiation with presentation. However, we had not predicted systematic differences between the five trait domains in terms of initial level (at position 1). Thus, modifying the core hypothesis and operationalizing it in terms of initial mean level (intercept) and increase of between-target differentiation (standard deviation) over time (slope) seems to be the most straightforward solution.
Position | A | C | E | N | O | |||||||||||||||
M | SD | Var | M | SD | Var | M | SD | Var | M | SD | Var | M | SD | Var | ||||||
1 | 3.80 | 0.60 | 1.04 | 3.46 | 0.58 | 0.58 | 2.57 | 0.68 | 0.68 | 3.06 | 0.60 | 0.40 | 3.24 | 0.68 | 0.57 | |||||
2 | 3.63 | 0.68 | 0.88 | 3.47 | 0.62 | 0.63 | 3.03 | 0.83 | 0.70 | 2.91 | 0.61 | 0.40 | 3.29 | 0.75 | 0.67 | |||||
3 | 3.64 | 0.71 | 0.94 | 3.49 | 0.65 | 0.69 | 3.02 | 0.83 | 0.70 | 2.94 | 0.63 | 0.42 | 3.34 | 0.77 | 0.73 | |||||
4 | 3.61 | 0.71 | 0.90 | 3.47 | 0.65 | 0.67 | 3.05 | 0.83 | 0.69 | 2.94 | 0.62 | 0.42 | 3.32 | 0.79 | 0.75 | |||||
5 | 3.62 | 0.72 | 0.94 | 3.47 | 0.66 | 0.69 | 3.08 | 0.81 | 0.67 | 2.96 | 0.62 | 0.41 | 3.35 | 0.79 | 0.77 | |||||
6 | 3.57 | 0.74 | 0.91 | 3.46 | 0.65 | 0.66 | 3.06 | 0.82 | 0.69 | 2.97 | 0.61 | 0.40 | 3.31 | 0.79 | 0.74 | |||||
7 | 3.59 | 0.70 | 0.88 | 3.45 | 0.66 | 0.67 | 3.06 | 0.82 | 0.69 | 2.97 | 0.63 | 0.41 | 3.31 | 0.79 | 0.74 | |||||
8 | 3.61 | 0.71 | 0.90 | 3.48 | 0.65 | 0.69 | 3.07 | 0.79 | 0.63 | 2.94 | 0.63 | 0.41 | 3.33 | 0.78 | 0.73 | |||||
9 | 3.56 | 0.71 | 0.85 | 3.45 | 0.66 | 0.68 | 3.07 | 0.80 | 0.66 | 2.97 | 0.63 | 0.42 | 3.33 | 0.78 | 0.74 | |||||
10 | 3.58 | 0.71 | 0.88 | 3.46 | 0.66 | 0.68 | 3.09 | 0.77 | 0.61 | 2.93 | 0.63 | 0.42 | 3.31 | 0.77 | 0.72 |
Position | A | C | E | N | O | |||||||||||||||
M | SD | Var | M | SD | Var | M | SD | Var | M | SD | Var | M | SD | Var | ||||||
1 | 3.80 | 0.60 | 1.04 | 3.46 | 0.58 | 0.58 | 2.57 | 0.68 | 0.68 | 3.06 | 0.60 | 0.40 | 3.24 | 0.68 | 0.57 | |||||
2 | 3.63 | 0.68 | 0.88 | 3.47 | 0.62 | 0.63 | 3.03 | 0.83 | 0.70 | 2.91 | 0.61 | 0.40 | 3.29 | 0.75 | 0.67 | |||||
3 | 3.64 | 0.71 | 0.94 | 3.49 | 0.65 | 0.69 | 3.02 | 0.83 | 0.70 | 2.94 | 0.63 | 0.42 | 3.34 | 0.77 | 0.73 | |||||
4 | 3.61 | 0.71 | 0.90 | 3.47 | 0.65 | 0.67 | 3.05 | 0.83 | 0.69 | 2.94 | 0.62 | 0.42 | 3.32 | 0.79 | 0.75 | |||||
5 | 3.62 | 0.72 | 0.94 | 3.47 | 0.66 | 0.69 | 3.08 | 0.81 | 0.67 | 2.96 | 0.62 | 0.41 | 3.35 | 0.79 | 0.77 | |||||
6 | 3.57 | 0.74 | 0.91 | 3.46 | 0.65 | 0.66 | 3.06 | 0.82 | 0.69 | 2.97 | 0.61 | 0.40 | 3.31 | 0.79 | 0.74 | |||||
7 | 3.59 | 0.70 | 0.88 | 3.45 | 0.66 | 0.67 | 3.06 | 0.82 | 0.69 | 2.97 | 0.63 | 0.41 | 3.31 | 0.79 | 0.74 | |||||
8 | 3.61 | 0.71 | 0.90 | 3.48 | 0.65 | 0.69 | 3.07 | 0.79 | 0.63 | 2.94 | 0.63 | 0.41 | 3.33 | 0.78 | 0.73 | |||||
9 | 3.56 | 0.71 | 0.85 | 3.45 | 0.66 | 0.68 | 3.07 | 0.80 | 0.66 | 2.97 | 0.63 | 0.42 | 3.33 | 0.78 | 0.74 | |||||
10 | 3.58 | 0.71 | 0.88 | 3.46 | 0.66 | 0.68 | 3.09 | 0.77 | 0.61 | 2.93 | 0.63 | 0.42 | 3.31 | 0.77 | 0.72 |
Note. A = Agreeableness, C = Conscientiousness, E = Extraversion, N = Neuroticism, O = Openness to Experiences, M = mean, SD = standard deviation, Var = averaged squared deviation from the midpoint of the response scale (3).
Discussion
The present study tested effects of presentation order in serial judgments as found in explicit evaluation contexts (e.g., competitions) for the field of person perception, specifically, people’s judgments of targets’ personalities based on their behavior. Such effects of presentation order, or serial position, have high practical relevance as people are frequently faced with this kind of judgment task in real life. Presentation order is a clear extraneous influence that, ideally, should not influence judgments. In addition, serial position effects are of theoretical interest as they provide insights into and allow predictions about the underlying processes of interpersonal judgment formation.
We regard the data set and analyses as strong for a variety of reasons: First, the number of targets and especially perceivers were much higher than in previous studies. Second, targets were observed in the lab while engaging with 20 different situations, and their behavior in each situation was judged by a different group of 200 perceivers. Thus, the generalizability of our findings is likely to be higher. Third, our hypotheses were theory-based (Unkelbach & Memmert, 2014) and tested in a strictly preregistered fashion, including cross-validation, for the domain of person perception. We now discuss the core findings in more detail.
The question of whether presentation order affects the between-target variability of judgments was front and center of this study. Possible effects on consensus and accuracy (to be addressed later) were assumed to originate from such an effect. Based on the theoretical reasoning by Unkelbach and Memmert (2014), variability seemed to best be assessed in terms of the average squared difference between judgments and the midpoint of the response scale. This operationalization was closest to the underlying theoretical idea that perceivers tend to assign “moderate” ratings to targets, due to their lack of calibration and possible motive to preserve their ability to respond consistently. Using this operationalization, our hypothesis was supported for judgments of three of the Big Five domains (Conscientiousness, Neuroticism, and Openness), but contradicted for the judgments of the other two domains (Agreeableness and Extraversion). For these latter two domains, the directionality of the effect was reversed: Variability around the scale mean decreased with presentation order. In both cases, the perceivers attributed relatively extreme mean levels (high on Agreeableness but low on Extraversion) to the targets they watched early on, which subsequently decreased, leading to a decrease in overall variability. These effects were remarkably robust, although we originally predicted the opposite pattern. Taken together, the results suggest an influence of trait-specific default expectations that perceivers have regarding the behavior of targets they have never encountered before.
Based on these findings, we explored using the standard deviation of ratings as a measure of variability in addition to our theory-based measure of variability. The outcome was very clear: When using the standard deviation, all effects of presentation order did in fact become positive. This suggests that between-target variation does increase the more targets a perceiver has already watched and judged. This effect is consistent with the assumption that actual differences between targets should be reflected more in judgments the more the perceivers had the opportunity to observe those differences, and that, in the absence of such information, perceivers will reference some kind of typical default value for their judgments. This leaves two options: (1) Perceivers simply resort to using a default (i.e., expected) value for early targets, or (2) perceivers compare their observation of early targets with that default value.
If people tend to assign default values to early targets, then contrary to our expectations, this default value is not always the midpoint of the available response scale. In our study, perceivers expected the targets to act in a fashion that may be described as high on Agreeableness, low on Extraversion, and moderate on the three remaining domains. In terms of methodology, this would imply the additional consideration of domain-specific effects on intercepts in studying effects of presentation order. The findings for Agreeableness and Extraversion, however, do not necessarily contradict the hypothesis that perceivers are initially drawn to more moderate judgments. We assumed the numerical construct of the response scale to reflect the psychological construct of the personality domains (e.g., Agreeableness) such that the moderate response scale option equals a moderate standing on the respective domain. However, perceivers’ default expectations (e.g., somewhat higher Agreeableness) could also be the actually moderate judgment whereas the scale mean would reflect a more “extreme” judgment. Thus, in our data, with ongoing position, judgments may have become psychologically more extreme while becoming less extreme with regard to the provided response scale.
The other possibility is that early targets are compared to perceivers’ default expectations. It may well be possible that the lab situations in which the targets were filmed somewhat dampened less agreeable and more extraverted behaviors. Agreeableness and Extraversion as dimensions are especially interpersonal in their definition and behavioral referents (e.g., Goldberg, 1990), and possibly more affected in their average manifestation by norms signaled in a given interpersonal setting (e.g., study setting requires that participants are willing to comply with tasks). This could have resulted in perceivers experiencing their first target as more agreeable and less extraverted compared to their default expectation for the average person. With later judgments, the frame of reference then may have shifted: Perceivers calibrated their scale to the current judgment situation, thus comparing new targets to recent prior observations.
In any case, perceivers’ experiences prior to the judgment situation at hand are important. People generally have extensive experience with person perception and likely know a range of trait-related behaviors that is much wider than the one that may have been observed in the current study situation. This aligns with findings showing the relevance of order effects for expert judges (Unkelbach et al., 2012). However, the mechanism underlying calibration may be somewhat different for expert vs. novice judges: Whereas experts can reference their default expectations early on, novices have no prior knowledge and may thus be more likely to choose moderate responses. Notably, for the domain of person perception, novice judges do not exist – at least not in the sense of having no prior experience with observing people and their trait-relevant behaviors. Note that for personality judgments expertise with the judgment subject (i.e., traits) is not necessarily linked to expertise with the judgment mode (i.e., rating items on scales) as trait inferences in everyday life may just be expressed in natural language or remain implicit. It is, however, familiarity with the subject that would lead to default expectations and drive the calibration phenomenon.
Regarding consensus and accuracy, the results were more straightforward. For three of the five personality domains (Extraversion, Neuroticism, and Openness), cross-validation results suggested that consensus increased as expected. Thus, a calibration effect on consensus does seem to exist for judgments on these domains, and it may be explained in terms of increasing between-target differentiation. The greatest exception was the Agreeableness domain, for which we found no effects in the predicted direction. We speculate that the high evaluativeness of this dimension (e.g., John & Robins, 1993; Leising et al., 2014) may have played a role here, and possibly have masked a weaker calibration effect. For example, perceiver effects (e.g., in global positivity) have been shown to affect judgments in the Agreeableness domain more than in other domains (e.g., Albright et al., 1988; Heynicke et al., 2022; Rau et al., 2021). The lack of consensus increases for Agreeableness, in combination with the high initial default ratings and the lowest overall rating accuracy, also present a pattern aligned with an asymmetry found for Agreeableness: It is the dimension most commonly but least accurately inferred in first impressions (Ames & Bianchi, 2008).
Rating accuracy seemed mostly unaffected by presentation order. This may be explained in terms of the relatively modest overall level of accuracy in this study (which was comparable to previous studies, e.g., Borkenau et al., 2004). If effects of presentation order on accuracy do exist, they may be relatively subtle, as well as trait- and criterion-dependent. It may also be the case that in calibrating their judgment scale to the new situation, perceivers grow better at reflecting the target differences within the specific judgment series but not globally with regard to external criteria.
Even if global judgment accuracy remains mostly unaffected, the effects of presentation order that we found matter in terms of the consequences that judgments can have: For example, the present study suggests that targets who are encountered early on by perceivers have a higher likelihood of being assigned or compared to the perceivers’ expected or default values in the respective setting, given the relative scarcity of information for comparison. Targets encountered later on may be more likely compared to the other targets previously observed in the given situation by the perceivers. Different positions in the judgment series thus can be beneficial or detrimental to targets depending on how their characteristics compare to the other targets in the series and the perceivers’ default expectations.
It is unclear how much default values are specific to settings. For example, in a try-out situation that actually is difficult, judges may correctly assume that most candidates will show rather weak performances. Candidates who are observed early in the sequence would then have a disadvantage, because their evaluations would more strongly reflect this assumption. The present study clearly suggests that, to ensure fairness of comparisons, perceivers should first be exposed to a number (2−3) of “test” candidates who are not part of the actual sample, and only afterwards start judging the actual candidates in earnest. The success of such an approach has, for example, been shown by Fasold et al. (2015) for sports performance judgments.
As for limitations, while the data set afforded us the possibility to systematically investigate pure calibration effects in behavior-based person judgments, the following constraints to our conclusion should be considered:
First, the magnitude of the effects that we found was rather modest, and most of the relevant increases in between-target differentiation occurred within the first few positions in the presentation order. There also may be a discrepancy between the findings of previous studies and the present one: Whereas many previous studies (e.g., Bruine de Bruin, 2005, 2006; Flôres & Ginsburgh, 1996; Glejser & Heyndels, 2001; Orazbayev, 2017; Plonsky et al., 2020; Rotthoff, 2015; Wilson, 1977) suggested a “late leniency” effect in perceivers (i.e., ratings becoming more favorable over time), the present study suggested the opposite, if one accepts judgments of Agreeableness as a proxy measure of overall judgment positivity. The present analyses do not enable us to resolve this discrepancy. We speculate, however, that the moderating factor in this regard may be the extent to which the perceivers’ judgments will have actual consequences for the targets: In real-life competitions, judges may go to great lengths to preserve their own ability to assign even “better” scores later on, and thus judge the earliest targets a bit more harshly in line with Unkelbach and Memmert’s (2014) original hypothesis. In our study, it was clear that there would be no such consequences, so the perceivers may have experienced fewer restraints on the positivity of the impressions that the first targets would make on them. In fact, there are other studies as well that suggest the initial default value for judgment positivity is high under such circumstances (Wessels et al., 2021).
Second, our results do not indicate how long a perceiver’s being calibrated will last. Most judgments that take place in real life may be based on at least some prior information about between-target differences. For example, judges at figure skating competitions are likely to have seen hundreds or thousands of performances. Likewise, as previously discussed the perceivers in the present study may be assumed to have considerable prior knowledge as to how people differ from one another in their everyday behaviors. Nevertheless, the literature shows that calibration effects clearly do occur under these circumstances. Findings from Unkelbach et al. (2012) suggest that expert judges are more aware of the consequences of consistency violations and even though they may have seen many performances and many examples of extremes before, they still do not know the range they will be seeing in a new setting (e.g., a new competition). This suggests that the time that passes between the judgments of the individual targets may play an important role: Specifically, targets that are presently observed are most likely to be compared with targets that were observed recently, provided that the circumstances under which the targets are observed are comparable. Thus, it is possible that some kind of “re-set” may take place in the perceiver’ mind once a new rapid succession of target observations begins.
Third, the study setup may constrain the generalizability of our conclusions in some regards: For one, perceivers were outside observers watching the targets on video whereas in many mundane situations raters may be involved and interact with the targets. Importantly, a setup like ours was necessary to investigate the manifestation of calibration effects based on their theoretical conceptualization: Multiple perceivers needed access to the exact same information about targets but in varied order, which cannot be achieved in an interactive setup. We do expect the effects that we found to translate into interactions but note that additional processes may come into play. For example, perceivers may (consciously or not) influence the targets’ behaviors: The behavior of the first target and the perceiver’s perception of it may affect the perceiver’s behavior toward the next target who in turn may then react differently to the perceiver (e.g., Campbell et al., 1964; Dufner et al., 2016; Rau et al., 2019). If fairness is to be ensured, these additional influences related to presentation order will have to be managed as well (e.g., by having perceivers remain in observer roles or following strict [e.g., interview] protocols).
As previously referenced, our setup may have also lacked some ecological validity with judgments being inconsequential for perceivers and targets. Outcome dependence could be manipulated in future studies (e.g., by telling perceivers they will ultimately have to select a target to collaborate with on a future task; e.g., Flink & Park, 1991). This could lead perceivers to pay even closer attention to the targets’ behaviors and thereby enhance consensus overall (Kenny, 2019; Neuberg & Fiske, 1987), but we would still expect calibration effects to emerge.
Ratings were made on rather broad trait adjectives and analyzed at the level of the even broader Big Five personality domains. Our item pool is thus not representative of personality items overall, and broadening the scope in terms of traits considered as well as varying the level of abstractness could reveal a more comprehensive picture of trait- specificity (e.g., in the default expectations).
Relatedly, we used a trait-wise analytical approach to (1) be able to consider trait differences but also (2) because the reasoning on calibration effects is based on studies that aim to compare targets on a given outcome variable. With personality, combinations or profiles of traits may also be of interest, making it relevant to consider whether and how profile-wise analyses would be affected by calibration. For example, the decreased between-target variability and influence of default expectations could result in profiles made early on being more similar across targets. As a byproduct of perceivers switching their frames of reference from using default expectations to prior target observations, the within-target rank-order of traits could also vary across the first positions but should stabilize in later judgments. Akin to our results for trait-wise consensus, we would expect profiles of a target made later on to align more with each other than with earlier profiles. Generally, researchers will have to specify what aspect of profiles they are interested in to determine if and how calibration is relevant.
One further issue tied to the data set concerns the response scale used for ratings, in our case, a five-point Likert scale with a technically moderate mid-point. Generally, calibration effects should emerge independent of the specific response scale as shown by the various response formats in studies with performance-based judgments (e.g., Unkelbach & Memmert, 2014). Thus, we would expect similar overall patterns for personality item scales with higher (e.g., nine) or even (e.g., eight) numbers of response options. Still, some deviations are conceivable: For example, with a higher number of available response options initial between-target variability could be higher compared to our study, simply because this would give perceivers more room for differentiating between targets without prematurely limiting their ability to use more extreme response options later on. Theoretically, more options could also enable more nuanced comparisons of targets (and enable stronger consensus increases) if perceivers use the full range of options. Even-numbered reponse options (i.e., without neutral midpoint) could make it more difficult to interpret variability as, for example, perceivers might indiscriminately veer for either of the two moderate options in case of perceived neutrality or uncertainty (e.g., Tzeng, 1983). These are, however, mere speculations and warrant further research.
Last, we may consider the effort required of the perceivers: Perceivers were asked to consecutively watch and judge ten targets providing ratings on 30 items each time. If fatigue set in over the course of the study, this could have affected the pattern of order effects we were interested in: For instance, as their motivation started to wane, perceivers could have increasingly relied on shared stereotypes instead of the observed behaviors – which could have boosted consensus but dampened accuracy. However, the combination of results from our three DVs and other characteristics of the data show that fatigue is unlikely to have played a relevant role. This is most easily demonstrated by the following: In the present analyses, we averaged DVs across situations to gain more reliable estimates. However, when investigated separately, the situations varied in their overall diagnosticity for a given trait (e.g., Conscientiousness can be inferred more accurately in some situations than others; see also Wiedenroth & Leising, 2020). Importantly, when calculating position-wise accuracy for individual situations, the data indicates that (independent of changes across position) the overall accuracy of early and later judgments consistently differs across situations. To achieve this, perceivers must have made use of the specific behavioral information from the respective situations as the categorical cues possibly relevant to stereotypes (e.g., clothing) were constant across all situations.
One final point to be taken from this study is of methodological nature: Although the importance of cross-validating research results is emphasized in many textbooks, not many studies do so, at least in the field of social psychology. The findings presented in Table 3 highlight the importance of distinguishing between replication in general, and cross-validation. In the latter case, model fit is determined for a model with specified parameters (slope, in our case). We found that, in several instances, this stricter approach did result in more conservative estimates of model fit. Therefore, we recommend making more use of cross-validation, not “despite”, but because of the fact that this approach will yield more realistic estimations of how accurate our models are.
Conclusion
The present study investigated effects of presentation order in serial judgments of personality based on behavior observation. Such person judgments are prevalent in everyday life, but research so far has been focused on specific contexts (e.g., competitions, exams). We showed that serial-position effects in judgments of different targets are relevant for more mundane contexts and general personality judgments as well, but the mechanism behind calibration effects may be different than theory proposed. Between-target judgment variability and consensus increased as perceivers observed more and more targets. Initial judgments were, however, shaped by perceivers’ default expectations rather than an avoidance of extreme judgments. The accuracy of judgments seemed mostly unaffected by presentation order, but targets may still experience beneficial or detrimental consequences from their position in a judgment series – dependent on how their actual standing on the judged trait fares in comparison to the perceivers’ expectations. Considering calibration effects is not only relevant for understanding judgment formation processes better in general – but also to ensure fair treatment of targets when judgments are consequential.
Contributions
Contributed to conception and design: AW, DL
Contributed to acquisition of data: NMW
Contributed to analysis and interpretation of data: AW, DL
Drafted and/or revised the article: AW, CU, NMW, DL
Approved the submitted version for publication: AW, CU, NMW, DL
Acknowledgements
The conducted research was preregistered with an analysis plan on the Open Science Framework (https://osf.io/dv2y4/, https://osf.io/xq6nr/, https://osf.io/t2hb3/, https://osf.io/qnv3u/, https://osf.io/xkphn/).
Funding Information
Preparation of this manuscript was supported by the German Research Foundation (grant LE 2151/6-1 to Daniel Leising).
Competing Interests
We have no conflict of interest to disclose.
Data Accessibility Statement
The complete dataset from which this study used a subsample and a codebook with all variables are available on the Open Science Framework (OSF; https://osf.io/q6xdn/). This repository also includes materials used for data collection. All materials specifically relevant to this study (e.g., code, preregistrations, supplements) are available here (https://osf.io/fzgep/).
Footnotes
Inconsistency was defined as follows: the sum of ratings on the negatively keyed items for a given Big Five domain was subtracted from the sum of ratings on the positively keyed items of that same domain (for each video). Perceivers were fully excluded if the mean of the five domain-wise sums was zero for any video.
Another possible indicator of variability is the standard deviation, but this uses the sample mean instead of the theoretical response scale mean as the anchor and therefore does not capture a possible change in variability in line with the theory to be tested (i.e., with moderate responses being preferred early on). We revisit this point later in the exploration.
For the four targets without informants, the accuracy criterion consisted only of the targets’ self-ratings.
Following a reviewer question, we repeated the accuracy analyses post-hoc in this sample for targets’ self-ratings vs. informant-ratings as separate accuracy criteria: In freely estimated models, we found positive effects for Extraversion and Neuroticism for both criteria, and for Conscientiousness and Openness only in relation to self-ratings – but regression weights’ CIs all included zero as in the main analysis. For models with specified slopes, we found two models that fit (CIs of R2 excluded zero) – accuracy regarding informant-ratings increased for Extraversion, and regarding target self-ratings for Openness.