We aimed to identify effect sizes of age discrimination in recruitment based on evidence from correspondence studies and scenario experiments conducted between 2010 and 2019. To differentiate our results, we separated outcomes (i.e., call-back rates and hiring/invitation to interview likelihood) by age groups (40-49, 50-59, 60-65, 66+) and assessed age discrimination by comparing older applicants to a control group (29-35 year-olds). We conducted searches in PsycInfo, Web of Science, ERIC, BASE, and Google Scholar, along with backward reference searching. Study bias was assessed with a tool developed for this review, and publication bias by calculating R-index, p-curve, and funnel plots. We calculated odds ratios for callback rates, pooled the results using a random-effects meta-analysis and calculated 95% confidence intervals. We included 13 studies from 11 articles in our review, and conducted meta-analyses on the eight studies that we were able to extract data from. The majority of studies were correspondence studies (k=10) and came largely from European countries (k=9), with the rest being from the U.S. (k=3) and Australia (k=1). Seven studies had a between-participants design, and the remaining six studies had a within-participants design. We conducted six random-effects meta-analyses, one for each age category and type of study design and found an average effect of age discrimination against all age groups in both study designs, with varying effect sizes (ranging from OR = 0.38, CI [0.25, 0.59] to OR = 0.89, CI [0.81, 0.97]). There was moderate to high risk of bias on certain factors, e.g., age randomization, problems with application heterogeneity. Generally, there’s an effect of age discrimination and it tends to increase with age. This has important implications regarding the future of the world’s workforce, given the increase in the older workforce and later retirement.
Introduction
Rationale
Age discrimination in hiring can have severe negative consequences for the individual worker (e.g., unemployment, forced early retirement) as well as for organizations and society at large (e.g., failing to meet demands on available workforce), and is also illegal in many countries. Older workers now make up the fastest growing segment of the workforce in most developed countries, and the numerous challenges they face has prompted some to refer to this group as “the new unemployables” (Wanberg et al., 2016). Hence, recognizing and combating discrimination against older applicants is becoming an increasingly pressing issue. An important step in this endeavor is to grasp the magnitude and the onset of age discrimination in hiring (World Health Organization, 2021), which is our aim with this systematic review.
An Aging Workforce
The workforce is increasingly getting older, for several reasons. Globally, the second half of the twentieth century was marked by a sharp rise in life expectancy, from 46.5 years in the early 1950s, to 65.4 years by the end of the 1990s (UNDP, 1999). The increasing number of retirees and a decreasing size of the active workforce contributing to social security systems, lead to cutbacks in the provision of pensions internationally (Galasso & Profeta, 2004). Aging populations and the associated financial demands on social security systems call for new political and economic directives. In an attempt to prevent the collapse of these systems, many countries aim at incentivizing the extension of work life past the retirement age (Neumark et al., 2017). Many people also prefer to keep working until an older age (Wöhrmann et al., 2016).
The Difficulties Facing Older Workers
Research suggests that older workers face difficulties on the labor market, and that these obstacles begin at an earlier age than might be expected. For starters, earnings tend to increase during the early career years until workers are middle-aged. At this point, they peak and then begin to decline distinctly (Carlsson & Eriksson, 2019), which is explained by a decrease in the number of hours worked (Rupert & Zanella, 2015). Indeed, unemployment duration tends to increase with age in most Western countries (OECD, 2018; Rupert & Zanella, 2015). Seniors often transition to part-time retirement, bridge jobs at the end of their careers, or return to work after a period of retirement (Johnson et al., 2014; Maestas, 2010), leaving them at risk of becoming unemployed. It seems that anti-discrimination legislation notwithstanding, older individuals face lower chances of attaining and holding employment compared to younger, equally competent ones, and these hardships are often observed for workers around the age of 50 and onwards (Wanberg et al., 2016).
There could be several reasons for older workers’ lower chances on the labor market. Part of the explanation might be that a larger share of the older workforce have outdated job skills (Fossum et al., 1986), are less familiar with modern job search methods (Gibson et al., 1993), or less likely to move to a new location (Theodossiou & Zangelidis, 2009). Nevertheless, there is reason to believe that recruiters often have stereotypes about older workers. In a survey with a representative sample of Swedish employers, Carlsson and Eriksson (2019) found widespread beliefs that workers’ flexibility/adaptability, ambition, and ability to learn new tasks start to decline as early as around the age of 40. Studies conducted in other cultures are consistent with this picture (Henkens, 2005; Posthuma & Campion, 2009; Taylor & Walker, 1998), and further suggest that managers are sometimes concerned that older workers are less productive, or in poorer physical shape. Age stereotypes seem to be especially prevalent in some industries, such as finance, retailing, and information technology (Posthuma & Campion, 2009).
Defining and Measuring Age Discrimination
Ageism occurs when people are categorized according to their age in ways that lead to injustice, harm, or disadvantage. The present review deals with direct discrimination in the form of disparate treatment. This refers to cases where employers apply different standards to individuals based on their group membership (Doyle, 2007; Gatewood & Field, 2001). In the case of ageism against older people, it translates into treating older applicants unfairly. In contrast, disparate impact (or indirect discrimination) refers to systems that indirectly result in unequal outcomes for older and younger workers, and falls outside the scope of the present research.
Direct discrimination has been estimated or measured in a number of ways. However, in studies based on administrative or survey data, age effects may be confounded with effects of other potential worker characteristics that employers perceive but the researcher does not. If such characteristics correlate with age, omitted variable bias threatens the study’s validity (Bertrand & Duflo, 2017). Additionally, surveys asking employers whether they treat applicants differently depending on age face shortcomings due to social desirability issues and possible lack of insight about discriminatory behaviors. Only experimental designs where age is randomly assigned to job applicants allow for a high degree of control and inferences of causality. In this systematic review, we thus focus only on studies with experimental designs, including field- as well as laboratory experiments. However, a requirement is that participants are real recruiters, rather than for example college students who lack personnel selection experience.
We divide experimental discrimination studies into two types. First, there is the scenario-based experiment, also called vignette study or factorial survey. Compared to surveys, vignette studies offer a broader range of situational or contextual factors (Hyman & Steiner, 1996), and allow for causal inferences. Second, there are field experiments. These can be further divided into correspondence studies and audit studies. In correspondence studies, researchers construct applications and randomly assign older and younger age to the fictive applicants. They send the applications to a large number of real job openings, and the outcome variable is the response from employers in the form of callbacks (invitations to job interviews or further consideration in the selection process). Correspondence studies and vignette studies thus have in common that they have experimental designs, use fictitious applications, and focus on the first stage of selection in which the participants (in this case, recruiters) screen applicants based on their resumes. The key difference between the two is that in correspondence studies, the participants (i.e., employers) are unaware that their behavior is being monitored and they are thus unable to conceal discriminatory behaviors. Furthermore, the behavior is observed in a real-life high stakes setting compared to an artificial setting in vignette studies.
Audit studies are another type of field study that is similar to correspondence tests, but instead of callbacks, they measure actual job offers by using actors who pose as real job applicants attending employment interviews. The goal is to use applicants who differ on the category of interest (in this case, age) but are maximally equal to one another in all other aspects (Gaddis, 2018). However, audit studies often fail to make applicants appear identical with respect to all other aspects (Neumark, 2012). Additionally, they often suffer from low statistical power due to the high cost and effort involved in collecting data, as well as potential demand effects when auditors are not naive to the purpose of the study. Because of these limitations, we do not include audit studies in this review.
Previous Reviews
While there are original studies researching labor market discrimination based on various discrimination grounds, systematic reviews have been conducted mostly to aggregate data on ethnic discrimination (Lippens et al., 2023; Zschirnt & Ruedin, 2016). Some reviews have provided overviews of correspondence experiments focusing on the most common discrimination grounds (e.g., ethnicity, gender, religion, sexual orientation, age; Bertrand & Duflo, 2017), with the most comprehensive review being Baert’s (2018) and their newly updated meta-analysis (Lippens et al., 2023), which encompassed nearly all studies on hiring discrimination across all discrimination grounds, tested by correspondence experiments. They conducted a meta-analysis on 19 correspondence studies examining age discrimination in hiring, of which 17 were examining age discrimination towards older applicants, although there was not a clear distinction of which ages belonged to control and which to the discriminated group between studies. Contrary, our study represents the first preregistered systematic review and meta-analysis that specifically targets hiring discrimination based on applicants’ age, with predefined criteria of what constitutes older age and control age, which allows us a clearer interpretation of effects we find. By incorporating preregistration, transparent methods for extraction and calculation of effect sizes, and a focused scope, our research offers a unique and valuable contribution to the existing literature.
Objectives
The overall objective of this systematic review is to investigate the size of age discrimination in hiring. Our review question is: How large is the effect of recruiters’ age discrimination against older (compared to younger) applicants in selection, according to correspondence testing and scenario experiments conducted between 2010 and 2019? In our preregistered protocol, we used the term recruitment instead of selection, but we subsequently realized that our preregistered inclusion criteria were in fact more narrow and that selection is the term that more accurately aligns with these. Additional modifications to the review question was further clarification of the population (recruiters) and of the comparator (younger applicants) to align with the preregistered inclusion criteria. As a secondary review question, we preregistered that we would aim at exploring the moderating effects of age groups, job type, and culture.
Methods
This systematic review was conducted according to Methodological Expectations of Cochrane Intervention Reviews (MECIR) (Higgins et al., 2022) and reported in accordance with the official PRISMA guidelines for reporting systematic reviews (Page et al., 2021). The criteria were defined in the protocol, which was preregistered on OSF using the PROSPERO protocol template (https://osf.io/cyft2). As far as possible, we retell the criteria verbatim from the protocol; however, in order to meet the reporting format, we had to make some changes to both wording (e.g., future to past tense) and structure (moving paragraphs). Below, we explain all deviations from the preregistered protocol.
Eligibility Criteria
Population
The participants had to be active recruiters who have received applications sent out by researchers in correspondence studies (field experiments) or active recruiters who rated fictive job applicants (scenario experiments) between 2010 and December 31, 2019. Scenario experiments with only undergraduate students or non-recruiting staff as participants were excluded.
Intervention
Age had to be systematically and randomly assigned to the application (e.g., 30 vs. 50). We coded different levels of manipulation of age, based on the applicants’ age group: 40 - 49, 50 - 59, 60 - 65 (retiring age) or above 65. Whenever age groups spanned several ages, we averaged these ages and linked them to the corresponding age group of the age average.
Control
Studies had to include the application of a person between 30 and 35 years of age, who did not signal any other type of minority group membership that is protected by anti-discrimination law (e.g., sexual orientation minority). The reason we planned to exclude applicants younger than 30 is because of the possibility of discrimination against young people, whereas the cut-off on 35 was intended to leave some range to the 40 year category in which studies found age discrimination to increase significantly. However, it turned out that several studies in the literature had cut-offs very close to ours, and we found it unreasonable to exclude those studies for that reason. We hence deviated from our preregistration, by including studies which had control groups aged 29 (Neumark et al., 2016, 2019; Challe et al., 2016 (Study 1)), as well as one study whose comparators were applicants aged 35 and 36 (Challe et al., 2016 (Study 3)), where we included both ages. Considering that we changed the inclusion criteria for the comparators in the content coding stage, we went back to the screening phase to check if any studies with 29-year-old comparators had been excluded during this stage. We found no additional studies with this comparator.
Outcome
Age discrimination was operationalized as the callback rates in correspondence testing. For scenario experiments (or similar), age discrimination was operationalised as recruiters’ assessments (i.e., ratings, decisions, and judgements) about applicants’ employability, job suitability, desirability, hiring priority, recommendation of selection, or likelihood of being invited to an interview. We included studies which manipulate age and explicitly state the job types in which the applications are sent and studies which contain more than one age manipulation group. We excluded studies that rely on eye-tracking measurements or any other kinds of non-verbal behaviors of the participants (i.e., observations, reading time, biopsychological parameters, etc.); interviews, as well as studies which do not focus on the selection stage (more specifically, the stage in which the recruiters choose the potential interview and job offer candidates among the applicants based on their resumes). We further excluded studies that used ratings or statements that are general in nature (i.e., “I would prefer not to hire older applicants’’) and thus measure attitudes instead of discrimination. Highly artificial tasks (e.g., a speeded comparison test, implicit tests, priming studies) which do not resemble real-life screening procedures, were also excluded.
Study Types
We included studies that were correspondence studies or scenario experiments of fictitious applications focusing on the first stage of selection, in which the participants decide which applicants get through to the next stage based on the applicants’ resumes (i.e., CVs and or personal letter). Correspondence studies were included if they involved sending out fictitious resumes that differ based on age being randomly assigned to each resume to job posts for different work positions and investigate age discrimination in terms of call-back rates for different age groups. Scenario experiments were included if they involved presenting participants (active recruiters) with fictitious resumes with age randomly assigned to each resume.
We excluded studies if they were reported in books or book chapters, or published in any language other than English, but we had not stated this in our protocol. We found it unlikely that this type of research would be published as book chapters without there also being a stand-alone journal article, and we did not have the resources to go through multiple languages.
Information Source
We conducted searches in the following databases: Web of Science, PsycInfo, ERIC and BASE. As a parallel instrument, we used Google Scholar and backward reference search for literature. We also manually searched reference lists of relevant reviews. The language of studies was restricted to English. The search included studies conducted between 2010 and December 31, 2019. This 2010 cut-off allows us to focus on relatively recent studies which reflect today’s working culture and ensure that the majority of studies were conducted when anti-discrimination laws had been established. As the age discrimination laws implementation varies across countries (Lahey, 2010), we decided to take 2010 as the cut-off year to ascertain that at least most of the European countries and the US have established laws. The 2019 cut-off excludes any study conducted during the pandemic, which might have considerably changed the labor market and would be a better focus of another paper.
Search Strategy
Concept | Search strings |
HIRING | Hiring OR recruitment OR Employment OR Personnel selection OR interview callback OR call back OR applicant OR select |
AGE DISCRIMINATION | Age discrimination OR ageism OR Age bias |
STUDY DESIGN | Resume OR Correspondence study OR Field experiment OR Field study OR Correspondence method OR Scenario experiment OR Factorial survey OR Vignette measure OR Vignette experiment OR Factorial measure OR Laboratory experiment |
JOB TYPE | Work sector OR Job vacancy OR sector affiliation OR sector OR work sector OR work field OR job sector OR job field OR job type |
Concept | Search strings |
HIRING | Hiring OR recruitment OR Employment OR Personnel selection OR interview callback OR call back OR applicant OR select |
AGE DISCRIMINATION | Age discrimination OR ageism OR Age bias |
STUDY DESIGN | Resume OR Correspondence study OR Field experiment OR Field study OR Correspondence method OR Scenario experiment OR Factorial survey OR Vignette measure OR Vignette experiment OR Factorial measure OR Laboratory experiment |
JOB TYPE | Work sector OR Job vacancy OR sector affiliation OR sector OR work sector OR work field OR job sector OR job field OR job type |
Generally, searches from all databases were downloaded and imported into the Zotero reference management software where they were saved and filtered. This way the search history was saved exactly as it was at the time of search. For further details on our search strategy, such as our search terms and search settings, see OSF (https://osf.io/cvn48/). Keywords reported in the protocol have been used and combined to create searches.
In addition to planned search strategies, we also conducted manual search of reference lists of full-text included studies, and extracted eligible articles from the reference lists of two reviews: Baert (2018) and Bertrand & Duflo (2017). The searches were last updated on April 6th, 2021 in ERIC, April 5th, 2021 in PsycInfo, April 11th, 2021 in BASE, April 8th, 2021 in Web of Science Core Collection, and April 9th, 2021 in Google Scholar.
We deviated in using the search strings when conducting the searches in databases, particularly searching through ERIC and PsycInfo. We included thesaurus terms of the key concepts (i.e., “Age discrimination”, “Personnel selection”, and “Recruitment” for ERIC and “Recruitment”, “Employment Discrimination”, “Age Discrimination”, “Ageism”, “Aged (Attitudes Toward)”, and “Aging (Attitudes Toward)” for PsycInfo) along with the keywords “labour market” and “age bias” which were concluded to improve search precision.
Following the MECIR manual, the search was updated on April 24, 2023, as more than six months had passed since the original searches were performed. We refined our search strategies to generate more sensitive search strings by removing the search strings for the job type category and incorporated an additional database, Business Source Ultimate, as recommended by a peer reviewer. Updated searches and search strategies are available in our OSF project.
Selection Process
As a first step of the selection process, searches were uploaded into the reference management software Zotero (Zotero v. 5.0.96.2, Roy Rosenzweig Center for History and New Media, 2021). Subsequently, two authors (LB & MH) merged duplicates based on the availability of data in each version of the duplicate studies. After excluding books and book chapters, the remaining articles were moved into the article screening software Rayyan (Ouzzani et al., 2016) which was used to conduct title/abstract screening of articles. The two authors then independently screened through all titles and abstracts using the blinding mode in the software, which allowed screening without seeing the other reviewer’s decisions. After all articles were screened by both authors, blind mode was turned off and conflicts were collectively resolved. Articles at the abstract and title stage were screened against the PICOS criteria, publication date, conduction date, and language of the publication to decide whether the studies should move on to the full-text screening phase or be excluded from further analyses. We only excluded records where we could confirm that the PICOS were not met: if we were unsure, they were retained.
After finalizing the number of studies to be included into the full-text reading stage, reports were retrieved, and two authors then conducted independent full-text screenings of the retrieved articles. In addition, they independently reported the extracted PICOS data from the included articles and transferred them into a Google sheets spreadsheet. This procedure was not formally blinded, but the authors independently extracted the PICOS data without checking each other’s work, and after independently extracting PICOS data, they collectively resolved any discrepancies.
Using the PICOS framework, studies were either excluded or included into the content coding/data extraction stage, which was conducted collectively and simultaneously among the two aforementioned researchers, whereas the data extracted were reported in the Content Coding Google sheets spreadsheet. Apart from the duplicate resolution in Zotero, each part of screening was conducted manually by the researchers, including coding the data.
Lastly, after going through the references of studies included in the systematic review of correspondence studies by Lippens et al. (2023), we included one more eligible study (Capéau et al., 2012) that had previously been excluded in the abstract screening phase.
Data Collection Process
For all studies that passed full-text screening we coded the following: authors, country (where the study was conducted), design, discrimination age (which is the age of applicants in the intervention group), control age (applicants’ age in the control group), job type, discrimination illegal (whether discrimination was illegal in the country where and when the study was conducted), study type, peer-review, and comments (any comments about the study). Two authors collectively, simultaneously, and manually extracted data from each report and reported them in the Content Coding spreadsheet. Data coded included the variables mentioned in the planned protocol section, along with number of participants (N), participants in each group (n), outcomes (measurement type), and extracted focal tests.
Data Items
We documented the outcome extractions using RMarkdown (v.2.11.). All of the extracted data is reported in “Materials” on OSF (https://osf.io/cvn48/).
Outcomes Looked for in Data Extraction
Correspondence testing studies have a dependent variable that can take the form of 0 (not invited) or 1 (invited). We planned to extract a two by two frequency table (an example can be found in the preregistration) from each independent comparison. As expected, results were not always reported in this format, but instead as Linear Probability Models (regression coefficients) or as proportions. When possible, we reconstructed this based on reported analysis, tables, or figures (e.g., multiplying a proportion with the number of applications). We closely followed our preregistered protocol in what we extracted, with the exception that we had only defined the appearance of the between-participants table, and not the within-participant table (Yes and No frequencies for Old group and Comparison group), which of course, also includes the “Both invited” and “Neither invited” that are necessary to extract.
For the scenario-based studies we extracted any callback data in the same manner. A more common format for scenario studies are ratings or judgments reported as means and standard deviations. For those, we extracted the mean and standard deviations from the control vs. the older applicant (e.g., in a table). For rankings, rejections vs. invitations were converted into 0 vs. 1 or mean value, depending on the base design. For both study types, when an exact conversion was not possible, we contacted the authors and requested data. Any data we failed to obtain were treated as missing.
Studies can have multiple discrimination outcomes, and we planned to always extract all that matched our criteria and then average them for our main outcome. However, we deviated from our protocol in how we handled multiple outcomes. During the full-text analysis, we noticed that some studies reported differences in types of call-backs (e.g., call-back as an invitation to interview versus call-back as only requesting more information from the applicants). In these instances, we decided to only consider call-backs reported as invitations to interviews or job offers for the quantitative analysis. In case scenario experiments reported multiple eligible outcomes, we prioritized outcomes referring to the likelihood of being hired/selection decisions/job offers, followed by likelihood to be invited to an interview, level of employability and job suitability, resp., when extracting data for synthesis.
Other Variables Sought after in Data Extraction
We coded the country in which the studies were conducted and their status of age discrimination laws and finally, whether the article was peer-reviewed. For the pre-planned subgroup analysis, if available, we started with extracting the reported occupations where the applications were sent in correspondence studies or the types of jobs fictional applicants were hypothetically applying to in scenario experiments, however, this was stopped, as we realized we would not have enough studies for each subgroup to conduct these analyses.
Study Risk of Bias Assessment
We conducted our risk of bias assessment based on the preregistered set of criteria. Because the original assessment tool from Cochrane was ill-suited for these types of studies, we developed an ad-hoc tool for this review.
For the correspondence studies, we assessed risk of bias of individual studies using the following categories: quality of the age manipulation, quality of randomization procedure, quality of callback procedure, quality of the applications, and for the scenario experiments two additional categories were assessed: quality of scenario and quality of the design.
The risk of bias assessment was limited to studies with available outcomes. We systematically differentiated risk of bias indicators for the factors within a study and the potential bias of the entire study. Table 3 shows the bias assessment of individual studies, and a summary of it is available in the Appendix on OSF. The systematic approach to calculate our factor indices can be found in Materials on OSF (“Bias Assessment - Age.csv”; https://osf.io/cvn48/).
Within the first category (quality of age manipulation) we assessed whether age was salient on the résumés (i.e., it was stated as a number or date of birth; item 1.1.) and whether the age difference between control and experimental group was large enough according to our preregistered criteria (e.g., 30 – 35 for the comparator; 40 – 49 for the experimental group, etc.; item 1.2). If both responses were negative, we assessed age manipulation as high risk of bias.
For the quality of studies’ randomization procedure we assessed whether age was randomly assigned (item 2.1) and the résumés sent out in randomized order (item 2.2.). To be assessed as low risk of bias, groups either had to be equal in size or the study reported that applicants’ age had been randomized, and applications were randomized, counter-balanced or sent out simultaneously. If item 2.1 received a negative response, the whole category was considered a high risk of bias factor.
The quality of the callback procedure was evaluated by assessing the adequacy of applications in relationship to the job posts applied to (3.1.) and the callback reception (3.2.). Applications were assessed as low bias if the skills on the résumés matched the job requirements and if the callbacks were collected through both phone and email.
We evaluated the quality of the applications by assessing the completeness of the sent applications (4.1.; i.e., résumés included - besides age - name, education and / or work experience) and the application formatting (4.2.; i.e., whether researchers included different formats of résumés to avoid recruiters receiving multiple CVs of identical patterns). It was assumed that in case recruiters received more than one application from the total sum of applications sent per study, they could be influenced by the recognition of similar or equal formatting patterns. The latter item only impacted bias in within-subject designs and the quality application was considered high risk of bias if the second item was high risk of bias.
The fifth category evaluated the quality of the scenario experiment (5.1.), and we assessed whether the presented scenario was realistic enough for the participants. The sixth factor assessed the quality of the design. This factor consisted of two sub-questions in the preregistered protocol, one regarding possible influence of experimenters on the participant, and the other whether the studies were blinded. This was changed post-hoc to a one item (6.1.) question of whether the studies were blinded so that the recruiters (participants) are not aware of the purpose of the design, as the two original sub-questions coincided with the fifth factor.
Due to the assessed relevance of the different factors, we considered a study to automatically be of high risk whenever it scored high in the second factor (quality of the randomization). Otherwise, studies were only assessed as high risk studies if the other factors provided at least two high risk of bias.
Effect Measures
For correspondence testing studies we calculated odds ratios and their variances based on the extracted frequencies. We did this separately for the between-participants and within-participants designs. Effects were calculated for studies included in the quantitative analysis, which all were correspondence studies with within- or between-subject designs. Calculations were done in R (4.1.2), using the metafor package (Viechtbauer, 2010) reported in RMarkdown (v.2.11.).
Synthesis Methods
Whenever studies had used the same type of intervention and comparison, with the same outcome measure type, we synthesized the results using a random-effects meta-analysis using the R package metafor (Viechtbauer, 2010). Because we only found data for one study using a scenario-based experiment, we could not conduct a meta-analysis for this type of study.
Regardless of the assumption of heterogeneity of the studies, we planned to perform subgroup analyses based on age groups, job type and cultural clusters, provided that we had at least five studies per group. We planned to fit a mixed-effect meta-regression model in metafor (Viechtbauer, 2010) to examine the difference between these subgroups. The model would include all three factors (if they had groups large enough). We planned to use the cut-off of p < .05 for the omnibus test of moderator effects, as well as for the factors. However, because we were not able to obtain at least five studies per group for any of the meta-analyses, we did not conduct the planned subgroup analyses. In case of missing data, it was reported as missing, and we contacted the authors of the original study in order to obtain it.
Reporting Bias Assessment
We planned to assess heterogeneity by estimating tau and I2 and assess evidence of publication or statistical reporting bias using funnel plots. If we had > 20 studies in a pooled set, we also planned to use PET PEESE (Stanley & Doucouliagos, 2014) and 3-PSM (Iyengar & Greenhouse, 1988) to estimate the effect after adjusting for publication bias. We further planned to examine publication bias and questionable research practices (e.g., p-hacking) through an analysis of the focal tests by using the p-checker (http://shinyapps.org/apps/p-checker/ ) to calculate an R-index (Schimmack, 2014) and a p-curve. If we had pooled sets of > 20 studies we would further calculate a z-curve (Brunner & Schimmack, 2020). We planned to conduct a sensitivity analysis if there was a reasonable assumption of studies having a moderate to high risk of bias and if there was disproportionate weighting of studies. To achieve this, we implemented a leave-one-out analysis method. This approach entails conducting the meta-analysis k times, each iteration excluding a different study. Consequently, this enables us to discern which studies (if any) largely influence the average effect size estimate and whether such bias skews the outcomes of the meta-analysis.
We pre-registered that we would extract focal tests, which are the reported significance tests typically used for main inference in a study, in order to conduct a p-curve analysis, R-index and z-curve (if enough studies). However, we did not pre-register which selection rule we would use (i.e., how to decide which one is the focal test). Because most studies did not report focal tests for our age groups, but only an overall effect for age, we extracted either the overall omnibus (test of age discrimination) test or a test for an averagely (e.g., 50 year old) old candidate when there were multiple tests reported. Hence, these focal tests are not directly corresponding to any of our meta-analyses, but simply a way to assess the risk of publication bias in the literature.
We ended up relying only on funnel plots, tau and I2 , R-index and p-curve, as we did not have enough studies to use PET-PEESE and 3-PSM or z-curve.
Results
Study Selection
Our original searches yielded a total of n = 2,599 reports, of which n = 1,728 can be attributed to sources from our primary databases (PsycInfo, Web of Science, ERIC, and BASE) and n = 871 can be attributed to secondary sources (Google Scholar and manually extracted references from reviews). After excluding duplicates and eliminating all books and certain book chapters based on title/abstract, in Zotero (Zotero v. 5.0.96.2, Roy Rosenzweig Center for History and New Media, 2021), we ended up transferring n = 1,733 studies into the screening process in Rayyan, (Ouzzani et al., 2016), with n = 1,253 being primary and n = 480 being secondary search results. Rayyan further identified n = 104 duplicates, leaving n = 1629 unique articles, however, we decided to screen all of the imported articles without immediately excluding duplicates suggested by Rayyan, to prevent potential errors made by the software. After screening our articles independently and discussing our decisions to accept or reject a paper for further consideration, we excluded n = 1,580 studies in the screening process and ended up with n = 51 reports to be considered in the full text readings.
After conducting the screening, we found one more relevant study in the references of another review of correspondence studies of hiring discrimination and also included it in the full-text reading phase. However, we only managed to retrieve 49 reports that went further into the full text reading phase. Two reports we were unable to find online, and we received no response from the authors. These 49 reports consisted of 53 studies which were checked against the PICOS criteria. The reference found in review was further included in the full-text reading.
The updated search yielded 3,995 reports from five databases (PsycInfo, Web of Science, ERIC, BASE, and Business Source Ultimate). After the deduplication process, 2,406 abstracts remained for screening. We considered 16 reports for full-text assessment, however, no new studies were included in the review. In the data extraction stage we found that two articles (Neumark et al., 2016 and Neumark et al., 2019) were based on one dataset, so we only included Neumark et al. (2016) in the analysis stage. Based on the criteria, we included 13 studies into the analysis stage, which were reported in 11 articles. However, final meta-analyses were conducted using eight studies from which we were able to obtain data needed to calculate effect sizes. Studies excluded in the full-text reading stage are available in Table 3 and Table 4 in the Appendix, along with the exclusion reasons (https://osf.io/cvn48/).
Study Characteristics
Studies from 11 articles were eligible to be included into the analyses based on eligibility criteria. However, it proved unfeasible to extract data from certain studies for our data synthesis, thus studies included in the synthesis were the following seven studies: Ahmed et al. (2012), Carlsson and Eriksson (2019), Farber (2019), Jansons and Zukovs (2012), Capéau et al. (2012) and Neumark et al. (2016, 2019). Along with that we were able to extract data for one scenario based study (Oesch, 2020). Below, we provide a narrative summary of the 13 initially included studies. Ahmed et al. (2012). In this particular study, the authors decided to conduct a within-subjects correspondence study including 466 job offers in Sweden. The applications were sent to job posts looking for either restaurant workers or sales assistants, with call-back rates being the dichotomous outcome variable. Their intervention group encompassed fictitious applicants aged 46 whereas the comparator was a fictitious applicant aged 31.
Capéau et al. (2012) conducted a mixed design correspondence study to examine hiring discrimination in Belgium based on age, sex, national origin or a certain physical state (e.g., pregnancy), by measuring call-back rates for fictitious applicants. They sent out 1708 fake resumes, and included ages 35 as comparator, and 23, 47 or 53 as the discriminated counterpart.
Carlsson and Eriksson (2019) conducted a between-subjects correspondence study to test age discrimination in hiring in Sweden, by assessing call-back rates for the fictitious applicants. The authors included age as a continuous variable in the interval of 35 to 70 years, and gender which they signaled via applicants’ names. They assigned names and ages to fictitious applications randomly and sent out triplets of resumes to over 2000 employers, generating a sample of 6066 applications, which were sent to job openings in seven occupations: administrative assistants, truck drivers, chefs, food serving and waitresses, retail salespersons and cashiers, sales representatives, and cleaners.
Challe et al. (2016) conducted four studies, out of which the first three, within-subjects correspondence studies conducted in France, were included in this review. The first study included fictitious applicants aged 29 as comparators, with the intervention group being fictitious 56-year-old applicants; however, these applicants were also divided into two groups, based on their expected retirement age. Triplets of resumes (29-year-old, 56-year-old closer to retirement, and 56-year-old further from retirement) were sent out to job posts for two occupations: call center agent (n = 300) and sales assistant (n = 301). The second study aimed to investigate age discrimination in hiring in relation to technological skill obsolescence, and encompassed sending three fictitious applications, with ages of applicants being 32, 42, and 52, to the following occupations: IT project managers and IT developers (n = 302), and management accountants and accountants (n = 308). In the third study, the authors intended to examine age discrimination in the context of gender stereotypes around certain occupations. They included fictitious applicants in their 50s (51 and 51) and applicants in their 30s (35 and 36) who were either male or female, and applied to personal service occupations (specifically: home help, cleaning persons, and caretakers). The outcome variable for age discrimination was call-back rates.
Farber (2019) conducted a between-subjects correspondence study in the United States to study age discrimination in hiring. Their intervention consisted of manipulating the age of fictitious applicants (chosen from given age groups: 22–23, 27–28, 33–34, 42–43, 51–52, or 60–61 year-olds) and length of unemployment spell (in weeks: 4, 12, 24, 52). They applied to either low-skill jobs (e.g. receptionist, office assistant) or high-skill jobs (e.g., executive assistant, office manager) (n = 2122) with the outcome variable being call-back rates.
Jansons and Zukovs (2012) conducted a within-subjects correspondence study in Latvia where they created fictitious resumes, with the intervention group being 55-year-olds and the comparator 35-year-olds. They applied to salesperson jobs (n = 529) and measured age discrimination by difference in call-back rates between the younger and the older candidate.
Montizaan and Fouarge (2016) conducted a within-subjects scenario (vignette) experiment in the Netherlands to examine age discrimination in hiring, relating to applicants’ and employers’ characteristics. The fictitious applicants were either 35, 45, 55, or 60 years old. Employers (n = 1100) were presented with two fictitious applications and were asked to choose which of the two presented applicants they would hire. The outcome variable for age discrimination was likelihood to be hired.
Neumark et al. (2016) conducted a between-subjects correspondence study in the United States. The fictitious applications differed in age (29-31, 64-66) and skill level (high or low) for each occupation they were sent to (sales, security and janitor). The authors sent out 7161 applications and their outcome variable was the call-backs.
Neumark et al. (2019) conducted a between-subject correspondence study in the United States. Fictitious applications differed in age (29-31, 64-66) and gender. The authors sent out 14,428 applications to 3,607 jobs and their outcome variable was the call-backs.
Oesch (2020) conducted a between-subject scenario experiment in Switzerland. Participants (recruiters, n = 501) had to indicate the likelihood of inviting fictitious candidates to an interview (on a scale from 0 to 10) based on the given resume, and propose an adequate wage for each candidate. The ages of fictitious applicants were 35, 40, 45, 50, or 55 years old. The occupations fictitious candidates were applying to were expert accountant, human resources assistant and building caretaker.
Richardson et al. (2013) conducted a between-subject scenario experiment in the United States. They recruited 154 participants (students and organization based, n = 102 and 54, resp.). Participants had to assess work-related competencies of fictitious applicants and indicate the likelihood of being hired on a 9-point Likert scale. The age of applicants was taken from a range of 33 to 66 years.
Risk of Bias in Individual Studies
In general, the lowest risk of bias was present in the quality of randomizing applicants’ age to the applications. There is some concern regarding age manipulation, i.e., how age was presented or made salient in the applications. Certain studies opted for making age implicit, through writing school graduation dates (e.g., Neumark et al.) which might impede realizing how old the applicant is intended to be. Application quality in the sense of making the CVs realistic and detailed presented a high potential for bias, however, this might be due to underreporting of details necessary to draw conclusions on the quality. Altogether, the majority of studies present a moderate to high risk of bias, and the results from these studies should be interpreted with caution.
Results of Individual Studies
We present results of syntheses further in text. Meta-analyses which included more than one study are accompanied with forest plots in the text. Other figures and code used to extract data, focal tests, and conduct presented analyses are available in the RMarkdown files in the appended materials. All meta-analyses were conducted using the random-effects model with Restricted Maximum Likelihood (REML) tau estimator, as this is the default in the metafor package. However, for a robustness check, we conducted the meta-analysis with the commonly used Dersimonian and Laird (1986) tau estimator as well, available in the analysis files on OSF (https://osf.io/zqxga/). In general, we found no discrepancies, apart from the meta-analysis of studies on age above sixty-five, where the REML method estimated a higher tau2 value. All summary statistics were calculated using default methods in the metafor package.
Results of Syntheses
As mentioned previously, we intended to conduct four different types of meta-analyses, containing one of four combinations from either correspondence or vignette studies and either in-between or within-subjects designs. However, due to the availability of our empirical material, we had to omit meta-analyses on scenario-based experiments. The results of the remaining meta-analyses are described below. We present odds ratios with confidence (CI) and prediction intervals (PI), and z and p values.
Age Category - Forty
Our first meta-analysis encompassed all within-subject correspondence studies measuring the hiring disparities between our comparator group and 40 to 49 year old applicants. Via a random-effects meta-analysis (k=2), the two included studies revealed an effect of applicant age on the hiring decisions held by the participating recruiters against older applicants, with the odds ratio being 0.38 (95 % CI [ 0.25, 0.59 ], 95% PI [ 0.21, 0.70 ], z = -4.36, p < .001). On average, the odds of older applicants receiving a callback were 0.4, and while the upper bound of the confidence interval still suggests lower odds of receiving a callback, the width of the interval implies uncertainty regarding the actual point estimate. Total variance was low (mostly due to a low number of included studies; k=2) tau2 = 0.05 (SD of the true effects across studies was tau = 0.22), and the heterogeneity was I2 = 47.11%.
Our second meta-analysis was set to include all between-subject correspondence studies measuring the hiring disparities between our comparator group and 40 to 49 year old applicants. Via a random-effects meta-analysis, the three studies included revealed an odds ratio of 0.89 (95 % CI [ 0.82, 0.97 ], 95% PI [ 0.82, 0.97 ], z = -2.6, p = .009), meaning they had 0.9 odds of receiving a callback, which, taken along with a narrow confidence interval that borders with no difference, shows a rather low discrimination effect for this age group. Tau2 and I2 were zero, however heterogeneity of studies is difficult to assess with such a small number of included studies (k=3), so this does not imply homogeneity.
Age Category - Fifty
Our third meta-analysis was set to include all within-subject correspondence studies measuring the hiring disparities between our comparator group and 50 to 59 year old applicants. Via a random-effects meta-analysis, the studies on the matter of age discrimination revealed an effect of applicant age on the hiring decisions. The overall effect was OR = 0.41 (95 % CI [ 0.29, 0.58 ], 95% PI [ 0.29, 0.58 ], z = -5.11, p < .001), meaning older applicants had less than half the odds of receiving a callback, and similarly to the analysis of within-subject design for the 40-year old age group, the confidence interval lowers our certainty in the estimate. Once again, tau2 and I2 values (0; 0%, resp.) for such a small sample of studies (k=3) do not provide meaningful assessment of heterogeneity.
Our fourth meta-analysis was set to include all between-subject correspondence studies measuring the hiring disparities between our comparator group and 50 to 59 year old applicants. Via a random-effects meta-analysis we found that on average older applicants had 0.75 odds of receiving a callback, which seems to be a fairly precise estimate based on the narrow confidence interval (OR = 0.75, 95 % CI [ 0.69, 0.80 ], 95% PI [ 0.69, 0.80 ], z = -7.88, p < .001). As in the former analysis, heterogeneity was not found (tau2 = 0; I2 = 0%).
Age Category - Sixty
Our fifth meta-analysis was set to include all between-subject correspondence studies measuring the hiring disparities between our comparator group and 60 to 65 year old applicants. Via a random-effects meta-analysis, we found on average that older applicants have 0.6 odds of receiving a callback. Confidence interval is relatively narrow, implying a stronger certainty in the precision of the point estimate (OR = 0.62, 95 % CI [ 0.53, 0.72 ], 95% PI [ 0.46, 0.82 ], z = -5.99, p < .001). Total variance was low (tau2 = 0.01, with the SD of the true effects across studies, tau = 0.12), with 75% of it being heterogeneity of true effects (I2 = 75.26).
Age Category - Above Sixty-Five
The final meta-analysis included all between-subject correspondence studies measuring the hiring disparities between our comparator group and applicants over 65 years of age. Via a random-effects meta-analysis, we found that on average, older applicants have half the odds of receiving a callback. The confidence interval is extremely wide, so it is uncertain where the true effect size of discrimination for this age group lies, however, the values in the upper bound already suggest the existence of discrimination. (OR = 0.50, 95 % CI [ 0.29, 0.85 ], 95% PI [ 0.18, 1.34 ], z = -2.55, p = 0.012). Total variance seems relatively low (tau2 = 0.18 and the SD of the true effects across studies, tau = 0.43) but the width difference between the confidence and prediction interval shows there is variability in the study effects. Effect heterogeneity accounted for most of that variance (I2 = 96.74) which might explain the width of the confidence interval.
Finally, for the one scenario experiment study we obtained data from (Oesch, 2020) we calculated a Hedge’s g effect size of g = 0.02, 95 % CI [ -0.05, 0.09 ] for 40-year-olds, and g = 0.15, 95 % CI [ 0.08, 0.22 ] for 50-year-olds.
Sensitivity Analysis
We conducted a leave-one-out sensitivity analysis and observed that excluding the Carlsson & Eriksson (2019) study from the meta-analysis led to a substantial decrease in the average effect size of age discrimination for the above 65 age group (see Appendix Table 2). This finding suggests that the Carlsson & Eriksson (2019) study increases the overall average effect size in the original meta-analysis for this age group. On the other hand, excluding either Neumark study (2016, 2019) resulted in an increase in the average effect size, indicating that their inclusion in the meta-analysis leads to a weaker effect of age discrimination for the above 65 age group. However, the impact on effect size is more pronounced when excluding the Carlsson and Eriksson study. This can be explained by the fact that their study design differed from the other studies in terms of age range included. Despite the decreased effect size when the study was removed, it still implies the existence of age discrimination.
Sensitivity analysis of other age groups did not show any substantial deviations from the overall effect size estimate which implies robust effect size estimates.
Reporting Biases
Here we present the p-curve and the p-checker results. We found an R-index of 0.91 for the correspondence studies, which implies high power. For the scenario experiments we found an R-index of 1 with 100% success rate. We included studies that we could extract focal tests from and that were reported as significant in the papers. Funnel plots have been conducted to visually assess potential asymmetry in published effects and are available in the RMarkdown files (https://osf.io/zqxga/), however, as we had a small number of studies per analysis, it is hard to assess asymmetry or potential publication bias with few data points. The p-curve shows a strong right skew with more studies having lower p-values, which is consistent with highly powered studies and little publication bias. Taken together, there is little indication of publication bias in this literature.
Certainty of Evidence
We base our certainty of evidence on the GRADE guidelines (Higgins et al., 2021). First domain considers the study risk of bias, and our assessment shows that the majority of studies suggest a moderate risk of bias. When it comes to imprecision, we generally find narrow confidence intervals and prediction intervals, and see little difference in inference depending on the true effect being in the upper vs. lower bound (Figure 2; Figure 3), meaning our average true effect and the deviation of found effects show good precision of evidence. Furthermore, we conducted the leave-one-out sensitivity analyses (available in supplement materials) of the effect size aggregates for between-subject studies examining age discrimination. These analyses show robust findings generally, with the only exception being the removal of Carlsson and Eriksson (2019), which somewhat increases the effect size and tau. Presumably, the Carlsson and Eriksson (2019) study introduces higher deviation in effects because of the continuous age variable. Furthermore, high heterogeneity of true effects in certain meta-analyses and inability to correctly estimate variance due to low number of studies imply a lower certainty in evidence due to inconsistency. The low diversity of countries where the studies were conducted (i.e., majority, USA and Sweden) also suggests a problem of generalizability. The funnel plots available online (Results. Rmd at https://osf.io/zqxga/) and the p-curve (Figure 4) show low risk of publication bias, and as we tried to retrieve gray literature as well, we are confident that reporting bias doesn’t impose a threat to evidence certainty. Overall, we conclude our certainty of evidence to be moderate based on possible problems with imprecision and study risk of bias assessment.
Discussion
General Report of Results
The results of the present meta-analysis suggest that there is a sizable effect of age discrimination against older applicants in the selection process. The results further suggest that this effect is most likely present already when the applicant’s age is between 40 and 49, and rises gradually with increasing age. Discrimination against older applicants occurs regardless of study design, but the discrimination effects in studies with within-participant designs are noticeably larger, which could be due to the study design, or some other unidentified difference between the samples (e.g., nationality; timing of the studies). With regards to the effect sizes, we argue that the odds of younger applicants being considered over older applicants - even though it varies depending on the age category - is practically large in terms of real-life discrimination. Specifically, we found that older applicants receive 11 to 50 per cent lower odds of being considered over the younger applicant. Lack of available data from the scenario experiments prevents any generalized conclusions, and the discussion will thus only focus on correspondence testing.
The findings from this review indicate considerable levels of ageism in the hiring process. In the context of the ongoing political efforts to extend working lives past the current retirement age, this discrimination is not only likely to have negative economic and health-related consequences for the individual worker, but it might prove largely unfavorable for society as a whole as many qualified candidates are being hindered from (re-)entering the labor market. Comparable meta-analyses on the topic of hiring discrimination against other minority groups usually yield similar results to the findings of our review. Zschirnt and Ruedin (2016) considered the log odds ratios of 34 correspondence studies about ethnic discrimination in OECD countries between 1990 and 2015, finding that on average, ethnic minority applicants receive lower callback rates than majority candidates, with their odds ratios ranging from OR = 0.27 [ 0.17, 0.43 ] to OR = 0.94 [ 0.73, 1.12 ]. The only outlier in this analysis can be seen in Bendick et al. (1991), who generated an odds ratio of 2.45 [1.86, 3.22]. However, this is explained by the fact that in their study, Latino applicants had received more qualifications in their applications. Similar results are provided by Flage (2019), who examined the magnitude of hiring discrimination between hetero- and homosexual applicants in OECD countries. The author finds that the odds ratio in this case is at 0.64; suggesting that the odds for homosexual applicants to receive callbacks are 36 per cent lower than for heterosexual applicants. In this regard, our meta-analyses mirror these findings, with our odds ratios ranging from 0.89 [0.82, 0.97] to 0.50 [ 0.29, 0.85 ], implying that there is an observable difference of discrimination between younger and older applicants.
Lippens et al. (2023) found a large average effect of hiring discrimination against older applicants (RR = 0.5804, CI 95% = [ 0.4993; 0.6748 ], p = .018). However, we cannot directly compare their results to ours as we differed in selection criteria, and the combining of different age groups. Furthermore, we consistently compared our older applicants against a comparator of 29 to 35 years old, while their comparator ages varied.
Limitation of Evidence
Following the aforementioned risk of bias assessment, there are certain issues to be considered when evaluating the overall explanatory power of our analysis. In some studies, applications were sent simultaneously (e.g., Ahmed et al., 2012; Neumark et al., 2016). On the one hand, it could be argued that time-randomization creates a more realistic replication of the in vivo hiring process and therefore decreases the chances of oversampling. On the other hand, one might also argue that applications, no matter the time of receipt, must necessarily be processed by the recruiter at different points in time and therefore become randomized eventually in practice either way. Either way, as this only concerns two studies, we assume that it does not have any significant influence on our results. The same argumentation applies regarding the age randomization and callback platforms. Although we consider age randomization more important, it was only omitted, i.e., not further described in one study (Challe et al., 2016, p. 3) and thus interpreted as a minor issue, especially considering that group sizes were equal in their experiment. However, we believe that infractions on the quality of the age manipulation and applications play a more central role in our overall assessment and we note that the assumption of the respective risk of bias items had to be rejected across several studies within these two factors.
Another item that was highly critical in regards to risk of bias was age salience. In five studies, age was either not sufficiently salient in the applications or further information was lacking in the paper. Indirect announcements of age (e.g., through high school graduation years, see Neumark et al., 2019) might introduce biases if recruiters infer false assumptions about an applicant’s age, for example by disregarding the age at which they picked up college education. This might explain the lower level of age discrimination in some studies.
We also documented the existence of anti-discrimination laws in countries where the studies were conducted, as this might influence the age discrimination effects (see Neumark et al., 2019). Although all countries have age discrimination laws which prohibit hiring discrimination based on age, nuances exist (e.g., in Switzerland) and it would be optimal to distinguish which sectors have a higher degree of freedom when making hiring decisions to properly appraise age discrimination effects. Because of the narrow body of evidence, it would be highly recommendable to extend our findings when more data is available. This also encompasses expanding the language barriers this review faces to generate resources that we could not extract. Moreover, the scope of this review is restricted to age discrimination against older people only, and previous findings (Bratt et al., 2018; Finkelstein et al., 2013) suggest that age discrimination is frequently targeted against very young applicants too, since recruiters often appear to be biased against this group, especially between generations. Furthermore, the subgroup analyses were not conducted as planned because of a low number of studies. It would be optimal to extend future reviews in these regards to contribute to a more differentiated understanding of how age discrimination can be influenced by certain contextual factors; particularly regarding the cultural context and the possible moderating role of job tasks and necessary experience on the direction of age discrimination effects. Furthermore, it would be beneficial to conduct a study on the EU level, to correspond to the studies conducted in the US, which were the majority of our sample. Moreover, including countries that don’t belong to the western society would provide even better insight into the state of age discrimination in the workforce and make results generalizable. Finally, we were limited by data availability to provide evidence of age discrimination from scenario-based studies. In general, when comparing the two study types, we would argue that correspondence studies provide more knowledge on discrimination prevalence, and although they are often more practically extensive to conduct, they provide a much more realistic insight into discrimination prevalence and a lower risk-of-bias evidence output. On the other hand, the scenario experiments could prove more useful in revealing information about the recruiters who make the decisions.
Limitations of the Review Process
We also acknowledge certain limitations of our review. In the earlier stages of this review, we exhaustingly screened the available databases and resources for studies; however, we decided to only integrate the first 200 search results for each individual search of Google Scholar, and might therefore have missed a minor number of studies (e.g., gray literature) which could have extended our relevant body of evidence. Moreover, through thorough prior research, we sought to capture all of the essential key terms in our search strategy, nonetheless, we might have overseen studies that could have been found outside of our search terms. What is more, studies were restricted to articles written in English. Especially since we intended to screen for the moderating effect of culture, the inclusion of literature written in other languages would be optimal. In general, we were also limited by the access to available data, especially if studies were less recent. This made it impossible to include some studies that would otherwise have been included after the full text screening phase. With regards to our meta-analyses, it is clear that parts of our individual meta-analyses are underpowered, with a small sample of studies. This partially restricts the validity of our findings. Indeed, the corresponding odds ratios were unexpectedly high in comparison with our findings from the other meta-analyses and these outcomes might thus be biased in terms of their statistical power, especially with certain confidence intervals being wide and bordering no effect (Carlsson & Eriksson, 2019). Finally, because of time constraints, we conducted the risk of bias assessment for studies individually, meaning that each study was only assessed by one rater. Although the risk of bias was still validated by either the second or third author of this review, we acknowledge that this could have introduced an incremental source of error.
Implications
The evidence from this review suggests that there is an effect of age discrimination in recruitment processes, which tends to increase with age. Given that our results are based on correspondence testing, which looks at discrimination at the initial stage of hiring, results imply a greater disadvantage for older applicants even before having a chance to present themselves. Although the effect might not be substantial, it is crucial for human resources professionals and employers to consider how this bias will affect the labor market. As the retirement age continues to increase, it becomes essential to capitalize on the skills and experience of older employees, for it might not only promote inclusivity at the workplace but increase overall productivity as well.
Furthermore, based on our assessment of study risk of bias, improvements can be made in addressing and reporting age salience and randomization in job applications within studies. Proper age randomization, as emphasized in Heckman’s (1998) critique, is essential in new correspondence experiments to prevent confounding results and undermining a study’s internal validity. Although conducting a meta-analysis of various correspondence studies may mitigate issues raised by Heckman by combining studies with diverse characteristics and samples, ensuring the high quality of original studies is crucial for providing a more accurate estimate of age discrimination effects. Consequently, future correspondence studies should focus on appropriately randomizing age and ensuring age salience is not systematically biased by other applicant characteristics. Moreover, we did not find many scenario experiments in this area, particularly those with well-executed application randomizations and the applicant characteristics. While correspondence tests offer greater external validity, conducting laboratory experiments involving recruiters, which allow for the examination of additional aspects of recruiter reasoning (e.g., motivation, attitudes), could yield valuable insights into the factors contributing to age discrimination in recruitment.
In conclusion, this review aimed to examine the evidence for age discrimination in hiring from the two most commonly used experimental designs, correspondence studies and scenario (vignette) experiments. Based on our findings, we suggest that correspondence studies provide better insight into discrimination prevalence, while scenario experiments might be less ecologically valid, but could be better suited to shed light on other factors relating to discrimination practices, such as recruiter characteristics.
Author Contributions
Conceptualization – L.B., M.H., R.C.
Data curation – L.B., R.C.
Formal analysis –L.B., R.C.
Investigation – L.B., M.H.
Methodology – L.B., M.H., R.C.
Project administration – L.B., M.H.
Software – L.B., R.C.
Supervision – R.C.
Validation – R.C., S.S.
Visualization – L.B., M.H.
Writing – original draft – L.B., M.H.
Writing – review & editing – L.B., S.S., R.C.
Conflicts of Interest
We have no conflict of interest to declare.
Acknowledgments
We thank Ida Henriksson for invaluable help with improving our search strategy.