Web-based data collection allows researchers to recruit large and diverse samples with fewer resources than lab-based studies require. Recent innovations have expanded the set of methodolgies that are possible online, but ongoing work is needed to test the suitability of web-based tools for various research paradigms. Here, we focus on webcam-based eye-tracking; we tested whether the results of five different eye-tracking experiments in the cognitive psychology literature would replicate in a webcam-based format. Specifically, we carried out five experiments by integrating two javascript-based tools: jsPsych and a modified version of Webgazer.js. In order to represent a wide range of applications of eye-tracking to cognitive psychology, we chose two psycholinguistic experiments, two memory experiments, and a decision-making experiment. These studies also varied in the type of eye-tracking display, including screens split into halves (Exps. 3 and 5) or quadrants (Exps. 2 and 4), or composed scenes with regions of interest that varied in size (Exp. 1). Outcomes were mixed. The least successful replication attempt was Exp. 1; we did not obtain a condition effect in our remote sample (1a), nor in an in-lab follow-up (1b). However, the other four experiments were more successful, replicating a blank-screen effect (Exp. 2), a novelty preference (Exp. 3), a verb bias effect (Exp. 4), and a gaze-bias effect in decision-making (Exp. 5). These results suggest that webcam-based eye-tracking can be used to detect a variety of cognitive phenomena, including those with sensitive time, although paradigms that require high spatial resolution (like Exp. 1) should be adapted to coarser quadrant or split-half displays.
The use of eye-tracking to study cognition took off when Alfred Yarbus used suction cups to affix a mirror system to the sclera of the eye in order to monitor eye position during the perception of images (Yarbus, 1967). In one study, participants viewed a painting depicting multiple people in a complex interaction inside of a 19th century Russian home. Yarbus showed, among other things, that the scan paths and locations of fixations were largely dependent on the instructions given to participants (e.g., View the picture freely vs. Remember the position of the people and objects in the room). In other words, the cognitive processing that the individual is engaged in drives the visuo-motor system. Since these findings, eye-tracking has become a central method in cognitive science research (for reviews see Hayhoe & Ballard, 2005; Rayner, 1998; Richardson & Spivey, 2004). For example, gaze location during natural scene perception is used to test theories of visual attention (e.g., Henderson & Hayes, 2017), and eye-movements during auditory language comprehension, using the “visual world paradigm,” demonstrated the context-dependent and incremental nature of language processing (e.g., Tanenhaus et al., 1995).
An important limitation of the eye-tracking methodology is that it has typically required costly equipment (eye-trackers can range in price from a few thousand dollars to tens of thousands of dollars), particular laboratory conditions (a quiet room with consistent indoor lighting conditions), and a substantial time investment (e.g., bringing participants into a laboratory one at a time). This limits who can conduct eye-tracking research – not all researchers have the necessary resources – and who can participate in eye-tracking research. Most eye-tracking study participants are from western, educated, industrialized, rich, and democratic (Henrich et al., 2010) convenience samples (but see Ryskin et al., 2023), which diminishes the generalizability of the findings and the scope of conclusions that can be drawn about human cognition. Likewise, the sample sizes for in-lab experiments are usually orders of magnitude smaller than what statisticians recommend (Nosek et al., 2022).
A robust solution to all these problems is online experiments, particularly with volunteer citizen scientists as participants (Gosling et al., 2010; Hartshorne, de Leeuw, et al., 2019; Li et al., 2024; Reinecke & Gajos, 2015). Historically, this option has not been available for eye-tracking, since few potential subjects have expensive eye-trackers at home. In principal, this could be circumvented by using so-called “poor-man’s” eye-tracking, in which researchers videotape the subject’s face and code eye position frame-by-frame (Snedeker & Trueswell, 2003). Most computers now have high-quality built-in cameras, so recording video is straightforward. Unfortunately, coding images frame-by-frame is extremely time-intensive, making large-sample studies unrealistic. However, in recent years, image analysis has improved to the point where this work can be automated with reasonable accuracy (Burton et al., 2014; Papoutsaki et al., 2016; Skovsgaard et al., 2011; Zheng & Usagawa, 2018). Nonetheless, webcam-based eye-tracking only started to be used regularly in research with the advent of Webgazer.js(Papoutsaki et al., 2016), a webcam-based Javascript plug-in that works in the browser and which can be integrated with any Javascript web interface, including jsPsych(de Leeuw, 2015), Gorilla(A. L. Anwyl-Irvine et al., 2020), lab.js(Henninger et al., 2021), or PsychoJS(Peirce et al., 2019).
Given the potential game-changing nature of webcam-based eye-tracking, a number of research groups have investigated how well it works. Nearly all this work has used Webgazer.js — most in combination with jsPsych, but some with Gorilla or hand-built integrations. There are two potential limitations to webcam-based eye-tracking. First, the spatial and temporal resolution is less than what is achievable with an infrared system. This is less a limitation of webcams than a limitation of the information about eye gaze available in an image of a face. Anecdotally, researchers who use poor-man’s eye-tracking find it difficult to locate the gaze to anything more fine-grained than quadrants. Second, testing subjects over the Internet involves less control: subjects may be unable or unwilling to calibrate equipment, adjust lighting, etc., to the same level of precision typical of in-lab studies.
Results to date have been encouraging. Semmelmann and Weigelt (2018) found data quality was reasonable for fixation location and saccades in fixation, smooth pursuit, and free-viewing tasks, though data collected online through a crowdsourcing platform was slightly more variable and timing was somewhat delayed compared to data collected in the lab. Several researchers successfully replicated well-known findings from the sentence-processing literature involving predictive looks (Degen et al., 2021; Prystauka et al., 2023; Van der Cruyssen et al., 2024; Vos et al., 2022). Yang and Krajbich (2021) successfully replicated a well-established link between value-based decision-making and eye gaze (see also Van der Cruyssen et al., 2024).
While promising, there are some salient limitations. First, many of the studies report effects that are smaller or later than what had been previously observed in the lab (Bogdan et al., 2024; Degen et al., 2021; Kandel & Snedeker, 2024; Slim et al., 2024; Slim & Hartsuiker, 2022; Van der Cruyssen et al., 2024). Some of this could be due to programming decisions rather than inherent limitations in the technology; accurate timing on a web browser is not trivial (A. Anwyl-Irvine et al., 2021; Bridges et al., 2020; de Leeuw & Motz, 2016; Passell et al., 2021), and subtle programming choices can significantly affect the accuracy of WebGazer.js timing (Yang & Krajbich, 2021).1 Since most of the prior work did not address these timing issues, it is not clear how many of the reported lab/web differences would resolve.
Second, prior work has focused on studies with relatively coarse-grained regions of interest (Bogdan et al., 2024; but see Semmelmann & Weigelt, 2018), dividing the screen either in half or in quadrants. This is particularly salient with (Prystauka et al., 2023), which simplified Altmann and Kamide (1999)’s design so that regions of interest are quadrants rather than the finer-grained ROIs used in the original. Certainly, webcam eye-tracking will not be as spatially fine-grained as an infrared eye-tracker, but we do not yet have a good sense of the limits.
Finally, the prior work has focused on a relatively limited range of methods. Different paradigms have different technical requirements and analyze the results differently. As a result, the breadth of utility of webcam eye-tracking is unclear.
Present work
In order to broaden the validation of online eye-tracking methodology, we set out to reproduce five previously published studies representing a variety of questions, topics, and paradigms. The goal was to examine the strengths and weaknesses of webcam eye-tracking for common paradigms in cognitive science, across a broad range of research areas. Critically, we used the only implementation known to have addressed a critical bug in the timing precision of WebGazer.js, specifically the one in jsPsych, a package which is known in general for having particularly good timing precision (Bridges et al., 2020; de Leeuw & Motz, 2016).
Selection of Studies
Table 1 provides an overview of the five studies that we selected. We chose five high-impact eye-tracking studies involving adult subjects (for a comparison of remote WebGazer and in-lab anticipatory looking effects with 18-27 month-old participants, see Steffan et al., 2024). Our goal was to include experiments from a range of topic areas (e.g., memory, decision making, psycholinguistics) and paradigms (two halves of the screen, visual world paradigm with four quadrants, visual world paradigm with “naturalistic” scenes). We had a preference for well-established findings that are known to replicate; otherwise, it can be difficult to distinguish a failure of the method from a failure of the original study to replicate. Each of the five target studies has at least one successful in-lab replication that has been published (see Table 1); for two of the target studies (Altmann & Kamide, 1999; Snedeker & Trueswell, 2004), we had access to the data from published replications (James et al., 2023; Ryskin et al., 2017, respectively) and we present these as comparison data in the present article.
Citation . | Topic Area . | Paradigm . | Citations . | Example Replication . |
---|---|---|---|---|
Altmann & Kamide (1999), Exp. 1 | Psycholinguistics | Natural Scenes | 2,263 | James et al. (2018) |
Johansson & Johansson (2014) | Memory | Four Quadrants | 293 | Johansson & Johansson (2020) |
Manns, Stark, & Squire (2000) | Memory | Two Halves | 144 | Zola et al. (2013) |
Snedeker & Trueswell (2004), Exp. 1 | Psycholinguistics | Four Quadrants | 526 | Ryskin et al. (2017) |
Shimojo et al. (2003), Exp. 1 | Decision Making | Two Halves | 1,246 | Bird et al. (2012) |
Citation . | Topic Area . | Paradigm . | Citations . | Example Replication . |
---|---|---|---|---|
Altmann & Kamide (1999), Exp. 1 | Psycholinguistics | Natural Scenes | 2,263 | James et al. (2018) |
Johansson & Johansson (2014) | Memory | Four Quadrants | 293 | Johansson & Johansson (2020) |
Manns, Stark, & Squire (2000) | Memory | Two Halves | 144 | Zola et al. (2013) |
Snedeker & Trueswell (2004), Exp. 1 | Psycholinguistics | Four Quadrants | 526 | Ryskin et al. (2017) |
Shimojo et al. (2003), Exp. 1 | Decision Making | Two Halves | 1,246 | Bird et al. (2012) |
General Methods
The stimuli and experimental code for each study is available on OSF; links are provided in the Method sections of each of the following experiments. All data and analysis code underlying this paper can be found in the Github repository: https://github.com/jodeleeuw/219-2021-eyetracking-analysis. The current paper is fully reproducible by rendering manuscript.Rmd.
Participants
Participants completed the experiment remotely and were recruited through the Prolific platform. In order to have access to the experiment, participants had to meet the following criteria: 18 years of age or older, fluency in English, and access to a webcam. All participants provided informed consent. The online studies were approved by the Vassar College Institutional Review Board. Across all five online experiments, 42% of participants returned the study without completing it. We were unable to record specific reasons for attrition, but we suspect that it was a combination of unwillingness to use their webcam for eye-tracking in an online experiment, technical problems with the webcam, and trying to complete the study on unsupported devices (mobile phones and tablets).
In addition, an in-lab replication was conducted for Experiment 1. Information about the sample is given in the Experiment 1 Method sections. This study was approved by the Institutional Review Board at Boston College.
In order to have adequate statistical power and precision, we aimed for 2.5x the sample size of the original experiment, following the heuristic of Simonsohn (Simonsohn, 2015). In Experiment 5, the original sample size was so small that we opted to collect 5x the number of participants to increase precision. Because of budget and time constraints we were unable to replace the data for subjects who were excluded or whose data was missing due to technical failures.
Equipment
We used a fork of the webgazer.js library for webcam eye-tracking (Papoutsaki et al., 2016), implemented in jsPsych, a Javascript library for running behavioral experiments in a web browser (Joshua R. de de Leeuw, 2015).2 Our fork included changes to webgazer.js in order to improve data quality for experiments in which the precise timing of stimulus onsets is relevant. Specifically, we implemented a polling mode so that gaze predictions could be requested at a regular interval, which improved the sampling rate considerably in informal testing. We aimed for a 30Hz sampling rate in all experiments, which was achievable for about 36.90% of the participants (Figure 1). This modification is similar to what Yang and Krajbich (2021) reported improved the sampling rate in their study of webgazer. We also adjusted the mechanism for recording time stamps of each gaze prediction, so that the time stamp reported by webgazer is based on when the video frame is received and not when the computation of the gaze point is finished.
We attempted to sample at 30Hz, which is a standard webcam refresh rate. The plurality of participants met this rate, but there was a long tail of slower rates.
We attempted to sample at 30Hz, which is a standard webcam refresh rate. The plurality of participants met this rate, but there was a long tail of slower rates.
Eye-tracking Calibration and Validation
When participants began the experiment, they were notified the webcam would be used for eye-tracking but no video would be saved. They were asked to remove glasses if possible, close any other tabs or apps, turn off notifications, and make sure their face was lit from the front. The webcam’s view of the participant popped up on the screen, and participants were asked to center their face in the box and keep their head still. The experiment window then expanded to full screen, and participants began the eye-tracking calibration.
During the calibration, dots appeared on the screen one at a time in different locations, and the participants had to fixate them and click on each one. Once they clicked on a dot, it would disappear and a new one would appear in a different location on the screen. The locations of calibration dots were specific to each experiment (details below) and appeared in the areas of the screen where the visual stimuli would appear during the main task in order to ensure that eye movements were accurately recorded in the relevant regions of interest. After the calibration was completed, the validation began. Participants were asked to go through the same steps as the calibration, except that they only fixated the dots as they appeared in different locations on the screen. If accuracy on the validation was too low (fewer than 50% of looks landed within a 200 px radius of the validation points), participants were given an opportunity to re-start the calibration and validation steps. Across all of the experiments reported, 56.99% of participants required recalibration. However, even participants with low calibration accuracy were allowed to complete the experiment, because we specifically wanted to study the consequences of low calibration accuracy.
Upon finishing calibration, participants were reminded to keep their head still for the remainder of the study.
Pre-registration
These data were collected within the context of an undergraduate research methods course. Groups of students (co-authors) designed and programmed experiments in jsPsych, pre-registered their planned analyses, and collected data through Prolific under the supervision of the last author. The OSF repositories associated with these experiments are linked in the methods sections of each individual study. Note that in the current paper we expand on those pre-registered analyses (e.g., including analyses of the calibration quality).
Data Pre-processing
We used R (Version 4.2.1; R Core Team, 2021) and the R-packages afex (Version 1.3.0; Singmann et al., 2021), broom.mixed (Version 0.2.9.4; Bolker & Robinson, 2020), dplyr (Version 1.1.4; Wickham et al., 2021), forcats (Version 1.0.0; Wickham, 2021a), ggplot2 (Version 3.5.1; Wickham, 2016), jsonlite (Version 1.8.8; Ooms, 2014), lme4 (Version 1.1.35.1; Bates et al., 2015), lmerTest (Version 3.1.3; Kuznetsova et al., 2017), Matrix (Version 1.6.5; Bates & Maechler, 2021), papaja (Version 0.1.2; Aust & Barth, 2020), readr (Version 2.1.5; Wickham & Hester, 2020), shiny (Version 1.8.0; Chang et al., 2021), stringr (Version 1.5.1; Wickham, 2019), tidyr (Version 1.3.1; Wickham, 2021b), and tinylabels (Version 0.2.4; Barth, 2022) for our analyses.
Experiment 1a
The first study was a replication attempt of Altmann and Kamide (1999). Altmann and Kamide used the visual world eye-tracking paradigm (Tanenhaus et al., 1995) to show that meanings of verbs rapidly constrain the set of potential subsequent referents in sentence processing. For example, when looking at the display in Figure 2 and listening to a sentence like “The boy will eat the…,” participants are more likely to look at the cake than when they hear “The boy will move the…,” in which case they tend to look at the train, presumably because cakes are edible and trains are not. Semantic information available at the verb is used to anticipate upcoming linguistic input.
Method
All stimuli, experiment scripts, and a pre-registration are available on the Open Science Framework at https://osf.io/s82kz.
Participants
Sixty participants (2.5 times the original sample of 24) were paid $2.60 for their participation. For unknown reasons, 2 of the subjects’ results were not recorded, so in the analysis, we worked with data collected from 58 participants.
Materials and Design
The visual stimuli were created through Canva and depicted an agent accompanied by four to five objects in the scene (see Figure 2). On critical trials, participants heard one of two sentences associated with the scene. In the restrictive condition, the sentence (e.g., “The boy will eat the cake”) contained a verb (e.g., “eat”) which restricts the set of possible subsequent referents (e.g., to edible things). Only the target object (e.g., the cake) was semantically consistent with the verb’s meaning. In the non-restrictive condition, the sentence (e.g., “The boy will move the cake”) contained a verb (e.g., “move”) which does not restrict the set of possible subsequent referents. The target object (e.g., the cake) as well as the distractor objects (e.g., the train, the ball, etc.) were semantically consistent with the verb’s meaning. Both sentences were compatible with the scene, such that the correct keyboard response for the critical trials was “yes.” Filler trials consisted of scenes that also contained an agent surrounded by objects as in the critical trials, but corresponding sentences named an object that was not present in the scene. The correct keyboard response for the filler trials was “no.”
Participants would hear a sentence (e.g., “The boy will eat the cake”) and respond according to whether the sentence matched the picture.
Participants would hear a sentence (e.g., “The boy will eat the cake”) and respond according to whether the sentence matched the picture.
Each participant was presented with 16 critical trials (eight in the restrictive condition, eight in the non-restrictive condition) and 16 fillers for a total of 32 trials. The order of trials and the assignment of critical scene to condition was random on a subject-by-subject basis.
Procedure
The task began with a 9-point eye-tracker calibration and validation (Figure 3). During the experiment, the participants were simultaneously presented with a visual image and a corresponding audio recording of a spoken sentence. Participants had to input a keyboard response indicating “yes” or “no” as to whether the sentence they heard was feasible given the visual image. There were two practice trials to ensure that participants understood the instructions before they undertook the main portion of the experiment. Participants’ reaction times, keyboard responses, and looks to objects in the scene were recorded for each trial.
Black points were used for calibration. Red crosses were used for checking the accuracy of the calibration. The inner validation crosses on the left side are shifted to the left of the corresponding calibration points by 10% due to an error in the code.
Black points were used for calibration. Red crosses were used for checking the accuracy of the calibration. The inner validation crosses on the left side are shifted to the left of the corresponding calibration points by 10% due to an error in the code.
Results
Looks to the objects in the scene were time-locked to the onset of the verb, the offset of the verb, onset of the post-verbal determiner, and onset of the target noun. ROIs were defined by creating boxes around each object in the scene. The size of each box was determined by taking the height and width of the given object and adding 20 pixels of padding. Each scene contained an agent region, a target region, and three or four distractor regions.
Cumulative Fixation Probabilities
For each sentence, the target time window began at the onset of the verb and ended 2000 milliseconds later. This window was then divided into 50-ms bins; for each participant and each trial, we recorded whether each object was fixated during the 50-ms bin. Collapsing over trials and participants, and averaging across distractors, we calculated the cumulative probability of fixation, shown in Figure 7, Panel (b).
Pre-noun fixations
In our first two analyses, we attempted to replicate the original finding that participants looked more to the target than to the distractor during the predictive time window, given that the verb is restricting. The first model tested whether there were more fixations to the target object than to the distractor in the time window before the onset of the target noun. We constrained the critical time window to begin at verb onset and end just before the onset of the noun; we divided this window into 50-ms bins, coded whether the target was fixated in each bin, and tracked the cumulative probability of a target fixation across the time window. The dependent measure in these analyses was the cumulative probability that was reached in the final 50-ms bin. We ran a regression model predicting the cumulative probability from the verb condition (restricting = 1 vs. non-restricting = 0), object type (target = 1 vs. distractor = 0), and their interaction, along with random effects for participants and images (with no covariance between random effects because the model cannot converge with full covariance matrix).3 This is analogous to the F1/F2 analyses used in the original but is more commonly used now. Contrary to the original study, there were no significant effects, although the critical interaction was in the expected direction (b = 0.05, SE = 0.03, p=0.15).
Pre-verb-offset fixations
In a follow-up analysis, Altmann and Kamide aligned the predictive time window with the offset of the verb rather than the onset of the noun as above, finding the same result: participants looked significantly more to the target than to the distractor. When we repeated this analysis,4 we again saw that the critical interaction is not significant but numerically in the expected direction (b = 0.05, SE = 0.03, p=0.20).
First target fixations after verb
Finally, we addressed whether participants looked to the target faster in the restrictive vs. the non-restrictive condition, starting after the onset of the verb. On average, participants looked to the target 335 ms after the noun onset in the restrictive condition [compared to -85 in Altmann and Kamide (1999) and 135 in James et al. (2023);] and 435 ms after the noun onset in the non-restrictive condition [compared to 127 in Altmann and Kamide (1999) and 285 in James et al. (2023);]. Thus, first fixations were not only delayed relative to those in the previous studies compared here, but also showed a smaller difference between conditions.
Critically, we failed to replicate the difference between conditions: participants looked sooner at the target in the restrictive condition, while accounting for verb duration and its interaction with condition, but this was not a statistically significant effect (b = -121.60, SE = 85.43, p=0.17).5
This was operationalized as the latency of the first target fixation in the non-restricting verb condition minus that in the restricting verb condition.
This was operationalized as the latency of the first target fixation in the non-restricting verb condition minus that in the restricting verb condition.
Calibration
Participants’ calibration quality was measured as the mean percentage of fixations that landed within 200 pixels of the calibration point. Calibration quality varied widely, ranging from 3.16% to 98.87%.
We tested whether a participant’s calibration quality was correlated with their effect size. There were three effects of interest: the verb-by-object interaction in predicting fixation probabilities, both in the (1) pre-noun-onset and (2) pre-verb-offset windows (calculated as the difference in target-over-distractor preference between verb conditions), and (3) the effect of verb on the timing of the first target fixation (calculated as the difference in target latency between verb conditions). Calibration quality was not significantly correlated with any of the three effects (Effect 1: Pearson’s r = 0.03, p = 0.83, Effect 2: Pearson’s r = -0.05, p = 0.73, Effect 3: Pearson’s r = 0.10, p = 0.49. Figure 4 plots the relation between calibration scores and Effect 3.
Re-analysis After Exclusions
The figure makes clear that there is a large proportion of participants with calibration scores under 50%. Thus, we re-analyzed the data with the 22 participants who had calibration scores of at least 50%. The first two models, comparing target and distractor fixations in the predictive window, produced very similar results as before; the critical interaction was not statistically significant (Pre-noun-onset window: b = 0.07, SE = 0.06, p=0.23; Pre-verb-offset window: b = 0.05, SE = 0.05, p=0.28). However, the final model, which tested the effect of verb condition on the timing of fixations to the target, yielded the expected statistically significant result (b = -193.35, SE = 96.33, p=0.05).
Results from an in-lab study with an eye-tracker (Panel A; James et al., 2023) are contrasted with the current results (Panel B). The vertical line marks the mean noun onset time across trials and conditions.
Results from an in-lab study with an eye-tracker (Panel A; James et al., 2023) are contrasted with the current results (Panel B). The vertical line marks the mean noun onset time across trials and conditions.
Discussion
Across three different tests of the hypothesis that listeners will use verb semantics to anticipate the upcoming referent, we found results that were numerically consistent with the hypothesis but not statistically significant. This is unlikely to be because the original effect is unreliable. Indeed, one of us had successfully replicated it previously using traditional methods (James et al., 2023).
An eyeball comparison to published data James et al. (2023) suggests that looks to the objects in the scene (relative to background or off-screen looks) were depressed across conditions and objects. After eliminating participants with validation accuracy under 50% and/or 10% or fewer fixations to any ROIs, we were left with only 22 of the original 60 participants. Either of these facts could have resulted in low power to detect effects. Given that nearly two-thirds of participants had poor data quality, we ran a follow-up webcam study in a lab setting order to test whether the failure to replicate was due to using a remote sample or to the technology itself.
Experiment 1b
Experiment 1b tested whether the failure to replicate significant condition effects in Experiment 1a was due to features of conducting the study remotely (i.e. varied experimental settings and apparatuses, lower compliance) rather than webcam-based eye-tracking or webgazerper se. Thus, Experiment 1b took place in a lab setting with undergraduate participants but otherwise used the same Method as Experiment 1a.
Method
Participants
Forty-nine participants completed the study in a lab setting. They were recruited via the Boston College subject pool. Participants needed to be 18 years of age or older and native speakers of English.
Materials and Design
Materials were identical to those in Experiment 1a.
Procedure
After being greeted by the experimenter and completing the informed consent form, participants followed on-screen prompts to complete the study, including calibration, as described in the Experiment 1a Procedure.
Apparatus
Subjects were tested on a MacIntosh laptop using the built-in Webcam. The study was run in Chrome using the same code as Exp. 1a.
Results
Cumulative Fixation Probabilities
For each sentence, the target time window began at the onset of the verb and ended 2000 milliseconds later. This window was then divided into 50-ms bins; for each participant and each trial, we recorded whether each object was fixated during the 50-ms bin. Collapsing over trials and participants, and averaging across distractors, we calculated the cumulative probability of fixation, shown in Figure 7, Panel (b). The results from Experiment 1a are copied here for ease of comparison (Panel a).
Pre-noun fixations
We ran the analysis exactly as before. Unlike with the remote sample, this time there was a significant main effect of object type such that participants were more likely to be looking at the target than the distracter object during this time window (b = 0.10, SE = 0.03, p=0.01). However, the critical interaction was again not statistically significant and indeed in the wrong direction (b = -0.05, SE = 0.05, p=0.25).
Pre-verb-offset fixations
When we repeated this analysis, the critical interaction was again not significant, nor was it even in the expected direction (b = -0.06, SE = 0.04, p=0.17).
First target fixations after verb
On average, participants looked to the target 405 ms after the noun onset in the restrictive condition and 399 ms after the noun onset in the non-restrictive condition. As in the remote sample, the latencies are overall slower than in results published by Altmann & Kamide (1999) and James et al. (2023). Unlike in the remote sample, however, the difference is in the unexpected direction, such that participants looked to the target faster in the non-restrictive condition. In any case, no effects were significant; results revealed that the paradoxical advantage in the non-restrictive condition was not statistically significant (b = 2.88, SE = 117.68, p=0.98). Effects of verb duration and its interaction with condition were also not statistically significant (duration: b = -0.58, SE = 0.54, p=0.30; interaction: b = -0.60, SE = 0.96, p=0.53 )
This was operationalized as the latency of the first target fixation in the non-restricting verb condition minus that in the restricting verb condition.
This was operationalized as the latency of the first target fixation in the non-restricting verb condition minus that in the restricting verb condition.
Calibration
As before, participants’ calibration quality was measured as the mean percentage of fixations that landed within 200 pixels of the calibration point. Calibration quality ranged from 5.13% to 97.89%. As in Exp. 1a, calibration quality was not significantly correlated with any of the three effects (pre-noun: Pearson’s r = -0.24, p = 0.11; pre-verb-offset: Pearson’s r = -0.19, p = 0.20; first fixation: Pearson’s r = -0.16, p = 0.29. Figure 4 plots the relation between calibration scores and Effect 3.
Re-analysis After Exclusions
Again, we re-analyzed the data focusing on participants with validation accuracy of at least 50% (N = 26). Across all three models, results were in line with analyses using the minimal exclusion criteria; none of the critical effects were statistically significant, nor were they in the expected direction (Pre-noun-onset window, verb x object interaction: b = -0.08, SE = 0.06, p=0.15; Pre-verb-offset window, verb x object interaction: b = -0.07, SE = 0.05, p=0.16; Verb effect on first target fixation: b = 6.40, SE = 97.92, p=0.95).
Results from our web-based study (Panel A; Experiment 1a) are contrasted with the current results (Panel B). The vertical line marks the mean noun onset time across trials and conditions.
Results from our web-based study (Panel A; Experiment 1a) are contrasted with the current results (Panel B). The vertical line marks the mean noun onset time across trials and conditions.
Discussion
Overall, the results of Experiment 1 paint a sobering picture of web-based eye-tracking. In Experiment 1a, results from the remote sample were in the expected direction but effects were smaller and delayed relative to previous work and failed to reach statistical significance. This was not due to vagaries of remote samples or working with a more diverse subject population: results from the in-lab study were, if anything, worse. Nor does our failure to replicate seem to be driven by subjects with poor calibration; effect sizes were not correlated with calibration success, and restricting analyses to subjects with good calibration scores only helped in one of the experiments and in only one of the three critical analyses.
Taken together, the instability of these effects might suggest that this paradigm is not well-suited for webcam-based eye-tracking. Notably, the ROIs were tightly drawn around the five to six objects in each scene (drawing larger ROIs in these scenes would have led to overlapping objects) and thus, analyses were unforgiving of inaccurate calibration. Further evidence comes from a recent webgazer study that successfully replicated the original results using modified stimuli consisting of only four objects, each in separate quadrants, allowing for larger, more distinct ROIs (Prystauka et al., 2023).
The next four experiments test paradigms with more generous ROIs.
Experiment 2
The second study was a replication attempt of Johansson and Johansson (2014), which examined how visuospatial information is integrated into memory for objects. They found that, during memory retrieval, learners spontaneously look to blank screen locations where pictures were located during encoding (see Spivey & Geng, 2001) and that this spatial reinstatement facilitates retrieval of the picture.
Method
All stimuli, experiment scripts, and a pre-registration are available on the Open Science Framework at https://osf.io/xezfu/.
Participants
Sixty participants were paid for their participation (once again, 2.5x larger than the original sample size of 24). Data from one participant were not properly recorded due to unknown technical issues, so data from 59 participants were included in all analyses to follow.
Materials and Design
The experiment consisted of two blocks each composed of an encoding phase and a recall phase. The two blocks differed by whether the recall phase was in the free-viewing or fixed-viewing condition, as described in the Procedure. Participants were randomly assigned to see the fixed-viewing or free-viewing block first. During each encoding phase, participants saw a grid indicating the four quadrants of the screen. Each quadrant contained six cartoon images of items belonging to the same category. The four categories were humanoids, household objects, animals, and methods of transportation (see Figure 8). Different images were used in each block; there were 48 unique images total across the experiment.
Each recall phase presented participants with a blank screen with a central fixation cross as they listened to true/false statements testing their recall of the previous grid. Each statement fell into either an interobject or intraobject condition. Interobject statements were those that compared two different items in the grid (e.g. “The skeleton is to the left of the robot”), while intraobject statements were those that asked about the orientation of a single item (e.g. “The bus is facing right”). There were 48 total statements in each block, such that each object was the subject of both an intraobject and an intraobject statement; there were 96 unique statements total across the experiment.
Procedure
The task began with a 5-point eye-tracker calibration and validation (Figure 9).
Black points were used for calibration. Red crosses were used for checking the accuracy of the calibration. In this experiment, all the same locations were used for both calibration and validation.
Black points were used for calibration. Red crosses were used for checking the accuracy of the calibration. In this experiment, all the same locations were used for both calibration and validation.
Participants received instructions that included an example grid and an explicit request that they not use any tools to help them during encoding. Participants then began their first encoding phase. Each of the four quadrants was presented one at a time. First, a list of the items in the quadrant was shown, then the pictures of items were displayed in the quadrant. For each item, participants used their arrow keys to indicate whether the object was facing left or right. After the participant identified the direction of each item, they would have an additional 30 seconds to encode the name and orientation of each item in the quadrant. Finally, after all four quadrants were presented, participants were shown the full grid of 24 items and had 60 seconds to further encode the name and orientation of each item.
Participants then entered the first recall phase, in which they listened to the 48 statements and responded by pressing the ‘F’ key for false statements and ‘T’ for true ones. While listening to these statements, in the free-viewing block, participants saw a blank screen and were allowed to freely gaze around the screen. During the fixed-viewing block, participants were asked to fixate a small cross in the center of the screen throughout the recall phase. In both cases, the mouse was obscured from the screen.
Participants then proceeded to the second encoding and recall phases as described above. After completing both encoding-recall blocks, participants were asked to answer a few survey questions (such as whether they wore glasses or encountered any distractions).
The primary methodological difference between this replication and Johansson and Johansson’s study was that the original study included two additional viewing conditions that were omitted from this replication due to time constraints. In those two conditions, participant were prompted to look to a specific quadrant (rather than free viewing or central fixation) which either matched or mismatched the original location of the to-be-remembered item.
Results
Eye gaze
Looks during the retrieval period were categorized as belonging to one of four quadrants based on the x,y coordinates. The critical quadrant was the one in which the to-be-retrieved object had been previously located during encoding. The other three quadrants were labeled “first”, “second,” and “third” depending upon the location of the critical quadrant (e.g., when the critical quadrant was in the top left, the “first” quadrant was the top right quadrant, but when the critical quadrant was in the top right, “first” corresponded to bottom right, etc.). In both the fixed- and free-viewing condition, participants directed a larger proportion of looks to the critical quadrant (see Figure 10). This bias appeared larger in the free-viewing condition, suggesting that the manipulation was somewhat effective.
Error bars indicate standard errors over participant means.
Error bars indicate standard errors over participant means.
The proportions of looks across quadrants in the free-viewing condition were analyzed using a linear mixed-effects model with quadrant as the predictor (critical as the reference level). The model included random intercepts and slopes for participants.6 Proportions of looks were significantly higher for the critical quadrant compared to the other three (first: b = -0.06, SE = 0.01, p<0.001, second: b = -0.08, SE = 0.01, p<0.001, third: b = -0.05, SE = 0.01, p<0.001).
Error bars indicate standard errors over participant means.
Error bars indicate standard errors over participant means.
Response Time and Accuracy
Participants’ response times and accuracies on memory questions are summarized in Figure 11. Both dependent variables were analyzed with linear mixed-effects model with relation type (interobject = -0.5, intraobject=0.5) and viewing_condition (fixed = -0.5, free=0.5) and their interaction as the predictors and random intercepts for participants.7
The original study reported RTs for interobject statements were longer in the fixed condition relative to the free-viewing condition, while there was no difference for intraobject statements, resulting in a significant interaction. In contrast, we found overall slower RTs for the free viewing condition (b = 260.98, SE = 105.24, p<0.001) and for interobject questions (b = -555.60, SE = 105.24, p<0.001) but no interaction. We also failed to replicate a significant main effect of question type on accuracy (b = -0.05, SE = 0.03, p=0.05), while we found an unexpected effect of viewing condition on accuracy, with higher accuracy for the fixed condition (b = -0.06, SE = 0.03, p=0.03).
One possibility is that the in-lab participants in Johansson and Johansson (2014) were much more compliant with the instruction to keep their gaze on central fixation (though these data are not reported in the original paper). However, restricting analyses to the subset of participants (N = 25) who were most compliant during the fixed-viewing block (at least 25% of their looks fell within 20% of the center of the display), the viewing condition effects and the interactions remained non-signficant. Given the smaller sample size we do not interpret these results further.
Calibration
This was operationalized as the difference between the proportion of looks to the critical quadrant minus the average proportion of looks to the average of the other three quadrants.
This was operationalized as the difference between the proportion of looks to the critical quadrant minus the average proportion of looks to the average of the other three quadrants.
Participants’ calibration quality was measured as the mean percentage of fixations that landed within 200 pixels of the calibration point, averaging initial calibration (or re-calibration for participants who repeated calibration) and calibration at the halfway point. Calibration scores varied substantially (between 17.34 and 100%). The quality of a participant’s calibration was not significantly correlated with the participant’s effect size ( Pearson’s r= 0.21, p = 0.11) as measured by the difference between the proportion of looks to the critical quadrant minus the average proportion of looks to the average of the other three quadrants (see Figure 12).
Re-analysis After Exclusions
There were 12 participants whose calibration score was under 50%. When those participants were removed from the analyses, critical results were the same: there remains a clear gaze bias in the free-viewing condition (first: b = -0.06, SE = 0.01, p<0.001, second: b = -0.08, SE = 0.02, p<0.001, third: b = -0.06, SE = 0.01, p<0.001 ); viewing condition still did not interact with question type to predict either behavioral outcome.
Survey results
Participants were asked a number of debriefing questions. 15% reported experiencing distracting notifications during the experiment. 14% reported that the experiment exited full screen mode during the experiment. 49% reported that it was difficult to keep their head in a fixed position during the memory recall portion of the experiment. 46% reported that they were unable to hear at least one question. 19% reported wearing glasses during the experiment. Average calibration scores were numerically lower for participants who reported wearing glasses (60.54%) compared to those who didn’t (72.99%, t=-1.90, p=0.08).
Discussion
We replicated the key eye-tracking result: As in Johansson and Johansson (2014) and Spivey and Geng (2001), during memory retrieval, learners spontaneously look to blank screen locations where pictures were located during encoding, suggesting that visuospatial information is integrated into the memory for objects. Interestingly, we did not replicate the behavior result: a memory benefit, in terms of speed, of spatial reinstatement via gaze position during retrieval of the picture. For researchers interested in that effect, this is an issue to be followed up on. However, it is not strictly relevant for our evaluation of gaze tracking online.
Experiment 3
The third study was a partial replication attempt of the “looking group” condition of Manns, Stark, and Squire (2000) (we did not replicate their priming condition). This experiment used the visual paired-comparison, which involves presenting a previously-viewed image and novel image together and measuring the proportion of time spent looking at each image. The expected pattern of results is that participants will look more at novel objects. Manns et al. (2000) hypothesized that this pattern of behavior could be used to measure the strength of memories. If a viewer has a weak memory of the old image, then they may look at the old and new images roughly the same amount of time. They tested this in two ways. First, they showed participants a set of images, waited five minutes, and then paired those images with novel images. They found that participants spent more time (58.8% of total time) looking at the novel images. They then measured memory performance one day later and found that participants were more likely to recall images that they had spent less time looking at during the visual paired-comparison task the previous day.
Method
The stimuli and experimental code can be found on the Open Science Framework at https://osf.io/k63b9/. The pre-registration for the study can be found at https://osf.io/48jsv . We inadvertently did not create a formal pre-registration using the OSF registries tool, but this document contains the same information and is time stamped prior to the start of data collection.
Participants
Our pre-registered target was 50 participants. 51 participants completed the first day of the experiment and 48 completed the second day. Following Manns et al., we excluded 3 participants due to perfect performance on the recognition memory test because this prevents comparison of gaze data for recalled vs. non-recalled images. Our final sample size was 45 participants.
Materials and Design
Stimuli consisted of color photographs of common objects (e.g. an apple, a key, etc.). We selected 96 unique images from the stimulus set provided by Konkle, Brady, Alvarez, and Oliva (2010), which contains multiple objects from hundreds of unique categories. We selected images in 48 pairs from the same object category such that each critical object had a corresponding foil object during the recognition test (e.g. one red apple and one green apple). All images are of a single object on a white background.
Stimulus sets were created to present participants with 24 presentation trials and 24 test trials on Day 1, and 48 recognition trials on Day 2. Each presentation trial screen presented two identical images (e.g. red apple and red apple). The test trial screens presented each of the previously-seen (“old”) images paired with a new image (e.g. red apple and bicycle key). The recognition trial screens presented a single image that was either from the original set of 24 (seen twice across the first two phases) or was a corresponding foil for the original set (e.g. red apple or green apple). Thus, each participant was exposed to 72 unique images over the course of the experiment (the 24 foils for the “new” images in the test phase are never seen). Two full stimulus lists were created to counterbalance images across participants; the images that composed the “old” set for one list made up the “new” set for the list. The experimental design is visually depicted in Figure 14.
Procedure
The task began with a 7-point eye-tracker calibration (each point was presented 3 times in a random order) and validation with 3 points (each presented once). The point locations were designed to focus calibration on the center of the screen and the middle of the left and right halves of the screen (Figure 13).
Black points were used for calibration. Red crosses were used for checking the accuracy of the calibration.
Black points were used for calibration. Red crosses were used for checking the accuracy of the calibration.
The experiment was administered over the course of two consecutive days. The presentation phase and test phase occurred on the first day, while the recognition test occurred on the second day.
During the presentation phase, participants viewed 24 pairs of identical images. Each pair was presented for 5 seconds and an interval of 5 seconds elapsed before the next pair was shown. The order of the photographs was randomized and different for each participant. After completion of the presentation phase, participants were given a 5-minute break during which they could look away from the screen.
After the break, they were prompted to complete the eye-tracking calibration again before beginning the test phase. During this phase, participants again viewed 24 pairs of photographs (old and new) with an interstimulus duration of 5 seconds.
Approximately 24 hours after completing the first session, with a leeway interval of 12 hours to accommodate busy schedules, participants were given the recognition test. Each image was shown on the screen for 1 second, followed by a 1 second interstimulus interval. Each photograph remained on the screen until the participants indicated whether or not they had seen it before by pressing ‘y’ for yes and ‘n’ for no. After they pressed one of the two keys, a prompt on the screen asked them to rate their confidence in their answer from 1 as a “pure guess” to 5 as “very sure.” by clicking on the corresponding number on the screen. No feedback on their responses was given during the test.
One aspect of the original study that we could not replicate was that we presented all stimuli on a single screen rather than two screens due to the constraints of the online experiment format. We had no a priori reason to expect this to matter.
Results
Day 1
We calculated the proportion of gaze samples in the ROI of the unfamiliar image out of all the gaze samples that were in either ROI. Of the 1248 total trials in the experiment across all subjects, 78 had no fixations in either ROI and so were excluded from this analysis.
The mean proportion looks to the novel object was 0.55 (SD = 0.10). This was significantly greater than 0.5 (t(49) = 3.29, p = 0.002), replicating the finding that participants show a preference for looking at the novel objects.
Day 2
In Day 2 analyses, we excluded the 16 (out of 2304) trials where the response time for the recognition judgment was greater than 10 seconds.
Participants correctly identified whether the image was familiar or unfamiliar 87.09% (SD = 10.49) of the time. After excluding the 3 participants who responded correctly to all images, the average confidence rating for correct responses (M = 3.51; SD = 0.41) was significantly higher than their average confidence ratings for incorrect responses (M = 2.55; SD = 0.75), t(44) = 9.36, p < 0.001). Among the same subset of participants, response times for correct responses (M = 1,443.49, SD = 413.94) were also significantly faster than for incorrect responses (M = 2,212.65, SD = 1,733.76), t(44) = -3.43 , p = 0.001). These findings replicate the original.
To see whether preferentially looking an the unfamiliar object on day 1 was correlated with confidence and response time for correct responses on Day 2, we computed the correlation coefficient between Day 1 proportion looks to novel and Day 2 confidence/RT for each participant. Following the original analysis, we transformed these values using the Fisher p-to-z transformation. Using one-sample t-tests, we failed to replicate the a significant difference from 0 for the correlation between proportion looks to the novel object and confidence ratings, t(38) = 0.46, p = 0.65 (excluding the subjects who gave the same confidence judgment for all images), nor for the correlation with RT, t(46) = 0.49, p = 0.63.
Effects of ROIs
Above we excluded looks that fell outside the ROIs. In contrast, the original experiment simply coded looks as being to the left or right. This may be more appropriate to the limited resolution of camera-based eye gaze detection. We re-ran analyses based on this split-half coding criterion. The correlation between proportion looks to novel using the ROI method and the halves method is 0.76 (see Figure 15).
We again found that participants looked more at the novel object, though at a slightly lower rate than before (0.54; SD = 0.08). As before, this was significantly greater than 0.5 (t(50) = 3.51, p < 0.001).
Performance on day 2 remained uncorrelated with day 1 looks-to-novel after switching the coding of gaze. We found no significant difference from 0 for the correlation between looks-to-novel and confidence ratings, t(39) = 0.74, p = 0.47 (excluding the subjects who gave the same confidence judgment for all images), nor the the correlation between looks-to-novel and RT, t(47) = 0.28, p = 0.78.
Calibration
To see if calibration success is correlated with the eye-tracking effects, we calculated a calibration score for each participant. The calibration score was the average proportion of samples within 200 pixels of the validation points during the final validation phase before the eye-tracking is performed. Calibration scores were not correlated with proportion looks to novel, regardless of whether scores were computed using ROIs (see Figure 16) or split-halves (see Figure 17).
This was operationalized as proportion of gaze samples to the new image out of all gaze samples).
This was operationalized as proportion of gaze samples to the new image out of all gaze samples).
Similarly, there was no correlation between calibration scores with the correlation between day 2 memory performance and day 1 looking for either kind of behavioral and looking measures (see Figure 18).
Looks-to-novel were either coded using the halves method (left panels) or ROI method (right panels); memory performance was measured using confidence ratings (top panels) or reaction time for correct recognition judgments (bottom panels).
Looks-to-novel were either coded using the halves method (left panels) or ROI method (right panels); memory performance was measured using confidence ratings (top panels) or reaction time for correct recognition judgments (bottom panels).
Re-analysis After Exclusions
As is clear from the preceding figures, there was a large number of participants (N = 26) that had calibration scores under 50%. When we re-analyzed the subset that remained after those participants were excluded (N = 19), key results were aligned with the main analyses. The looks-to-novel result was upheld such that participants looked more at the new image on Day 1 (mean: 0.54, SD = 0.08, t(50) = 3.51, p < 0.001), but looks-to-novel remained unrelated to Day 2 memory outcomes (confidence: t(20) = -0.58, p = 0.57; RT: t(24) = 0.41, p = 0.68).
Discussion
As in Manns et al. (2000), participants looked more at novel images than previously seen images. This effect was consistent for ROIs based on the images and for the coarser ROIs based on two halves of the display. A day later, participants were also able to discriminate the images they had seen from foil images they had not seen during the previous session. However, there was no evidence that memory performance on day 2 was related to looking time on day 1.
Thus, most of the key findings replicated. What should we make of the finding that failed to replicate? Notably, the proportion of looks to the novel object on Day 1 (54% or 55%, depending on analysis) was numerically smaller than in the original (59%), potentially decreasing the ability to detect correlations with eye gaze. If so, the question is whether that is due to lower-quality eye gaze tracking. This seems unlikely, because calibration quality did not appear to substantially affect this result (or any of the others). Alternative explanation include the well-documented decline effect (Hartshorne & Schachner, 2012; Open Science Collaboration, 2015; replications tend to find smaller effect sizes than the original, Schooler, 2011) and the common finding that subject accuracy is lower when not testing undergraduates at elite universities (Germine et al., 2012; Hartshorne, Skorb, et al., 2019).
Experiment 4
The fourth study was a replication attempt of Experiment 1 in Ryskin et al. (2017), which was closely modeled on Snedeker and Trueswell (2004). These studies used the visual world paradigm to show that listeners use knowledge of the co-occurrence statistics of verbs and syntactic structures to resolve ambiguity. For example, in a sentence like “Feel the frog with the feather,” the phrase “with the feather” could be describing the frog, or it could be describing the instrument that should be used to do the “feeling.” When both options (a frog holding a feather and a feather by itself) are available in the visual display, listeners rely on the verb’s “bias” (statistical co-occurrence either in norming or corpora) to rapidly choose an action while the sentence is unfolding. Use of the verb’s bias should be reflected in a greater proportion of eye gaze fixations to the animal (e.g., frog) for verbs that are biased toward modifier structures relative to instrument structures and, conversely, a greater proportion of eye gaze fixations to the instrument (e.g., feather) for verbs that are biased toward instrument structures relative to modifier structures.
Method
The stimuli and experimental code can be found on the Open Science Framework at the following link, https://osf.io/x3c49/. The pre-registration for the study can be found at https://osf.io/3v4pg.
Participants
Fifty-seven participants were paid $2.50 for their participation — three fewer than the intended sample of 60, which was not reached due to time constraints, but double the sample in Ryskin et al. (2017) Experiment 1.
Materials and Design
The images and audios presented to the participants were the same stimuli used in the original study. The critical trials were divided into modifier-biased, instrument-biased, and equi-biased conditions, and the filler trials did not contain ambiguous instructions. Two lists of critical trials were made with different verb and instrument combinations (e.g., “rub” could be paired with “panda” and “crayon” in one list and “panda” and “violin” in the second list). Within each list, the same verb was presented twice but each time with a different target instrument and animal. The lists were randomly assigned to the participants to make sure the effects were not caused by the properties of the animal or instrument images used. There were 54 critical trials per list (3 verb conditions x 9 verbs per condition x 2 presentations) and 24 filler trials. The list of the 27 verbs used, along with their verb bias norms, can be found in Appendix A of the original study.
Procedure
After the eye-tracking calibration and validation (Figure 19), participants went through an audio test so they could adjust the audio on their computer to a comfortable level. Before beginning the experiment, they were given instructions that four objects would appear, an audio prompt would play, and they should do their best to use their mouse to act out the instructions. They then went through three practice trials which were followed by 54 critical trials and 24 filler trials presented in a random order.
Black points were used for calibration. Red crosses were used for checking the accuracy of the calibration. In this experiment, all the same locations were used for both calibration and validation.
Black points were used for calibration. Red crosses were used for checking the accuracy of the calibration. In this experiment, all the same locations were used for both calibration and validation.
During a trial, four pictures were displayed (target animal, target instrument, distractor animal, distractor instrument), one in each corner of the screen, and participants heard an audio prompt that contained instructions about the action they needed to act out (e.g., “Rub the butterfly with the crayon”; see Figure 20)8. Using their cursor, participants could act out the instructions by clicking on objects and moving them or motioning over the objects9. After the action was completed, the participants were instructed to press the space bar which led to a screen that said “Click Here” in the middle in order to remove bias in the eye and mouse movements from the previous trial. The experiment only allowed the participants to move on to the next trial once the audio was completely done playing and the mouse had been moved over at least one object.
The butterfly is the target animal, the panda is the distractor animal, the crayon is the target instrument, and the violin is the distractor instrument.
The butterfly is the target animal, the panda is the distractor animal, the crayon is the target instrument, and the violin is the distractor instrument.
Results
As shown in Figure 21, the qualitative results match those of the original. The quantitative patterns of clicks were similar to those observed in the original dataset, though for Instrument-biased verbs, clicks were closer to evenly split between the animal and the instrument relative to the in-lab study where they were very clearly biased toward the instrument. A mixed-effects logistic regression model was used to predict whether the first movement was on the target instrument with the verb bias condition as an orthogonally contrast-coded (instrument vs. equi & modifier: inst = -2/3, equi = 1/3, mod = 1/3; equi vs. modifier: inst = 0, equi = -1/2, mod = 1/2 ) fixed effect.
Error bars indicate standard errors over participant means.
Error bars indicate standard errors over participant means.
Participants and items were entered as varying intercepts with by-participant varying slopes for verb bias condition.10 Participants were more likely to first move their mouse over target instruments in the instrument-biased condition relative to the equi-biased and modifier-biased condition (b = -1.50, SE = 0.25, p < 0.01). Further, participants were more likely to first move their mouse over target instruments in the equi-biased condition relative to the modifier-biased condition (b = -1.10, SE = 0.29, p < 0.01)
Gaze fixations were time-locked to the auditory stimulus on a trial by trial basis and categorized as being directed towards one of the four items in the display if the x, y coordinates fell within a 400 by 400 pixel rectangular region of interest around each image (a 200 x 200 pixel rectangle containing the image with an additional 100 pixels of padding on all sides). In order to assess how verb bias impacted sentence disambiguation as the sentence unfolded, the proportion of fixations was computed in three time windows: the verb-to-animal window (from verb onset + 200 ms to animal onset + 200 ms), the animal-to-instrument window (from animal onset + 200 ms to instrument onset + 200 ms), and the post-instrument window (from instrument onset + 200 ms to instrument onset + 1500ms + 200 ms). Results were qualitatively similar to those in the original as shown in Figure 22, though proportions of fixations to the target animal were much lower in the web version of the study. This may reflect the fact that participants in the web study are less attentive and/or the quality of the webgazer eye-tracking system is lower, relative to the Eyelink 1000 which was used for the original study. Eye gaze results are shown in more detail in Figure 23.
Error bars reflect standard errors over subject means.
Error bars reflect standard errors over subject means.
Vertical lines indicate average onsets of animal and instrument offset by 200ms. Shaded ribbons reflect standard errors over subject means.
Vertical lines indicate average onsets of animal and instrument offset by 200ms. Shaded ribbons reflect standard errors over subject means.
Mixed-effects linear regression models were used to predict the proportions of fixations to the target animal vs. instrument within each time window with the verb bias condition as an orthogonally contrast-coded (instrument vs. equi & modifier: inst = -2/3, equi = 1/3, mod = 1/3; equi vs. modifier: inst = 0, equi = -1/2, mod = 1/2 ) fixed effect. Participants and items were entered as varying intercepts.11
In the verb-to-noun window, participants did not look more at the target animal in any of the verb bias conditions (Instrument vs. Equi and Modifier: b = -0.01, SE = 0.02, p = 0.59; Equi vs. Modifier: b = 0, SE = 0.02, p = 1 ).
In the noun-to-instrument window, participants looked more at the target animal in the modifier-biased condition and equi-biased conditions relative to the instrument-biased condition (b = 0.03, SE = 0.01, p < 0.01) and in the modifier biased relative to the equi-biased condition (b = 0.02, SE = 0.01, p < 0.05).
In the post-instrument window, participants looked more at the target animal in the modifier-biased condition and the equi-biased conditions relative to the instrument-biased condition (b = 0.08, SE = 0.02, p < 0.01) but not significantly so in the modifier biased condition relative to the equi-biased condition (b = 0.03, SE = 0.02, p = 0.15).
Calibration
Participants’ calibration quality, measured as the mean percentage of fixations that landed within 200 pixels of the calibration point, varied substantially (between 2.22 and 97.36%). The quality of a participant’s calibration significantly correlated with the participant’s effect size (Pearson’s r= 0.29, p < 0.05). The difference in target animal fixation proportions between modifier and instrument conditions was higher for participants with better calibration (see Figure 24).
Error bars reflect standard errors over subject means.
Error bars reflect standard errors over subject means.
Re-analysis After Exclusions
A subset of 35 participants had calibration quality >50%. Figure 25 shows proportions of fixations to the target animal for this subset alongside the full dataset. Though the overall proportions of fixations to the target animal in this subset were higher than in the full dataset, they were still much lower than in the original study. Replicating the linear mixed-effects analysis (in the post-instrument onset time window only) on this subset of participants suggests that the effect of verb bias condition was larger in this subset than in the full dataset. Participants’ preference to the target animal relative to the target instrument in the modifier-biased condition and the equi-biased conditions was greater than in the instrument-biased condition (b = 0.10, SE = 0.02, p < 0.001), but the difference betwen the modifier biased condition and the equi-biased condition was not significant (b = 0.02, SE = 0.02, p = 0.29).
Effects of ROIs
Eye-tracking on the web differs critically from in-lab eye-tracking in that the size of the display differs across participants. Thus the size of the ROIs differs across participants. The current version of the web experiment used a bounding box around each image to determine the ROI. This approach is flexible and accomodates variability in image size, but may exclude looks that are directed at the image but fall outside of the image (due to participant or eye-tracker noise) as shown in Figure 26a. Alternatively, The display can be split into 4 quadrants which jointly cover the entire screen (see Figure 26b).
Magenta points indicate looks that were not categorized into an ROI.
Magenta points indicate looks that were not categorized into an ROI.
Gaze was categorized based on which quadrant of the screen the coordinates fall in (as opposed to a bounding box around the image). Vertical lines indicate average onsets of animal and instrument offset by 200ms.
Gaze was categorized based on which quadrant of the screen the coordinates fall in (as opposed to a bounding box around the image). Vertical lines indicate average onsets of animal and instrument offset by 200ms.
Categorizing gaze location based on which of the four quadrants of the screen the coordinates fell in, increases the overall proportions of fixations (see Figure 27). In the post-instrument window, participants’s proportion of looks to the target animal in the modifier-biased condition and the equi-biased conditions were significantly greater than inthe instrument-biased condition ( b = 0.08, SE = 0.02, p < 0.01) and marginally so in the modifier biased condition relative to the equi-biased condition ( b = 0.04, SE = 0.02, p = 0.05). Effect size estimates appeared somewhat larger and noise was somewhat reduced when using the quadrant categorization relative to the bounding box-based ROIs.
Discussion
As in Ryskin et al. (2017) and Snedeker and Trueswell (2004), listeners’ gaze patterns during sentences with globally ambiguous syntactic interpretations differed depending on the bias of the verb (i.e., modifier-, instrument-, or equi-). For modifier-biased verbs, participants looked more quickly at the target animal and less at the potential instrument than for instrument-biased verbs (and equi-biased verbs elicited a gaze pattern between these extremes). This pattern was stronger for those who achieved higher calibration accuracy and when quadrant-based ROIs were used compared to image-based ROIs.
Experiment 5
The fifth study was a replication attempt of Shimojo, Simion, Shimojo, and Scheier (2003), which found that human gaze is actively involved in preference formation. Separate sets of participants were shown pairs of human faces and asked either to choose which one they found more attractive or which they felt was rounder. Prior to making their explicit selection, participants were increasingly likely to be fixating the face they ultimately chose, though this effect was significantly weaker for roundness discrimination.
Note that Shimojo and colleagues compare five conditions, of which we replicate only the two that figure most prominently in their conclusions: the “face-attractiveness-difficult task” and the “face-roundness task”.
Method
All stimuli and experiment scripts are available on the Open Science Framework at https://osf.io/eubsc/. The study pre-registration is available at https://osf.io/tv57s.
Participants
Fifty participants for the main task were recruited on Prolific and were paid $10/hour. For one subject (roundness task group), eye gaze data failed to record. We ended up with 25 participants in the attractiveness condition and 24 in the roundness condition. The original sample size in Shimojo et al. (2003) was 10 participants total.
Materials and Design
The faces in our replication were selected from a set of 1,000 faces within the Flickr-Faces-HQ Dataset (the face images used in Shimojo et al. were from the Ekman face database and the AR face database). These images were chosen because the person in each image was looking at the camera with a fairly neutral facial expression and appeared to be over the age of 18. Twenty-seven participants were recruited on Prolific to participate in stimulus norming (for attractiveness). They each viewed all 172 faces and were asked to rate them on a scale from 1 (less attractive) to 7 (more attractive) using a slider. Faces were presented one at a time and in a random order for each participant. Data from three participants were excluded because their modal response made up more than 50% of their total responses, for a total of 24 participants in the norming.
Following Shimojo et al., 19 face pairs were selected by identifying two faces that 1) had a difference in mean attractiveness ratings that was 0.25 points or lower and 2) matched in gender, race, and age group (young adult, adult, or older adult).
Participants were presented with each of the 19 face pairs; task condition was manipulated between subjects, such that half of participants made judgments about facial attractiveness and the other half made judgments about facial shape.
Procedure
At the beginning of the experimental task, participants completed a 9-point eye-tracker calibration (each point appeared 3 times in random order) and 3-point validation. The validation point appeared once at center, middle left, and middle right locations in random order (see Figure 28).
Black points were used for calibration. Red crosses were used for checking the accuracy of the calibration.
Black points were used for calibration. Red crosses were used for checking the accuracy of the calibration.
During each trial of the main task, two faces were displayed on the two halves of the screen, one on the left and one on the right (as in Figure 29). In the attractiveness task, participants were asked to chose the more attractice face in the pair and in the shape judgment task participants were asked to pick the face that appeared rounder. They pressed the “a” key on their keyboard to select the face on the left and the “d” key to select the face on the right. A fixation cross appeared in the center of the screen between each set of faces. Participants were asked to look at this fixation cross in order to reset their gaze in between trials. The order of the 19 face pairs was random for each participant.
Text did not appear on each screen.
Text did not appear on each screen.
Results
In the original study, a video-based eye-tracker was used. The eye movements of participants were recorded with a digital camera downsampled to 33.3 Hz, with eye position determined automatically with MediaAnalyzer software. In our study, subjects supplied their own cameras, so hardware sampling rate varied. However, data was collected at 20 Hz.
Due to large variation in response time latency, Shimojo and colleagues analyzed eye gaze for the 1.67 seconds prior to the response. This duration was one standard deviation of the mean response time, ensuring that all timepoints analyzed have data from at least 67% of trials. In our dataset, one standard deviation amounts to 1.95 seconds. We then binned eyegaze data into 50 ms bins rather than the 30 ms bins used by Shimojo and colleagues, reflecting the different sampling rates.
The proportion of fixations to the target in the two seconds prior to the judgment are shown for the attractiveness (red) and roundness (blue) conditions along with their best-fitting sigmoids.
The proportion of fixations to the target in the two seconds prior to the judgment are shown for the attractiveness (red) and roundness (blue) conditions along with their best-fitting sigmoids.
Following Shimojo and colleagues, data for each condition were fit using a four-parameter sigmoid (Fig. 30). These fit less well than in the original paper for both the attractiveness judgment (R2 = 0.80 vs. 0.91) and the roundness judgment (R2 = 0.36 vs. 0.91).
From these curves, Shimojo and colleagues focus on two qualitative findings. First, they note a higher asymptote for the attractiveness discrimination task relative to roundness discrimination. Qualitatively, this appears to replicate. However, their statistical analysis – a Kolmogorov-Smirnov test for distance between two distributions – is not significant (D = 0.17, p = 0.55), though it should be noted that this is a very indirect statistical test of the hypothesis and probably not very sensitive.
The second qualitative finding they note is that the curve for the roundness judgment “saturates” (asymptotes) earlier than the curve for the attractiveness judgment. They do not present any statistical analyses, but it is clear qualitatively that the result does not replicate.
Calibration
As in the previous experiments, calibration score was defined as the average proportion of samples within 200 pixels of the validation point during the final validation phase before the eye-tracking is performed. Where participants required more than one calibration (N=14), only the final calibration was considered.
To determine whether calibration accuracy influenced our key effects, we calculated the proportion of samples during the task in which the participant was fixating the face they ultimately chose. Calibration accuracy significantly correlated with fixations in the attractiveness condition (r = 0.50 [0.12, 0.75], p = 0.01) but not the roundness condition (r = 0.25 [-0.18, 0.60], p = 0.25). Inspection of Fig. 31 reveals that this correlation is due to a handful of participants with calibration values below 50%.
Re-analysis After Exclusions
As in the previous experiments, we re-analyzed the data, removing the participants whose calibration accuracy was not greater than 50%. This slightly improved the fits of the sigmoids (Attractiveness: R2 = 0.78; Roundness: R2 = 0.59). However, the difference between sigmoids remained non-significant using the Kolmogorov-Smirnov test (D = 0.20, p = 0.33). Descriptively, the results do not look substantially different (Fig. 32).
The proportion of fixations to the target in the two seconds prior to the judgment are shown for the attractiveness (red) and roundness (blue) conditions along with their best-fitting sigmoids.
The proportion of fixations to the target in the two seconds prior to the judgment are shown for the attractiveness (red) and roundness (blue) conditions along with their best-fitting sigmoids.
Effects of ROIs
In the original experiment, eye-gazes that did not directly fixate one or other of the faces were excluded. In this section we explore an alternative coding of the eye movement data by coding simply left half vs. right half of the screen. The coarser coding may be more appropriate for webcam-based eye-tracking.
Only a small percentage of samples (7.00%) involved looks to anything other than one of the two faces. Thus, not surprisingly, the correlation between percentage of time spent fixating the to-be-chosen face using the ROI method and the halves method was near ceiling (r = 0.97 [0.97, 0.98], p = 0). Since the choice of method had almost no effect on whether participants were coded as fixating one face or the other, we did not further investigate the effect of method choice on the analytic results.
Discussion
Qualitatively, the results are similar to those of Shimojo et al., such that participants look more at the option that they ultimately choose. This gaze bias appears to be stronger for decisions about face attractiveness than shape, though this is not supported by the statistical analysis approach used in the original paper. The gaze patterns remained consistent for participants with better calibration accuracy.
General Discussion
We conducted five attempted replication studies using different experimental paradigms from across the cognitive sciences. All were successfully implemented in jsPsych using the webgazer plugin, though replication was not uniform.
Experiment 1 had the smallest ROIs due to the use of an integrated visual scene with five to six ROIs of varying size per scene, as opposed to ROIs corresponding to display halves or quadrants. Both attempts to replicate Altmann and Kamide (1999) were unsuccessful, despite the success of previous in-lab replications using infrared eye-tracking (e.g. James et al., 2023). A previous conceptual replication of this paradigm using webcam-based eye-tracking (Prystauka et al., 2023) was successful but used a four-quadrant visual world paradigm, rather than the “naturalistic” scenes used in the original study and in the current replication attempts. It is worth noting that removing variability related to participant environments (by conducting the webcam-tracking study in the lab) did not appear to improve the sensitivity of the paradigm. The primary limitation is likely to be the size of the ROIs.
Experiment 2 used the four quadrants of the participant’s screen as ROIs. As in Johansson and Johansson (2014) and Spivey and Geng (2001), participants spontaneously looked to blank ROIs which previously contained to-be-remembered pictures. These results appeared to be robust to calibration quality. An additional manipulation, instructing participants to keep gaze fixed on a central point, was not successful. One possibility is that participants are less motivated to follow such instructions when an experimenter is not present in the same room with them. It may be possible to improve performance by emphasizing that this is an important aspect of the experiment or by providing additional training/practice in keeping the eyes still on one particular point.
Experiment 3 used two large ROIs (halves of the display in one analysis) and successfully replicated the novelty preference in terms of gaze duration shown in Manns et al. (2000). However, the subtler relationship between gaze duration and recognition memory on Day 2 was not replicated, despite the fact that participants were able to discriminate pictures they had seen from those they hadn’t seen during that delayed test. Calibration quality did not appear to impact this relationship. More work is needed to understand whether delay manipulations can be practically combined with webcam eye-tracking.
Experiment 4 used the four quadrants of the participant’s screen as ROIs. As in Ryskin et al. (2017), listeners used knowledge of the co-occurrence statistics of verbs and syntactic structures to resolve ambiguous linguistic input (“Rub the frog with the mitten”). Across multiple time windows, participants looked more at potential instruments (mitten), when the verb (rub) was one that was more likely to be followed by a prepositional phrase describing an instrument with which to perform the action , as opposed to describing the recipient of the action (frog). Despite the qualitative replication of past findings, the overall rates of looks to various objects were much lower than in an in-lab study using infrared eye-tracking. This reduction may be related to measurement quality: effect sizes were greater for participants with higher calibration accuracy. Using the full quadrants as ROIs, rather than bounding boxes around the four images, also appeared to improve the measurement of the effect. Crucially, there was no evidence of a delay in the onset of effects relative to in-lab work, indicating that the modifications to webgazer that are made within the jsPsych plug-in successfully address the issues noted by Dijkgraaf, Hartsuiker, and Duyck (2017) and suggesting that this methodology can be fruitfully used to investigate research questions related to the timecourse of processing.
Experiment 5, similar to Experiment 3, used two large ROIs (or halves of the display). As in Shimojo et al. (2003) and in the recent webcam-based replication by Yang and Krajbich (2021), we saw that participants looked more at the face or shape that they ultimately chose during a judgment task. This gaze bias appears to be stronger for decisions about face attractiveness than shape, though this effect was not statistically significant.
The relatively large number of statistical comparisons means that we would not expect everything to come out significant even in the best-case scenario. Nonetheless, there appears to be some systematicity in the patterns of significant replication. Most of the failures to replicate seem to be due to small ROIs (e.g., Exp. 1) or effects that were small to begin with (e.g., Exp. 3) or involved insensitive statistical tests (e.g., Exp. 5).
In sum, the webgazer plug-in for jsPsych can be fruitfully used to conduct a variety of cognitive science experiments on the web, provided the limitations of the methodology are carefully considered.
This plug-in (and potentially other webcam-based eye-tracking options) opens the door to including a broader, more diverse sample of participants in cognitive science research. In contrast to the typical in-lab eye-tracking sample, which consists primarily of young university students from WEIRD countries (Henrich et al., 2010; de Oliveira & Baggs, 2023; Linxen et al., 2021; Rad et al., 2018), webcam eye-tracking can reach a much more diverse population, as the only prerequisite is access to a computer with a webcam (a standard feature on most modern laptops and tablets). While webcam eye-tracking cannot by itself eliminate all demographic biases, some of which are in any case baked in by history (Henrich, 2020), it can help to reduce them, at least for studies amenable to the paradigm.
Lessons and Questions
The findings above suggest a few lessons for researchers wishing to conduct eye-tracking studies with a webcam, as well as highlighting a few important methodological questions that need to be addressed going forward.
First, it appears that current technology is sufficient for studies with relatively large ROIs of half or a quarter of the screen. While this encompasses a large number of common paradigms, it does not cover all of them, even within the Visual World Paradigm. It is possible that even with current technology, reasonable sensitivity could be achieved with slightly smaller ROIs, but it will require more systematic investigation to determine what exactly is feasible. Likewise, improving the technology may result in better precision.
Note also that while ROIs the size of quadrants are feasible, good calibration is important. Based on the results above, researchers using such paradigms may want to exclude data from participants with less than 50% validation accuracy. While stricter thresholds may decrease noise, they come at a cost. In preliminary analysis, we found that a threshold of 75% somewhat increased effect size in Exp. 4, but at the cost of reducing the sample size from 35 to 19. However, it could be helpful to more systematically investigate the optimal threshold and whether this varies by size of ROI.
There are a number of other interventions that might increase precision and should be investigated. For instance, it is possible to monitor head distance in real time (for an example, see Jiang et al., 2021); doing so and providing real-time feedback to participants should diminish noise. It may also be possible to automatically detect challenges to calibration accuracy (such as poor lighting) and provide targeted feedback. Similarly, it may be profitable to systematically study the parameters of eye gaze validation, for instance the number of validation points and the maximum distance allowed between gaze and target (set to 200px in the present study). These and other options should be investigated in order to determine how much they might improve precision.
Second, researchers need to take care with timing accuracy. Running studies on modern computers requires synchronizing multiple clocks, something that does not happen automatically. Failure to synchronize can result in substantial inaccuracies (Slim & Hartsuiker, 2022). As indicated by the current work, the current jsPsych implementation of WebGazer.js has timing synchronization sufficient for many studies. We urge researchers to avoid using the current release of WebGazer.js as is, and instead either use jsPsych, our patch to WebGazer.js (https://github.com/brownhci/WebGazer/commit/35301e56c192ef265bb270873a6526205f183711), or other software that has been specifically documented as addressing common issues in timing for web-based eye-tracking.
Third, it is true for any eye-tracker that precise data depends on good calibration, and good calibration depends in part on lighting and background (see, for instance, Yüksel, 2023). We suspect that it is possible to give subjects automated feedback on how to improve their own calibration, which could result in higher-quality data and fewer subject exclusions based on poor calibration.
Contributions
Joshua R. de Leeuw
Conceptualization
Data curation
Formal analysis
Investigation
Methodology
Project administration
Software
Supervision
Validation
Visualization
Writing - original draft
Writing - review & editing
Ariel N. James, Rachel Ryskin, and Joshua K. Hartshorne
Conceptualization
Formal analysis
Visualization
Writing - original draft
Writing - review & editing
Undergraduate student authors, listed alphabetically: Haylee Backs, Nandeeta Bala, Laila Barcenas-Meade, Samata Bhattarai, Tessa Charles, Gerasimos Copoulos, Claire Coss, Alexander Eisert, Elena Furuhashi, Keara Ginell, Anna Guttman-McCabe, Emma (Chaz) Harrison, Laura Hoban, William A. Hwang, Claire Iannetta, Kristen M. Koenig, Chauncey Lo, Victoria Palone, Gina Pepitone, Margaret Ritzau, Yi Hua Sung, and Lauren Thompson
Investigation
Methodology
Software
Funding information
Funding for data collection was provided by the Vassar College Department of Cognitive Science (JdL). Partial funding for JKH’s contributions was provided by NSF2436092.
Ethics statement
The remote experiments (1a, 2-5) were conducted with approval by the Vassar College Institutional Review Board. The laboratory experiment (1b) was conducted with approval by the Boston College Institutional Review Board.
Competing interests
The authors have no competing interests to declare.
Footnotes
See also discussion at https://github.com/jspsych/jsPsych/discussions/1892
The WebGazer.js patch can be found at https://github.com/brownhci/WebGazer/commit/35301e56c192ef265bb270873a6526205f183711, and the fork can be found at https://github.com/jspsych/WebGazer.
lme4 syntax: lmer_alt(probability ~ object_type*verb_condition + (object_type*verb_condition | subject) + (object_type*verb_condition | scene)
lme4 syntax: lmer_alt(probability ~ object_type*verb_condition + (object_type*verb_condition | subject) + (object_type*verb_condition | scene)
lme4 syntax: lmer_alt(time ~ verb_condition*verb_duration + (verb_condition | subject) + (verb_condition | scene)
lme4 syntax: lmer(proportion ~ quadrant + (1 + quadrant | subject_id)). Among other limitations, this approach violates the independence assumptions of the linear model because looks to the four locations are not independent. This analysis was chosen because it is analogous to the ANOVA analysis conducted in the original paper.
lme4 syntax: lmer(DV ~ relation_type*viewing_condition + (1 | subject_id))
In the original study, the pictures appeared one by one on the screen and their names were played as they appeared. We removed this introductory portion of the trial to save time
As opposed to the original study we recorded mouse movement instead of clicking behavior since not all of the audio prompts required clicking. For example, the sentence “locate the camel with the straw” may not involve any clicking but rather only mousing over the camel.
lme4 syntax: glmer(is.mouse.over.instrument ~ verb_bias + (1 + verb_bias participant) + (1 | item), family="binomial", data=d)
lme4 syntax: lmer(prop.fix.target.animal ~ verb_bias + (1 + verb_bias participant) + (1 | item), data=d). A model with by-participant varying slopes for verb bias condition was first attempted but did not converge.