Estimating Time: Comparing the Accuracy of Estimation Methods for Interval Timing

,


Introduction
In the growing research field on interval perception the number of ways to measure subjective time are seemingly growing, too.As a researcher, one has to decide whether a task is retro-or prospective (e.g., Block et al., 2018), in which modality intervals are presented (e.g., auditory or visually; Wearden et al., 2006), how exactly intervals are presented (e.g., filled or empty; Grondin, 1993), the paradigm used (e.g., temporal reproduction, production, bisection, or comparison; for a review, see Grondin, 2010;Wearden, 2016), and how responses are being collected (e.g., verbal or motor responses; e.g., Block et al., 2018;Mioni, 2018).While subjective (distortions of) time perception may be captured no matter which choice was made regarding the listed options, often neglected from this choice are the potential differences in cognitive strategy or what representation of time underlies a given task.One prominent idea is that time is represented in spatial terms (for a review, see Bender & Beller, 2014).Indeed, visuospatial representations of time are reflected in how we think and communicate about time, and also in how we process and act on time (Bonato et al., 2012;Núñez & Cooperrider, 2013).For example, time-related notions in language are often spatialized: the future lies ahead of us, we are looking back at earlier times, or the vacation was too short.The latter notion, how we process and act on time, is reflected in the commonly found Spatial-Temporal Association of Response Codes (STEARC) effect (Conson et al., 2008;Fabbri et al., 2012Fabbri et al., , 2013;;Ishihara et al., 2008;Vallesi et al., 2008;Vicario et al., 2008;Weger & Pratt, 2008).The STEARC effect describes a space-related representation of time and temporal magnitudes, such that before/shorter responses have a processing or response advantage when associated with the left side of space, and, vice versa, after/longer responses show the same advantages Shared first authorship; a.damsma@rug.nlShared first authorship a b Damsma, A., Schlichting, N., van Rijn, H., & Roseboom, W. (2021).Estimating Time: Comparing the Accuracy of Estimation Methods for Interval Timing.Collabra: Psychology,7(1).https://doi.org/10.1525/collabra.21422Downloaded from http://online.ucpress.edu/collabra/article-pdf/7/1/21422/458734/collabra_2021_7_1_21422.pdf by University of Groningen user on 18 October 2022 when associated with the right side of space.This spatialization of time can also be observed in children as young as five years (Coull et al., 2018).Mental timeline theories in particular suggest that time is represented as a spatial linear axis that allows absolute (i.e., how long a stimulus lasted) and relative timing (i.e., temporal order; Bonato et al., 2012;Magnani & Musetti, 2017).The orientation of the timeline is heavily influenced by culture and experience, such as, for example, reading direction (e.g., English speakers, who read from left to right, map events on a timeline directed rightward, while Arabic speakers, who read from left to right, showed the reverse pattern; Boroditsky, 2001;Fuhrman & Boroditsky, 2010) or commonly used spatial metaphors to talk about time (e.g., Mandarin speakers use horizontal and vertical terms to temporally order events, while English speakers commonly use only horizontal terms; Boroditsky et al., 2011).A number of neurobiological and cognitive models even suggest that space and time share their neural representation (e.g., A Theory Of Magnitude (ATOM): Walsh, 2003Walsh, , 2015; hippocampal time and space cells: Buzsáki & Llinás, 2017), emphasizing the intertwinedness of these two dimensions.ATOM, for example, is based on 1) behavioral findings showing a tight link between spatial (size, length) and temporal magnitudes, in that spatial magnitude influences the perception of temporal magnitudes in a "more is more" fashion (i.e., more spatial magnitude is more temporal magnitude, Cai et al., 2018;Cai & Connell, 2016;Casasanto & Boroditsky, 2008;Xuan et al., 2007); and 2) on neuroimaging studies revealing shared neural representations in the parietal cortex during the processing of spatial, numerical and temporal magnitudes (e.g., Bueti & Walsh, 2009;Dormal et al., 2012;Hayashi et al., 2013;Riemer et al., 2016).Adding to this theory, Coull & Droit-Volet (2018) highlight that explicit representations of time are not solely rooted in space but also in motor interactions with the world, which have a temporal and a spatial component.The authors offer a developmental approach of how we construct a representation of time by performing actions in space during childhood (see also Loeffler et al., 2018).
Assuming that time is indeed represented spatially or in an ATOM-like common magnitude system, an additional method to estimate intervals is the use of a timeline or visual analogue scale.While visuospatial estimation formats are commonly used in intentional binding studies (e.g., Haggard et al., 2002), to our knowledge, only few interval timing studies have made use of them (e.g., Damsma et al., 2018;Roseboom et al., 2019).Apart from the more conceptual question of how exactly time may be represented in the brain, there are practical issues regarding the implications for different response modes at hand, too: So far it has not been tested whether an explicit translation from time to space affects precision and/or accuracy of temporal estimates compared to other commonly used estimation methods.In two separate experiments we aimed to test the advantages and disadvantages of using different estimates of time, namely reproductions in the time dimension, estimates in the spatial dimension, or estimates in a symbolic form.
In Experiment 1 participants estimated intervals by either pressing a button (motor reproduction), clicking on a timeline (timeline estimation), or giving a numerical estimate (verbal estimation).The to-be-estimated interval was a white square appearing and disappearing on a black screen.The results of Damsma et al. (2018) suggested that participants exhibit a response bias when using timeline estimations, seen in avoidance of clicking close to the end of the line or screen.To test whether this bias can be prevented participants performed one of two versions of this experiment: one in which the range of the timeline corresponded to the tested intervals, and one in which the timeline corresponded to intervals longer than the tested intervals.In other words, participants were either calibrated to the test durations or to slightly longer durations.When estimating an interval using a timeline or verbal estimates, participants can be more deliberate in their estimates (i.e., go back and forth in time) compared to motor reproductions, in which participants have only one chance to make an estimate.While intervals had a clear on-and offset and required no further processing steps in Experiment 1, we used a more complex temporal estimation task in Experiment 2. Participants saw a stream of digits and one targetletter and were subsequently asked to first estimate the onset of the target letter within the stream, and second to reproduce the duration of the complete stream by either motor reproductions or timeline estimations.Again, half of the participants were calibrated to the test durations, while the other half were calibrated to longer durations.In this more complex setup, participants did not only have to attend to and memorize one duration, but they had to attend to the content of the stream and memorize two durations.Timeline estimates may allow for relative timing (e.g., when did the target occur relative to the estimated offset), while motor reproductions require a strictly sequential order of interval reproductions.In both experiments we will compare accuracy (i.e., the estimations) and precision (i.e., the absolute error and the coefficient of variation (CV)) of temporal estimates.A common finding in temporal estimation tasks, especially in reproduction tasks, is that previously encountered intervals influence the perception of the current interval (also known as sequential context effects; for a review, see van Rijn, 2016).We will compare the magnitude of these context effects for temporal estimation methods and calibration conditions.If there is a cost to a potential spatial transformation, we expect that the timeline estimates show lower accuracy and/or precision than motor reproductions and verbal estimates.In addition, we expect that calibration with longer intervals may increase the accuracy of the estimates, especially for longer intervals, in the timeline estimation condition.

Methods
Participants.Sixty healthy adults (20 male, mean age 22.65) participated in exchange for course credits or a financial compensation of €8.All participants had normal or corrected-to-normal vision.Informed consent as approved by the Ethical Committee Psychology of the University of Groningen (identification number 17408-S-NE) was obtained before testing.Sample size was based on past research (e.g., Damsma et al., 2018;Schlichting et al., 2018), Estimating Time: Comparing the Accuracy of Estimation Methods for Interval Timing Collabra: Psychology no statistical a priori power analysis was conducted.
Experimental Design and Procedure.Participants were asked to perform a temporal estimation task using three different estimation methods: motor reproduction, timeline and verbal estimation.At the beginning of the experiment, participants were explicitly instructed to not keep track of time by, for example, counting or feet tapping.Stimuli were displayed on a 1920 × 1080 LED-based monitor screen (Iiyama ProLite G2773HS) with a refresh rate of 100 Hz.
The to-be-reproduced interval (equally spaced between 1 and 4 s in steps of 0.5 s) was presented at the beginning of the trial.Appearance of a white square (50 by 50 pixels) at the center of the screen marked the onset, and the disappearance of the square the offset of the interval.After a fixation period of 1 s participants were asked to estimate the previously perceived interval in one of three ways: a) timeline estimation: participants were asked to click on a timeline at the point where the interval ended (1 pixel on screen corresponded to 0.01 s), apart from a tick demarking the onset of the interval there were no further spatial/temporal indications (ticks) given; b) motor reproduction: the white square re-appeared, marking the onset of the reproduction, and participants were asked to press the spacebar to end the interval; c) verbal estimation: participants were asked to enter a numerical estimate in seconds with one decimal place.After the estimation participants received immediate feedback on each trial (practice and experimental trials) in form of a timeline (1 pixel on screen corresponded to 0.01 s).The feedback format was the same for all conditions to make the tasks as equal as possible.Two grey bars on top of the timeline depicted the on-and offset of the veridical interval, and two white bars below the timeline depicted participants estimates (i.e., both onset bars were always aligned).See Figure 1 for a schematic depiction of the experimental design.The experiment was run in Matlab R2014b (The MathWorks) using the Psychophysics Toolbox version 3.0.12(Brainard, 1997) in Windows 10.
The experiment was divided into three blocks (i.e., one block for each estimation method) of 42 experimental trials (i.e., four trials per duration) each.Order of blocks, and thus estimation methods, was counterbalanced between participants.Before the start of each block, participants received instructions about the estimation method to be used in the upcoming block, and they performed 12 practice trials in order to get accustomed to the timeline and the estimation method before the start of the experimental trials.The order of trials was the same in each block, but varied between participants.
As a between-subjects manipulation, half of the participants were assigned to perform a calibrated version of the estimation tasks.In the calibrated version the training-trials consisted also of longer intervals than those in the test trials (1.0, 2.5, 5.0, and 6.0 s), while in the uncalibrated version training trials were chosen from the range of intervals of test trials (1.0, 2.0, 3.0, and 4.0 s).Importantly, this changed the length of the timeline in the feedback screen and also in the timeline estimation condition: in the calibrated version the timeline was longer, so that during test trials participants did not have to click as close towards the end of the line to estimate the longest duration as they had to in the uncalibrated version.In both the calibrated and uncalibrated condition, the timeline was presented centrally on the screen.The experiment files can be found at https://osf.io/w38qg/.
Analysis.All estimates shorter than 0.2 s and longer than 10 s (0.34% of the data) and all trials in which no estimates were provided (0.34% of the data) were excluded from analysis.The estimates were analyzed using Linear Mixed Models (LMMs) from the lme4 package (Bates et al., 2015) in R (R Core Team, 2016).To compare the accuracy of the different conditions, we tested a model predicting estimates.In addition, we compared the precision of the conditions by testing models predicting the absolute error and the CV.Finally, we tested a model predicting reaction time.In each model, duration (i.e., the veridical duration of the interval), estimation method (motor, timeline or verbal) and calibration condition (uncalibrated or calibrated) and their interactions were sequentially added as fixed factors.Only fixed factors that significantly improved the model according to a likelihood ratio test were included in the final model.To assure the interpretability of significant interaction terms, the relevant main effects were also included in the model.In addition, the fixed factor duration was centered at 2.5 and calibration condition was recoded using effect coding (-0.5 and 0.5 for uncalibrated and calibrated, respectively), to make main effects of duration and estimation method easier to interpret.Participant was always included as a random intercept term.After establishing the final model, we sequentially added random slope terms, starting with the random slope that decreased the AIC value most.We tested whether the inclusion of the random slope term was warranted using likelihood ratio tests.Given the final model, we compared the three estimation methods with post-hoc contrasts using the glht function in the multcomp package in R (Hothorn et al., 2017).
Using LMMs enables us to test both intercept effects (e.g., if estimates in one condition are systematically overestimated compared to another condition) and slope effects Estimating Time: Comparing the Accuracy of Estimation Methods for Interval Timing Collabra: Psychology (e.g., if there is a stronger pull towards the mean in one condition the slope will be lower than in other conditions).
Here, we will report the most important findings, but the complete analysis scripts and final model results can be found at https://osf.io/w38qg/.

Results
Estimates.Figure 2A shows the average estimates for each duration and estimation and calibration condition.Model comparison showed that adding duration as a continuous fixed factor improved the basic model that included estimate as the dependent variable and subject as a random factor (χ 2 (1) = 7647.90,p < .001),indicating that, overall, estimates increased with the presented duration.In addition, estimation method and its interaction with duration improved the model fit (χ 2 (2) = 63.33,p < .001and χ 2 (2) = 33.89,p < .001,respectively), showing that the intercept and slope of the estimates differed between estimation methods.Post-hoc contrasts showed that estimates (at the middle interval of 2.5 s) were longer for verbal estimates than for line estimates (β = 0.13, p = .002)and motor reproductions (β = 0.10, p = .005).In addition, the slopes of line estimates and motor reproductions were smaller than for verbal estimates (β = -0.08,p < .001and β = -0.09,p < .001),suggesting a larger central tendency effect for line estimate and motor reproductions.There was no evidence for intercept or slope differences between the line estimates and motor reproduction condition (ps > .666).Adding calibration condition as a fixed factor did not improve the model fit (χ 2 (1) = 0.85, p = .361).However, we found a significant three-way interaction between duration, estimation method and calibration condition (χ 2 (2) = 6.13, p = .047).Post-hoc contrasts showed that the slope difference between the calibration condition was higher for motor reproductions than for line estimates (β = 0.08, p = .031).
Interestingly, Figure 2B shows that the error in the verbal estimations systematically diverged from a linear pattern: visual inspection suggests that it was lower for integer durations (1, 2, 3 and 4 s) than for the durations in between (1.5, 2.5, and 3.5 s).Post-hoc, we tested this notion by adding a dichotomous fixed factor indicating whether a du-ration was an integer to a model predicting the absolute error in the verbal estimation condition.Duration, calibration version and their interaction were also included as fixed factors.We found that this dichotomous fixed factor improved the model significantly (χ 2 (1) = 71.93,p < .001),indicating that the error was indeed lower for rounded integers.This was not the case for the line estimations (χ 2 (1) = 3.46, p = .063)and the motor reproductions (χ 2 (1) = 0.83, p = .363).
Coefficient of Variation (CV).We calculated the CV per participant and presented duration as the standard deviation divided by the average estimate.Figure 2C shows the average CV for every presented duration for the different estimation and calibration conditions.Presented duration improved the model significantly (χ 2 (1) = 128.28,p < .001),showing that -overall -the CV was smaller for longer durations.We found no evidence that this negative slope differed between estimation conditions (χ 2 (2) = 4.26, p = .119).However, we found that the intercept (at 2.5 s) did differ between estimation conditions (χ 2 (2) = 50.75,p < .001): In line with the absolute error, the CV was larger for line estimates and motor reproductions compared to the verbal estimates (β = 0.06, p < .001and β = 0.03, p = .007,respectively) and larger for line estimates compared to motor reproductions (β = 0.03, p = .004).We found no evidence for a difference between the calibration conditions (χ 2 (1) = 1.27, p = .260).
End of the line effects.We expected that the calibration conditions would mostly affect the participants' tendency to not respond close to the end of the line.In this case, we would expect that calibration most strongly influences estimates of longer intervals, and that this effect would be most pronounced in line estimates.To test this hypothesis, we investigated the influence of calibration on the accuracy and precision of the longest interval (i.e., 4 s).An LMM predicting these estimates showed that they differed between estimation methods (χ 2 (2) = 17.89, p < .001).However, we found no evidence that calibration improved the overall estimates (χ 2 (1) = 3.06, p = .080),or that the calibration effect differed between estimation methods (χ 2 (2) = 0.27, p = .874).Looking at the precision, we also found that the absolute error at the longest interval differed between estimation methods (χ 2 (2) = 26.24,p < .001),and that the error was higher in the calibrated condition (χ 2 (1) = 4.61, p = .032).Although the visual inspection of Figure 2B suggests that the effect of calibration condition was larger for the timeline estimations compared to the other methods, we found no evidence that this effect differed between estimation methods (χ 2 (2) = 4.29, p = .117).The CV showed a similar pattern: it differed between estimation methods (χ 2 (2) = 6.12, p = .047)and was higher in the calibrated condition (χ 2 (1) = 8.17, p = .004),but there was no evidence for a difference in the effect of calibration between estimation methods (χ 2 (2) = 4.26, p = .119).Overall, these results indicate that calibrating participants with longer durations did not improve the accuracy of the line, motor or verbal estimates, but did decrease their precision.
Sequential context effects.To test whether there were differences in sequential context effects between the esti-  mation methods, we tested the impact of previously presented durations.We started with the LMM predicting estimated duration including estimation condition, presented duration and their interaction as fixed factors.We gradually added previous presented durations (N-1, N-2, etc.) to the model as continuous fixed factors and tested whether they improved the model fit.We found that only the most recent previous trial (i.e., N-1) improved the model (χ 2 (1) = 75.28,p < .001),and that this factor differed between the estimation conditions (χ 2 (2) = 7.07, p = .029).Post-hoc contrasts showed that the effect of N-1 was larger for motor compared to verbal reproductions (β = 0.04, p = .017).There were no other differences (ps > .239).
Reaction time.Figure 2D shows the average reaction time (RT) for every estimation method and calibration condition.The model showed that the overall reaction time, and the change of RT with duration, differed between estimation methods (χ 2 (2) = 731.77,p < .001and χ 2 (2) = 848.88,p < .001,respectively).Post-hoc contrasts showed that the RT at the 2.5 s interval intercept was higher for the verbal compared to the line estimation (β = 0.49, p < .001)and higher for the motor reproductions compared to the verbal (β = 0.40, p < .001)and line estimation (β = 0.89, p < .001).Because the motor reproductions increased with the presented duration, whereas reaction time for verbal and line estimations is independent of the presented duration, the slope was larger for the motor reproduction method (β = 0.77, p < .001and β = 0.79, p < .001,respectively).There was no difference in slope between the verbal and line estimation (p = .837).Adding the interaction between estimation method and calibration condition improved the model significantly (χ 2 (2) = 14.40, p < .001),but there were no significant differences in the final model (ps > .328).

Discussion
In Experiment 1, we compared three estimation methods in a simple interval estimation task.The results showed that the verbal estimates were overall more veridical than the motor and line estimates.We found no evidence for a difference in the accuracy of the motor and line estimates.When we look at precision of estimation, we found that the CV decreased with the presented duration.This is a violation of Weber's law, or the scalar property of time perception, which states that the CV should be constant over different durations (although violations are frequent in the timing literature: see Grondin, 2014).Comparing the absolute error and the CV between estimation methods, we found that verbal estimates were most precise.Notably, however, this precision depended on the specific presented duration: it was higher for rounded integers than for durations with a fractional part.In addition, motor reproductions were generally more precise than line estimates.Overall, these results suggest that there is no cost in accuracy to the potential spatial transformation required for line estimates, but there might be a small cost in precision, that is, the variability of the estimates.Note, however, that any difference between motor reproductions and line estimates, especially, may arise due to differences in the amount of motor noise rather than because of their underlying representation and translation into another dimension.
We expected that calibrating participants with a larger interval range and a longer corresponding timeline at the Estimating Time: Comparing the Accuracy of Estimation Methods for Interval Timing Collabra: Psychology start of the experiment would diminish the underestimation of longer durations.However, we found no evidence that this calibration increased the overall accuracy, or the accuracy of the longest duration.Instead, we found a small cost in the precision of the longest interval estimates.These results suggest that calibration did not improve the timeline estimates by diminishing a potential end of the line bias.Alternatively, the end of the line effects here (and in Damsma et al., 2018) could reflect a general pull towards the mean in which the estimate is biased towards previously presented durations.Indeed, sequential context effects were observed in all three conditions.Verbal estimates were less affected by the duration of the previous trial, which can be explained by their generally higher accuracy and precision.According to the Bayesian view of perception, it is optimal to rely more on prior experience in making an estimate when the current observation is less precise (Acerbi et al., 2012;Jazayeri & Shadlen, 2010).
When reproducing longer time intervals using a motor response, the duration of a trial scales with the interval to be reproduced.Line and verbal estimates, on the other hand, have the advantage of a stable response time of -in our experiment -around 1.5 and 2 s, respectively.Based on these results, we suggest that researchers can increase the number of trials in their experiment when testing longer intervals by using line or verbal estimates, and thereby increase statistical power.
Another potential advantage of timeline estimates over motor reproductions is that there is more time for a deliberate decision, compared to the 'one shot' approach of motor reproductions: in the latter case, participants are by definition unable to decrease their estimate at any point in time.This 'time asymmetry' might induce biases specific to motor reproduction, such as systematic under-reproduction (Riemer et al., 2012).In contrast, in the line estimation condition, participants can move the cursor freely to the left or right to decrease or increase their estimate.This could make timeline estimates more accurate in situations that are more complex than the reproduction of a single interval to which participants can fully direct their attention, such as when estimating an interval concurrently with other tasks (e.g., Brown, 1997Brown, , 2006;;Zakay, 1993) or estimating multiple intervals (e.g., Brown & West, 1990;van Rijn & Taatgen, 2008).To test this notion, Experiment 2 consisted of a stream of stimuli that contained one target.Participants had to estimate both the target onset and the duration of the stream.

Experiment 2 Methods
Participants.Thirty-nine healthy adults (9 male, mean age 20.64 years) participated in exchange for course credits.None of the participants in Experiment 1 took part in Experiment 2. Informed consent as approved by the Ethical Committee Psychology of the University of Groningen (identification number 17054-S-NE) was obtained before testing.Sample size was based on past research (e.g., Experiment 1; Damsma et al., 2018;Schlichting et al., 2018), no statistical a priori power analysis was conducted.
Experimental Design.Participants were asked to per- form a temporal estimation task using two different methods: motor reproductions and timeline estimations (Figure 3).At the beginning of the experiment, participants were explicitly instructed to not keep track of time by, for example, counting or feet tapping.Stimuli were displayed on a 1280 × 1024 CRT-based monitor screen (Iiyama Vision Master Pro 513) with a refresh rate of 100 Hz.
The interval was presented as a stream of numeric characters (1 to 9 characters in total) and one alphabetic character, the target (A, B, C, D, E, F, H, J, K, P, R, T, U, or V).The alphanumeric characters were presented in Arial with a font size of 16 pt.Within the stream alphanumeric characters were chosen randomly, while no two consecutive characters were the same.Participants were asked to estimate both the interval from stream onset to target onset as well as the duration of the total stream.There were six different total stream durations (4.75, 5.25, 5.75, 6.25, 6.75, and 7.25 s) and 11 positions where the target could occur from stream onset (1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, and 6 s).Target onset was chosen completely random, that is, for some participants not all target positions may have occurred.Each alphanumeric character was presented for 0.25 s with 0.25 s between two successive characters, so that each stream consisted of 10 to 15 alphanumeric characters in total.The estimation methods were similar to Experiment 1, with the only difference that two responses were required.In the motor reproduction task, a first spacebar press corresponded to the time point of target occurrence, and a second spacebar press corresponded to the end of the stream.Similarly, a first mouse click on the timeline corresponded to target occurrence, and a second mouse click to the end of the stream in The experiment was divided into two blocks (i.e., one block for each estimation method) of 60 experimental trials (i.e., ten trials per duration) each.Order of blocks, and thus estimation methods, was counterbalanced between participants.Before the start of each block, participants received instructions about the estimation method to be used in the upcoming block, and they performed 12 practice trials in order to get accustomed to the timeline and the estimation method before the start of the experimental trials.The order of trials was the same in each block but varied between participants.
As in Experiment 1, half of the participants performed a calibrated version of the estimation tasks.In the calibrated version the training-trials consisted also of longer overall intervals than those in the test trials (3.25, 6.25, and 9.75 s), while in the uncalibrated version training trials were chosen up to the longest duration of test trials (2.25, 4.75, and 7.25 s).The experiment files can be found at https://osf.io/w38qg/.
Analysis.All estimates shorter than 0.2 s and longer than 11 s (1.95% of the data) were excluded from analysis.The longest durations of the target estimates (5, 5.5 and 6 s) were also excluded from analysis, because there were on average less than 4 trials per condition per participant, leading to unreliable calculations of the error and CV measures.The analysis procedure was similar to Experiment 1. Target and stream estimates were analyzed separately.The durations were centered at 2.75 s and 6 s for target and stream estimations, respectively.All categorical fixed factors were recoded using effect coding (-0.5 and 0.5), to facilitate the interpretation of main effects when interactions are included in the model.In the current experiment, there were two estimation methods (motor reproduction and timeline estimation) instead of three methods in Experiment 1.Therefore, instead of post-hoc contrast results, we will report the β-coefficient and t-value of factors in the final LMM, as they are a direct representation of the difference between the estimation methods.The analysis scripts and results can be found at https://osf.io/w38qg/.
Total stream duration.The stream estimates also increased with the presented duration (χ 2 (1) = 792.99,p < .001;β = 0.57, t = 16.87), but here the slope was steeper for motor reproductions than for line estimates (χ 2 (1) = 7.13, p = .008;β = 0.09, t = 2.55).Model comparison showed a stronger effect of calibration for the line estimates com-pared to the motor reproduction methods (χ 2 (1) = 24.85,p < .001),although this effect did not reach significance after including random slopes in the final model (β = -0.35,t = -1.68,p = .101).In addition, the effect of calibration on the slope was larger for the line estimates compared to the motor reproductions (χ 2 (1) = 4.44, p = .035;β = -0.17,t = -2.31).Overall, these results suggest that stream estimates were more veridical for the motor compared to the line condition, but that calibration with a longer timeline decreased this difference.
Absolute error.Interval to target onset.Figure 4B shows the average absolute error for the estimations of different durations for each condition.The LMM showed that the absolute error of target estimates increased with duration (χ 2 (1) = 5.37, p = .020;β = 0.03, t = 2.57).Model comparison suggested that there was a difference between line estimates and motor reproductions (χ 2 (1) = 4.60, p = .032),but this effect was not significant after including random slopes (β = -0.06,t = -1.12,p = .271).Overall, calibration condition did not affect the absolute error.
Total stream duration.In line with the absolute error of the target estimates, the error of the stream estimates increased with duration (χ 2 (1) = 45.02,p < .001;β = 0.15, t = 9.60).In addition, model comparison showed an interaction effect of estimation method and calibration condition, but this effect did not remain significant in the final model (χ 2 (1) = 5.69, p = .017;β = 0.17, t = 0.96).However, the slope difference between the calibration conditions was larger for motor compared to line estimates (χ 2 (1) = 4.64, p = .031;β = 0.18, t = 3.00).The final model also revealed a steeper slope for the motor compared to the line condition (β = 0.07, t = 2.34).
Coefficient of Variation (CV).Interval to target onset.Figure 4C shows the CV for the different duration and conditions.We found that the CV decreased with duration (χ 2 (1) = 75.60,p < .001;β = -0.04,t = -8.98).We found no differences between the estimation methods or the calibration conditions.
Sequential context effects.Interval to target onset.We started with the LMM established to predict the target estimates (including duration and estimation method as fixed factors).We then sequentially added previous target and stream durations.We found that target estimates were significantly influenced by target estimates in the previous trial (i.e., N-1; χ 2 (1) = 23.16,p < .001;β = 0.05, t = 4.82).This effect did not differ between the motor and line estimates.There was no significant effect of N-2 (χ 2 (1) = 2.65, p = .104).We also tested whether the stream estimates in the current trial influenced the target estimates, but there was no evidence that this was the case (χ 2 (1) = 0.89, p = .345).

Discussion
In Experiment 2, participants were asked to reproduce the interval between the onset of an alphanumeric stream and a target letter in the stream as well as the end of the stream.The results suggest that motor reproductions had a slightly more veridical slope than line estimates, but only for the stream estimates.This can be explained by a tendency by the participants to avoid the end of the line, leading to a relative compression of stream estimates, which need to be placed towards the end of the scale.This notion is also reflected in the effect of calibration: while calibration had no effect on motor reproductions, it improved the average accuracy of the stream estimates in the line condition.In line with Experiment 1, the CV decreased with duration, violating the scalar property.The overall precision of the responses of the motor reproductions and line estimates was similar, however, the variability increased more with the presented duration for motor reproductions.As in Experiment 1, we found that reaction times were stable over the different test durations and lower overall in the line condition.
We again found that previously perceived target or stream durations influenced target and stream estimates in the current trial, and there was no difference between estimation methods.There was also an effect of target onset on stream estimates, in that the later the target appeared, the longer the stream was estimated.One explanation is that participants use a sort of relative timing: if the target occurred relatively late, the duration of the stream was probably longer (see also van Rijn & Taatgen, 2008).In the line estimation condition, another explanation of this finding is that participants tend to keep their distance from the target estimates when making the second estimates on the stream duration, an effect that might be similar to the bias of avoiding the end of the scale.Thus, if the target occurred relatively late, the stream duration estimate will be shifted to having occurred later (see also Damsma et al., 2018, who show that estimates of the timing of targets in an attentional blink paradigm are not independent of each other).Interestingly, we found a difference between estimation methods in the effect of the previous target onset on stream estimates.This effect may have been more prevalent in the line estimation condition because of the strong visual Estimating Time: Comparing the Accuracy of Estimation Methods for Interval Timing Collabra: Psychology representation in the line compared to the motor condition.Not only were participants able to see their target estimate in the line condition, but it was also potentially easier to incorporate the feedback of the previous trial because it was visualized in the exact same way as participants gave their estimates.Because of the increased task complexity in Experiment 2, the visual representation might have been taken more into account as compared to Experiment 1.

General discussion
In the current study, we compared the accuracy and precision of interval estimations using a visual analogue scale (or, a timeline) to non-spatial estimation methods (motor reproductions in Experiment 1 and 2 and verbal estimations in Experiment 1).If, regardless of estimation method, temporal estimates undergo the same or similar transformations, we expected to find no differences between the different estimation methods.If, on the other hand, a mental transformation from time to space is required, we would expect costs in accuracy and precision in the timeline estimates.In Experiment 1, we found similar accuracy for line estimates and motor reproductions, whereas the precision was higher for motor estimates.Verbal estimates seemed to lead to the most accurate and precise estimates.However, the pattern we found in absolute errors suggests that this estimation method comes with its own unique problems that we discuss further below.In the more complex paradigm of Experiment 2 we found that estimates were slightly more accurate for motor reproductions compared to timeline estimates, while the precision was similar.
Taken together, these results suggest that both motor reproduction and timeline estimation can be reliably used to measure subjective timing.This could potentially indicate that space and time have a similar neural representation (e.g., Walsh, 2003Walsh, , 2015)).For example, in the episodic memories of everyday events, hippocampal neurons might encode both space and time information (Buzsáki, 2019;Buzsáki & Llinás, 2017).Whether a common representation of space and time is inherent in low-level accumulation or only occurs at later stages in the representation or decision process (e.g., Anobile et al., 2018) remains an open question.Alternatively, it is possible that the transformation of duration into space has only a relatively minor cost, roughly equivalent to the effect of noise introduced by manual reproduction.Time may be represented in a sufficiently abstract way to make transformations to any other representational form effortless and equally accurate; or time is omnipresent in any neural process and readily available as input for other cognitive processes (Hass & Durstewitz, 2014).
In either case, however, it is important to note that both motor reproductions and line estimates might come with their own respective sources of noise.Models of the interval reproduction task often assume that variance is introduced in the motor response, independently of the variance in interval perception (Acerbi et al., 2012;Wearden, 2003).This additional motor noise would presumably decrease the precision of the estimates, relative to methods that do not require a motor response or methods that use alternative, less noisy motor responses such as eye movements.An addi-tional factor that could lower precision of motor reproductions is the previously discussed 'one-shot' approach.Finally, given that motor response might play an important role in the effect of previous durations on the current estimate (Roach et al., 2017), motor reproductions could show a larger pull towards the mean, decreasing accuracy.However, we found no difference in the effect of the previous trial on the current estimate between motor and line estimates.This is in line with recent evidence from our lab, suggesting that previous durations already affect perception itself (Damsma et al., 2020).Timeline estimation also has its own unique sources of variance.Participants first have to learn how exactly time translates into space when using a specific timeline.This means that even if time would be represented spatially, a source of noise in line estimates could be that participants have to scale their spatial representation of time before giving an estimate.Because it is difficult to disentangle these sources of noise from noise in the representation of time, the differences in accuracy and precision between the estimation methods do not allow arguing in favor or against the idea of a spatial time or general magnitude representation.
In Experiment 1, we found that verbal estimates were more accurate and precise than timeline estimates and motor reproductions, implying that verbal estimates are superior to other estimation methods.However, participants were encouraged to express their subjective estimate in familiar terms (in this study: seconds).This familiarity might come at a cost: We found that the verbal estimates displayed an inconsistent pattern of precision, in which rounded integer intervals were estimated with higher precision than non-integer intervals.This pattern can be explained in three ways: 1) the emphasis on 'seconds' might lead participants to think about time in terms of these pre-learned units, 2) the method might have encouraged participants to count (although they were explicitly instructed not to count), and 3) the method of report might have encouraged some participants to round their estimate to the nearest integer, without using the fractional part.Regardless of the origin of the precision pattern, the results indicate that verbal estimates might encourage participants to think about time in a less 'linear' and a more 'categorical' way.Indeed, this is in line with the idea that verbal estimates are "contaminated by linguistic and semantic tags associated with traditional units of time perception" (Hancock & Block, 2012).These hypotheses imply that verbal estimates might be less accurate or precise when, for example, a range of sub-second intervals is reproduced.Future studies might test this idea by comparing estimation methods in different interval ranges.
In both experiments, we found that the coefficient of variation decreased with duration.This is a violation of the scalar property, which states that the CV should be constant over estimated durations.While the scalar property holds true for certain time scales and paradigms, it has to be noted that violations of the scalar property also have been observed commonly in timing tasks (for a review, see Grondin, 2014).Indeed, several studies have shown a decrease in CV similar to the effect reported here (e.g., Laje et al., 2011;Lewis & Miall, 2009).A potential explanation for the decreasing CV is that the variance observed in the Estimating Time: Comparing the Accuracy of Estimation Methods for Interval Timing Collabra: Psychology task has both time-dependent (variability associated with the timing mechanism) and time-independent components (such as motor variability), as noted above.Laje et al. (2011) showed that the decrease in CV could be captured by a generalized form of Weber's law, in which these components are explicitly modeled.Time-independent variance naturally depend on the nature of the estimation paradigm (for example, the extent to which the estimation method depends on motricity), explaining the differences in CV between estimation conditions in Experiment 1.
The differences in the results of Experiment 1 and 2 might be due to the different paradigms.First, participants had to reproduce two intervals in Experiment 2 (i.e., target onset and stream offset), and only a single interval in Experiment 1.This makes the task more difficult, which results in a higher absolute error in Experiment 2 (see also Brown et al., 1992;Brown & West, 1990).In addition, the estimates of the first and second interval might not be completely independent: the results show that there is an intercept difference, accompanied by a 'local' pull towards the mean, dependent on whether the first or the second interval is reproduced (see also Damsma et al., 2018).These dependencies might be stronger when they are visually represented on a timeline compared to motor reproductions.Second, longer intervals were presented in Experiment 2, which would also decrease the precision, in line with the scalar property.
One explanation for the lower accuracy in the timeline estimates in Experiment 2 is an increased response bias (i.e., reluctance to use the end of the scale), because the timeline offers a more explicit physical range compared to motor reproductions.If this is the case, we expected that estimates would be more accurate when the interval range is artificially increased in pre-experiment calibration trials.Indeed, the results of Experiment 2 showed that calibrating participants with a larger range increased the accuracy for longer intervals (i.e., the stream duration estimates), with similar precision.In Experiment 1, the calibration neither improved the overall accuracy, nor the accuracy of the longer intervals.Overall, these results suggest that the range of the timeline should be taken into account, as a range that is larger than the actual test durations might reduce the response bias for longer intervals.In the current study, the resolution of the timeline was identical in the calibrated and non-calibrated condition (1 pixel on screen corresponded to 0.01 s).Future studies could test whether this property of the timeline affects accuracy and precision of estimates, especially if the range of test durations is much larger than in the current study.Additionally, participants in our experiments received feedback about their accuracy on a line in every estimation condition, to keep the conditions as similar as possible.This way of presenting feedback could potentially bias participants towards a spatial representation of time, or benefit performance for motor or line estimates differentially.Future studies might test this notion by removing the feedback or varying the feedback modality.
Overall, the results show that each estimation method comes with its own unique advantages and drawbacks.Line estimations offer the advantage of a stable response time, which can allow the researcher to increase the number of trials in supra-second interval estimation experiments (i.e., using intervals longer than ~1.5 s).However, compared to motor reproductions, there might be a small cost in accuracy, potentially because of a required spatial transformation.This difference might be overcome by calibrating participants with a suitable interval range.Motor reproductions offer an intuitive estimation method, but the response times scale linearly with the presented intervals.In addition, it is difficult to disentangle the precision of the actual temporal estimate from motor inaccuracies (Droit-Volet, 2010;Hallez et al., 2019).In Experiment 1, we showed that verbal estimates are more accurate and precise than line estimates and motor reproductions.However, the precision of verbal estimates depends on whether the interval is a whole integer, indicating a bias towards familiar whole second units.Although future research should further investigate the reliability of estimation methods in different timing experiments, the current study can point timing researchers to a more optimal estimation method given their specific paradigm.

Figure 1 .
Figure 1.Trial procedure of Experiment 1. Participants performed a simple temporal estimation task, in which they had to estimate the duration of a square in three ways: A) clicking on a timeline (line estimation), B) pressing a key to indicate the estimated offset of the interval (motor reproduction), and C) typing a verbal estimate in seconds (e.g.: "1.4"; verbal estimation).Feedback was presented at the end of each trial.
Estimating Time: Comparing the Accuracy of Estimation Methods for Interval Timing Collabra: Psychology

Figure 2 .
Figure 2. A) Average estimates for the timeline, motor and verbal conditions and calibration conditions in Experiment 1.The grey dashed line represents veridical performance.B) Average absolute error of the timeline, motor and verbal estimates and calibration conditions.C) Average CV of the timeline, motor and verbal estimates and calibration conditions.D) Average reaction times (RTs) of the timeline, motor and verbal estimates and calibration conditions.While the RTs are stable over durations for the timeline and verbal estimates, motor reproductions of course scale with the presented duration.In all figures, error bars represent the standard error of the mean.

Figure 3 .
Figure 3. Trial procedure of Experiment 2. Participants were presented with a stream of numbers with one target letter.Their task was to estimate the interval from the beginning of the stream until the target onset, and also of the total duration of the stream by either A) clicking on a timeline (line estimates) or B) pressing a key at the estimated moments (motor reproduction).Feedback was presented at the end of each trial.
Estimating Time: Comparing the Accuracy of Estimation Methods for Interval Timing Collabra: Psychology the timeline estimation task.Participants received immediate feedback similar to the feedback in Experiment 1, with two additional bars corresponding to the veridical and estimated target occurrence (or interval to target onset).
Estimating Time: Comparing the Accuracy of Estimation Methods for Interval Timing Collabra: Psychology

Figure 4 .
Figure 4. A) Average interval-to-target and total stream duration estimates for the timeline and motor conditions and calibration conditions in Experiment 2. The grey dashed line represents veridical performance.B) Average absolute error of the target and stream estimates for the timeline and motor conditions and calibration conditions.C) Average CV of the target and stream estimates for the timeline and motor conditions and calibration conditions.D) Average response time (RT) of the stream estimation for the timeline and motor conditions and calibration conditions.In all figures, error bars represent the standard error of the mean.