Response Time Concealed Information Test on smartphones

The Response Time-Based Concealed Information Test (RT-CIT) can reveal when a person recognizes a relevant (probe) item among other, irrelevant items, based on comparatively slower responding to the probe item. Thereby, if a person is concealing the knowledge about the relevance of this item (e.g., recognizing it as a murder weapon), this deception can be revealed. So far, the RT-CIT has been used only on desktop computers. In Experiment 1 (n = 72; within-subject), we compare the probe-irrelevant differences when using the conventional desktop-based CIT to using a smartphone-based CIT, demonstrating practical equivalence. In Experiment 2 (n = 116; within-subject), we demonstrate that using thumbs for responses (while holding the smartphone) leads to equally efficient CIT results as using conventional index finger responses. At the same time, this second experiment also demonstrates how smartphone-based studies may be efficiently run in large groups, using the participants’ own smartphones. Finally, as an interesting addition, here for the first time we also measured keypress durations (i.e., the time durations of holding down the response keys) in the RT-CIT, which we found to be significantly shorter for probe than for irrelevant items.

Undetected deception may lead to extreme costs in certain scenarios such as counterterrorism, preemployment screening for intelligence agencies, or highstakes criminal proceedings.However, meta-analyses have repeatedly shown that without special aid, based on their own best judgment only, people (including police officers, detectives, and professional judges) distinguish lies from truths on a level hardly better than mere chance (Bond & DePaulo, 2006;Hartwig & Bond, 2011;Kraut, 1980).Therefore, researchers have advocated special techniques that facilitate lie detection, among which the most prominent ones are information-elicitation interviewing techniques (e.g., Vrij & Granhag, 2012) and the use of technology (e.g., computerized tasks as in the present study).
One of the potential technological aids is the Concealed Information Test (CIT; Lykken, 1959;Meijer, Selle, Elber, & Ben-Shakhar, 2014).The CIT aims to disclose whether examinees recognize certain relevant items, such as a weapon used in a recent homicide, among a set of other objects, when they actually try to conceal any knowledge about the criminal case.In the response time (RT)-based CIT, participants classify the presented stimuli as the target or as one of several non-targets by pressing one of two keys (Seymour, Seifert, Shafto, & Mosmann, 2000;Suchotzki, Verschuere, Van Bockstaele, Ben-Shakhar, & Crombez, 2017;Varga, Visu-Petra, Miclea, & Buş, 2014).Typically, five non-targets are presented, among which one is the probe, which is an item that only a guilty person would recognize, and the rest are irrelevants, which are similar to the probe and, thus, indistinguishable from it for an innocent person.For example, in a murder case where the true murder weapon was a knife, the probe could be the word "knife," while irrelevants could be "gun," "rope," etc.Assuming that the innocent examinees are not informed about how the murder was committed, they would not know which of the items is the probe.The items are repeatedly shown in a random sequence, and all of them have to be responded to with the same response keys, except one arbitrary target -a randomly selected, originally also irrelevant item that has to be responded to with the other response key.Since guilty examinees recognize the probe as the relevant item in respect of the deception detection scenario, it will become unique among the irrelevants and in this respect more similar to the rarely occurring target (Lukács & Ansorge, 2019a).Due

Participants
The tests were conducted at a behavioral experiment laboratory of the University of Vienna, where 77 psychology students completed our experiment (to receive experiment participation credits for curriculum requirements).The experiment was run in a within-subject design: Each participant completed once the Smartphone version, once the Desktop version.The test was taken in groups of two: While one participant was first tested in the Smartphone condition, the other was first tested in the Desktop condition, after which they did the reverse.All participants were tested with their own personal first and last names as probes in the CIT task, simulating a guilty suspect trying to conceal the recognition of these two names (see, e.g., Lukács, Gula, Szegedi-Hallgató, & Csifcsák, 2017;Verschuere & Kleinberg, 2015).
The data from the intended first two participants were excluded immediately after the completion of the test, due to small technical issues.The preregistered number of 75 participants were collected subsequently.Out of these, two participants were excluded due to entering, to be used as a probe, a double first name (i.e., including a middle name; despite our warning).Our exclusion criteria were an accuracy rate not over 50% for targets or not over 75% for main items (probe or irrelevant items).There was only one related exclusion (due to too low target accuracy).This left 72 valid participants (M age ± SD age = 21.14 ± 1.65; 12 male; 36 started with Smartphone).

Procedure
Before the beginning of the experiment, each participant read and signed an informed consent form, which also included the information that the following task simulates a lie detection scenario, during which participants should try to hide their identities.
For the CIT that came next, both the smartphone and the desktop applications were written in HTML5/JavaScript (for the use of this framework in RT tasks in general, see Reimers & Stewart, 2015; or in RT-CIT in specific, see Kleinberg & Verschuere, 2015).The Desktop version was run in Google Chrome (Version 70.0.3538),while for the Smartphone version the same code was adapted into the Ionic Framework to create a hybrid mobile application (built for Android; see e.g., Khandeparkar, Gupta, & Sindhya, 2015).This latter allowed the implementation of several useful native smartphone functionalities; in particular true full screen (no interfering notification or navigation bars) and automatic local storage of the resulting data.
The consequent tests on the Smartphone and Desktop were identical, except for the following points.In case of the keyboard (Logitech K120 920-003626) of the desktop computer, a simple keypress was required: key "F" as response on the left, and key "K" as response on the right.(For corresponding response categories, see below.)In case of the touchscreen of the smartphone (Moto G5 XT1676), a tap was required as response: the touching of the screen (finger down) and releasing it (finger lifted up; within 300 ms of the touch start).The layout of the two response fields was designed to have approximately the same size (and form) as the surface of the keyboard keys.The distance between the left and right response fields was the same as the distance between the left and right keyboard keys (ca.6.5 cm; surface size per field or key: ca.2.4 × 2.2 cm).The keyboard and the smartphone were switched in the Downloaded from http://online.ucpress.edu/collabra/article-pdf/6/1/3/437127/255-3994-1-pb.pdf by guest on 30 November 2020 same place after the first test, so that, in the second test, the position of the responses (keys or fields) remained the same.In case of the keyboard, the key letters ("F," "K") were mentioned only once in the beginning of the task; afterwards, same as in case of Smartphone, the responses were referred to only as left-side or right-side response.In either case, participants used their left and right index fingers to respond.In case of Desktop, the monitor was placed next to the keyboard, and the items in the CIT task appeared in the middle of its screen (37.5 × 30.0 cm).In case of the Smartphone screen (11.0 × 6.2 cm; always in horizontal mode), the items appeared above the response fields, in the top half of the screen (see Figure 1).Both screens were the same distance from the eyes of the participant (ca.55 cm).Consequently, participants looked at the smartphone screen in a larger angle from horizontal (ca.50°) than at the desktop screen (ca.24°).Each detail of the rest of the description, as follows below, applies to both versions equally.
Participants entered their first names (along with gender) and last names, which then served as the two probe items in the task. 2 For each probe, five items were randomly chosen from a list of frequent German names (with corresponding gender for first names; 117 female and 138 male first names; 100 last names), out of which one was randomly chosen to serve as target, while the remaining four served as irrelevants.The random choice was restricted in that these five items had the closest possible character length to the given probe, and not any two of the six items started with the same letter.Thus, for each participant, there were altogether 12 unique items: two probes, two targets, and eight irrelevants (all 12 identical in the Desktop and Smartphone conditions.)We refer to the probes and irrelevants jointly as non-targets.
Next, participants were presented the two target names, and were asked to memorize these items in order to recognize them as requiring a different response during the following task.On the next page, participants were asked to recall the memorized items, and could proceed only if they entered these items correctly.If any of the entered items was incorrect, the participant received a warning and was redirected to the previous page in order to have another look at the same items.
During the task, the items were presented one by one on the screen (in 0.65 cm tall uppercase letters; in white font color on black background) and participants had to categorize them with one of the two response alternatives.Participants were told that the right-side response means "Yes," they recognize the item, while the left-side response means "No," they do not recognize the item -and they were correspondingly instructed to say "Yes" to the targets, and "No" to all other, non-target words (i.e., both the irrelevants and the probes).
We implemented the simplest version of the RT-based CIT (single-probe protocol; see Verschuere, Kleinberg, & Theocharidou, 2015).We decided for this version in order to focus exclusively on the most general aspect of the CIT, namely the recognition of a single relevant detail among irrelevant details, without any additional complexities that are involved in more efficient protocols (e.g., Lukács & Ansorge, 2019b;Lukács, Kleinberg, & Verschuere, 2017;Verschuere & Kleinberg, 2015).
More precisely, there are two obvious superior alternatives: the multiple-probe protocol (Verschuere et al., 2015) and the single-probe protocol enhanced with familiarity-related filler items (Lukács, Kleinberg, et al., 2017; or the even more complex related alternatives; Lukács & Ansorge, 2019b).The former mixes all item categories within the CIT (e.g., first and last names) randomly together in one block, while the latter includes additional familiarity-and unfamiliarity-referring items (e.g., "familiar" and "unknown").Thus, in both cases the trial sequence randomization can result in an unequal distribution of preceding items for any given item (e.g., several targets preceding a probe), whose semantic priming effect may influence the response speed to the upcoming word (e.g., Foss, 1982;Meyer & Schvaneveldt, 1971), hence, introducing theoretically uninteresting statistical noise in the data.Furthermore, the singleprobe protocol has several practical advantages as compared to the multiple-probe protocol (as detailed by Lukács, Kleinberg, et al., 2017): applicability even in case of a limited number of probe items (Podlesny, 2003), compatibility with common test procedures and scoring algorithms (Krapohl, 2011), and sequential testing to narrow down possibilities (Lukács, Kleinberg, et al., 2017), and that (consequently) practitioners currently also consider the single-probe protocol to be the only viable option (Ogawa, Matsuda, Tsuneoka, & Verschuere, 2015).The addition of familiarity-related fillers, on the other hand, is also a recent development, and we preferred to use a well-established CIT protocol.
In sum, we found it best to use the simplest possible protocol, which is the single-probe protocol.We presume that the difference between the use of smartphone versus desktop involves very basic cognitive and behavioral processes (mainly: quite simply the pressing of key vs. touching a touchscreen) that could hardly be affected by specific CIT versions -therefore, the outcome of the comparison can be subsequently extrapolated to any of the more complex designs.(Relatedly, to ensure that we still obtain large enough probe versus irrelevant effects to be compared between devices, we chose very high salient probes, namely, the participants' personal names.)During the comprehension check and the first practice task (see below), reminder captions were displayed: "Recognized?"(Erkannt?) at the top of the screen, and, in the lower part of the screen, "No" (Nein) on the left and "Yes" (Ja) on the right (Figure 1).Starting from the second practice task (and throughout the main blocks), these captions were not displayed anymore.
The inter-trial interval (i.e., between the end of one trial and the beginning of the next) always randomly varied between 300 and 800 ms.In case of a correct response, the next trial followed.In case of an incorrect response or no response within the given time limit, the caption "Wrong!" (Falsch!) or "Too slow" (Zu langsam!) appeared, respectively, below the stimulus in red color for 400 ms, followed by the next trial.
The main task was preceded by a comprehension check and two practice tasks.The check served to ensure that the participant had fully understood the task.All 12 items were displayed in a random order, and participants had plenty of time (10.5 s) to choose a response -however, each trial required a correct response.In case of an incorrect response, the participant immediately got a corresponding feedback, was reminded of the instructions, and had to repeat this check.This check guaranteed that the eventual differences (if any) between the responses to the probe and the responses to the irrelevants were not due to misunderstanding of the instructions or any uncertainty about the required responses in the eventual task.
In the following first practice task, the response window was longer than in the main task (2 s instead of 800 ms), while the second practice task had the same design as the main task.Both practice tasks consisted of 12 trials (first the six items from one name category, then the six from the other; in the order of the main blocks; see below).In either practice task, in case of too few valid responses, the participants received a corresponding feedback, were reminded of the instructions, and had to repeat the practice task.The requirement was a minimum of 60% valid responses (correct response between 150 and 800 ms) for targets and for main items (probes and irrelevants together).
The main task contained two blocks: one with first names only, and one with last names only (order counterbalanced across participants, but, for each participant, same order in each condition, Desktop and Smartphone).Each probe, irrelevant, and target was repeated 18 times in each block (hence, altogether 36 probe, 72 irrelevant, and 36 target trials).Within each block, the order of the items was randomized in groups: first, all six items (one probe, four irrelevants, and one target) in the given category were presented in a random order, then the same six items were presented in another random order (but with the restriction that the first item in the next group was never the same as the last item in the previous group).
After this test was completed in both conditions (Desktop and Smartphone), participants gave their demographic details and completed a very brief questionnaire regarding their alertness during the task (see Appendix A).Finally, participants were given more detailed information about the experiment and contact details for potential further inquiries.The experiment took about 30 min per session.

Data Analysis
We conducted preregistered analyses, except where explicitly noted otherwise.
For the main questions, the dependent variable was the probe-to-irrelevant RT mean (i.e., probe RT mean minus irrelevant RT mean, per each participant), which was compared between the Desktop and Smartphone conditions with three statistical tests: (a) a simple t-test to test the potential difference, (b) Bayesian likelihood ratio to test the null hypothesis, and (c) a two one-sided t-test (TOST) procedure, as a frequentist approach for testing the equivalence, with equivalence bounds of d = -0.4 and d = 0.4 (see below).Following the suggestion of a reviewer of a previous version of this manuscript, for probeirrelevant RT means we report Spearman-Brown split-half reliability coefficients (Brown, 1910;Eisinga, Grotenhuis, & Pelzer, 2013;Spearman, 1910; for CIT, Kleinberg & Verschuere, 2015).
While in the RT-CIT usually only RT means are used as predictors (for guilty-innocent classifications), certain extents of probe-to-irrelevant differences are also often observed in accuracy rates as well, and therefore may be of interest (in particular, see Lukács & Ansorge, 2019b;Lukács, Gula, et al., 2017).Consequently, the three tests above were repeated with probe-to-irrelevant accuracy rate differences (i.e., probe accuracy rate minus irrelevant accuracy rate, per each participant), in place of RT means, as dependent variables.
In the preregistration, we only mentioned comparing keypress-and touch-durations (from here on, we designate these collectively as hold-durations), and the potential effects of self-reported alertness, as potential exploratory analyses.Here, we specify that, for hold-durations, we decided for an analysis of variance (ANOVA) with the two factors Trial Type (probe vs. irrelevant) and Device (Desktop vs. Smartphone). 3 Regarding the alertness questionnaire, we tested the correlations of the aggregated ratings, in case of desktop and smartphone separately, with (a) probe-to-irrelevant RT mean differences, and (b) probe-to-irrelevant accuracy rate differences; these analyses are reported in Appendix A.
Finally, as a preliminary assessment of the potential incremental benefit of hold-durations, we report exploratory binary logistic regression analysis combining RT means and hold-durations, and present illustrative simulated areas under the curves (AUCs; see below) based on the fitted values.For each simulation, to represent the hypothetical "innocent" (or "naive") suspect's data, we generated a sample of 1,000 values in perfect normal distribution with a mean of zero (see in the R script uploaded to the OSF repository).In case of each predictor, we used the SD of the same predictor from the real participants' data. 4For example, the probe-irrelevant RT mean SD in the Desktop condition was 27.1 ms; hence, the simulated data was a normally distributed 1,000 values with SD = 27.1 (and a mean of zero).

Bayesian analysis
We report Bayes factors using the default r-scale of 0.707 (Morey & Rouder, 2018).The Bayes factor is a ratio between the likelihood of the data fitting under the null hypothesis and the likelihood of fitting under the alternative hypothesis (Jarosz & Wiley, 2014;Wagenmakers, 2007).For example, a Bayes factor (BF) of 3 means that the obtained data is three times as likely to be observed if the alternative hypothesis is true, while a BF of 0.5 means that the obtained data is twice as likely to be observed if the null hypothesis is true.Here, for more readily interpretable numbers, we denote Bayesian factors as BF 10 for supporting alternative hypothesis, and as BF 01 for supporting null hypothesis.Thus, for example, BF 01 = 2 again means that the obtained data is twice as likely under the null hypotheses than under the alternative hypothesis.Typically, BF = 3 is interpreted as the minimum likelihood ratio for "substantial" evidence for either the null or the alternative hypothesis (Jeffreys, 1961).

TOST
In the TOST procedure, the null hypothesis, analogous to a simple t-test, is the presence of a true difference in either direction, with the effect sizes specified as the equivalence bounds, in our case d = 0.4 in either direction.If the p value for the one-sided t-tests examining either direction (or both) is below the alpha level (.05), we can assume that, in the given direction, there is no difference larger than the specified effect size (Lakens, 2017;Schuirmann, 1987). 5As described in our preregistration, the conventional medium effect size of d = 0.5 has been shown, in previous studies, to be a reasonable practical indication of substantially increased CIT efficiency (e.g., Lukács & Ansorge, 2019b;Lukács, Kleinberg, et al., 2017;Verschuere et al., 2015).Therefore, for an insubstantial difference, we chose a somewhat lower effect size.To note, we aim to reveal whether there is an equivalence within these bounds of d = -0.4 and d = 0.4, but this is not to say that differences smaller than that are always fully negligible in all respects -however, this is a reasonable estimation for the potential usefulness of the smartphonebased alternative of the RT-CIT.

AUCs
To illustrate the potential efficiency of discriminating between guilty and innocent suspects, we calculated AUCs (a diagnostic efficiency measure, for binary classification, that takes into account the distribution of all predictor values; Rice & Harris, 2005;Zou, O'Malley, & Mauri, 2007) for receiver operating characteristics (ROCs).The AUC can range from 0 to 1, where .5 means chance level classification, and 1 means flawless classification (i.e., all guilty and informed innocent classifications can be correctly made based on the given predictor variable, at a given cutoff point).

Effect sizes
To demonstrate the magnitude of the observed effects, for F-tests we report generalized eta squared (η G 2 ) and partial eta squared (η p 2 ) with 90% CIs (Lakens, 2013).We report Welch-corrected t-tests (Delacre, Lakens, & Leys, 2017), with corresponding Cohen's d values as standardized mean differences and their 95% CIs (Lakens, 2013).In case of TOST, we also report 90% CIs to show the effect size bounds at alpha level.We used the conventional alpha level of .05 for all statistical significance tests.
For all analyses, RTs below 150 ms were excluded.For RT analyses, only correct responses were used.Accuracy was calculated as the number of correct responses divided by the number of all trials (after the exclusion of those with an RT below 150 ms).

Results
Aggregated RT mean, accuracy rate, and hold-duration, for the different stimulus types in each condition (Desktop and Smartphone), are given in Table 1, along with related effect sizes.

Exploratory analysis: Logistic model-based predictors
Using probe-irrelevant RT mean differences and probeirrelevant hold-duration differences as two potential predictors in a logistic regression model, we fitted values in order to assess the potential incremental value of holddurations in predicting the (simulated) conditions of guilt and innocence.The assessment of goodness-of-fit revealed a significant improvement relative to a constantonly model for both conditions, Desktop: χ 2 ( 2

Discussion
Using a single-probe protocol RT-CIT with the participants' first and last names as probes, Experiment 1 has shown that, as hypothesized, the smartphone-based version can be as efficient as the desktop-based version.There is, however, an additional aspect of using smartphone as compared to desktop computer: namely, the touchscreen of the smartphone can also easily be operated with thumbs instead of index fingers; which allows more mobility for the user, and which is in fact the more common and natural way for smartphone usage in general (e.g., Azenkot & Zhai, 2012;Bröhl, Mertens, & Ziefle, 2017).Several studies have shown that using index finger responses instead of thumbs can lead to different results: In particular, it has been consistently found that more accurate general input (mainly: typing) can be given using index fingers (Buschek, De Luca, & Alt, 2016;Lehmann & Kipp, 2018;Wang & Ren, 2009;Wobbrock, Myers, & Aung, 2008).However, results have been mixed regarding speed differences, which seems to depend on the particular input type and study design (Azenkot & Zhai, 2012;Goel, Jansen, Mandel, Patel, & Wobbrock, 2013;Lehmann & Kipp, 2018;Wobbrock et al., 2008).In any case, to our knowledge, no studies have explored the potential effect of this difference in a regular experimental RT task yet, let alone in the RT-CIT.Therefore, in Experiment 2, we compared the conditions of using index fingers (Index finger condition) and of using thumbs (Thumb condition), to see whether the latter is at least as efficient as the former (i.e., yielding at least as large probe-to-irrelevant differences).
In Experiment 1, we also had a novel, exploratory finding: shorter hold-durations for probe compared to irrelevant items.This finding was significant in both conditions -though with a rather small effect, especially in the smartphone conditions.Therefore, an additional reason for Experiment 2 was to replicate this novel finding; with a larger sample size, and now, based on Experiment 1, with a data-based prediction before the experiment.
Finally, in Experiment 2, we also demonstrate how smartphone-based experiments may be efficiently run in larger groups, using the participants' own smartphones.(2) Bayesian hypothesis tests, and (3) a TOST procedure with equivalence bounds of d = -0.4 and d = 0.4.Again, these three tests were repeated with probe-to-irrelevant accuracy rates as dependent variables.
We preregistered the testing of hold-duration between probe and irrelevant items using a one-sided t-test, expecting shorter durations for probes (based on Experiment 1), along with a complementary Bayesian analysis.Here, we further specify that we do this separately for Index finger and Thumb conditions (and designate it as exploratory analysis).To provide more justification for this, we first perform an ANOVA, similarly as in Experiment 1, with the two factors Trial Type (probe vs. irrelevant) and Hand-position (Index finger vs. Thumb), to show whether there is an interaction.
Regarding correlation tests of the physical screen size with probe-to-irrelevant RT mean differences and with probe-to-irrelevant accuracy rate differences, see Appendix B.
Finally, as in Experiment 1, we report exploratory logistic regression analysis combining RT means and holddurations, and present illustrative simulated AUCs.

Results
Aggregated RT mean, accuracy rate, hold-duration, for the different stimulus types in each condition (Index finger and Thumb), are given in Table 2, along with related effect sizes.

RT means
The t-test between the probe-to-irrelevant RT means of Index finger and Thumb conditions indicated no significant difference, with Bayesian hypothesis testing supporting the null hypothesis; t(115) = 0.19, p = .848,d within = 0.02, 95% CI [-0.16, 0.20], 90% CI [-0.13, 0.17], BF 01 = 9.53.The TOST showed that the 90% CI of the effect is well within the equivalence bounds (d = -0.4 and d = 0.4): The one-sided t-test against the upper bound (null hypothesis of larger probe-to-irrelevant RT means for Index finger than for Thumb) was significant, t(115) = -3.57,p < .001,as well as the one against the lower bound (null hypothesis of larger values for Thumb), t(115) = 3.22, p < .001.The reliability coefficients were ρ = .739for Index finger, and ρ = .608for Thumb.

Accuracy rates
Unlike in case of RT means, the t-test between the probe-to-irrelevant accuracy rates of Index finger and Thumb indicated a small but statistically significant difference, though with an inconclusive BF; t(115) = 2.02, p = .046,d within = 0.19, 95% CI [0.00, 0.37], 90% CI [0.03, 0.34], BF 01 = 1.37.The TOST again showed that the 90% CI of the effect is within the equivalence bounds (d = -0.4 and d = 0.4): The one-sided t-test against the upper bound (null hypothesis of larger values for Index finger) was significant, t(115) = -2.29,p = .012,as well as the one against the lower bound (null hypothesis of larger values for Thumb), t(115) = 6.33, p < .001.This altogether means that the accuracy rate difference between probe and irrelevant was statistically shown to be significantly larger in case of the Thumb condition, but at the same time, based on our predefined equivalence bounds, this difference is not of notable practical relevance.

Discussion
In this second experiment we have shown that the hand position (using index fingers vs. thumbs for responses) plays no role in the results of the RT-CIT, at least regarding RT means.Regarding accuracy rates, we have shown that there is a small difference, in that slightly higher probe-toirrelevant accuracy rate differences are found when using thumbs.This may be because using thumbs, as opposed to index fingers, is more sensitive to tasks requiring accuracy, and more prone to error rates in general (Buschek et al., 2016;Lehmann & Kipp, 2018;Wang & Ren, 2009;Wobbrock et al., 2008).This aspect could be explored in the future.
However, probe-to-irrelevant accuracy rate differences are in any case generally low in the RT-CIT, and have only rarely been used as predictors of guilt, but even in those cases only as secondary predictors.Nonetheless, if this aspect may be of any interest in the future, the method can still very well be always used with thumbs, as there is no general opposing reason or practical limitation.Note also that we have proven the equivalence, for the same accuracy rate differences, between the desktop and smartphone using index fingers.Consequently, the larger differences when using thumbs can only be an improvement (in respect of guilty-innocent predictions) as compared to the regular desktop version.
Our previous finding of shorter hold-durations for index finger was successfully replicated in this second experiment (p < .001).At the same time, this difference was absent in case of using thumbs.We have strong statistical support for this finding through both the ANOVA interaction (larger probe-to-irrelevant differences for Index finger; p = .004)and the Bayesian likelihood supporting the null finding (probe-to-irrelevant differences in case of Thumb; BF 01 = 9.31).We also see a reasonable explanation for this.People are much more used to tapping with thumbs, as required by smartphone applications (for which usually thumbs are used): Touchscreens typically have a specific required hold-duration, only at the end of which is the given function executed (e.g., opening a folder).Hence, participants may be more strongly adapted to thumb taps, which are thereby more resistant to minor influences such as the probe-to-irrelevant differences in the RT-CIT.Nonetheless, this finding was not explicitly expected prior to the study and, therefore, would deserve further research.

General Discussion
In the present study, we have shown that the Response Time-Based Concealed Information Test (RT-CIT) can be used just as well on a smartphone as on a desktop computer.Before real life use, replication studies would be advisable, in particular field settings, and also including more efficient CIT protocols (Lukács, Kleinberg, et al., 2017;Verschuere et al., 2015).However, it already appears to be a valid method for various potential applications, facilitating the use of CIT in any situation where desktop computers are not available, limited, or impractical: such as border control (e.g., mass screening for the detection of country of origin 10 ), pre-employment screening via remote interviews (where the smartphone application could automatically verify the device ID or phone number), or an immediately available test for appropriate investigating authorities, such as those in the police force, or in the military, at battlefronts (cf. the "handheld polygraph" of the U. S. Army; Dedman, 2008;Gordon, 2017;National Research Council, 2010; United States Office of the Secretary of Defense, 2018).
While not directly related to the main question of our study, we included an exploratory analysis in our first experiment on keypress-and touch-durations as topics relevant to other smartphone-based studies as well (e.g., Buschek, De Luca, & Alt, 2015;Goel et al., 2013).We found shorter durations for probes (i.e., when participants saw their own names), and replicated this finding in the second experiment (though only when using index fingers for touchscreen taps, and not when using thumbs).As compared to the use of RT mean alone, the combination of RT mean with hold-duration as model-based predictor led to noteworthy increases in classification efficiencies (AUCs) in two out of the four cases.
One reason for the duration differences could be that the lifting of the fingers corresponded to a second response and that some of the delay in the probe conditions was used to plan this second response in a sequence of responses consisting of key press and release (see Verwey, 1995).As a post-hoc test for this hypothesis, we calculated the correlations of response times and hold-durations per individual: These correlations were on average very weak (all correlation means between -.08 and .02)for both probe and irrelevant trial types in both conditions in both experiments -making the proposed hypothesis unlikely.Another potential explanation, however, is that participants perhaps felt their delay in the probe trials in general (see Corallo, Sackur, Dehaene, & Sigman, 2008) and made an effort to compensate for the delay by a swifter key release.It could be interesting to explore whether this phenomenon appears in other RT tasks that contain target items, or any sort of response conflict or interference (e.g., Stroop tasks, cuing tasks, etc.).
Finally, while the feasibility of RT tasks on smartphones has been suggested before (Burke et al., 2017;Kay et al., 2013;Schatz et al., 2015), we provide strong evidence that such results can be identical to the ones obtained on regular computers.As demonstrated in our second experiment, using the participants' own smartphones, data can be easily collected in groups of 10-20, requiring nothing but an empty classroom.In the future, entire studies, with, say, over a hundred participants gathered in an auditorium, could be conducted this way within half an hour, with no equipment needed by the researchers.This would be a great advantage especially for less wealthy, less well-equipped universities and research institutes anywhere in the world.

Limitations
Our probe versus irrelevant effect sizes and simulated classification rates probably do not reflect well those that would be obtained in real life cases.While the personal relevance of the presented self-related autobiographical details arguably also resembles the relevance of real-life incriminating items, the extent of applicability is yet to be explored.In a specific situation very similar to the one simulated in the present study, authorities may test the true identity of the person, in which case the results may be assumed comparable to those in our study (regarding higher stakes at hand, see Kleinberg & Verschuere, 2016).This is, however, likely not a frequent case.The relevance of the more probable crime-related items (such as a murder weapon), which may be contributed to by the various emotions related to the actually committed crime (guilt, suspense, etc.), would be very difficult to simulate in a controlled experiment, and may require field studies in the future.In general, more realistic settings would be needed for a proper assessment of classification efficiency, as opposed to the highly controlled laboratory studies such as the present one, and indeed as it is in most RT-CIT studies.
Importantly, the primary aim of this study was to assess whether the smartphone-based CIT could be as efficient as the desktop-based one, and this comparison does not depend on precise or realistic demonstration of classification efficiencies.There is, however, one finding that could be substantially influenced by these biases: Namely, the incremental contribution of the novel holdduration measure.This measure is, as we explicitly stated, an exploratory finding whose efficiency, usefulness, and mechanism should be assessed in future studies.

Conclusions
In the present study, using a single-probe protocol RT-CIT with the participants' first and last names as probes, we have (a) demonstrated that the smartphonebased version can be just as well used as the desktop-based version -using, for responses, either index fingers or thumbs (thus, simply holding the device in the hand), (b) shown that responses to probes compared to irrelevants in the RT-CIT have shorter keypress-and touch-durations -a difference that may be used as additional predictor of concealed knowledge, and which may be explored in other psychological tests as well, and (c) demonstrated a large-group experimental procedure using participants' smartphones, which may be adopted for any computerized tasks for fast and costless data collection in future studies.U. A. and B. K, proofread by M. K.All authors approved the submitted version for publication.

Figure 1 :
Figure 1: Example screenshot from the first practice phase of the Smartphone-based CIT, with "MICHAEL" displayed as the item awaiting a response.(In the following practice and main tasks, the reminder captions [Erkannt?-"Recognized?";Nein -"No"; Ja -"Yes"] were not displayed anymore.)

Table 1 :
RT Means, Accuracy Rates, in Experiment 1.Means and SDs (in the format of M ± SD) for individual RT means, accuracy rates, and hold-durations; for Probe (participants' own names), Irrelevant (other names), Target (the designated irrelevant details that require different response), P -I (individual probe minus irrelevant values); separately for the Smartphone and Desktop computer conditions.Cohen's d effect sizes (as d PvsI ) and simulated AUCs for the probe-to-irrelevant differences are given under each respective column.

Table 2 :
RT Means, Accuracy Rates, and Hold-Durations, in Experiment 2. Means and SDs (in the format of M ± SD) for individual RT means, accuracy rates, and hold-durations; for Probe (participants' own names), Irrelevant (other names), Target (the designated irrelevant details that require different response), P -I (individual probe minus irrelevant values); separately for the Thumb (using thumbs) and Index (using index fingers) conditions.Cohen's d effect sizes (as d PvsI ) and simulated AUCs for the probe-to-irrelevant differences are given under each respective column.