Repeated Retrieval Practice to Foster Students’ Critical Thinking Skills

There is a need for effective methods to teach critical thinking. Many studies on other skills have demonstrated beneficial effects of practice that repeatedly induces retrieval processes (repeated retrieval practice). The present experiment investigated whether repeated retrieval practice is effective for fostering critical thinking skills, focusing on avoiding biased reasoning. Seventy-five students first took a pre-test. Subsequently, they were instructed on critical thinking and avoiding belief-bias in syllogistic reasoning and engaged in retrieval practice with syllogisms. Afterwards, depending on the assigned condition, they (1) did not engage in extra retrieval practice; (2) engaged in retrieval practiced a second time (week later); or (3) engaged in retrieval practiced a second (week later) and a third time (two weeks later). Two/three days after the last practice session, all participants took a post-test consisting of practiced tasks (to measure learning relative to the pre-test) and non-practiced (transfer) tasks. Results revealed no significant difference between the pretest and the posttest learning performance as judged by the mean total performance (MC-answers + justification), although participants were, on average, faster on the post-test than on the pre-test. Exploring performance on MC-answers-only suggested that participants did benefit from instruction/practice but may have been unable to justify their answers. Unfortunately, we were unable to test effects on transfer due to a floor effect, which highlights the difficulty of establishing transfer of critical thinking skills. To the best of our knowledge, this is the first study that addresses repeated retrieval practice effects in the critical thinking domain. Further research should focus on determining the preconditions of repeated retrieval practice effects for this type of tasks.


Introduction
One of the most valued and sought after skills that higher education students are expected to learn is critical thinking (CT). CT is key to effective thinking about difficult issues, weighing evidence, determining credibility, and acting rationally, which is essential for succeeding in future careers and to be efficacious citizens (Billings & Roberts, 2014;Davies, 2013;Halpern, 2014;Van Gelder, 2005). The concept of CT can be expressed in a variety of definitions, but at its core, CT is "good thinking that is well reasoned and well supported with evidence" (H. A. Butler & Halpern, 2020, p. 152). One key aspect of CT is the ability to avoid biases in reasoning and decision-making (e.g., West et al., 2008), referred to as unbiased reasoning. Bias is said to occur when people rely on heuristics (i.e., mental shortcuts) dur-ing reasoning prior to choosing actions and estimating probabilities that result in systematic deviations from ideal normative standards (i.e., derived from logic and probability theory: Stanovich et al., 2016;Tversky & Kahneman, 1974). As biased reasoning can have serious consequences in both daily life and complex professional environments, it is essential to teach CT in higher education (e.g., Koehler et al., 2002).
Nonetheless, while some effective instructional approaches for learning CT have been identified, it is still unclear which methods are most effective in supporting the ability to transfer what has been learned (Halpern & Butler, 2019;Heijltjes et al., 2015;Heijltjes, Van Gog, Leppink, et al., 2014;Ritchhart & Perkins, 2005;Tiruneh et al., 2014Tiruneh et al., , 2016Van Peppen et al., 2018;Van Peppen, Verkoeijen, Kolenbrander, et al., 2021). Transfer is the process of applying one's prior knowledge or skills to related materials or some new context (e.g., Barnett & Ceci, 2002;Cormier & Hagman, 2014;Haskell, 2001;Perkins & Salomon, 1992;Salomon & Perkins, 1989). There are some insights into fostering transfer of CT-skills to isomorphic tasks (in this study referred to as learning; e.g., Heijltjes, Van Gog, Leppink, et al., 2014), but not into transfer to novel tasks that share underlying principles but have not been previously encountered (e.g., Heijltjes et al., 2015;Heijltjes, Van Gog, Leppink, et al., 2014;Van Peppen et al., 2018;Van Peppen, Verkoeijen, Kolenbrander, et al., 2021). As it is crucial that students can successfully apply the CT-skills acquired at a later time and to novel contexts/problems and it would be unfeasible to train students on each and every type of reasoning bias they will ever encounter, more knowledge is needed into the conditions that not only yield learning of CT-skills but also transfer.
Previous research has demonstrated that to establish learning and transfer, learners have to actively construct meaningful knowledge from to-be-learned information, by mentally organizing it in coherent knowledge structures and integrating these with one's prior knowledge (Bassok & Holyoak, 1989;Fiorella & Mayer, 2016;Gick & Holyoak, 1983;Holland et al., 1989;Wittrock, 2010). This, in turn, can aid future problem solving (Kalyuga, 2011;Renkl, 2014;Van Gog et al., 2019): if a situation presents similar requirements and the learner recognizes them, they may select and apply the same or a somewhat adapted learned procedure to solve the problem. One of the strongest learning techniques known to promote the construction of meaningful knowledge structures, is having students retrieve tobe-learned material from memory, known as practice testing or retrieval practice (e.g., Dunlosky et al., 2013;Fiorella & Mayer, 2015Roediger & Butler, 2011). The effect of retrieval practice seems to be extremely robust (for reviews, see Carpenter, 2012;Delaney et al., 2010;Moreira et al., 2019;Pan & Rickard, 2018;Rickard & Pan, 2017;Roediger & Butler, 2011;Roediger & Karpicke, 2006b;Rowland, 2014) emerging on measures of both learning and transfer, and with different kinds of materials and test formats (e.g., A. C. Butler, 2010;Carpenter & Kelly, 2012;McDaniel et al., 2012McDaniel et al., , 2013Rohrer et al., 2010).

Repeated Retrieval Practice
The effect of retrieval practice seems to be positively related to the number of successful retrieval attempts during practice (e.g., Rawson & Dunlosky, 2011;Roediger & Karpicke, 2006a), albeit with diminishing returns. For example, in Experiment 2 from the study by Roediger and Karpicke (2006a), participants either studied a prose passage multiple times (SSSS condition), studied a prose passage multiple times and took one free recall retrieval practice test (SSST condition) or studied a prose passage once and took a free recall retrieval practice test thrice (STTT condition). Subsequently, a delayed final free recall test on the prose passage was administered in all conditions. The results on this final free recall test showed that taking a single retrieval practice test increased the free recall performance relative to the control condition from a mean score of 40% correct to a mean score of 56% correct. Furthermore, repeated retrieval practice (i.e., the STTT condition) increased the free recall performance to a mean of 61% correct, hence showing diminishing returns for extra retrieval practice. That is, where a single retrieval practice test in the SSST condition lifted final test performance with 16% points, the two additional retrieval practice tests increased the final test performance with only 5% points. These diminishing returns of repeated retrieval practice might be due to the fact that the practice testing effect depends not only on the number of successful retrieval attempts but also on the effort that is required to successfully retrieval information from memory. According to the retrieval effort hypothesis (e.g., Pyc & Rawson, 2009) the effect of retrieval practice becomes larger when successful retrieval attempts require more effort. When information is repeatedly retrieved from memory, the effort associated with successful retrieval is likely to decrease, which will lead to diminishing returns of repeated retrieval practice.
Despite the potential of repeated retrieval for learning, its impact has not been investigated in research on CT. Therefore, the present study sought to determine whether repeated retrieval practice is beneficial to foster learning of CT-skills as well, and whether it can additionally facilitate transfer. For educational practice, it is relevant to identify the most efficient schedule from among those that achieve a desired level of durability. While the majority of studies were conducted in laboratory settings, the current study was conducted as part of an existing CT-course using educationally relevant practice sessions (multiple practice tasks within a session) and retention intervals (days/weeks). To the best of our knowledge, this is the first study that investigated the effects of repeated retrieval practice in the CTdomain.

The Present Study
Participants first completed a pretest including syllogistic reasoning tasks (for an overview of the study design, see figure 1), which examined their tendency to be influenced by the believability of a conclusion when evaluating the  logical validity of arguments. Thereafter, they received instructions on CT in general and on syllogisms in particular. Subsequently, they engaged in retrieval practice with these tasks on domain-specific problems. Depending on condition, participants (1) did not engage in extra retrieval practice with these tasks (practice once); (2) engaged in retrieval practice a second time (one week later; practice twice); or (3) engaged in retrieval practice a second (one week later) and third time (two weeks after second time; practice thrice). Subsequently, all participants completed a posttest including practiced tasks (i.e., syllogistic reasoning tasks; measure of learning) and non-practiced tasks (i.e., Wason selection tasks; measure of transfer) two or three days after their last practice session. Participants had to indicate after each test and practice item how much effort they invested on that item and time-on-task was logged during all phases. Furthermore, they were asked after each practice session to assess how well they thought they understood the practice problems (i.e., global judgment of learning; JOL) to gain insight into the added value of extra practice according to the students themselves. Previous research has demonstrated that students' JOLs are related to their learning strategies and study time (i.e., monitoring learning processes; e.g., Koriat, 1997;Nelson et al., 1994;Zimmerman, 2000) and, thus, may indirectly contribute to performance enhancement.
We hypothesized that explicit CT-instructions combined with retrieval practice would be effective for learning: thus, we expected an overall mean pretest to posttest performance gain on learning items in all conditions (Hypothesis 1). Furthermore, and more importantly, we expected that practicing retrieval twice would lead to a higher pretest to posttest performance gain on learning items (Hypothesis 2a) and a higher posttest performance on transfer items 1 (Hypothesis 3a) than practicing retrieval once. We expected that practicing retrieval thrice would lead to a higher pretest to posttest performance gain on learning items (Hypothesis 2b) and a higher posttest performance on transfer items (Hypothesis 3b) than practicing retrieval twice. However, as outlined before, prior research suggests that additional retrieval practice will have diminishing returns on the final test, so we expected these differences to be smaller than the differences between practicing retrieval once and twice.
To get more insight into the effectiveness (higher performance) and efficiency (i.e., performance/investment of mental effort or time; Van Gog & Paas, 2008) of repeated retrieval practice on learning and transfer, we explored the invested mental effort, time-on-test, and JOLs. Thus, we exploratively compared the practice conditions on invested mental effort on test items, time-on-test, and JOLs.

Method
The hypotheses and complete method section were preregistered on the Open Science Framework (OSF). All data, script files, and materials (in Dutch) are available on the project page that we created for this study (https://osf.io/ pfmyg/).

Participants and Design
Participants were all first-year 'Safety and Security Management' students attending a Dutch University of Applied Sciences (N = 103). Eleven students did not complete the posttest and two students completed the posttest a week late and therefore were excluded from the analyses (as this may have influenced the results). Seventeen participants were excluded because of non-compliance, i.e., when more than half of the practice tasks during one of the essential practice sessions were not read seriously.
2 Due to a technical problem, one class of students (i.e., 24 students) did not receive the demographic questionnaire and the pretest. Together, this resulted in a final sample of 75 students for the posttest-only analyses (i.e., completed all essential sessions, excluding the demographic questions and pretest) and a subsample of 51 students (68%) for the pretest to Because transfer items were not included in the pretest, we are not able to detect transfer gains.
Fast readers (i.e., maximum reading speed of 0.17 seconds per word; e.g., Trauzettel-Klosinski & Dietz, 2012), taken as a limit. We calculated power functions of our analyses using the G*Power software (Faul et al., 2009). The power of our oneway ANOVAs -under a fixed alpha level of .05 and with a sample size of 75-is estimated at .11, .47, and .87 for picking up a small (η p 2 = .01), medium (η p 2 = .06), and large (η p 2 = .14) effect. Regarding the crucial interaction between number of practice sessions and test moment -again calculated under a fixed alpha level of .05, but with a sample size of 51 and a correlation between measures of .64-the power is estimated at .27, .95, and >.99 for picking up a small, medium, and large interaction effect, respectively. Thus, our sample size under the above assumptions should be sufficient to pick up medium-large effects, and previous studies on repeated (retrieval) practice mainly demonstrated medium-large effects (e.g., Roediger & Karpicke, 2006b). The educational committee of the university approved on conducting this study within the curriculum. In week 1, all participants first completed the CT-skills pretest, followed by the CT-instructions and practice session one (see figure 1 for an overview). Participants were randomly assigned to one of three conditions. They either (1) did not practice extra with the tasks (practice once condition, posttest only: n = 26; both tests: n = 16), (2) practiced a second time in week 2 (practice twice condition, n = 25; n = 16), or (3) practiced a second time in week 2 and a third time in week 4 (practice thrice condition, n = 24; n = 19). Participants completed the CT-skills posttest two or three days after their last practice session.

CT-skills tests.
The content of the surface features of all items was adapted to participants' study domain. The pretest consisted of 16 syllogistic reasoning items across two categories (i.e., conditional and categorical syllogisms, see Appendix S1 for an example with explanation of each category), which were used to measure learning, as these were instructed and practiced during the training phase. All of the items included a belief bias (i.e., when the conclusion aligns with your prior beliefs or real-world knowledge but is invalid or vice versa; Evans et al., 1983;Markovits & Nantel, 1989;Newstead et al., 1992) and examined the tendency to be influenced by the believability of a conclusion when evaluating the logical validity of arguments (Evans, 1977(Evans, , 2003. These types of tasks are frequently used to measure people's ability to avoid biases (e.g., Stanovich et al., 2016).
Our tests consisted of 3 × affirming the consequent of a conditional statement (if p then q, q therefore p; invalid); 3 × denying the consequent of a conditional statement (if p then q, not q therefore not p; valid); 2 × affirming the antecedent of a conditional statement (if p then q, p therefore q; valid); 2 × denying the antecedent of a conditional statement (if p then q, not p therefore not q; invalid); 3 × categorical syllogism 'no A is B, some C are B, therefore some C are not A' (valid); and 3 × categorical syllogism 'no A is B, some C are B, therefore some A are not C' (invalid). Participants had to indicate for each item whether the conclusion was valid or invalid and to explain their multiple-choice (MC) answer to check their understanding (on the MC-answers they might be guessing). They could earn 1 point for the correct MC-answer and 1 point for a correct and 0.5 point for a partially correct explanation (see subsection 2.4). The MC and explanation scores were sum-scored and, thus, the maximum total score on the learning items was 32 points.
The posttest was identical to the pretest but, additionally, six Wason selection items were added that measured the tendency to confirm a hypothesis rather than to falsify it (see the Appendix for two examples with explanations; e.g., Dawson et al., 2002;Evans, 2002;Stanovich, 2011). These items measured transfer as they were not instructed/practiced but shared similar features with the four types of conditional syllogisms. Our test consisted of 3 abstract versions and 3 versions including study-related context. A MC-format with four answer options was used in which only a specific combination of two selected answers was the correct answer. One point was assigned for each correct answer (see subsection 2.4), resulting in a maximum total score of six points on the transfer items.

CT-instructions.
The video-based CT-instructions (15 min.) consisted of a general CT-instruction (i.e., features of CT and attitudes/skills needed to think critically) and explicit instructions on belief-bias in syllogisms that consisted of a worked example of each of the six types in the pretest. The worked examples showed the correct line of reasoning and included possible problem-solving strategies, which allowed participants to mentally correct initially erroneous responses. At the end, participants received a hint stating that the principles used in these examples can be applied with several other reasoning tasks.
2.2.3 CT-practice. Participants could practice retrieval on the six types of syllogisms on topics that they might encounter in their working-life. Participants were instructed to read the problems thoroughly and to choose the correct MC-answer option, provided directly below the problems. They had to deliberately recall the relevant information from their memory to solve the problems. After each practice-task, they received correct-answer feedback and were given a worked example in which the line of reasoning was explained in steps and clarified with a visual representation. The second and third practice sessions were parallel versions of the first one (i.e. structurally equivalent problems but with different surface features).

Mental effort.
After each test item and after each CT-practice problem, participants were asked to indicate how much effort they invested on completing that task, on a 9-point scale ranging from (1) very, very low effort to (9) very, very high effort (Paas, 1992).

Global judgments of learning (JOL).
At the end of each practice session, participants made a JOL on how well they thought they understood the CT-practice problems on a 7-point scale ranging from (1) very poorly to (7) very well (Koriat et al., 2002;Thiede et al., 2003).

Procedure
The study was run during the first four weeks of a CTcourse in the Integral Safety and Security Management study program of an institute of higher professional education. The CT-skills pretest and first practice session were Repeated Retrieval Practice to Foster Students' Critical Thinking Skills Collabra: Psychology conducted during the first lesson in a computer classroom at the participants' university with an entire class of students and their teacher present. The extra practice sessions and the posttest were completed entirely online (cf. Heijltjes, . Participants came from four different classes and within each class, students were randomly assigned to one of the conditions. All materials were delivered in a computer-based environment (Qualtrics platform). Participants could work at their own pace, were allowed to use scrap paper while solving the tasks, and timeon-task was logged during all phases.
In advance of the first lesson, the students were informed by their teacher about the experiment (i.e., procedure and time window). When entering the classroom in week 1, participants were instructed to sit down at one of the desks and read the A4-paper containing some general instructions and a link to the Qualtrics environment where they first had to sign an informed consent form. Thereafter, they had to fill in a demographic questionnaire and complete the pretest. After each test item, they had to indicate how much mental effort they invested. Subsequently, participants entered the practice phase in which they first viewed the video-based CT-instructions (15 min), followed by the practice tasks. At the end of the practice phase, participants had to indicate their JOL. Participants had to wait (in silence) until the last participant had finished before they were allowed to leave the classroom.
One day before each online session (i.e., practice session 2 and 3 and posttest), participants received an e-mail with a reminder and the request to reserve time for this mandatory part of their CT-course. One hour before participants could start, they received the link to the Qualtrics environment. They were given a specific time window (8 am to 10 pm that day) to complete these sessions. Two or three days after session 1, participants of the practice once condition had to complete the posttest. In the beginning of week 2, all participants had to complete the second practice session. Since the content of our materials was part of the final exam of this course and the ethical guidelines of the institute of higher professional education state that all students should have been offered the same exam materials, participants of the practice once condition practiced with the extra practice materials but they were no longer included in the experiment. Two or three days after session 2, participants of the practice twice condition had to complete the posttest. Due to practical reasons (i.e., one week school holiday), the procedure of week 2 was repeated in week 4; all participants had to complete the third practice session but students in the practice once and twice conditions were no longer partaking in the experiment and those in the practice thrice condition had to complete the posttest after three days. Participants who did not complete either the posttest or one of the extra practice sessions received an e-mail the day after the specific time-window with the message that they could complete it that day as a last opportunity.

Data Analysis
Items were scored for accuracy; 1 point for each correct MC-alternative and a maximum of 1 point (increasing in steps of 0.5) for the correct explanation on the learning items (coding scheme can be found on our OSF-page). Unfortunately, one transfer item had to be removed from the test due to incorrectly offered MC-answer options. As a result, participants could attain a maximum total score of 32 points on the learning items and five points on the transfer items. For comparability, learning and transfer outcomes were computed as percentage correct scores instead of total scores. Two raters independently scored 25% of the explanations on the learning items of the posttest. Intraclass correlation coefficient (two-way mixed, consistency, singlemeasures; McGraw & Wong, 1996) was 0.996, indicating excellent interrater reliability (Koo & Li, 2016). The remainder of the tests was scored by one rater. Cronbach's alpha was .74 on the learning items on the pretest, .71 on the learning items on the posttest and .79 on the transfer items.
Boxplots were created to identify outliers (i.e., values that fall more than 1.5 times the interquartile range above the third quartile or below the first quartile) in the data. If any, we first conducted the analyses on the data of all participants and reran the analyses on the data without outliers. If outliers had influence on the results, we reported the data of both analyses. If not, we only reported the results on the full data set. In case of severe violations of the assumption of normality for our analyses, we conducted appropriate non-parametric tests.

Results
For all analyses in this paper, a p-value of .05 was used as a threshold for statistical significance. Partial eta-squared (ηp 2 ) is reported as an effect size for all ANOVAs with ηp 2 = .01, ηp 2 = .06, and ηp 2 = .14 denoting small, medium, and large effects, respectively (Cohen, 1988). Cramer's V is reported as an effect size for chi-square tests with (having 2 degrees of freedom) V = .07, V = .21, and V = .35 denoting small, medium, and large effects, respectively.

Check on Condition Equivalence
Before running any of the main analyses, we checked our conditions on equivalence. Preliminary analyses confirmed that there were no a-priori differences between the conditions in age, F(2, 50) = 0.46, p = .634, ηp 2 = .02; educational background, χ² (8)  .701, ηp 2 = .01. We found a gender difference between the conditions, χ²(2) = 6.23, p = .043, V = .35. However, gender did not correlate significantly with any of our performance measures (minimum p = .669) and was therefore not a confounding variable.   (1-9

Planned Analyses
We conducted pretest to posttest analyses on the data of participants who completed all essential experimental sessions (n = 51) and posttest-only analyses on the data of participants who missed the demographic questions and pretest (n = 75). Because of a floor effect on transfer performance, analysis of the transfer data would unfortunately not be very meaningful, and we therefore report only descriptive statistics on those data. Together with the descriptive statistics of the other dependent variables, these can be found in Table 1.

Time-on-test.
Because the data was not normally distributed, we conducted a Kruskal-Wallis H test with Condition (practice once, practice twice, practice thrice) as between-subjects factor on pretest-posttest differences in time spent on learning items. The results showed that there was no significant difference between conditions in pretestposttest time spent on learning items, χ²(2) = 1.54, p = .464, ηp 2 = .01. A Kruskal-Wallis H test on the posttestonly data with Condition (practice once, practice twice, practice thrice) as between-subjects factor, showed that there was no significant difference in time spent on posttest learning items between conditions, χ²(2) = 4.54, p = .103, ηp 2 = .04. In addition to the results of the analysis on the full data, a 2×3 mixed ANOVA on the data without five outliers with Test Moment (pretest, posttest) as within-subjects factor and Condition (practice once, practice twice, practice thrice) as between-subjects factor did reveal a significant effect of Test Moment, F(1, 42) = 39.34, p < .001, ηp 2 = .48; more time was spent on the pretest (M = 73.84, SD = 17.55) than the posttest (M = 49.26, SD = 21.14).

Repeated Retrieval Practice to Foster Students' Critical Thinking Skills
Collabra: Psychology 3.2.4 Global judgments of learning. Finally, we examined differences in global JOLs using a one-way ANOVA. The results revealed no main effect of Condition, F(2, 74) = 1.82, p = .170, ηp 2 = .05.

Exploratory Analyses
To gain more insight into the effects of repeated retrieval practice, we explored participants' level of performance during practice session one, two, and three.
3 Descriptive statistics showed that on average, performance increased with increasing practice opportunities: mean percentage correct during practice session one was 58.67% (SD = 21.29; n = 75), during session two 65.31% (SD = 19.20; n = 49), and during practice three 69.44% (SD = 16.79; n = 24). 4 Since the transfer items of the tests shared similar features with the four types of conditional syllogisms, we additionally explored participants' level of performance during learning on these types only. Again, descriptive statistics showed that performance increased: mean percentage correct during practice session one was 55.33% (SD = 24.42; n = 75), during practice session two 63.78% (SD = 25.55; n = 49), and during practice session three 69.79% (SD = 19.48; n = 24). Additionally, we explored whether performance on MCquestions only on the syllogism (learning) items improved after instruction and practice, using a 2×3 mixed ANOVA with Test Moment (pretest, posttest) as within-subjects factor and Condition (practice once, practice twice, practice thrice) as between-subjects factor. The results indeed revealed a main effect of Test Moment, F(1, 47) = 20.26, p < .001, ηp 2 = .30; performance was better on the posttest (M = 68.66, SE = 2.30) than the pretest (M = 57.42, SE = 2.60). There was, however, no significant main effect of Condition, F(2, 47) = 0.50, p = .613, ηp 2 = .02, nor an interaction between Test Moment and Condition, F(2, 47) = 0.01, p = .990, ηp 2 < .01. Finally, we explored how much time participants spent on the worked-example feedback after correct and incorrect retrievals. Both test and descriptive statistics (see Table 2) showed that participants spent -with almost all practice tasks -more time on the worked-example feedback after incorrect retrievals than after correct retrievals. Although participants generally spent less time on the worked-example feedback as they practiced more often (i.e., during a later practice session), this pattern is found during each of the three practice sessions.

Addressing Potential Power Issues
Due to a technical problem, our final sample was considerably smaller than predetermined and might have been insufficient to detect a small-medium interaction effect. Since adding participants to an already completed experiment will increase the Type 1 rate (alpha) and conducting a second identical experiment (i.e., in the context of an actual course) would be resource-demanding, we decided to exploratory apply whether or not that would be worthwhile, using a sequential stopping rule (SSR: see, for example Arghami & Billard, 1982, 1991Botella et al., 2006;Doll, 1982;Fitts, 2010;Pocock, 1992;Ximénez & Revuelta, 2007). SSRs make it possible to stop early when statistical significance is unlikely to be achieved with the planned number of participants.
One SSR that is simple, efficient, and appropriate to this experiment is the COAST (composite open adaptive stopping rule; Frick, 1998). The COAST allows to stop testing participants and reject the null hypothesis if the p-value is less than a lower criterion of .01; to stop testing participants and retain the null hypothesis if the p-value is greater than an upper criterion of .36; and to test more participants if the p-value is between these two values. In the present study, the p-values of our main analyses (i.e., on performance measures) were obviously larger than the high criterion of .36. Hence, there was no hint of an existing effect of repeated retrieval practice in the present study and, thus, we decided not to add additional participants.

Discussion
The current study investigated whether repeated retrieval practice is beneficial to foster learning of CT-skills and whether it can additionally facilitate transfer. Contrary to our expectations, we did not reveal pretest to posttest performance gains on learning items. Thus, we did not replicate the finding that participants' performance improves after explicit instructions combined with retrieval practice on domain-specific problems (Hypothesis 1: e.g., Heijltjes et al., 2015;Van Peppen et al., 2018;Van Peppen, Verkoeijen, Kolenbrander, et al., 2021). It should be noted, however, that this comparable level of posttest performance was attained in less time than pretest performance (i.e., prior to instruction/practice). Moreover, our exploratory findings on performance on MC-questions only, suggest that students did benefit from instructions and retrieval practice. This difference in outcomes when looking at MC-answers and total scores (i.e., MC + justification) could mean that participants did learn what the right answer was, but may have been unable to justify their answers sufficiently. In that case, however, our intervention only resulted in simple memorization (i.e., rote learning; Mayer, 2002) instead of a deeper understanding of the subject matter. This might perhaps also explain the occurrence of a floor effect on performance on transfer items, as transfer of knowledge or skills depends on how well-developed the knowledge structures are that are formed during initial learning (e.g., Perkins & Salomon, 1992).
This concerns all participants who engaged in the relevant practice sessions (i.e., all conditions in practice session one, practice twice and thrice in session two, and practice thrice in session three).
We additionally tested within the practice thrice condition (n = 24) whether there was a significant difference in performance during practice session one, two, and three. Performance increased on average with increasing practice opportunities (M 1 = 60.42%, M 2 = 65.97%, M 3 = 69.44%), but these differences (possibly due to the small sample size) were not significant, F(2, 46) = 1.94, p = .155, ηp 2 = .08.  In line with previous repeated retrieval findings (e.g., Roediger & Butler, 2011), average performance scores during practice seemed to increase with more repetitions. However, repeated retrieval practice did not have a significant effect -compared to practice once-on performance on the final test (i.e., on learning items; Hypotheses 2a/ 2b). Unfortunately, we were unable to test whether repeated retrieval practice would enhance transfer (Hypotheses 3a/ 3b) due to a floor effect. Because the power of our study was only sufficient to pick up medium-to-large effects of repeated retrieval, it could be that additional retrieval practice had an unidentifiable small effect. In the current study, each practice session consisted of multiple practice tasks (instead of one as in most studies) and it could, therefore, be argued that practice once in this study can already be seen as repeated practice, which possibly explains the absence of substantial effects of repeated retrieval.
Another potential explanation for the lack of effect of additional retrieval practice, might lie in the feedback that was provided after each retrieval attempt. While many studies only show a retrieval practice effect when feedback is provided (for an overview, see Van Gog & Sweller, 2015) and others show that elaborative feedback can enhance effects of retrieval practice (e.g., Pan et al., 2016;Pan & Rickard, 2018), findings from recent research suggest that the feedback after each retrieval attempt may have eliminated the repeated retrieval effect (Kliegl et al., 2019;Pastötter & Bäuml, 2016;Storm et al., 2014). According to the bifurcation model (Halamish & Bjork, 2011;Kornell et al., 2011), feedback only strengthens knowledge that is not successfully retrieved, whereas knowledge that is successfully retrieved is hardly affected by subsequent feedback. As such, it may be that participants in the condition that merely practiced once (i.e., lowest performance during practice) processed the feedback better and, therefore, performed equally well on the final test as participants in the other conditions. Moreover, it may be that participants' motivation to learn the correct answer was higher when they were unable to provide the correct answer during retrieval practice than when they were able to do so (e.g., Kang et al., 2009;Potts & Shanks, 2019). Our findings regarding time spent on worked-example feedback after correct/incorrect retrievals support this idea (i.e., more time spent after incorrect than correct retrievals). The possible elimination of a lag effect on learning problem-solving skills by providing feedback after each retrieval attempt is an interesting issue for future research.
Although participants achieved a considerably high level of performance during retrieval practice (approx. 60-70 percent correct), which was comparable to previous studies that did demonstrate beneficial effects of repeated retrieval practice (e.g., A. C. Butler, 2010;Roediger & Karpicke, 2006b), a floor effect on performance on transfer items had arisen. Since the practice tasks consisted of MC-questions only, this finding again supports the idea that students do benefit from instructions and retrieval practice but may have been unable to justify their answers on the tests sufficiently. Another likely cause for this floor effect may be that participants lacked profound in-depth understanding of the structural overlap between syllogisms and Wason selection tasks (i.e., measure of transfer). During practice, partici-pants could earn one point for each correctly solved syllogism. Each transfer item, however, required recall and application of all four conditional syllogism principles to solve it correctly and, thus, to earn one point. Future studies on to-be-transferred problem-solving procedures as in the current study, should guarantee sufficient understanding of structural features of tasks and complete recall of the procedure during retrieval practice. It may be helpful to provide longer or more extensive practice, including more guidance in identifying how tasks are related. Potentially, practicing retrieval until all retrievals are successful and complete might be a solution for complete recall of procedures (i.e., successive relearning: e.g., Bahrick, 1979;Rawson et al., 2013). Given that transfer of CT skills from trained to untrained tasks remains elusive (as our current results also underline), there is an urgent need to determine the exact obstacles to the transfer of CT-skills, which could lie in a failure to recognize that the acquired knowledge is relevant to the new task, inadequate recall of the acquired knowledge, and/or difficulties in actually applying that knowledge onto the new task (i.e., three-step process of transfer; Barnett & Ceci, 2002).
To the best of our knowledge, this is the first study that investigated the effects of repeated retrieval practice in the CT-domain. Moreover, while the majority of research on repeated retrieval practice has been conducted in laboratory settings, the current was conducted as part of an existing CT-course -using educationally relevant practice sessions and retention intervals. As such, it adds to the small body of literature on what instructional designs are (or are not) efficient and effective for CT-courses aiming at learning and transfer of CT-skills, which is relevant for both educational science and educational practice.