Re-analysing the Data From Moffatt et al. (2020): What Can We Learn From an Under-powered Absence of Difference?

Re-analysing the Data From Moffatt et al. (2020): What Can We Learn From an Under-powered Absence of Difference? Ladislas Nalborczyk 1,2,3 a 1 GIPSA, University Grenoble Alpes, CNRS, Grenoble, France 2 Aix Marseille University, CNRS, LPC, Marseille, France 3 Aix Marseille University, CNRS, LNC, Marseille, France


Introduction
The activity of silently talking to oneself or "inner speech" is a foundational ability, allowing oneself to remember, plan, self-motivate, or self-regulate (for reviews, see Alderson-Day & Fernyhough, 2015;Loevenbruck et al., 2018;Perrone-Bertolotti et al., 2014). However, whereas the use of inner speech is associated with many adaptive functions in everyday life, inner speech dysfunctions can be identified in multiple psychological disorders. For instance, rumination, broadly defined as unconstructive repetitive thinking about past events and current mood states (Martin & Tesser, 1996), is involved in the onset and maintenance of serious mental disorders such as depression, anxiety, eating disorders, or substance abuse (for review, see Nolen-Hoeksema et al., 2008).
Given the predominantly verbal nature of rumination (e.g., Ehring & Watkins, 2008;Goldwin et al., 2013;Goldwin & Behar, 2012;McLaughlin et al., 2007), we previously proposed to consider rumination as a form of inner speech and to study it using the methods that have historically been used to study other forms of inner speech, namely, by using surface electromyography (EMG) and motor interference protocols (e.g., Nalborczyk et al., 2017;Nalborczyk, 2019;Nalborczyk, Banjac, et al., 2021;Nalborczyk et al., 2022). We first showed that induced rumination was accompanied by increased facial (both over a forehead and a perioral site) muscular activity in comparison to a rest period (Nalborczyk et al., 2017). However, because rumination was only compared to a rest period, it remained uncertain whether this perioral activity was specifically re-lated to (inner) speech processes. Therefore, we ran a follow-up study comparing verbal to non-verbal rumination, which suggested that the facial EMG correlates we had previously identified were not specifically related to the verbal content of the ruminative thoughts (Nalborczyk, Banjac, et al., 2021). We discussed these findings in length and proposed several theoretical interpretations that can account for these results in the discussion of Nalborczyk, Banjac, et al. (2021) and more extensively in Nalborczyk (2019). Moffatt et al. (2020) designed an experiment with the aim of refining our understanding of the involvement of the speech motor system in different varieties of inner speech and clarifying the relation between the peripheral correlates of inner speech and (self-reported) subjective experience. Their main conclusion is that inner experience between induced rumination and distraction differs "without a change in electromyographic correlates of inner speech" (p.1). In other words, they suggest that the subjective experience of inner speech is unrelated (or loosely related) to the electromyographic correlates of inner speech, which are thought to be represented by the EMG amplitude recorded over the orbicularis oris inferior and orbicularis oris superior muscles. However, for this in-sample observation to be of interest in an out-of-sample context (i.e., to be informative for other non-observed individuals, or said otherwise, to bring information about the population), this absence of difference should be substantiated by adequately-powered statistical tests (given the target effect size) as well as reliable measures. This is unlikely to be the case here, for reasons that we will present and discuss in the present article.

Exploring the data
As typical in studies manipulating induced rumination, Moffatt et al. (2020) designed a two-step protocol. First, they aimed to induce a negative mood by asking participants to solve unsolvable or excessively difficult anagram and subtraction tasks. Second, they prompted participants to either ruminate on these (purportedly induced) negative feelings (by asking them to "think about the causes, consequences, and meaning of their current feelings") or to distract themselves (by asking them to "think about a village, city or town that you are particularly familiar with"). Rumination and distraction were manipulated within-subject, with all subjects alternating between rumination and distraction, in a counter-balanced order.
Their final sample of participants, after data exclusion, included 26 participants (data available at https://osf.io/ hj7tz/). The EMG data is depicted in Figure 1 by condition (where BAS, DIS, and RUM refer to the baseline, distraction, and rumination conditions, respectively) and by muscle (frontalis, FRO; orbicularis oris inferior, OOI; and orbicularis oris superior, OOS). This figure shows that the average natural logarithm of the EMG peak amplitude recorded over the FRO was at similar levels in the baseline and distraction conditions, but was much higher in the rumination condition. However, the average natural logarithm of the EMG peak amplitude recorded over the OOI and OOS muscles was higher than baseline in both the rumination and distraction conditions, with a slight increase from distraction to rumination (both on the mean and median). Having described the data collected by Moffatt et al. (2020), we now turn to a discussion of some problems related to conclusions that can be made from under-powered non-significant results.

Conclusions from under-powered null-hypothesis significance tests
There is an infamous tradition of conducting and interpreting uninformative null-hypothesis significance tests in Psychology (e.g., Meehl, 1967Meehl, , 1978Meehl, , 1990aMeehl, , 1990bMeehl, , 1997. By "uninformative", we mean that some null-hypothesis significance tests are simply not diagnostic with regards to the substantive effect of interest (e.g., whether there is a difference between conditions A and B).
As highlighted by several authors (e.g., J. Cohen, 1994;Pollard & Richardson, 1987;Rouder et al., 2016), concluding that an effect is probably absent solely based on a nonsignificant p-value is the continuous (i.e., probabilistic) extension of the modus tollens and is not a valid argument (i.e., the conclusion does not follow from the premises). This fallacious argument is also known as the fallacy of acceptance, the absence of evidence fallacy or the argument from ignorance, and proceeds as follows: "If the null hypothesis is true, then this observation should rarely occur. This observation occurred. Therefore, the null hypothesis is false (or has low probability)". In short, this argument is fallacious because it fails to consider the (probability of the data under the) alternative hypothesis.
This problem is tackled in modern usages of null-hypothesis significance tests by ensuring that the claim under scrutiny is submitted to severe tests (e.g., Mayo, 2018;Mayo & Spanos, 2006). In general terms, the strong severity principle states that we have evidence for a claim to the extent that it survives a stringent scrutiny, that is, to the extent that it survives severe tests. More precisely, some claim (e.g., is said to be severely tested if it had great chances of being corroborated/falsified, had the claim been true/false. When a statistical test is under-powered (for detecting a given effect size), the claim under scrutiny is not strongly (severely) tested, hence it not possible to obtain strong or reliable evidence for the claim (bad test, no evidence).

Optimistic a priori power analysis
Anticipating the legitimate critiques on the power of their study, Moffatt et al. (2020) report the results of a power analysis using the effect size reported in Nalborczyk et al. (2017) of This represents a highly optimistic estimate of the substantive effect of interest (i.e., the difference in the natural logarithm of the EMG peak amplitude between the rumination and distraction conditions) as this effect represents the standardised mean difference in EMG amplitude between a rest and a rumination periods as estimated in Nalborczyk et al. (2017).
We suggest the (a priori) power of the study ran by Moffatt et al. (2020) was much lower than suggested by the authors. Indeed, we speculate that the standardised mean difference in EMG peak amplitude between the rumination and distraction conditions may be much weaker than the standardised mean difference in EMG amplitude between the rumination and rest conditions. If we assume that the former is half the size of the latter, therefore the a priori power of the main statistical test from Moffatt et al. (2020) was around meaning that they had less than 1 chance out of 2 to find a significant effect (given that the population effect size was actually Notice that whereas taking half the effect size of Nalborczyk et al. (2017) may seem arbitrary, Figure 2 shows that a one-sample t-test with a sample size of is under-powered for a vast range of effect sizes.

Frequentist properties of Bayes factors
Once again, anticipating the legitimate critique that the absence of a significant difference is not necessarily "significant" evidence for the absence of an effect,  reported the following Bayes factor (BF) analysis (p.12): "[…] therefore it is possible that the sample size of the present study lacked sufficient power to detect the effect of rumination on muscle activity. In order to test this, a Bayesian paired samples t-test was conducted for the peak log values of muscle activity between the rumination and distraction conditions. This revealed strong evidence in favour of the alternative hypothesis for the FRO muscle and moderate evidence in favour of the null hypothesis for the OOS and OOI muscles, according to current guidelines for interpreting Bayes factors  First, contrary to what the authors suggest, whereas computing a BF indeed allows assessing the relative evidence for Re-analysing the Data From Moffatt et al. (2020): What Can We Learn From an Under-powered Absence of Difference?
Collabra: Psychology the null, computing a BF (i.e., comparing two models) does not solve the problem of low power. More precisely, the sensitivity (i.e., the ability to attain a certain goal) of an experimental design to detect a given effect is an issue for both frequentist and Bayesian statistical tests. To illustrate this point, we simulated 10.000 datasets (for under the assumption of either no effect (i.e., the null hypothesis of an effect size of (i.e., the supposed target effect size in Moffatt et al., 2020), or an effect size of (i.e., the effect size reported in Nalborczyk et al., 2017).
As shown in Figure 3, the distribution of log-BFs computed under each hypothesis reveals important inter-simulation variability. For instance, 29.60% of the computed log-BFs under the null hypothesis are "inconclusive" and 1.88% of the log-BFs support the alternative hypothesis (although the population effect size is When the population effect size is of 49.97% of the computed log-BFs are "inconclusive" and 37.50% of the log-BFs support the null hypothesis (although the population effect size is actually non-null). When the population effect size is of 41.65% of the computed log-BFs are "inconclusive" and 6.41% of the log-BFs support the null hypothesis. In brief, this simulation shows that for small sample and effect sizes, BFs have non-negligible error rates (see also Schönbrodt et al., 2017). 1 The problems discussed above about the interpretation of under-powered non-significant results also apply to the test Moffatt et al. (2020) performed regarding the effect of the conditions' order. In Nalborczyk, Banjac, et al. (2021), we manipulated the modality of rumination (whether it is verbal or non-verbal) in a between-subject manner to avoid order effects and to avoid dissipating the effects of the negative mood induction. More precisely, we assumed that inducing rumination after a distraction condition in a withinsubject manner would dissipate the effects of the mood induction and therefore reduce the impact of the rumination induction. In contrast to this approach, Moffatt et al. (2020) asked participants to ruminate and then distract themselves (or reciprocally), after an induced stressor (an induced failure). Anticipating again that the order of the within-subject conditions may be an issue, Moffatt et al. (2020) say: "Unless otherwise reported, the inclusion of order in which the conditions were completed as a betweensubjects variable as part of a mixed-design ANOVA produced no significant main effects or interactions involving order." (p.7) Unfortunately, obtaining a non-significant effect of the conditions' order is very weak evidence that order did not play a role in the results, given the low power of the tests that were performed (the sample size in each group was of N = 12 and N = 14).

Robustness of Bayes factors to prior specifications
Formulated in Bayesian terms, the problem of specifying credible effect sizes in a priori power analyses may be described as a problem of prior specification. However, defining sound prior distributions for the alternative hypothesis is notoriously difficult (for some guidance, see for instance Dienes, 2019Dienes, , 2021. In Figure 4, we report the results of prior sensitivity analyses, depicting the value of the BF in favour of the alternative hypothesis (relative to the null hypothesis) for the difference between the distraction and rumination conditions, under various prior specifications, for each muscle.
This figure strikingly reveals large variability in the resulting BF with various prior specifications. More precisely, when the scale (width) of the prior put on the standardised effect size is changed (along the x-axis), the BF changes accordingly. For instance, varying the prior scale from 0.1 to 1.0 for the OOI results in BFs from 0.78 to 0.21, respectively.

Discussion and Conclusions
With this short paper, we aimed to nuance the strong conclusion made by Moffatt et al. (2020), who asserted that the inner experience of rumination was not related to its peripheral muscular correlates. First, we discussed the statistical and epistemological reasons that cast doubt upon the main conclusion of Moffatt et al. (2020). Because the statistical tests conducted by Moffatt et al. (2020) were heavily under-powered, they provide only weak evidence for an absence of difference between conditions. Second, we highlighted that the frequentist properties of Bayesian tools (e.g., Bayes factors) provide an important piece of information that may help design more informative studies. Third, sensitivity analyses further suggested that various prior specifications may lead to widely different Bayes factors.
In addition to these methodological limitations, we now wish to discuss the theoretical interpretations and implications of these results. As discussed in the introduction section, we previously conducted several studies aiming to assess the role of the speech motor system in rumination. Following our initial study (Nalborczyk et al., 2017), we ran an extension in which we compared verbal to non-verbal rumination. The results suggested that the facial EMG correlates of verbal and non-verbal rumination were similar (Nalborczyk, Banjac, et al., 2021). Given the ample evidence on the EMG correlates of inner speech production (for an overview, see Chapter 1 in Nalborczyk, 2019), we needed to explain why this particular form of inner speech (induced rumination) was not associated with speech-specific pe-It should be noted that, as stressed by Rouder (2014), Bayes factors indicate the relative evidence for a hypothesis, conditional on some observed data. In other words, Bayesian updating is not conditional on some hypothetical truth. With this in mind, the present simulation aims at illustrating how the frequentist properties of BFs may be used to design more informative studies (see also Schönbrodt & Wagenmakers, 2018), while acknowledging that proper long-term error rates control is not the realm of the Bayesian framework.  ripheral muscular activity.
In Nalborczyk, Banjac, et al. (2021), we suggested that this observation was coherent with the mental-habit view of depressive rumination (Watkins & Nolen-Hoeksema, 2014), which defines rumination as a habitual behaviour, automatically triggered by contextual cues such as negative mood. We know habitual behaviours are more automatic (i.e., they are not intentionally initiated) than non-habitual behaviours. Interestingly, it has been observed that the automaticity with which a verbal thought is evoked may influence the degree to which it is enacted, that is, the degree to which it recruits the speech motor system (e.g., B. H. Cohen, 1986;Sokolov, 1972). According to B. H. Cohen (1986), the presence of peripheral motor activity during inner speech production may be interpreted in terms of attention sharing. For instance, in novel (hence non-automatic) or difficult situations, the vividness of inner speech may be strengthened by increasing the speech motor activity, resulting in more salient auditory percepts. Relating this idea to the motor control framework we previously proposed (e.g., Grandchamp et al., 2019;Loevenbruck et al., 2018), it may be said that the characteristics of the task or situation (e.g., novelty, difficulty) may influence the amount of inhibition that is applied to motor commands during inner speech production (Nalborczyk, Debarnot, et al., 2021), hence resulting in more or less visible peripheral muscular activity (for a discussion of these ideas in the broader context of motor imagery, see Guillot et al., 2012).
Another possible interpretation is that automatic forms of inner speech may rely more heavily on higher-level (e.g., memory-based) cognitive processes whereas less automatic (i.e., more intentional or deliberate) forms of inner speech may rely more on simulation mechanisms via the use of internal models of the speech motor system (Nalborczyk, 2019;Nalborczyk, Debarnot, et al., 2021). In other words, the production of automatic versus non-automatic inner speech would be underpinned by different processes that would involve the speech motor system to a different extent. This distinction is similar to the distinction between the two routes of prediction-by-association and predictionby-simulation in speech perception and comprehension (Pickering & Garrod, 2013). The prediction-by-association mechanism would rely more on perceptual sensory experiences and domain-general cognitive abilities whereas the prediction-by-simulation mechanism would rely more on the simulation of the motor action leading to the speech auditory percept. In the former case, no peripheral muscular activity is expected, whereas in the latter case, the speech motor system would be involved in simulating or emulating the corresponding overt action (cf. also the distinction between motor simulation and direct simulation in Tian & Poeppel, 2012). Whether the physiological correlates of automatic versus non-automatic (deliberate) forms of inner speech differ because of inhibitory constraints or because they rely on different processes (e.g., prediction-by-association or prediction-by-simulation) remains an open empir-Re-analysing the Data From Moffatt et al. (2020): What Can We Learn From an Under-powered Absence of Difference?
Collabra: Psychology ical question. We previously discussed these issues in more length and suggested ways forward from an experimental perspective in the discussion of Nalborczyk (2019).
To conclude, we wish to bring some nuance to the conclusion of Moffatt et al. (2020), who stated that "In conclusion, induced rumination appeared to involve similar levels of inner speech-related muscle activity to a period of distraction" (p.14). In consideration of the limitations discussed in the present article, this conclusion seems hasty. Indeed, we provided theoretical (epistemological) and empirical (via simulation and sensitivity analyses) reasons to doubt the strength of the evidence in favour of the null hypothesis in this study. This commentary stresses the importance of planning adequately-powered studies of induced rumination, and the need for more thoughtful statistical analyses and data interpretation, as recommended by Wasserstein et al. (2019).