Halsey and colleagues (2015) raise an important issue regarding a certain letter with which we all are familiar:
They describe the sample-to-sample variability in the P value as a major cause of lack of repeatability that is not generally considered. They explain
why P is fickle to discourage the ill-informed practice of interpreting analyses based predominantly on this statistic.
In their estimation, the omission of this variability reflects a general lack of awareness.
The statistical power of a test dramatically affects the capacity with which we can interpret a P value and as a consequence the result of the test.
I’ve been thinking more about power, with specific regard to molecular ecology and accurate sampling of organisms with complex life cycles (also see my interview with Sean Hoban and some of his work, also highlighted here).
The authors provide some background on the misunderstandings about P:
If statistical power is limited, regardless of whether the P value returned from a statistical test is low or high, a repeat of the same experiment will likely result in a substantially different P value and thus suggest a very different level of evidence against the null hypothesis.
To demonstrate this, they take samples drawn from two normally-distributed populations of data in which they knew there was differentiation. They take subsamples and find over replicate experiments (though in practice, we would likely only perform one experiment), that the P values vary quite a bit (see Figure 2, Figure 4)!
Only when the statistical power is at least 90% is a repeat experiment likely to return a similar P value, such that interpretation of P for a single experiment is reliable.
We usually want to find the direction of an effect, as well as its size and also its precision. Halsey et al. (2015) advocate for the increased use of effect size and its 95% CIs.
Discovering that P is flawed will leave many scientists uneasy. As we have demonstrated, however, unless statistical power is very high (and much higher than in most experiments), the P value should be interpreted tentatively at best. Data analysis and interpretation must incorporate the uncertainty embedded in a P value.