In an editorial this week, David Trafimow and Michael Marks, editors of the (formerly) scientific journal Basic and Applied Social Psychology (BASP), announce that "the null hypothesis significance testing procedure (NHSTP)" is "invalid", and, consequently, that "from now on, BASP is banning the NHSTP". Likewise, they ban the use of confidence intervals:
- Analogous to how the NHSTP fails to provide the probability of the null hypothesis, which is needed to provide a strong case for rejecting it, confidence intervals do not provide a strong case for concluding that the population parameter of interest is likely to be within the stated interval. Therefore, confidence intervals also are banned from BASP.
The NHSTP procedure is always carried out relative to some null hypothesis which typically says that some effect size or some parameter is zero. Given the data, the p-value can, somewhat loosely speaking, be defined as the probability of obtaining at least as extreme data as those that were actually obtained, given that the null hypothesis is true.1 If the p-value ends up below a given threshold - called the significance level - then statistical significance is declared. Mainly for historical reasons, the significance level is usually taken to be 0.05.
A statistically significant outcome is usually taken to support a suspicion that the null hypothesis is false, and the logic is as follows. If statistical significance is obtained, then we are in a position to conclude that either the null hypothesis is false, or an event of low probability (namely the event of obtaining statistical significance despite the null hypothesis being true) happened (with the traditional significance level, the probability is at most 0.05). Now, low-probability events do happen sometimes, but we expect them typically not to happen, and if such a thing didn't happen this time, then the only remaining option is that the null hypothesis is false. The lower the threshold for significance is, the stronger is a statistically significant result considered to count against the null hypothesis.
This is the logical justification of NHSTP (the justification of confidence intervals is similar), and the way we statisticians have been teaching it for longer than I have lived. It supplies the scientist with an important tool for to judging when her data supports the sucipicion that an effect (a deviation from the null hypothesis) really exists, as opposed to the case where the data shows no signs of being anything other than unremarkable statistical fluctuations around what is expected under the null hypothesis.
Unfortunately, there are misconceptions about NHSTP that are widespread in the scientific community. The one that is most relevant for the present discussion is the tempting but misguided idea of thinking of statistical significance at level 0.05 as a demonstration that the probability of the null hypothesis is at most 0.05,2 and more generally that a p-value can be interpreted as the probability of the null hypothesis. But that is to succumb to the fallacy of the transposed conditional, which is to confuse the probability of the data given the null hypothesis with the probability of the null hypothesis given the data.3 Frequentist methods like NHSTP never provide probabilities for the null hypothesis. For such probabilities, Bayesian methods are needed, but they involve the controversial step (mentioned by Trafimow and Marks) of specifying a prior distribution representing our beliefs prior to seeing the data. No probabilities in - no probabilities out.
My impression from what professors Trafimow and Marks write is that they used to be under the spell of the fallacy of the transposed conditional, but have later woken up from that spell and realized their mistake. But now they suffer from another misconception, namely that the (purported) justification for NHSTP consists in that fallacy, and hence that NHSTP is "invalid". I find it tragic that a scientific journal decides to ban indispensable statistical methods based on such a silly misunderstanding.
What statistical methods, then, do they suggest as an alternative? They are lukewarm about Bayesian inference. Instead they declare that they...
- encourage the use of larger sample sizes than is typical in much psychology research, because as the sample size increases, descriptive statistics become increasingly stable and sampling error is less of a problem.
The incompetence of professors Trafimow and Marks, and their catastrophic misstep of banning NHSTP from their journal, is a splendid illustration of the main thesis of my paper Why the empirical sciences need statistics so desperately.
1) It is the fuzzy notion of "extreme" that contributes most to the looseness of the description as given here, but trust me: it can be made precise. In simple situations the appropriate definition of "extreme" is usually straightforward, while in other situations the Neyman-Pearson lemma offers important guidance.
2) There is also the even more extreme misconception that statistical significance disproves the null hypothesis in the same sense that the observation of a black swan disproves the all-swans-are-white hypothesis. Yes, I'm not joking. There are Harvard professors who think that way.
3) For two events A and B, we do not in general have that the probability P(A|B) of A given B is the same as the probability P(B|A) of B given A. An example: Consider the experiment of picking a world citizen at random (uniformly), let A be the event that the person picked is Norwegian, and let B be the event that he or she is world chess champion. Then P(B|A) is about 1/5000000=0.0000002, because only one out of the about 5000000 Norwegians is world chess champion, whereas P(A|B)=1 because the world chess champion is Norwegian. (Note that these two numbers differ by almost as much as two probabilities ever can.)