In an editorial this week, David Trafimow and Michael Marks, editors of the (formerly) scientific journal
Basic and Applied Social Psychology (BASP), announce that
"the null hypothesis significance testing procedure (NHSTP)" is
"invalid", and, consequently, that
"from now on, BASP is banning the NHSTP". Likewise, they ban the use of confidence intervals:
Analogous to how the NHSTP fails to provide the probability of the null hypothesis, which is needed to provide a strong case for rejecting it, confidence intervals do not provide a strong case for concluding that the population parameter of interest is likely to be within the stated interval. Therefore, confidence intervals also are banned from BASP.
With all due respect, professors Trafimow and Marks, but this is moronic. The procedure they call NHSTP is not
"invalid", and neither is the (closely related) use of confidence intervals. The only things about NHSTP and confidence intervals that are
"invalid" are certain naive and inflated ideas about their interpretation, held by many statistically illiterate scientists. These misconceptions about NHSTP and confidence intervals are what should be fought, not NHSTP and confidence intervals themselves, which have been indispensable tools for the scientific analysis of empirical data during most of the 20th century, and remain so today.
The NHSTP procedure is always carried out relative to some null hypothesis which typically says that some effect size or some parameter is zero. Given the data, the p-value can, somewhat loosely speaking, be defined as the probability of obtaining at least as extreme data as those that were actually obtained, given that the null hypothesis is true.1 If the p-value ends up below a given threshold - called the significance level - then statistical significance is declared. Mainly for historical reasons, the significance level is usually taken to be 0.05.
A statistically significant outcome is usually taken to support a suspicion that the null hypothesis is false, and the logic is as follows. If statistical significance is obtained, then we are in a position to conclude that either the null hypothesis is false, or an event of low probability (namely the event of obtaining statistical significance despite the null hypothesis being true) happened (with the traditional significance level, the probability is at most 0.05). Now, low-probability events do happen sometimes, but we expect them typically not to happen, and if such a thing didn't happen this time, then the only remaining option is that the null hypothesis is false. The lower the threshold for significance is, the stronger is a statistically significant result considered to count against the null hypothesis.
This is the logical justification of NHSTP (the justification of confidence intervals is similar), and the way we statisticians have been teaching it for longer than I have lived. It supplies the scientist with an important tool for to judging when her data supports the sucipicion that an effect (a deviation from the null hypothesis) really exists, as opposed to the case where the data shows no signs of being anything other than unremarkable statistical fluctuations around what is expected under the null hypothesis.
Unfortunately, there are misconceptions about NHSTP that are widespread in the scientific community. The one that is most relevant for the present discussion is the tempting but misguided idea of thinking of statistical significance at level 0.05 as a demonstration that the probability of the null hypothesis is at most 0.05,2 and more generally that a p-value can be interpreted as the probability of the null hypothesis. But that is to succumb to the fallacy of the transposed conditional, which is to confuse the probability of the data given the null hypothesis with the probability of the null hypothesis given the data.3 Frequentist methods like NHSTP never provide probabilities for the null hypothesis. For such probabilities, Bayesian methods are needed, but they involve the controversial step (mentioned by Trafimow and Marks) of specifying a prior distribution representing our beliefs prior to seeing the data. No probabilities in - no probabilities out.
My impression from what professors Trafimow and Marks write is that they used to be under the spell of the fallacy of the transposed conditional, but have later woken up from that spell and realized their mistake. But now they suffer from another misconception, namely that the (purported) justification for NHSTP consists in that fallacy, and hence that NHSTP is "invalid". I find it tragic that a scientific journal decides to ban indispensable statistical methods based on such a silly misunderstanding.
What statistical methods, then, do they suggest as an alternative? They are lukewarm about Bayesian inference. Instead they declare that they...
encourage the use of larger sample sizes than is typical in much psychology research, because as the sample size increases, descriptive statistics become increasingly stable and sampling error is less of a problem.
The irony here is that in order to quantify what is a suitable sample size in order to make descriptive statistics sufficiently
"stable", and to make sampling error
"less of a problem", the NHSTP conceptual apparatus is needed. For more on this, see my
discussion two years ago with Stephen Ziliak and Deirdre McCloskey.
Footnotes
1) It is the fuzzy notion of "extreme" that contributes most to the looseness of the description as given here, but trust me: it can be made precise. In simple situations the appropriate definition of "extreme" is usually straightforward, while in other situations the Neyman-Pearson lemma offers important guidance.
2) There is also the even more extreme misconception that statistical significance
disproves the null hypothesis in the same sense that the observation of a black swan disproves the all-swans-are-white hypothesis. Yes, I'm not joking.
There are Harvard professors who think that way.
3) For two events A and B, we do not in general have that the probability P(A|B) of A given B is the same as the probability P(B|A) of B given A. An example: Consider the experiment of picking a world citizen at random (uniformly), let A be the event that the person picked is Norwegian, and let B be the event that he or she is world chess champion. Then P(B|A) is about 1/5000000=0.0000002, because only one out of the about 5000000 Norwegians is world chess champion, whereas P(A|B)=1 because
the world chess champion is Norwegian. (Note that these two numbers differ by almost as much as two probabilities ever can.)