In 2009, I read and enjoyed Stephen Ziliak's and Deirdre McCloskey's important book
The Cult of Statistical Significance, and in 2010 my
review of it appeared in the
Notices of the American Mathematical Society. In the hope of engaging Ziliak and McCloskey in some further discussion, I write the present blog post in English; readers looking for a gentle introduction in Swedish to statistical hypothesis testing and statistical significance may instead consult
an earlier blog post of mine.
In Ziliak's and McCloskey's
recent contribution to
Econ Journal Watch, we find the following passage:
In several dozen journal reviews and in comments we have received—from, for example, four Nobel laureates, the statistician Dennis Lindley (2012), the mathematician Olle Häggström (2010), the sociologist Steve Fuller (2008), and the historian Theodore Porter (2008)—no one [...] has tried to defend null hypothesis significance testing.
This surprised me, not so much because I had never expected to be cited in the same sentence as
the fully fledged relativist provocateur Steve Fuller, but mainly because
Häggström (2010)—which contains the passage
The Cult of Statistical Significance is written in an entertaining and polemical style. Sometimes the authors push their position a bit far, such as when they ask themselves: "If null-hypothesis significance testing is as idiotic as we and its other critics have
so long believed, how on earth has it survived?" (p. 240). Granted, the single-minded focus on statistical significance that they label sizeless science is bad practice. Still, to throw out the use of significance tests would be a mistake, considering how often it is a crucial tool for concluding with confidence that what we see really is a pattern, as opposed
to just noise. For a data set to provide reasonable evidence of an important deviation from the null hypothesis, we typically need both statistical and subject-matter significance
—is such an extraordinarily odd reference to put forth in support of the statement that
"no one has tried to defend null hypothesis significance testing". Ziliak and McCloskey are of course free to be unimpressed by this passage and not consider it to qualify as a defense of statistical significance testing, but note that they write
"no one has tried to defend" rather than just
"no one has defended". Hence, they do not even grant my passage the status of a valid
attempt at defending significance testing. This strikes me as overly harsh.
Let me take this opportunity to expand a bit, by means of a simple example, on my claim that in order to establish "reasonable evidence of an important deviation from the null hypothesis, we typically need both statistical and subject-matter significance". Assume that the producer of the soft drink Percy-Cola has carried out a study in which subjects have been blindly exposed to one mug of Percy-Cola and one mug of Crazy-Cola (in randomized order), and asked to indicate which of them tastes better. Assume furthermore that 75% of subjects prefer the mug containing Percy-Cola, while only 25% prefer the one with Crazy-Cola. How impressed should we be by this?
This depends on how large the study is. Compare the two cases
(a) out of a total of 4 subjects, 3 preferred Percy-Cola,
(b) out of a total of 1000 subjects, 750 preferred Percy-Cola.
If we follow Ziliak's and McCloskey's advice to ignore statistical significance and focus instead purely on subject-matter (i.e., in this case, gastronomical) significance, then the two cases are indistinguishable, because in both cases the data indicates that 75% of subjects prefer Percy-Cola, which in subject-matter terms is quite a substantial deviation from the 50% we would expect in case neither of the liquids tasted any better than the other. Still, there is good reason to be more convinced of the superiority of Percy-Cola in case (b) than in case (a). The core reason for this is that under the null hypothesis that both drinks taste equally good (or bad), the probability of getting an outcome at least as favorable to Percy-Cola as the one we actually got turns out to be 5/16 ≈ 0.31 in case (a), while in case (b) the probability turns out to be about 6.7⋅10
-59. These numbers (0.31 and 6.7⋅10
-59, respectively) are precisely what is known in the theory of significance testing as the p-values for rejecting the null hypothesis. 0.31 is a really lousy p-value, meaning that in view of the data in (a) it is still fully plausible to suppose that the drinks are equally good (or even that Crazy-Cola is a bit better). On the other hand, 6.7⋅10
-59 is an extremely good p-value, so in case (b) we may safely conclude that Percy-Cola really does taste better (in the sense of being preferred by a majority of the population from which subjects have been sampled). In other words, case (b) exhibits much better statistical significance than case (a).
Statistical significance is a useful way of quantifying how convinced we should be that an observed effect is real and not just a statistical fluctuation. Ziliak and McCloskey argue at length in their book that statistical significance has often been misused in many fields, and in this they are right. But they are wrong when they suggest that the concept is worthless and should be discarded.
Edit, March 4, 2015: Somewhat belatedly, and thanks to the kind remark by Mark Dehaven in the comments section below, I have realized that my sentence "Statistical significance is a useful way of quantifying how convinced we should be that an observed effect is real and not just a statistical fluctuation" in the last paragraph does not accurately reflect my view - neither my view now, nor the one I had two years ago. It is hard for me to understand now how I could have written such a thing, but my best guess is that it must have been written in a haste. Statistical significance and p-values do not quantify "how convinced we should be", because there may be so much else, beyond the data set presently at hand, that ought to influence how convinced or not we should be. Instead of the unfortunate sentence, I'd prefer to say that "Statistical significance and p-values provide, as a first approximation, an indication of how strongly the data set in itself constitutes evidence against the null hypothesis (provided the various implicit and explicit model assumptions correctly represent reality)".