tisdag 5 februari 2013

Statistical significance is not a worthless concept

In 2009, I read and enjoyed Stephen Ziliak's and Deirdre McCloskey's important book The Cult of Statistical Significance, and in 2010 my review of it appeared in the Notices of the American Mathematical Society. In the hope of engaging Ziliak and McCloskey in some further discussion, I write the present blog post in English; readers looking for a gentle introduction in Swedish to statistical hypothesis testing and statistical significance may instead consult an earlier blog post of mine.

In Ziliak's and McCloskey's recent contribution to Econ Journal Watch, we find the following passage:
    In several dozen journal reviews and in comments we have received—from, for example, four Nobel laureates, the statistician Dennis Lindley (2012), the mathematician Olle Häggström (2010), the sociologist Steve Fuller (2008), and the historian Theodore Porter (2008)—no one [...] has tried to defend null hypothesis significance testing.
This surprised me, not so much because I had never expected to be cited in the same sentence as the fully fledged relativist provocateur Steve Fuller, but mainly because Häggström (2010)—which contains the passage
    The Cult of Statistical Significance is written in an entertaining and polemical style. Sometimes the authors push their position a bit far, such as when they ask themselves: "If null-hypothesis significance testing is as idiotic as we and its other critics have so long believed, how on earth has it survived?" (p. 240). Granted, the single-minded focus on statistical significance that they label sizeless science is bad practice. Still, to throw out the use of significance tests would be a mistake, considering how often it is a crucial tool for concluding with confidence that what we see really is a pattern, as opposed to just noise. For a data set to provide reasonable evidence of an important deviation from the null hypothesis, we typically need both statistical and subject-matter significance
—is such an extraordinarily odd reference to put forth in support of the statement that "no one has tried to defend null hypothesis significance testing". Ziliak and McCloskey are of course free to be unimpressed by this passage and not consider it to qualify as a defense of statistical significance testing, but note that they write "no one has tried to defend" rather than just "no one has defended". Hence, they do not even grant my passage the status of a valid attempt at defending significance testing. This strikes me as overly harsh.

Let me take this opportunity to expand a bit, by means of a simple example, on my claim that in order to establish "reasonable evidence of an important deviation from the null hypothesis, we typically need both statistical and subject-matter significance". Assume that the producer of the soft drink Percy-Cola has carried out a study in which subjects have been blindly exposed to one mug of Percy-Cola and one mug of Crazy-Cola (in randomized order), and asked to indicate which of them tastes better. Assume furthermore that 75% of subjects prefer the mug containing Percy-Cola, while only 25% prefer the one with Crazy-Cola. How impressed should we be by this?

This depends on how large the study is. Compare the two cases
    (a) out of a total of 4 subjects, 3 preferred Percy-Cola,

    (b) out of a total of 1000 subjects, 750 preferred Percy-Cola.

If we follow Ziliak's and McCloskey's advice to ignore statistical significance and focus instead purely on subject-matter (i.e., in this case, gastronomical) significance, then the two cases are indistinguishable, because in both cases the data indicates that 75% of subjects prefer Percy-Cola, which in subject-matter terms is quite a substantial deviation from the 50% we would expect in case neither of the liquids tasted any better than the other. Still, there is good reason to be more convinced of the superiority of Percy-Cola in case (b) than in case (a). The core reason for this is that under the null hypothesis that both drinks taste equally good (or bad), the probability of getting an outcome at least as favorable to Percy-Cola as the one we actually got turns out to be 5/16 ≈ 0.31 in case (a), while in case (b) the probability turns out to be about 6.7⋅10-59. These numbers (0.31 and 6.7⋅10-59, respectively) are precisely what is known in the theory of significance testing as the p-values for rejecting the null hypothesis. 0.31 is a really lousy p-value, meaning that in view of the data in (a) it is still fully plausible to suppose that the drinks are equally good (or even that Crazy-Cola is a bit better). On the other hand, 6.7⋅10-59 is an extremely good p-value, so in case (b) we may safely conclude that Percy-Cola really does taste better (in the sense of being preferred by a majority of the population from which subjects have been sampled). In other words, case (b) exhibits much better statistical significance than case (a).

Statistical significance is a useful way of quantifying how convinced we should be that an observed effect is real and not just a statistical fluctuation. Ziliak and McCloskey argue at length in their book that statistical significance has often been misused in many fields, and in this they are right. But they are wrong when they suggest that the concept is worthless and should be discarded.

Edit, March 4, 2015: Somewhat belatedly, and thanks to the kind remark by Mark Dehaven in the comments section below, I have realized that my sentence "Statistical significance is a useful way of quantifying how convinced we should be that an observed effect is real and not just a statistical fluctuation" in the last paragraph does not accurately reflect my view - neither my view now, nor the one I had two years ago. It is hard for me to understand now how I could have written such a thing, but my best guess is that it must have been written in a haste. Statistical significance and p-values do not quantify "how convinced we should be", because there may be so much else, beyond the data set presently at hand, that ought to influence how convinced or not we should be. Instead of the unfortunate sentence, I'd prefer to say that "Statistical significance and p-values provide, as a first approximation, an indication of how strongly the data set in itself constitutes evidence against the null hypothesis (provided the various implicit and explicit model assumptions correctly represent reality)".

13 kommentarer:

  1. Another hypothesis I find plausible is that people have really bad taste.

  2. It's news to me that I'm a relativist, and I don't see how what you link to has anything to do with relativism.

    1. The link I gave was merely intended to illustrate your role as a provocateur, and had nothing to do with relativism except insofar as a relativist position makes it harder for a commentator to distinguish serious science from pseudoscientific crap.

    2. Thanks for admitting that you didn't know what you were talking about -- let's hope that this doesn't generalise!

  3. Bäste Olle,

    We may have misread your piece. Too bad, because the position we defend is the correct one! You say statistical significance is a "useful way of quantifying how convinced we should be that an observed effect is real and not just a statistical fluctuation." That is indeed how it taught. But our book, and the scores of other statements by high theorists and also (like us) practioners who have thought it through, from Gosset to the present, show again and again that it is false. If you amended your statement to include the words "type II error" and "a substantive loss functions" and "in cases in which the actual problem is a sampling one, and not an entire population," we would agree, of course, because in that case the proper decision-theoretic problem in the stle of Neyman and Pearson is being posed. But that's not what you say now. You want people to go on using null hypothesis significance testing as they usually do, alone, unmodified, without considerations of power. You are defending the misuse. It would be like saying that the left wing of an airplane is a "useful tool," and then advising people to go on using the left wing without the right wing. The airplane, and the science, will crash!

    Bästa hälsningar,

    Deirdre McCloskey

    1. Thanks for your reply, Deirdre. So tell me then how you would treat cases (a) and (b) of the Percy-Cola scenario without invoking anything "as idiotic as" significance testing.

    2. And while you're thinking about that question, Deirdre, let me offer another one:

      If one party A of a discussion were to say...

      "For a data set to provide reasonable evidence of an important deviation from the null hypothesis, we typically need both statistical and subject-matter significance"

      ...and the other party B were to respond by accusing A of wanting...

      "people to go on using null hypothesis significance testing as they usually do, alone"

      ...and by equating A:s position to...

      "saying that the left wing of an airplane is a 'useful tool', and then advising people to go on using the left wing without the right wing"

      ...how would you characterize B:s mode of rhetoric?

  4. Dear Olle,

    I see what you mean: in your defense of null hypothesis significance testing, you did not assert that substantive significance should be neglected. But that is what happens in the journals, from economics to medicine, in 8 or 9 of every 10 articles published. Our positions are much closer than the recent posts seem to indicate. We like very much your review and also your other writings on significance and science.

    But a p value of .31 might be enough to suspect a real and not merely random effect. In gambler's terms it means (assuming there are no problems of bias or confounding in your subjects and units) the odds are .69/.31 or 2.2-to-1 or greater than a real effect has been detected over and above the assumed-to-be-true point null of no difference. If the substantive stakes are important enough - let's say the study is not about Cola preference but rather about the likelihood of heart attack from taking one of two pain pills - a p value of .31 should not be dismissed on the faulty logic that if the p doesn't fit you can acquit.

    Merck's Vioxx pill hurt a lot of people for that reason - for treating a p value of about .20 as "meaningless". (It wasn't.) If I can learn this from n=4 instead of n=1,000, that might do the trick -- especially if my budget is constrained.

    Small samples can prove more than people think. In the United States, for example, we now have "road rage" laws which fine angry drivers for bad behavior on the highway. A major stimulus for the law was n=1 angry man in California, who threw a woman's dog into oncoming traffic on grounds that she cut him off while driving on the highway. This "event" did not have to occur 1,000 times before reasonable people decided to reject road rage.

    On the other hand, it is mainly in repeated experiments, wherein the sampling error is based on actual and not merely imaginary repetitions of the experiment, that statistical significance can importantly contribute. Almost no one in economics and other social sciences repeats an experiment. Gosset's (aka Student's) work at Guinness did precisely that: once Gosset decided upon his expected oomph, he repeated experiments under different conditions and at multiple sites until he could say with, for example, 10 to 1 odds that a certain oomph, plus or minus experimental error, was likely to obtain.

    Steve Ziliak

    1. Thanks, Steve, for your comments! Yes, I do suspect that our positions are closer than they might seem. While I insist that the concept of statistical significance is a highly useful one that should most certainly not be thrown on the methodological garbage heap, I fully agree with you that mindless use of it is a bad and dangerous thing, and that such mindless use is disturbingly common. So when I deem it "useful", I do not mean to say that it should be mindlessly applied, on its own and in any situation. I would never advocate any straightforward statistical recipe (be it statistical significance or anything else) to be universally applied without consideration of context, etc. In particular, I have no sympathy for the primitive idea that a p-value of 0.049 automatically falsifies the null hypothesis, or that a p-value larger than 0.05 never yields any insight whatsoever.

      In my Percy-Cola example, I tried to choose the numers to be extreme enough to avoid most of the difficult side considerations that tend to arise. Now you say that a p-value of 0.20 or 0.31 "might be enough to suspect a real and not merely random effect", and this I cannot deny. I can certainly cook up situations, e.g. in a gambling context, where evidence corresponding to such p-values would be enough for me to actually act upon. But I'd say typically it would not. It all depends. To comment on the the Vioxx case I'd first have to go back to your book and your sources, so I'll pass on that. But I can say that if a Percy-Cola representative would ask me to endorse, based on scenario (a) above, the statement that statistical evidence indicates their soft drink tasting better than their competitor's, my reply would be "in your dreams!".

      I like your road rage example. The point you make is close to one I have sometimes made, namely that statistical evidence is in some cases entirely unnecessary, and that acting purely on anecdotal evidence may then be warranted. Science is a highly prestigious thing, which is good in many ways, but a downside is that scientific evidence is often mindlessly requested in such cases. For instance, my observation that "many Swedish school children are understimulated by a much too easy math curriculum" (obviously true) was once dismissed, by a professor of pedagogics, as "merely an opinion" in the absence of a scientific study to back it.

  5. Olle Hägglund states that: "Statistical significance is a useful way of quantifying how convinced we should be that an observed effect is real and not just a statistical fluctuation."

    I think this statement is interesting for two reasons.

    The first is that I believe the statement is false.
    Statistical significance is a way of quantifying how much your data deviates from the null hypothesis. Whether this has anything to do with reality cannot be assessed through the data - it is for the researcher to judge combining the statistically significant finding with other information. Thus, statistical significance (or for that sake – the p-value) is in itself not a useful way of quantifying evidence.

    The other interesting thing about the statement is the claim that statistical significance is useful for the quantification of convictions. Convictions are by nature subjective, or qualitative. If the main results of frequentist statistical procedures result in a variety of subjective convictions about the state of the world, then what - in essence makes - this method different and superior to Bayesian inference?

    1. All right, I admit that...

      "Statistical significance is a useful way of quantifying how convinced we should be that an observed effect is real and not just a statistical fluctuation"

      ...was overly imprecise. Of course there will, in practice, typically be many other things beyond statistical significance and the p-value that ought to contribute to how convinced or not convinced we are. But I maintain that statistical significance often plays an important role in this regard.

      And by the way, who the heck is Olle Hägglund?

    2. See the new paragraph added at the end of the blog post.