Häggström hävdar: Statistical significance is not a worthless concept

tisdag 5 februari 2013

Statistical significance is not a worthless concept

In 2009, I read and enjoyed Stephen Ziliak's and Deirdre McCloskey's important book The Cult of Statistical Significance, and in 2010 my review of it appeared in the Notices of the American Mathematical Society. In the hope of engaging Ziliak and McCloskey in some further discussion, I write the present blog post in English; readers looking for a gentle introduction in Swedish to statistical hypothesis testing and statistical significance may instead consult an earlier blog post of mine.

In Ziliak's and McCloskey's recent contribution to Econ Journal Watch, we find the following passage:

defend

This surprised me, not so much because I had never expected to be cited in the same sentence as the fully fledged relativist provocateur Steve Fuller, but mainly because Häggström (2010)—which contains the passage

The Cult of Statistical Significance

both

and

—is such an extraordinarily odd reference to put forth in support of the statement that "no one has tried to defend null hypothesis significance testing". Ziliak and McCloskey are of course free to be unimpressed by this passage and not consider it to qualify as a defense of statistical significance testing, but note that they write "no one has tried to defend" rather than just "no one has defended". Hence, they do not even grant my passage the status of a valid attempt at defending significance testing. This strikes me as overly harsh.

Let me take this opportunity to expand a bit, by means of a simple example, on my claim that in order to establish "reasonable evidence of an important deviation from the null hypothesis, we typically need both statistical and subject-matter significance". Assume that the producer of the soft drink Percy-Cola has carried out a study in which subjects have been blindly exposed to one mug of Percy-Cola and one mug of Crazy-Cola (in randomized order), and asked to indicate which of them tastes better. Assume furthermore that 75% of subjects prefer the mug containing Percy-Cola, while only 25% prefer the one with Crazy-Cola. How impressed should we be by this?

This depends on how large the study is. Compare the two cases

(b) out of a total of 1000 subjects, 750 preferred Percy-Cola.

If we follow Ziliak's and McCloskey's advice to ignore statistical significance and focus instead purely on subject-matter (i.e., in this case, gastronomical) significance, then the two cases are indistinguishable, because in both cases the data indicates that 75% of subjects prefer Percy-Cola, which in subject-matter terms is quite a substantial deviation from the 50% we would expect in case neither of the liquids tasted any better than the other. Still, there is good reason to be more convinced of the superiority of Percy-Cola in case (b) than in case (a). The core reason for this is that under the null hypothesis that both drinks taste equally good (or bad), the probability of getting an outcome at least as favorable to Percy-Cola as the one we actually got turns out to be 5/16 ≈ 0.31 in case (a), while in case (b) the probability turns out to be about 6.7⋅10^-59. These numbers (0.31 and 6.7⋅10^-59, respectively) are precisely what is known in the theory of significance testing as the p-values for rejecting the null hypothesis. 0.31 is a really lousy p-value, meaning that in view of the data in (a) it is still fully plausible to suppose that the drinks are equally good (or even that Crazy-Cola is a bit better). On the other hand, 6.7⋅10^-59 is an extremely good p-value, so in case (b) we may safely conclude that Percy-Cola really does taste better (in the sense of being preferred by a majority of the population from which subjects have been sampled). In other words, case (b) exhibits much better statistical significance than case (a).

Statistical significance is a useful way of quantifying how convinced we should be that an observed effect is real and not just a statistical fluctuation. Ziliak and McCloskey argue at length in their book that statistical significance has often been misused in many fields, and in this they are right. But they are wrong when they suggest that the concept is worthless and should be discarded.

Edit, March 4, 2015: Somewhat belatedly, and thanks to the kind remark by Mark Dehaven in the comments section below, I have realized that my sentence "Statistical significance is a useful way of quantifying how convinced we should be that an observed effect is real and not just a statistical fluctuation" in the last paragraph does not accurately reflect my view - neither my view now, nor the one I had two years ago. It is hard for me to understand now how I could have written such a thing, but my best guess is that it must have been written in a haste. Statistical significance and p-values do not quantify "how convinced we should be", because there may be so much else, beyond the data set presently at hand, that ought to influence how convinced or not we should be. Instead of the unfortunate sentence, I'd prefer to say that "Statistical significance and p-values provide, as a first approximation, an indication of how strongly the data set in itself constitutes evidence against the null hypothesis (provided the various implicit and explicit model assumptions correctly represent reality)".

13 kommentarer:

Anonym5 februari 2013 kl. 20:42
Another hypothesis I find plausible is that people have really bad taste.
SvaraRadera
Svar
Steve Fuller6 februari 2013 kl. 01:07
It's news to me that I'm a relativist, and I don't see how what you link to has anything to do with relativism.
SvaraRadera
Svar
Unknown6 februari 2013 kl. 04:17
Bäste Olle,

We may have misread your piece. Too bad, because the position we defend is the correct one! You say statistical significance is a "useful way of quantifying how convinced we should be that an observed effect is real and not just a statistical fluctuation." That is indeed how it taught. But our book, and the scores of other statements by high theorists and also (like us) practioners who have thought it through, from Gosset to the present, show again and again that it is false. If you amended your statement to include the words "type II error" and "a substantive loss functions" and "in cases in which the actual problem is a sampling one, and not an entire population," we would agree, of course, because in that case the proper decision-theoretic problem in the stle of Neyman and Pearson is being posed. But that's not what you say now. You want people to go on using null hypothesis significance testing as they usually do, alone, unmodified, without considerations of power. You are defending the misuse. It would be like saying that the left wing of an airplane is a "useful tool," and then advising people to go on using the left wing without the right wing. The airplane, and the science, will crash!

Bästa hälsningar,

Deirdre McCloskey
SvaraRadera
Svar
Steve Ziliak9 februari 2013 kl. 19:13
Dear Olle,

I see what you mean: in your defense of null hypothesis significance testing, you did not assert that substantive significance should be neglected. But that is what happens in the journals, from economics to medicine, in 8 or 9 of every 10 articles published. Our positions are much closer than the recent posts seem to indicate. We like very much your review and also your other writings on significance and science.

But a p value of .31 might be enough to suspect a real and not merely random effect. In gambler's terms it means (assuming there are no problems of bias or confounding in your subjects and units) the odds are .69/.31 or 2.2-to-1 or greater than a real effect has been detected over and above the assumed-to-be-true point null of no difference. If the substantive stakes are important enough - let's say the study is not about Cola preference but rather about the likelihood of heart attack from taking one of two pain pills - a p value of .31 should not be dismissed on the faulty logic that if the p doesn't fit you can acquit.

Merck's Vioxx pill hurt a lot of people for that reason - for treating a p value of about .20 as "meaningless". (It wasn't.) If I can learn this from n=4 instead of n=1,000, that might do the trick -- especially if my budget is constrained.

Small samples can prove more than people think. In the United States, for example, we now have "road rage" laws which fine angry drivers for bad behavior on the highway. A major stimulus for the law was n=1 angry man in California, who threw a woman's dog into oncoming traffic on grounds that she cut him off while driving on the highway. This "event" did not have to occur 1,000 times before reasonable people decided to reject road rage.

On the other hand, it is mainly in repeated experiments, wherein the sampling error is based on actual and not merely imaginary repetitions of the experiment, that statistical significance can importantly contribute. Almost no one in economics and other social sciences repeats an experiment. Gosset's (aka Student's) work at Guinness did precisely that: once Gosset decided upon his expected oomph, he repeated experiments under different conditions and at multiple sites until he could say with, for example, 10 to 1 odds that a certain oomph, plus or minus experimental error, was likely to obtain.

Steve Ziliak
SvaraRadera
Svar
Mark Dehaven3 mars 2015 kl. 18:31
Olle Hägglund states that: "Statistical significance is a useful way of quantifying how convinced we should be that an observed effect is real and not just a statistical fluctuation."

I think this statement is interesting for two reasons.

The first is that I believe the statement is false.
Statistical significance is a way of quantifying how much your data deviates from the null hypothesis. Whether this has anything to do with reality cannot be assessed through the data - it is for the researcher to judge combining the statistically significant finding with other information. Thus, statistical significance (or for that sake – the p-value) is in itself not a useful way of quantifying evidence.

The other interesting thing about the statement is the claim that statistical significance is useful for the quantification of convictions. Convictions are by nature subjective, or qualitative. If the main results of frequentist statistical procedures result in a variety of subjective convictions about the state of the world, then what - in essence makes - this method different and superior to Bayesian inference?
SvaraRadera
Svar

Lägg till kommentar