Comments on Häggström hävdar: Statistical significance is not a worthless concept

See the new paragraph added at the end of the blog...

2015-03-04T17:16:11.486+01:00

See the new paragraph added at the end of the blog post.

All right, I admit that... "Statistical sign...

2015-03-04T15:51:14.190+01:00

All right, I admit that...

"Statistical significance is a useful way of quantifying how convinced we should be that an observed effect is real and not just a statistical fluctuation"

...was overly imprecise. Of course there will, in practice, typically be many other things beyond statistical significance and the p-value that ought to contribute to how convinced or not convinced we are. But I maintain that statistical significance often plays an important role in this regard.

And by the way, who the heck is Olle Hägglund?

Olle Hägglund states that: "Statistical signi...

2015-03-03T18:31:30.806+01:00

Olle Hägglund states that: "Statistical significance is a useful way of quantifying how convinced we should be that an observed effect is real and not just a statistical fluctuation."

I think this statement is interesting for two reasons.

The first is that I believe the statement is false.
Statistical significance is a way of quantifying how much your data deviates from the null hypothesis. Whether this has anything to do with reality cannot be assessed through the data - it is for the researcher to judge combining the statistically significant finding with other information. Thus, statistical significance (or for that sake – the p-value) is in itself not a useful way of quantifying evidence.

The other interesting thing about the statement is the claim that statistical significance is useful for the quantification of convictions. Convictions are by nature subjective, or qualitative. If the main results of frequentist statistical procedures result in a variety of subjective convictions about the state of the world, then what - in essence makes - this method different and superior to Bayesian inference?

Thanks, Steve, for your comments! Yes, I do suspec...

2013-02-10T11:53:34.879+01:00

Thanks, Steve, for your comments! Yes, I do suspect that our positions are closer than they might seem. While I insist that the concept of statistical significance is a highly useful one that should most certainly not be thrown on the methodological garbage heap, I fully agree with you that mindless use of it is a bad and dangerous thing, and that such mindless use is disturbingly common. So when I deem it "useful", I do not mean to say that it should be mindlessly applied, on its own and in any situation. I would never advocate any straightforward statistical recipe (be it statistical significance or anything else) to be universally applied without consideration of context, etc. In particular, I have no sympathy for the primitive idea that a p-value of 0.049 automatically falsifies the null hypothesis, or that a p-value larger than 0.05 never yields any insight whatsoever.

In my Percy-Cola example, I tried to choose the numers to be extreme enough to avoid most of the difficult side considerations that tend to arise. Now you say that a p-value of 0.20 or 0.31 "might be enough to suspect a real and not merely random effect", and this I cannot deny. I can certainly cook up situations, e.g. in a gambling context, where evidence corresponding to such p-values would be enough for me to actually act upon. But I'd say typically it would not. It all depends. To comment on the the Vioxx case I'd first have to go back to your book and your sources, so I'll pass on that. But I can say that if a Percy-Cola representative would ask me to endorse, based on scenario (a) above, the statement that statistical evidence indicates their soft drink tasting better than their competitor's, my reply would be "in your dreams!".

I like your road rage example. The point you make is close to one I have sometimes made, namely that statistical evidence is in some cases entirely unnecessary, and that acting purely on anecdotal evidence may then be warranted. Science is a highly prestigious thing, which is good in many ways, but a downside is that scientific evidence is often mindlessly requested in such cases. For instance, my observation that "many Swedish school children are understimulated by a much too easy math curriculum" (obviously true) was once dismissed, by a professor of pedagogics, as "merely an opinion" in the absence of a scientific study to back it.

Dear Olle, I see what you mean: in your defense ...

2013-02-09T19:13:44.203+01:00

Dear Olle,

I see what you mean: in your defense of null hypothesis significance testing, you did not assert that substantive significance should be neglected. But that is what happens in the journals, from economics to medicine, in 8 or 9 of every 10 articles published. Our positions are much closer than the recent posts seem to indicate. We like very much your review and also your other writings on significance and science.

But a p value of .31 might be enough to suspect a real and not merely random effect. In gambler's terms it means (assuming there are no problems of bias or confounding in your subjects and units) the odds are .69/.31 or 2.2-to-1 or greater than a real effect has been detected over and above the assumed-to-be-true point null of no difference. If the substantive stakes are important enough - let's say the study is not about Cola preference but rather about the likelihood of heart attack from taking one of two pain pills - a p value of .31 should not be dismissed on the faulty logic that if the p doesn't fit you can acquit.

Merck's Vioxx pill hurt a lot of people for that reason - for treating a p value of about .20 as "meaningless". (It wasn't.) If I can learn this from n=4 instead of n=1,000, that might do the trick -- especially if my budget is constrained.

Small samples can prove more than people think. In the United States, for example, we now have "road rage" laws which fine angry drivers for bad behavior on the highway. A major stimulus for the law was n=1 angry man in California, who threw a woman's dog into oncoming traffic on grounds that she cut him off while driving on the highway. This "event" did not have to occur 1,000 times before reasonable people decided to reject road rage.

On the other hand, it is mainly in repeated experiments, wherein the sampling error is based on actual and not merely imaginary repetitions of the experiment, that statistical significance can importantly contribute. Almost no one in economics and other social sciences repeats an experiment. Gosset's (aka Student's) work at Guinness did precisely that: once Gosset decided upon his expected oomph, he repeated experiments under different conditions and at multiple sites until he could say with, for example, 10 to 1 odds that a certain oomph, plus or minus experimental error, was likely to obtain.

Steve Ziliak

And while you're thinking about that question,...

2013-02-07T09:48:04.743+01:00

And while you're thinking about that question, Deirdre, let me offer another one:

If one party A of a discussion were to say...

"For a data set to provide reasonable evidence of an important deviation from the null hypothesis, we typically need both statistical and subject-matter significance"

...and the other party B were to respond by accusing A of wanting...

"people to go on using null hypothesis significance testing as they usually do, alone"

...and by equating A:s position to...

"saying that the left wing of an airplane is a 'useful tool', and then advising people to go on using the left wing without the right wing"

...how would you characterize B:s mode of rhetoric?

Hehe

2013-02-06T13:06:34.448+01:00

Hehe

Thanks for admitting that you didn't know what...

2013-02-06T12:36:17.620+01:00

Thanks for admitting that you didn't know what you were talking about -- let's hope that this doesn't generalise!

The link I gave was merely intended to illustrate ...

2013-02-06T08:55:05.438+01:00

The link I gave was merely intended to illustrate your role as a provocateur, and had nothing to do with relativism except insofar as a relativist position makes it harder for a commentator to distinguish serious science from pseudoscientific crap.

Thanks for your reply, Deirdre. So tell me then ho...

2013-02-06T08:50:25.660+01:00

Thanks for your reply, Deirdre. So tell me then how you would treat cases (a) and (b) of the Percy-Cola scenario without invoking anything "as idiotic as" significance testing.

Bäste Olle, We may have misread your piece. To...

2013-02-06T04:17:17.221+01:00

Bäste Olle,

We may have misread your piece. Too bad, because the position we defend is the correct one! You say statistical significance is a "useful way of quantifying how convinced we should be that an observed effect is real and not just a statistical fluctuation." That is indeed how it taught. But our book, and the scores of other statements by high theorists and also (like us) practioners who have thought it through, from Gosset to the present, show again and again that it is false. If you amended your statement to include the words "type II error" and "a substantive loss functions" and "in cases in which the actual problem is a sampling one, and not an entire population," we would agree, of course, because in that case the proper decision-theoretic problem in the stle of Neyman and Pearson is being posed. But that's not what you say now. You want people to go on using null hypothesis significance testing as they usually do, alone, unmodified, without considerations of power. You are defending the misuse. It would be like saying that the left wing of an airplane is a "useful tool," and then advising people to go on using the left wing without the right wing. The airplane, and the science, will crash!

Bästa hälsningar,

Deirdre McCloskey

It's news to me that I'm a relativist, and...

2013-02-06T01:07:42.004+01:00

It's news to me that I'm a relativist, and I don't see how what you link to has anything to do with relativism.

Another hypothesis I find plausible is that people...

2013-02-05T20:42:14.433+01:00

Another hypothesis I find plausible is that people have really bad taste.