Visar inlägg med etikett Stephen Ziliak. Visa alla inlägg
Visar inlägg med etikett Stephen Ziliak. Visa alla inlägg

fredag 20 mars 2015

Geoff Cummings dansande p-värden

Följande video, där den i statistisk metodologi engagerade psykologiprofessorn Geoff Cumming argumenterar emot bruket av så kallade p-värden, har nästan blivit viral nu efter tidskriften Basic and Applied Social Psychologys (BASP) intellektuella haveri häromsistens. För att maximalt uppskatta Cummings presentation bör man nog helst redan vara bekant med p-värde och besläktade statistiska begrepp (en bekantskap som med fördel kan stiftas genom att läsa Avsnitt 1 och 2 i min uppsats Statistisk signifikans och Armageddon), men för all del, lite roande kan den nog vara ändå.

Mina känslor inför videon är något blandade, och låt mig göra följande kommentarer.
    1. Vad gäller Cummings insats ur pedagogisk, presentationsteknisk och retorisk synvinkel kan jag inte annat göra än att lyfta på hatten - ett riktigt mästerstycke!

    2. Cumming har helt klart en poäng i att det ofta kan vara mer upplysande att redovisa sina vetenskapliga resultat i termer av konfidensintervall jämfört med att bara uppge p-värden. I ett enkelt renodlat exempel som det Cumming presenterar är det en utmärkt idé, men någon universell mirakelmedicin är det inte. Att beräkna ett konfidensintervall inbegriper motsvarigheten till beräknande av p-värden för alla möjliga parametervärden samtidigt, och i mer komplicerade sammanhang (inte minst då mer än en okänd parameter föreligger) visar detta sig ofta vara matematiskt ogörligt och/eller leda till långt mer komplicerade och svårtolkade konfidensmängder än de snälla intervall som fås i videon.

    3. Den p-värdesdans som Cumming demonstrerar i videon är följden av en kombination av måttlig effektstorlek och litet stickprov. Om det stämmer, som Cumming säger, att simuleringens effekt- och stickprovsstorlek är typiska för empiriska studier i psykologi, så anser jag att den viktigaste lärdomen av hans exempel inte handlar om att det skulle vara något fel på p-värdesbegreppet i sig. Snarare är lärdomen denna: att psykologer behöver bli mer stringenta i sina så kallade styrkeanalyser, vilket betyder att de behöver se till att deras stickprov är tillräckligt stora för att någorlunda tillförlitligt kunna detektera rimliga nivåer på effektstorleken. (Detsamma kan sägas om ekonomen Robert Östlings simuleringsexempel på Ekonomistas häromsistens.)

    4. I den högljudda skara av mestadels statistikteoretiskt okunniga anti-p-värdesfundamentalister som jublar över ovan nämnda BASP-tilltag att bannlysa p-värden är det många som hänvisar till Cummings video. Även om det vore fel att lasta Cumming för alla fåraktigheter som yttrats i denna diskussion, så har han med sina svepande generaliseringar (t.ex. "For a typical experiment, p tells you virtually nothing", 6:50 in i videon) helt klart uppmuntrat till en del dumheter. Ett typexempel, som fått stor spridning, är följande yttrande av neurologen Steven Novella:

      Another problem with the p-value is that it is not highly replicable. This is demonstrated nicely by Geoff Cumming as illustrated with a video. He shows, using computer simulation, that if one study achieves a p-value of 0.05, this does not predict that an exact replication will also yield the same p-value.
    Att tala om replikerbarhet hos p-värden är så malplacerat att jag tar mig för pannan. Ett p-värde är inte en parameter hos den okända fördelning som forskaren är ute efter att skatta, utan ett mått på i vad mån de erhållna data kan anses tala emot den så kallade nollhypotesen. Att kritisera p-värdesbegreppet för bristande replikerbarhet är som att vägra inse att data är osäkra. Face it: ett nytt experiment betyder nya data - och ett nytt p-värde. Den som accepterar logiken i att döma ut p-värdet på denna grund kan lika gärna döma ut själva datainsamingen - data blir ju olika varje gång, och är på så vis inte replikerbara! En orimlig slutsats, givetvis, men sådan är Novellas bisarra logik.

    5. Det finns mycket kritiskt att säga om hur p-värden och statistisk signifikans används och tolkas i praktiken på många områden. Viktigt är emellertid att inte kasta ut barnet med badvattnet. Det är vantolkningarna och det felaktiga bruket av p-värden och statistisk signifikans som bör bekämpas, inte begreppen själva, som ofta erbjuder mycket viktiga statistiska redskap. En (i övrigt värdefull och intressant) bok som begår samma barn-med-badvattensutkastning som BASP-redaktionen är Stephen Ziliaks och Deirdre McCloskeys The Cult of Statistical Significance från 2008. I min recension av den boken sammanfattade jag min syn på saken så här:

      The Cult of Statistical Significance is written in an entertaining and polemical style. Sometimes the authors push their position a bit far, such as when they ask themselves: "If null-hypothesis significance testing is as idiotic as we and its other critics have so long believed, how on earth has it survived?" (p. 240). Granted, the single-minded focus on statistical significance that they label sizeless science is bad practice. Still, to throw out the use of significance tests would be a mistake, considering how often it is a crucial tool for concluding with confidence that what we see really is a pattern, as opposed to just noise. For a data set to provide reasonable evidence of an important deviation from the null hypothesis, we typically need both statistical and subject-matter significance.

tisdag 5 februari 2013

Statistical significance is not a worthless concept

In 2009, I read and enjoyed Stephen Ziliak's and Deirdre McCloskey's important book The Cult of Statistical Significance, and in 2010 my review of it appeared in the Notices of the American Mathematical Society. In the hope of engaging Ziliak and McCloskey in some further discussion, I write the present blog post in English; readers looking for a gentle introduction in Swedish to statistical hypothesis testing and statistical significance may instead consult an earlier blog post of mine.

In Ziliak's and McCloskey's recent contribution to Econ Journal Watch, we find the following passage:
    In several dozen journal reviews and in comments we have received—from, for example, four Nobel laureates, the statistician Dennis Lindley (2012), the mathematician Olle Häggström (2010), the sociologist Steve Fuller (2008), and the historian Theodore Porter (2008)—no one [...] has tried to defend null hypothesis significance testing.
This surprised me, not so much because I had never expected to be cited in the same sentence as the fully fledged relativist provocateur Steve Fuller, but mainly because Häggström (2010)—which contains the passage
    The Cult of Statistical Significance is written in an entertaining and polemical style. Sometimes the authors push their position a bit far, such as when they ask themselves: "If null-hypothesis significance testing is as idiotic as we and its other critics have so long believed, how on earth has it survived?" (p. 240). Granted, the single-minded focus on statistical significance that they label sizeless science is bad practice. Still, to throw out the use of significance tests would be a mistake, considering how often it is a crucial tool for concluding with confidence that what we see really is a pattern, as opposed to just noise. For a data set to provide reasonable evidence of an important deviation from the null hypothesis, we typically need both statistical and subject-matter significance
—is such an extraordinarily odd reference to put forth in support of the statement that "no one has tried to defend null hypothesis significance testing". Ziliak and McCloskey are of course free to be unimpressed by this passage and not consider it to qualify as a defense of statistical significance testing, but note that they write "no one has tried to defend" rather than just "no one has defended". Hence, they do not even grant my passage the status of a valid attempt at defending significance testing. This strikes me as overly harsh.

Let me take this opportunity to expand a bit, by means of a simple example, on my claim that in order to establish "reasonable evidence of an important deviation from the null hypothesis, we typically need both statistical and subject-matter significance". Assume that the producer of the soft drink Percy-Cola has carried out a study in which subjects have been blindly exposed to one mug of Percy-Cola and one mug of Crazy-Cola (in randomized order), and asked to indicate which of them tastes better. Assume furthermore that 75% of subjects prefer the mug containing Percy-Cola, while only 25% prefer the one with Crazy-Cola. How impressed should we be by this?

This depends on how large the study is. Compare the two cases
    (a) out of a total of 4 subjects, 3 preferred Percy-Cola,

    (b) out of a total of 1000 subjects, 750 preferred Percy-Cola.

If we follow Ziliak's and McCloskey's advice to ignore statistical significance and focus instead purely on subject-matter (i.e., in this case, gastronomical) significance, then the two cases are indistinguishable, because in both cases the data indicates that 75% of subjects prefer Percy-Cola, which in subject-matter terms is quite a substantial deviation from the 50% we would expect in case neither of the liquids tasted any better than the other. Still, there is good reason to be more convinced of the superiority of Percy-Cola in case (b) than in case (a). The core reason for this is that under the null hypothesis that both drinks taste equally good (or bad), the probability of getting an outcome at least as favorable to Percy-Cola as the one we actually got turns out to be 5/16 ≈ 0.31 in case (a), while in case (b) the probability turns out to be about 6.7⋅10-59. These numbers (0.31 and 6.7⋅10-59, respectively) are precisely what is known in the theory of significance testing as the p-values for rejecting the null hypothesis. 0.31 is a really lousy p-value, meaning that in view of the data in (a) it is still fully plausible to suppose that the drinks are equally good (or even that Crazy-Cola is a bit better). On the other hand, 6.7⋅10-59 is an extremely good p-value, so in case (b) we may safely conclude that Percy-Cola really does taste better (in the sense of being preferred by a majority of the population from which subjects have been sampled). In other words, case (b) exhibits much better statistical significance than case (a).

Statistical significance is a useful way of quantifying how convinced we should be that an observed effect is real and not just a statistical fluctuation. Ziliak and McCloskey argue at length in their book that statistical significance has often been misused in many fields, and in this they are right. But they are wrong when they suggest that the concept is worthless and should be discarded.

Edit, March 4, 2015: Somewhat belatedly, and thanks to the kind remark by Mark Dehaven in the comments section below, I have realized that my sentence "Statistical significance is a useful way of quantifying how convinced we should be that an observed effect is real and not just a statistical fluctuation" in the last paragraph does not accurately reflect my view - neither my view now, nor the one I had two years ago. It is hard for me to understand now how I could have written such a thing, but my best guess is that it must have been written in a haste. Statistical significance and p-values do not quantify "how convinced we should be", because there may be so much else, beyond the data set presently at hand, that ought to influence how convinced or not we should be. Instead of the unfortunate sentence, I'd prefer to say that "Statistical significance and p-values provide, as a first approximation, an indication of how strongly the data set in itself constitutes evidence against the null hypothesis (provided the various implicit and explicit model assumptions correctly represent reality)".