onsdag 9 juli 2014

On the value of replications: Jason Mitchell is wrong

The so-called decline effect - the puzzling and widely discussed fact that many effects reported in the scientific literature seem to diminish with time - has raised a suspicion that perhaps many of the originally reported findings are spurious. The recent surge of interest in scientific replications (which has traditionally been, and still to some extent is, viewed as a second-rate activity compared to the search for effects not previously reported in the literature) is therefore a welcome development.

Not everyone agrees. Jason Mitchell, a highly cited neuroscientist at Harvard, has recently posted an essay entitled On the emptiness of failed replications on his webpage - a piece of writing that has already received some attention elsewhere. It is a furious attack upon the practice of replication, where he claims that replications that do not find the effects reported in the original study "have no meaningful scientific value". While he doesn't explicitly state that such replication studies ought to be hidden away or buried, that nevertheless seems to be his conclusion. Mitchell's essay is poorly argued, and his main points are just wrong. Let me offer my thoughts, from the point of view of mathematical statistics. My account will involve the concept of statistical significance, and it will be assumed that the reader has at least a rudimentary familiarity with the concept.1

Mitchell consistently uses the term "failed experiment" for a study in which no statistically significant deviation from the null hypothesis is detected. I think this biased language contributes to Mitchell getting his conclusions so badly wrong. If the aim of an experiment is to shed light on whether the null hypothesis is true or not, then the unbiased attitude would be to have no particular hopes or wishes as to which is the case, and no particular hopes or wishes as to whether statistical significance is obtained or not. (This attitude is of course not always attained in practice, but it certainly is the ideal we should strive for. Calling one of the outcomes a failure and the other a success is unlikely to promote such an impartial attitude.) Fueled by his terminology, Mitchell considers only one possible explanation of lack of statistical significance in a replication study, namely that
    (1) the study was carried out incompetently.
No doubt this is sometimes the correct explanation. There are, however, at least two other possible explanations that need to be considered, namely
    (2) there is no effect (or the effect is so small as to be virtually undetectable with the present technology, sample sizes, etc), and

    (3) there is an effect, but the data just happened to come out in such a way that statistical significance was not obtained.

It is important to realize that (3) can well happen even if the experiment is carried out with impeccable competence. The opposite can also happen - perhaps in the original study - namely that there is no effect, but statistical significance is nevertheless obtained. The latter is in fact what is expected to happen about once in 20 if the null hypothesis of no effect is true, provided the traditional threshold of a p-value below 0.05 is employed in defining statistical significance. We live in a noisy and chaotic world where noise in measurements and random variation in the sample of patients (or whatever) produces random outcomes, a randomness that prevents absolute certainty in our conclusions. It is perfectly plausible to have one research group report a statistically significant effect, and another group report lack of statistical significance in a replication study of the same phenomenon, without either of the groups having succumbed to scientific incompetence. So there is no reason to say, as Mitchell does, that...
    someone who publishes a replication [without obtaining statistical significance] is, in effect, saying something like, "You found an effect. I did not. One of us is the inferior scientist."
Mitchell repeatedly falls back on this kind of emotional language. He is clearly distressed that there are other scientists out there replicating his findings, rather than simply accepting them uncritically. How dare they question his results? Well, here is a piece of news for Mitchell: All progress in science consists in showing that Nature is actually a bit different (or, in some cases, very different) from what was previously thought. If, as Mitchell seems to be suggesting, scientists should cease questioning each other's results, then scientific progress will grind to a halt.

Of course (1) can happen, and of course this is a problem. But Mitchell goes way overboard in his conclusion: "Unless direct replications are conducted by flawless experimenters, nothing interesting can be learned from them." There is never any guarantee that the experimenters are flawless, but this does not warrant dismissing them out of hand. If original studies and replication studies point in contradictory directions, then we need to work more on the issue, perhaps looking more closely at how the studies were done, or carry out further replications, in order to get a better grasp of where the truth is.

Another problem with Mitchell's "nothing interesting can be learned from them" stance is that it cuts both ways. If the fact that not all experimenters are necessarily flawless had the force that his argument suggests, then all scientific studies (original as well as replications, regardless of whether they exhibit statistical significance or not) can be dismissed by the same argument, so we can all give up doing science.

Mitchell has in fact anticipated this objection, and counters it with an utterly confused exercise in Popperian falsificationism. Popper's framework involves an asymmetry between a scientific theory (H) and its negation (Hc). In the present setting, (H) correponds to the null hypothesis of no effect, and (Hc) to the existence of a nonzero effect. Mitchell draws an analogy with the classical example where
    (H) = {all swans are white}
and
    (Hc) = {there exists at least one non-white swan}.
The crucial asymmetry in Popperian falsificationism is that it just takes one observation - a single black swan - to falsify (H), whereas no amount of observations of swans (white or otherwise) will falsify (Hc). But Mitchell's parallel between the statistically significant effect found in the original study and the discovery of a black swan succumbs to a crucial disanalogy: the discovery of a black swan is a definite falsification of (H), whereas the statistical significance may well be spurious. So when he goes on to ridicule scientists engaging in replication by comparing them to ornithologists who, faced with a black swan, try to resurrect (H) by looking for further examples of white swans, then the only one who comes across as ridiculous is Mitchell himself.

Publication bias is the phenomenon that a scientific study that exhibits statistical significance has a greater chance of being published than one that doesn't. This phenomenon causes severe difficulties for anyone trying to get a full picture of the scientific evidence on a particular issue, because if we only have access to those studies that exhibit a statistically significant effect, we will tend to overestimate the effect, or perhaps even see an effect where there is none.2 Outrageous attacks like Mitchell's upon the publication of studies that do not exhibit statistical significance will only make this problem worse.

Footnotes

1) A non-technical one-paragraph summary of statistical significance might read as follows.

Statistical significance serves to indicate that the obtained results depart to a statistically meningful extent from what one would expect under the null hypothesis of having no effect. It is usually taken as supporting a suspicion that the null hypothesis is false, i.e., that there is a non-zero effect. Given the data, the p-value is defined as how likely it would be, in case we knew that the null hypothesis were true, to obtain at least as extreme data as those we actually got. If the p-value falls below a prespecified threshold, which is often taken to be 0.05, then statistical significance is declared.

For those readers who know Swedish, my paper Statistisk signifikans och Armageddon may serve as an introduction to the concept. For others, there's an ocean of introductory textbooks in mathematical statistics that will do just as well.

2) A well-known case where this has resulted in a misleading overall picture is parapsychology. A more serious example is the selective publication of clinical studies of psychiatric drugs.

13 kommentarer:

  1. I mean, just imagine: there are Harvard professors who believe that a p-value below 0.05 does the same to their null hypothesis as a black swan does to the all-swans-are-white hypothesis!

    I can hardly think of a clearer sign of the need to reform the training of future scientists, and in any case it certainly exemplifies my oft-repeated position of how "the empirical sciences need statistic[al competence] so desparately".

    SvaraRadera
  2. Svar
    1. Thanks, I've corrected the spelling in the earlier blog post that was the source of my cut-and-paste.

      Radera
  3. So, do you think psychiatric drugs are ineffective?

    SvaraRadera
    Svar
    1. I know too little about the field to take a definite stand, but I dare say this much: On one hand it seems implausible that they would all be entirely ineffective. On the other hand it seems clear from the writings of, among others, Angell (cited above) and Goldrace that the effects of some of them have been exaggerated to a considerable extent.

      Radera
  4. I my self are a scientist in training. I must admit that my knowledge about statistics is weak. Especially where needed, building and fitting statistical models for predictions of behavior in novel materials. I do feel unequipped regarding this reflecting back on my undergraduate training.

    I have done some attempts to correct this, but it is easy to get lost in the vast amount of literature available.

    Can you give any pointers to books etc. that could be a first stepping stone in correcting my knowledge gap? I would be much grateful for any advice. Possible future blog post/s?

    SvaraRadera
    Svar
    1. I appreciate your desire to learn, but without knowing where you stand (in terms of mathematical literacy etc) or more about what kind of work you hope to be doing, it is very hard to give useful advice. You'd be much better off, I believe, from an IRL chat with a local statistician.

      Radera
  5. example of a negative result of "no meaningful scientific value" that may be drawn to the good Harvard prof's attention - the Michelson-Morley experiment on motion through postulated aether

    SvaraRadera
  6. The importance of publishing negative results is acknowledged by the launching of this new journal:
    http://www.journals.elsevier.com/new-negatives-in-plant-science/
    These two journals devoted to negative results already exist:
    http://www.jnrbm.com (since 2010)
    http://www.jnr-eeb.org/index.php/jnr/index (since 2004)
    I touched upon the importance of publishing negative data in this commentary:
    http://onlinelibrary.wiley.com/doi/10.1111/j.2042-7166.2012.01165.x/full

    SvaraRadera
    Svar
    1. Thanks, Dan! Those journals are welcome initiatives, although of course negative results should not in general be confined to such journals.

      Is there any place where readers without subscription to Wiley Online can access your commentary?

      Radera
  7. Jösses vilken bra blogg du har. Jag ska läsa inlägg tillbaka i tiden också men först vill jag fråga någon angående din uppsats statistisk signifikans och armageddon:

    Definierar du verkligen att ditt villkor för signifikanstest gäller världens undergång? Jag tycker det känns konstigt hur du bara sätter upp P(mänskligheten går under innan totala # mnskr uppnår 1200 miljarder) = 0.95. Det är nästan säkert mitt förstånd som inte kan följa logiken men kan du snälla förklara varför 7,8,9 leder till 10. Förklara för mig som om jag vore ca 4 år gammal.

    SvaraRadera
    Svar
    1. Den här bloggposten alltså (säger jag för att övriga läsare skall hitta vägen dit). Om du läser uppsatsen till slut så finner du att jag tar avstånd från domedagsargumentationen.

      Hur som helst, tack för uppmuntrande omdöme om min blogg!

      Radera
  8. Hmm.. Ok, det följer av att du har definierat N som alla människor som någonsin kommer födas? Som att vid N måste jorden gå under...

    SvaraRadera