The so-called
decline effect - the puzzling and
widely discussed fact that many effects reported in the scientific literature seem to diminish with time - has raised a suspicion that perhaps many of the originally reported findings are spurious. The recent surge of interest in scientific replications (which has traditionally been, and still to some extent is, viewed as a second-rate activity compared to the search for effects not previously reported in the literature) is therefore a welcome development.
Not everyone agrees.
Jason Mitchell, a highly cited neuroscientist at Harvard, has recently posted an essay entitled
On the emptiness of failed replications on his webpage - a piece of writing that has already received
some attention elsewhere. It is a furious attack upon the practice of replication, where he claims that replications that do not find the effects reported in the original study
"have no meaningful scientific value". While he doesn't explicitly state that such replication studies ought to be hidden away or buried, that nevertheless seems to be his conclusion. Mitchell's essay is poorly argued, and his main points are just wrong. Let me offer my thoughts, from the point of view of mathematical statistics. My account will involve the concept of statistical significance, and it will be assumed that the reader has at least a rudimentary familiarity with the concept.
1
Mitchell consistently uses the term
"failed experiment" for a study in which no statistically significant deviation from the null hypothesis is detected. I think this biased language contributes to Mitchell getting his conclusions so badly wrong. If the aim of an experiment is to shed light on whether the null hypothesis is true or not, then the unbiased attitude would be to have no particular hopes or wishes as to which is the case, and no particular hopes or wishes as to whether statistical significance is obtained or not. (This attitude is of course not always attained in practice, but it certainly is the ideal we should strive for. Calling one of the outcomes a failure and the other a success is unlikely to promote such an impartial attitude.) Fueled by his terminology, Mitchell considers only one possible explanation of lack of statistical significance in a replication study, namely that
(1) the study was carried out incompetently.
No doubt this is sometimes the correct explanation. There are, however, at least two other possible explanations that need to be considered, namely
(2) there is no effect (or the effect is so small as to be virtually undetectable with the present technology, sample sizes, etc), and
(3) there is an effect, but the data just happened to come out in such a way that statistical significance was not obtained.
It is important to realize that (3) can well happen even if the experiment is carried out with impeccable competence. The opposite can also happen - perhaps in the original study - namely that there is no effect, but statistical significance is nevertheless obtained. The latter is in fact what is
expected to happen about once in 20 if the null hypothesis of no effect is true, provided the traditional threshold of a p-value below 0.05 is employed in defining statistical significance. We live in a noisy and chaotic world where noise in measurements and random variation in the sample of patients (or whatever) produces random outcomes, a randomness that prevents absolute certainty in our conclusions. It is perfectly plausible to have one research group report a statistically significant effect, and another group report lack of statistical significance in a replication study of the same phenomenon, without either of the groups having succumbed to scientific incompetence. So there is no reason to say, as Mitchell does, that...
someone who publishes a replication [without obtaining statistical significance] is, in effect, saying something like, "You found an effect. I did not. One of us is the inferior scientist."
Mitchell repeatedly falls back on this kind of emotional language. He is clearly distressed that there are other scientists out there replicating his findings, rather than simply accepting them uncritically. How
dare they question his results? Well, here is a piece of news for Mitchell: All progress in science consists in showing that Nature is actually a bit different (or, in some cases,
very different) from what was previously thought. If, as Mitchell seems to be suggesting, scientists should cease questioning each other's results, then scientific progress will grind to a halt.
Of course (1) can happen, and of course this is a problem. But Mitchell goes way overboard in his conclusion: "Unless direct replications are conducted by flawless experimenters, nothing interesting can be learned from them." There is never any guarantee that the experimenters are flawless, but this does not warrant dismissing them out of hand. If original studies and replication studies point in contradictory directions, then we need to work more on the issue, perhaps looking more closely at how the studies were done, or carry out further replications, in order to get a better grasp of where the truth is.
Another problem with Mitchell's "nothing interesting can be learned from them" stance is that it cuts both ways. If the fact that not all experimenters are necessarily flawless had the force that his argument suggests, then all scientific studies (original as well as replications, regardless of whether they exhibit statistical significance or not) can be dismissed by the same argument, so we can all give up doing science.
Mitchell has in fact anticipated this objection, and counters it with an utterly confused exercise in
Popperian falsificationism. Popper's framework involves an asymmetry between a scientific theory (H) and its negation (H
c). In the present setting, (H) correponds to the null hypothesis of no effect, and (H
c) to the existence of a nonzero effect. Mitchell draws an analogy with the classical example where
(H) = {all swans are white}
and
(Hc) = {there exists at least one non-white swan}.
The crucial asymmetry in Popperian falsificationism is that it just takes one observation - a single black swan - to falsify (H), whereas no amount of observations of swans (white or otherwise) will falsify (H
c). But Mitchell's parallel between the statistically significant effect found in the original study and the discovery of a black swan succumbs to a crucial disanalogy: the discovery of a black swan is a definite falsification of (H), whereas the statistical significance may well be spurious. So when he goes on to ridicule scientists engaging in replication by comparing them to ornithologists who, faced with a black swan, try to resurrect (H) by looking for further examples of white swans, then the only one who comes across as ridiculous is Mitchell himself.
Publication bias is the phenomenon that a scientific study that exhibits statistical significance has a greater chance of being published than one that doesn't. This phenomenon causes severe difficulties for anyone trying to get a full picture of the scientific evidence on a particular issue, because if we only have access to those studies that exhibit a statistically significant effect, we will tend to overestimate the effect, or perhaps even see an effect where there is none.
2 Outrageous attacks like Mitchell's upon the publication of studies that do not exhibit statistical significance will only make this problem worse.
Footnotes
1) A non-technical one-paragraph summary of statistical significance might read as follows.
Statistical significance serves to indicate that the obtained results depart to a statistically meningful extent from what one would expect under the null hypothesis of having no effect. It is usually taken as supporting a suspicion that the null hypothesis is false, i.e., that there is a non-zero effect. Given the data, the p-value is defined as how likely it would be, in case we knew that the null hypothesis were true, to obtain at least as extreme data as those we actually got. If the p-value falls below a prespecified threshold, which is often taken to be 0.05, then statistical significance is declared.
For those readers who know Swedish, my paper
Statistisk signifikans och Armageddon may serve as an introduction to the concept. For others, there's an ocean of introductory textbooks in mathematical statistics that will do just as well.