The purpose of the present blog post is to report about my mixed feelings for two publications on probability and statistical inference that have come to my attention. In both cases, I concur with the ultimate message conveyed but object to a probability calculation that is made along the path towards that ultimate message.
Also, in both cases I am a bit late to the party. My somewhat lame excuse for this is that I cannot react to a publication before learning about its existence. In any case, here are my reactions:
The first case is the 2019 children's picture book Bayesian Probability for Babies by Chris Ferrie and Sarah Kaiser, which is available read out loud on YouTube. The reading of the entire book is over in exactly 2:00 minutes:
As you can see, the idea of the book is to explain Bayes' Theorem in a simple and fully worked out example. In the example, the situation at hand is that the hero of the story has taken a bite of a cookie which may or may not contain candies, and the task is to work out the posterior probability that the cookie contains candies given the data that the bite has no candies. This is just lovely...
...were it not for the following defect that shows up half-way (1:00) through the video. Before introducing the prior (the proportion of cookies in the jar having candies), expressions for P(D|C) and P(D|N) are presented, where D is the data from the bite, C is the event that the cookie has candies, and N is the event that it does not: we learn that P(D|C)=1/3 and P(D|N)=1, the authors note that the latter expression is larger, and conclude that "the no-candy bite probably came from a no-candy cookie!".
This conclusion is plain wrong, because it commits a version of the fallacy of the transposed conditional: a comparison between P(D|C) and P(D|N) is confused with one between P(C|D) and P(N|D). In fact, no probability that the cookie at hand has candies can be obtained before the prior has been laid out. The asked-for probability P(C|D) can, depending on the prior, land anywhere in the interval [0,1]. A more generous reader than me might object that the authors immediately afterwards do introduce a prior, which is sufficiently biased towards cookies containing candies to reverse the initial judgement and land in the (perhaps counterintuitive) result that the cookie is more likely than not to contain candies: P(C|D)=3/4. This is to some extent a mitigating circumstance, but I am not impressed, because the preliminary claim that "the no-candy bite probably came from a no-candy cookie" sends the implicit message that as long as no prior has been specified, it is OK to proceed as if the prior is uniform, i.e., puts equal probability on the two possible states C and N. But in the absence of specific arguments (perhaps something based on symmetry), it simply isn't. As I emphasized back in 2007, uniform distribution is a model assumption, and there is no end to how crazy conclusions one risks ending up with if one doesn't realize this.
The second case is the 2014 British Medical Journal article Trap of trends to statistical significance: likelihood of near significant P value becoming more significant with extra data by John Wood, Nick Freemantle, Michael King and Irwin Nazareth. The purpose of the paper is to warn against overly strong interpretations of moderately small p-values in statistical hypothesis testing. This is a mission I wholeheartedly agree with. In particular, the authors complain about other author's habit when, say, a significance level of 0.05 is employed but a disappointingly lame p-value of 0.08 is obtained, to describe it as "trending towards significance" or some similar expression. This, Wood et al remarks, gives the misleading suggestion that if only the sample size used had been bigger, significance would have been obtained - misleading because there is far from any guarantee that that would happen as a consequence of larger sample size. This is all fine and dandy...
...were it not for the probability calculations that Wood et al provide to support their claim. The setting discussed is the standard case of testing for a difference between two groups, and the article offer a bunch of tables where we can read, e.g., that if the p-value of 0.08 is obtained, and 50% more data is added to the sample size, then there's still a 39.7% chance that the outcome remains non-significant (p>0.05). The problem here is that such probabilities cannot be calculated, because they depend on the true but unknown effect size. If the true effect size is large, then the p-value is likely to improve with increased effect size (and in fact it would with probability 1 approach 0 as the sample size goes to infinity), whereas if the effect size is 0, then we should expect the p-value to regress towards the mean (and it would with probability 1 keep fluctuating forever over the entire interval [0,1] as the sample size goes to infinity).
At this point, the alert reader may ask: can't we just assume the effect size to be random, and calculate the desired probabilities as the corresponding weighted average over the possible effect sizes? In fact, that is what Wood et al do, but they barely mention it in the article, and hide away the specification of that prior distribution in an appendix that only a minority of their readers can be expected to ever lay their eyes on.
What is an appropriate prior on the effect size? That is very much context-dependent. If the statistical study is in, say, the field of parapsychology which has tried without success to demonstrate nonzero effects for a century or so, then a reasonable prior would put a point mass of 0.99 or more at effect size zero, and the remaining probability spread out near zero. If on the other hand (and to take another extreme) the study is about test subjects shown pictures of red or blue cars and asked to determine their color, and the purpose of the study is to find out whether the car being red increases the probability of the subject answering "red" compared to if the car is blue, then the reasonable thing to do is obviously to put most of the prior on large effect sizes.
None of this context-dependence is discussed by the authors. This omission serves to create the erroneous impression that if a study has sample size 100 and produces a p-value of 0.08, then the probability that the outcome remains non-significant if the sample size is increased to 150 can unproblematically be calculated to be 0.397.
So what is the prior used by Wood et al in their calculations? When we turn to the appendix, we actually find out. They use a kind of improper prior that treats all possible effect sizes equally, at the cost of the total "probability" mass being infinity rather than 1 (as it should be for a proper probability distribution); mathematically this works neatly because a proper probability distribution is obtained as soon as one conditions on some data, but it creates problems in coherently interpreting the resulting numbers such as the 0.397 above as actual probabilities. This is not among my two main problem with the prior, however. One I have already mentioned: the authors' utter negligence of showing that this particular choice of prior leads to relevant probabilities in practice, and their sweeping under the carpet of the very fact that there is a choice to be made. The other main problem is that with this prior, the probability of zero effect is exactly zero. In other words, their choice of prior amounts to dogmatically assuming that the effect size is nonzero (and thereby that the p-value will tend to 0 as the effect size increases towards infinity). For a study of what happens in other studies meant to shed light on whether or not a nonzero effect exists, this particular model assumption strikes me as highly unsuitable.