I have been feeling guilty ever since I recently published a p-value. It led me to sit down and think hard about why I could not resist doing so and what I really think it told me, if anything. I suppose that a collateral question is to ask why I didn’t keep it to myself. To be honest, I quite often calculate p-values though I seldom let on.
It occurred to me that there was something in common between p-values and the anecdotes that I have blogged about here and here. Hence more jellybeans.
What is a p-value?
My starting data was the conversion rates of 10 elite soccer penalty takers. Each of their conversion rates was different. Leighton Baines had the best figures having converted 11 out of 11. Peter Beardsley and Emmanuel Adebayor had the superficially weakest, having converted 18 out of 20 and 9 out of 10 respectively. To an analyst that raises a natural question. Was the variation between the performance signal or was it noise?
In his rather discursive book The Signal and the Noise: The Art and Science of Prediction, Nate Silver observes:
The signal is the truth. The noise is what distracts us from the truth.
In the penalties data the signal, the truth, that we are looking for is Who is the best penalty taker and how good are they? The noise is the sampling variation inherent in a short sequence of penalty kicks. Take a coin and toss it 10 times. Count the number of heads. Make another 10 tosses. And a third 10. It is unlikely that you got the same number of heads but that was not because anything changed in the coin. The variation between the three counts is all down to the short sequence of tosses, the sampling variation.
In Understanding Variation: The Key to Managing Chaos, Don Wheeler observes:
All data has noise. Some data has signal.
We first want to know whether the penalty statistics display nothing more than sampling variation or whether there is also a signal that some penalty takers are better than others, some extra variation arising from that cause.
The p-value told me the probability that we could have observed the data we did had the variation been solely down to noise, 0.8%. Unlikely.
p-Values do not answer the exam question
The first problem is that p-values do not give me anything near what I really want. I want to know, given the observed data, what it the probability that penalty conversion rates are just noise. The p-value tells me the probability that, were penalty conversion rates just noise, I would have observed the data I did.
The distinction is between the probability of data given a theory and the probability of a theory give then data. It is usually the latter that is interesting. Now this may seem like a fine distinction without a difference. However, consider the probability that somebody with measles has spots. It is, I think, pretty close to one. Now consider the probability that somebody with spots has measles. Many things other than measles cause spots so that probability is going to be very much less than one. I would need a lot of information to come to an exact assessment.
In general, Bayes’ theorem governs the relationship between the two probabilities. However, practical use requires more information than I have or am likely to get. The p-values consider all the possible data that you might have got if the theory were true. It seems more rational to consider all the different theories that the actual data might support or imply. However, that is not so simple.
A dumb question
In any event, I know the answer to the question of whether some penalty takers are better than others. Of course they are. In that sense p-values fail to answer a question to which I already know the answer. Further, collecting more and more data increases the power of the procedure (the probability that it dodges a false negative). Thus, by doing no more than collecting enough data I can make the p-value as small as I like. A small p-value may have more to do with the number of observations than it has with anything interesting in penalty kicks.
That said, what I was trying to do in the blog was to set a benchmark for elite penalty taking. As such this was an enumerative study. Of course, had I been trying to select a penalty taker for my team, that would have been an analytic study and I would have to have worried additionally about stability.
There is a further question about whether the data variation arose from happenstance such as one or more players having had the advantage of weather or ineffective goalkeepers. This is an observational study not a designed experiment.
And even if I observe a signal, the p-value does not tell me how big it is. And it doesn’t tell me who is the best or worst penalty taker. As R A Fisher observed, just because we know there had been a murder we do not necessarily know who was the murderer.
E pur si muove
It seems then that individuals will have different ways of interpreting p-values. They do reveal something about the data but it is not easy to say what it is. It is suggestive of a signal but no more. There will be very many cases where there are better alternative analytics about which there is less ambiguity, for example Bayes factors.
However, in the limited case of what I might call alternative-free model criticism I think that the p-value does provide me with some insight. Just to ask the question of whether the data is consistent with the simplest of models. However, it is a similar insight to that of an anecdote: of vague weight with little hope of forming a consensus round its interpretation. I will continue to calculate them but I think it better if I keep quiet about it.
R A Fisher often comes in for censure as having done more than anyone to advance the cult of p-values. I think that is unfair. Fisher only saw p-values as part of the evidence that a researcher would have to hand in reaching a decision. He saw the intelligent use of p-values and significance tests as very different from the, as he saw it, mechanistic practices of hypothesis testing and acceptance procedures on the Neyman-Pearson model.
In an acceptance procedure, on the other hand, acceptance is irreversible, whether the evidence for it was strong or weak. It is the result of applying mechanically rules laid down in advance; no thought is given to the particular case, and the tester’s state of mind, or his capacity for learning is inoperative. By contrast, the conclusions drawn by a scientific worker from a test of significance are provisional, and involve an intelligent attempt to understand the experimental situation.
“Statistical methods and scientific induction”
Journal of the Royal Statistical Society Series B 17: 69–78. 1955, at 74-75
Fisher was well known for his robust, sometimes spiteful, views on other people’s work. However, it was Maurice Kendall in his obituary of Fisher who observed that:
… a man’s attitude toward inference, like his attitude towards religion, is determined by his emotional make-up, not by reason or mathematics.