Does noise make you fat?

“A new study has unearthed some eye-opening facts about the effects of noise pollution on obesity,” proclaimed The Huffington Post recently in another piece or poorly uncritical data journalism.

Journalistic standards notwithstanding, in Exposure to traffic noise and markers of obesity (BMJ Occupational and environmental medicine, May 2015) Andrei Pyko and eight (sic) collaborators found “evidence of a link between traffic noise and metabolic outcomes, especially central obesity.” The particular conclusion picked up by the press was that each 5 dB increase in traffic noise could add 2 mm to the waistline.

Not trusting the press I decided I wanted to have a look at this research myself. I was fortunate that the paper was available for free download for a brief period after the press release. It took some finding though. The BMJ insists that you will now have to pay. I do find that objectionable as I see that the research was funded in part by the European Union. Us European citizens have all paid once. Why should we have to pay again?

On reading …

I was though shocked reading Pyko’s paper as the Huffington Post journalists obviously hadn’t. They state “Lack of sleep causes reduced energy levels, which can then lead to a more sedentary lifestyle and make residents less willing to exercise.” Pyko’s paper says no such thing. The researchers had, in particular, conditioned on level of exercise so that effect had been taken out. It cannot stand as an explanation of the results. Pyko’s narrative concerned noise-induced stress and cortisol production, not lack of exercise.

In any event, the paper is densely written and not at all easy to analyse and understand. I have tried to pick out the points that I found most bothering but first a statistics lesson.

Prediction 101

Frame(Almost) the first thing to learn in statistics is the relationship between population, frame and sample. We are concerned about the population. The frame is the enumerable and accessible set of things that approximate the population. The sample is a subset of the frame, selected in an economic, systematic and well characterised manner.

In Some Theory of Sampling (1950), W Edwards Deming drew a distinction between two broad types of statistical studies, enumerative and analytic.

  • Enumerative: Action will be taken on the frame.
  • Analytic: Action will be on the cause-system that produced the frame.

It is explicit in Pyko’s work that the sampling frame was metropolitan Stockholm, Sweden between the years 2002 and 2006. It was a cross-sectional study. I take it from the institutional funding that the study intended to advise policy makers as to future health interventions. Concern was beyond the population of Stockholm, or even Sweden. This was an analytic study. It aspired to draw generalised lessons about the causal mechanisms whereby traffic noise aggravated obesity so as to support future society-wide health improvement.

How representative was the frame of global urban areas stretching over future decades? I have not the knowledge to make a judgment. The issue is mentioned in the paper but, I think, with insufficient weight.

There are further issues as to the sampling from the frame. Data was taken from participants in a pre-existing study into diabetes that had itself specific criteria for recruitment. These are set out in the paper but intensify the questions of whether the sample is representative of the population of interest.

The study

The researchers chose three measures of obesity, waist circumference, waist-hip ratio and BMI. Each has been put forwards, from time to time, as a measure of health risk.

There were 5,075 individual participants in the study, a sample of 5,075 observations. The researchers performed both a linear regression simpliciter and a logistic regression. For want of time and space I am only going to comment on the former. It is the origin of the headline 2 mm per 5 dB claim.

The researchers have quoted p-values but they haven’t committed the worst of sins as they have shown the size of the effects with confidence intervals. It’s not surprising that they found so many soi-disant significant effects given the sample size.

However, there was little assistance in judging how much of the observed variation in obesity was down to traffic noise. I would have liked to see a good old fashioned analysis of variance table. I could then at least have had a go at comparing variation from the measurement process, traffic noise and other effects. I could also have calculated myself an adjusted R2.

Measurement Systems Analysis

Understanding variation from the measurement process is critical to any analysis. I have looked at the World Health Organisation’s definitive 2011 report on the effects of waist circumference on health. Such Measurement Systems Analysis as there is occurs at p7. They report a “technical error” (me neither) of 1.31 cm from intrameasurer error (I’m guessing repeatability) and 1.56 cm from intermeasurer error (I’m guessing reproducibility). They remark that “Even when the same protocol is used, there may be variability within and between measurers when more than one measurement is made.” They recommend further research but I have found none. There is no way of knowing from what is published by Pyko whether the reported effects are real or flow from confounding between traffic noise and intermeasurer variation.

When it comes to waist-hip ratio I presume that there are similar issues in measuring hip circumference. When the two dimensions are divided then the individual measurement uncertainties aggregate. More problems, not addressed.

Noise data

The key predictor of obesity was supposed to be noise. The noise data used were not in situ measurements in the participants’ respective homes. The road traffic noise data were themselves predicted from a mathematical model using “terrain data, ground surface, building height, traffic data, including 24 h yearly average traffic flow, diurnal distribution and speed limits, as well as information on noise barriers”. The model output provided 5 dB contours. The authors then applied some further ad hoc treatments to the data.

The authors recognise that there is likely to be some error in the actual noise levels, not least from the granularity. However, they then seem to assume that this is simply an errors in variables situation. That would do no more than (conservatively) bias any observed effect towards zero. However, it does seem to me that there is potential for much more structured systematic effects to be introduced here and I think this should have been explored further.

Model criticism

The authors state that they carried out a residuals analysis but they give no details and there are no charts, even in the supplementary material. I would like to have had a look myself as the residuals are actually the interesting bit. Residuals analysis is essential in establishing stability.

In fact, in the current study there is so much data that I would have expected the authors to have saved some of the data for cross-validation. That would have provided some powerful material for model criticism and validation.

Given that this is an analytic study these are all very serious failings. With nine researchers on the job I would have expected some effort on these matters and some attention from whoever was the statistical referee.

Results

Separate results are presented for road, rail and air traffic noise. Again, for brevity I am looking at the headline 2 mm / 5 dB quoted for road traffic noise. Now, waist circumference is dependent on gross body size. Men are bigger than women and have larger waists. Similarly, the tall are larger-waisted than the short. Pyko’s regression does not condition on height (as a gross characterisation of body size).

BMI is a factor that attempts to allow for body size. Pyko found no significant influence on BMI from road traffic noise.

Waist-hip ration is another parameter that attempts to allow for body size. It is often now cited as a better predictor of morbidity than BMI. That of course is irrelevant to the question of whether noise makes you fat. As far as I can tell from Pyko’s published results, a 5 dB increase in road traffic noise accounted for a 0.16 increase in waist-hip ratio. Now, let us look at this broadly. Consider a woman with waist circumference 85 cm, hip 100 cm, hence waist-hip ratio, 0.85. All pretty typical for the study. Predictively the study is suggesting that a 5 dB increase in road traffic noise might unremarkably take her waist-hip ratio up over 1.0. That seems barely consistent with the results from waist circumference alone where there would not only be millimetres of growth. It is incredible physically.

I must certainly have misunderstood what the waist-hip result means but I could find no elucidation in Pyko’s paper.

Policy

Research such as this has to be aimed at advising future interventions to control traffic noise in urban environments. Broadly speaking, 5 dB is a level of noise change that is noticeable to human hearing but no more. All the same, achieving such a reduction in an urban environment is something that requires considerable economic resources. Yet, taking the research at its highest, it only delivers 2 mm on the waistline.

I had many criticisms other than those above and I do not, in any event, consider this study adequate for making any prediction about a future intervention. Nothing in it makes me feel the subject deserves further study. Or that I need to avoid noise to stay slim.

Soccer management – signal, noise and contract negotiation

Some poor data journalism here from the BBC on 28 May 2015, concerning turnover in professional soccer managers in England. “Managerial sackings reach highest level for 13 years” says the headline. A classic executive time series. What is the significance of the 13 years? Other than it being the last year with more sackings than the present.

The data was purportedly from the League Managers’ Association (LMA) and their Richard Bevan thought the matter “very concerning”. The BBC provided a chart (fair use claimed).

MgrSackingsto201503

Now, I had a couple of thoughts as soon as I saw this. Firstly, why chart only back to 2005/6? More importantly, this looked to me like a stable system of trouble (for football managers) with the possible exception of this (2014/15) season’s Championship coach turnover. Personally, I detest multiple time series on a common chart unless there is a good reason for doing so. I do not think it the best way of showing variation and/ or association.

Signal and noise

The first task of any analyst looking at data is to seek to separate signal from noise. Nate Silver made this point powerfully in his book The Signal and the Noise: The Art and Science of Prediction. As Don Wheeler put it: all data has noise; some data has signal.

Noise is typically the irregular aggregate of many causes. It is predictable in the same way as a roulette wheel. A signal is a sign of some underlying factor that has had so large an effect that it stands out from the noise. Signals can herald a fundamental unpredictability of future behaviour.

If we find a signal we look for a special cause. If we start assigning special causes to observations that are simply noise then, at best, we spend money and effort to no effect and, at worst, we aggravate the situation.

The Championship data

In any event, I wanted to look at the data for myself. I was most interested in the Championship data as that was where the BBC and LMA had been quick to find a signal. I looked on the LMA’s website and this is the latest data I found. The data only records dismissals up to 31 March of the 2014/15 season. There were 16. The data in the report gives the total number of dismissals for each preceding season back to 2005/6. The report separates out “dismissals” from “resignations” but does not say exactly how the classification was made. It can be ambiguous. A manager may well resign because he feels his club have themselves repudiated his contract, a situation known in England as constructive dismissal.

The BBC’s analysis included dismissals right up to the end of each season including 2014/15. Reading from the chart they had 20. The BBC have added some data for 2014/15 that isn’t in the LMA report and not given the source. I regard that as poor data journalism.

I found one source of further data at website The Sack Race. That told me that since the end of March there had been four terminations.

Manager Club Termination Date
Malky Mackay Wigan Athletic Sacked 6 April
Lee Clark Blackpool Resigned 9 May
Neil Redfearn Leeds United Contract expired 20 May
Steve McClaren Derby County Sacked 25 May

As far as I can tell, “dismissals” include contract non-renewals and terminations by mutual consent. There are then a further three dismissals, not four. However, Clark left Blackpool amid some corporate chaos. That is certainly a termination that is classifiable either way. In any event, I have taken the BBC figure at face value though I am alerted as to some possible data quality issues here.

Signal and noise

Looking at the Championship data, this was the process behaviour chart, plotted as an individuals chart.

MgrSackingsto201503

There is a clear signal for the 2014/15 season with an observation, 20 dismissals,, above the upper natural process limit of 19.18 dismissals. Where there is a signal we should seek a special cause. There is no guarantee that we will find a special cause. Data limitations and bounded rationality are always constraints. In fact, there is no guarantee that there was a special cause. The signal could be a false positive. Such effects cannot be eliminated. However, signals efficiently direct our limited energy for, what Daniel Kahneman calls, System 2 thinking towards the most promising enquiries.

Analysis

The BBC reports one narrative woven round the data.

Bevan said the current tenure of those employed in the second tier was about eight months. And the demand to reach the top flight, where a new record £5.14bn TV deal is set to begin in 2016, had led to clubs hitting the “panic button” too quickly.

It is certainly a plausible view. I compiled a list of the dismissals and non-renewals, not the resignations, with data from Wikipedia and The Sack Race. I only identified 17 which again suggests some data quality issue around classification. I have then charted a scatter plot of date of dismissal against the club’s then league position.

MgrSackings201415

It certainly looks as though risk of relegation is the major driver for dismissal. Aside from that, Watford dismissed Billy McKinlay after only two games when they were third in the league, equal on points with the top two. McKinlay had been an emergency appointment after Oscar Garcia had been compelled to resign through ill health. Watford thought they had quickly found a better manager in Slavisa Jokanovic. Watford ended the season in second place and were promoted to the Premiership.

There were two dismissals after the final game on 2 May by disappointed mid-table teams. Beyond that, the only evidence for impulsive managerial changes in pursuit of promotion is the three mid-season, mid-table dismissals.

Club league position
Manager Club On dismissal At end of season
Nigel Adkins Reading 16 19
Bob Peeters Charlton Athletic 14 12
Stuart Pearce Nottingham Forrest 12 14

A table that speaks for itself. I am not impressed by the argument that there has been the sort of increase in panic sackings that Bevan fears. Both Blackpool and Leeds experienced chaotic executive management which will have resulted in an enhanced force of mortality on their respective coaches. That along with the data quality issues and the technical matter I have described below lead me to feel that there was no great enhanced threat to the typical Championship manager in 2014/15.

Next season I would expect some regression to the mean with a lower number of dismissals. Not much of a prediction really but that’s what the data tells me. If Bevan tries to attribute that to the LMA’s activism them I fear that he will be indulging in Langian statistical analysis. Will he be able to resist?

Techie bit

I have a preference for individuals charts but I did also try plotting the data on an np-chart where I found no signal. It is trite service-course statistics that a Poisson distribution with mean λ has standard deviation √λ so an upper 3-sigma limit for a (homogeneous) Poisson process with mean 11.1 dismissals would be 21.1 dismissals. Kahneman has cogently highlighted how people tend to see patterns in data as signals even where they are typical of mere noise. In this case I am aware that the data is not atypical of a Poisson process so I am unsurprised that I failed to identify a special cause.

A Poisson process with mean 11.1 dismissals is a pretty good model going forwards and that is the basis I would press on any managers in contract negotiations.

Of course, the clubs should remember that when they look for a replacement manager they will then take a random sample from the pool of job seekers. Really!

Anecdotes and p-values

JellyBellyBeans.jpgI have been feeling guilty ever since I recently published a p-value. It led me to sit down and think hard about why I could not resist doing so and what I really think it told me, if anything. I suppose that a collateral question is to ask why I didn’t keep it to myself. To be honest, I quite often calculate p-values though I seldom let on.

It occurred to me that there was something in common between p-values and the anecdotes that I have blogged about here and here. Hence more jellybeans.

What is a p-value?

My starting data was the conversion rates of 10 elite soccer penalty takers. Each of their conversion rates was different. Leighton Baines had the best figures having converted 11 out of 11. Peter Beardsley and Emmanuel Adebayor had the superficially weakest, having converted 18 out of 20 and 9 out of 10 respectively. To an analyst that raises a natural question. Was the variation between the performance signal or was it noise?

In his rather discursive book The Signal and the Noise: The Art and Science of Prediction, Nate Silver observes:

The signal is the truth. The noise is what distracts us from the truth.

In the penalties data the signal, the truth, that we are looking for is Who is the best penalty taker and how good are they? The noise is the sampling variation inherent in a short sequence of penalty kicks. Take a coin and toss it 10 times. Count the number of heads. Make another 10 tosses. And a third 10. It is unlikely that you got the same number of heads but that was not because anything changed in the coin. The variation between the three counts is all down to the short sequence of tosses, the sampling variation.

In Understanding Variation: The Key to Managing ChaosDon Wheeler observes:

All data has noise. Some data has signal.

We first want to know whether the penalty statistics display nothing more than sampling variation or whether there is also a signal that some penalty takers are better than others, some extra variation arising from that cause.

The p-value told me the probability that we could have observed the data we did had the variation been solely down to noise, 0.8%. Unlikely.

p-Values do not answer the exam question

The first problem is that p-values do not give me anything near what I really want. I want to know, given the observed data, what it the probability that penalty conversion rates are just noise. The p-value tells me the probability that, were penalty conversion rates just noise, I would have observed the data I did.

The distinction is between the probability of data given a theory and the probability of a theory give then data. It is usually the latter that is interesting. Now this may seem like a fine distinction without a difference. However, consider the probability that somebody with measles has spots. It is, I think, pretty close to one. Now consider the probability that somebody with spots has measles. Many things other than measles cause spots so that probability is going to be very much less than one. I would need a lot of information to come to an exact assessment.

In general, Bayes’ theorem governs the relationship between the two probabilities. However, practical use requires more information than I have or am likely to get. The p-values consider all the possible data that you might have got if the theory were true. It seems more rational to consider all the different theories that the actual data might support or imply. However, that is not so simple.

A dumb question

In any event, I know the answer to the question of whether some penalty takers are better than others. Of course they are. In that sense p-values fail to answer a question to which I already know the answer. Further, collecting more and more data increases the power of the procedure (the probability that it dodges a false negative). Thus, by doing no more than collecting enough data I can make the p-value as small as I like. A small p-value may have more to do with the number of observations than it has with anything interesting in penalty kicks.

That said, what I was trying to do in the blog was to set a benchmark for elite penalty taking. As such this was an enumerative study. Of course, had I been trying to select a penalty taker for my team, that would have been an analytic study and I would have to have worried additionally about stability.

Problems, problems

There is a further question about whether the data variation arose from happenstance such as one or more players having had the advantage of weather or ineffective goalkeepers. This is an observational study not a designed experiment.

And even if I observe a signal, the p-value does not tell me how big it is. And it doesn’t tell me who is the best or worst penalty taker. As R A Fisher observed, just because we know there had been a murder we do not necessarily know who was the murderer.

E pur si muove

It seems then that individuals will have different ways of interpreting p-values. They do reveal something about the data but it is not easy to say what it is. It is suggestive of a signal but no more. There will be very many cases where there are better alternative analytics about which there is less ambiguity, for example Bayes factors.

However, in the limited case of what I might call alternative-free model criticism I think that the p-value does provide me with some insight. Just to ask the question of whether the data is consistent with the simplest of models. However, it is a similar insight to that of an anecdote: of vague weight with little hope of forming a consensus round its interpretation. I will continue to calculate them but I think it better if I keep quiet about it.

R A Fisher often comes in for censure as having done more than anyone to advance the cult of p-values. I think that is unfair. Fisher only saw p-values as part of the evidence that a researcher would have to hand in reaching a decision. He saw the intelligent use of p-values and significance tests as very different from the, as he saw it, mechanistic practices of hypothesis testing and acceptance procedures on the Neyman-Pearson model.

In an acceptance procedure, on the other hand, acceptance is irreversible, whether the evidence for it was strong or weak. It is the result of applying mechanically rules laid down in advance; no thought is given to the particular case, and the tester’s state of mind, or his capacity for learning is inoperative. By contrast, the conclusions drawn by a scientific worker from a test of significance are provisional, and involve an intelligent attempt to understand the experimental situation.

“Statistical methods and scientific induction”
Journal of the Royal Statistical Society Series B 17: 69–78. 1955, at 74-75

Fisher was well known for his robust, sometimes spiteful, views on other people’s work. However, it was Maurice Kendall in his obituary of Fisher who observed that:

… a man’s attitude toward inference, like his attitude towards religion, is determined by his emotional make-up, not by reason or mathematics.

Deconstructing Deming XI B – Eliminate numerical goals for management

11. Part B. Eliminate numerical goals for management.

W. Edwards Deming.jpgA supposed corollary to the elimination of numerical quotas for the workforce.

This topic seems to form a very large part of what passes for exploration and development of Deming’s ideas in the present day. It gets tied in to criticisms of remuneration practices and annual appraisal, and target-setting in general (management by objectives). It seems to me that interest flows principally from a community who have some passionately held emotional attitudes to these issues. Advocates are enthusiastic to advance the views of theorists like Alfie Kohn who deny, in terms, the effectiveness of traditional incentives. It is sad that those attitudes stifle analytical debate. I fear that the problem started with Deming himself.

Deming’s detailed arguments are set out in Out of the Crisis (at pp75-76). There are two principle reasoned objections.

  1. Managers will seek empty justification from the most convenient executive time series to hand.
  2. Surely, if we can improve now, we would have done so previously, so managers will fall back on (1).

The executive time series

I’ve used the time series below in some other blogs (here in 2013 and here in 2012). It represents the anual number of suicides on UK railways. This is just the data up to 2013.
RailwaySuicides2

The process behaviour chart shows a stable system of trouble. There is variation from year to year but no significant (sic) pattern. There is noise but no signal. There is an average of just over 200 fatalities, varying irregularly between around 175 and 250. Sadly, as I have discussed in earlier blogs, simply selecting a pair of observations enables a polemicist to advance any theory they choose.

In Railway Suicides in the UK: risk factors and prevention strategies, Kamaldeep Bhui and Jason Chalangary of the Wolfson Institute of Preventive Medicine, and Edgar Jones of the Institute of Psychiatry, King’s College, London quoted the Rail Safety and Standards Board (RSSB) in the following two assertions.

  • Suicides rose from 192 in 2001-02 to a peak 233 in 2009-10; and
  • The total fell from 233 to 208 in 2010-11 because of actions taken.

Each of these points is what Don Wheeler calls an executive time series. Selective attention, or inattention, on just two numbers from a sequence of irregular variation can be used to justify any theory. Deming feared such behaviour could be perverted to justify satisfaction of any goal. Of course, the process behaviour chart, nowhere more strongly advocated than by Deming himself in Out of the Crisis, is the robust defence against such deceptions. Diligent criticism of historical data by means of process behaviour charts is exactly what is needed to improve the business and exactly what guards against success-oriented interpretations.

Wishful thinking, and the more subtle cognitive biases studied by Daniel Kahneman and others, will always assist us in finding support for our position somewhere in the data. Process behaviour charts keep us objective.

If not now, when?

If I am not for myself, then who will be for me?
And when I am for myself, then what am “I”?
And if not now, when?

Hillel the Elder

Deming criticises managerial targets on the grounds that, were the means of achieving the target known, it would already have been achieved and, further, that without having the means efforts are futile at best. It’s important to remember that Deming is not here, I think, talking about efforts to stabilise a business process. Deming is talking about working to improve an already stable, but incapable, process.

There are trite reasons why a target might legitimately be mandated where it has not been historically realised. External market conditions change. A manager might unremarkably be instructed to “Make 20% more of product X and 40% less of product Y“. That plays in to the broader picture of targets’ role in co-ordinating the parts of a system, internal to the organisation of more widely. It may be a straightforward matter to change the output of a well-understood, stable system by an adjustment of the inputs.

Deming says:

If you have a stable system, then there is no use to specify a goal. You will get whatever the system will deliver.

But it is the manager’s job to work on a stable system to improve its capability (Out of the Crisis at pp321-322). That requires capital and a plan. It involves a target because the target captures the consensus of the whole system as to what is required, how much to spend, what the new system looks like to its customer. Simply settling for the existing process, being managed through systematic productivity to do its best, is exactly what Deming criticises at his Point 1 (Constancy of purpose for improvement).

Numerical goals are essential

… a manager is an information channel of decidedly limited capacity.

Kenneth Arrow
Essays in the Theory of Risk-Bearing

Deming’s followers have, to some extent, conceded those criticisms. They say that it is only arbitrary targets that are deprecated and not the legitimate Voice of the Customer/ Voice of the Business. But I think they make a distinction without a difference through the weasel words “arbitrary” and “legitimate”. Deming himself was content to allow managerial targets relating to two categories of existential risk.

However, those two examples are not of any qualitatively different type from the “Increase sales by 10%” that he condemns. Certainly back when Deming was writing Out of the Crisis most OELs were based on LD50 studies, a methodology that I am sure Deming would have been the first to criticise.

Properly defined targets are essential to business survival as they are one of the principal means by which the integrated function of the whole system is communicated. If my factory is producing more than I can sell, I will not work on increasing capacity until somebody promises me that there is a plan to improve sales. And I need to know the target of the sales plan to know where to aim with plant capacity. It is no good just to say “Make as much as you can. Sell as much as you can.” That is to guarantee discoordination and inefficiency. It is unsurprising that Deming’s thinking has found so little real world implementation when he seeks to deprive managers of one of the principle tools of managing.

Targets are dangerous

I have previously blogged about what is needed to implement effective targets. An ill judged target can induce perverse incentives. These can be catastrophic for an organisation, particularly one where the rigorous criticism of historical data is absent.

The art of managing footballers

Van Persie (15300483040) (crop).jpg… or is it a science? Robin van Persie’s penalty miss against West Bromwich Albion on 2 May 2015 was certainly welcome news to my ears. It eased the relegation pressures on West Brom and allowed us to advance to 40 points for the season. Relegation fears are only “mathematical” now. However, the miss also resulted in van Persie being relieved of penalty taking duties, by Manchester United manager Louis van Gaal, until further notice.

He is now at the end of the road. It is always [like that]. Wayne [Rooney] has missed also so when you miss you are at the bottom again.

The Daily Mail report linked above goes on to say that van Persie had converted his previous 6 penalties.

Van Gaal was, of course, referring to Rooney’s shot over the crossbar against West Ham in February 2013, when Rooney had himself invited then manager Sir Alex Ferguson to retire him as designated penalty taker. Rooney’s record had apparently been 9 misses from 27 penalties. I have all this from this Daily Telegraph report.

I wonder if statistics can offer any insight into soccer management?

The benchmark

It was very difficult to find, very quickly, any exhaustive statistics on penalty conversion rates on the web. However, I would like to start by establishing what constituted “good” performance for a penalty taker. As a starting point I have looked at Table 2 on this Premier League website. The data is from February 2014 and shows, at that date, data on the players with the best conversion rates in the League’s history. Players who took fewer than 10 penalties were excluded. It shows that of the ten top converting players, who must rank as the very good if not the ten best, in the aggregate they converted 155 of 166 penalties. That is a conversion rate of 93.4%. At first sight that suggests a useful baseline against which to assess any individual penalty taker.

Several questions come to mind. The aggregate statistics do not tell us how individual players have developed over time, whether improving or losing their nerve. That said, it is difficult to perform that sort of analysis on these comparatively low volumes of data when collected in this way. There is however data (Table 4) on the overall conversion rate in the Premier League since its inception.

Penalties

That looks to me like a fairly stable system. That would be expected as players come and go and this is the aggregate of many effects. Perhaps there is latterly reduced season-to-season variation, which would be odd, but I am not really interested in that and have not pursued it. I am aware that during this period there has been a rule change allowing goalkeepers to move before the kick his taken but I have just spent 30 minutes on the web and failed to establish the date when that happened. The total aggregate statistics up to 2014 are 1,438 penalties converted out of 1,888. That is a conversion rate of 76.2%.

I did wonder if there was any evidence that some of the top ten players were better than others or whether the data was consistent with a common elite conversion rate of 93.4%. In that case the table positions would reflect nothing more than sampling variation. Somewhat reluctantly I calculated the chi-squared statistic for the table of successes and failures (I know! But what else to do?). The statistic came out as 2.02 which, with 9 degrees of freedom, has a p-value (I know!) of 0.8%. That is very suggestive of a genuine ranking among the elite penalty takers.

It inevitably follows that the elite are doing better than the overall success rate of 76.2%. Considering all that together I am happy to proceed with 93.4% as the sort of benchmark for a penalty taker that a team like Manchester United would aspire to.

Van Persie

This website, dated 6 Sept 2012, told me that van Persie had converted 18 penalties with a 77% success rate. That does not quite fit either 18/23 or 18/24 but let us take it at face value. If that is accurate then that is, more or less, the data on which Ferguson gave van Persie the job in February 2013. It is a surprising appointment given the Premier League average of 76.2% and the elite benchmark but perhaps it was the best that could be mustered from the squad.

Rooney’s 9 misses out of 27 yields a success rate of 67%. Not so much lower than van Persie’s historical performance but, in all the circumstances, it was not good enough.

The dismissal

What is fascinating is that, no matter what van Persie’s historical record on which he was appointed penalty taker, before his 2 May miss he had scored 6 out of 6. The miss made it 6 out of 7, 85.7%. That was his recent record of performance, even if selected to some extent to show him in a good light.

Selection of that run is a danger. It is often “convenient” to select a subset of data that favours a cherished hypothesis. Though there might be that selectivity, where was the real signal that van Persie had deteriorated or that the club would perform better were he replaced?

The process

Of course, a manager has more information than the straightforward success/ fail ratio. A coach may have observed goalkeepers increasingly guessing a penalty taker’s shot direction. There may have been many near-saves, a hesitancy on the part of the player, trepidation in training. Those are all factors that a manager must take into account. That may lead to the rotation of even the most impressive performer. Perhaps.

But that is not the process that van Gaal advocates. Keep scoring until you miss then go to the bottom of the list. The bottom! Even scorers in the elite-10 miss sometimes. Is it rational to then replace them with an alternative that will most likely be more average (i.e. worse)? And then make them wait until everyone else has missed.

With an average success rate of 76.2% it is more likely than not that van Persie’s replacement will score their first penalty. Van Gaal will be vindicated. That is the phenomenon called regression to the mean. An extreme event (a miss) is most likely followed by something more average (a goal). Economist Daniel Kahneman explores this at length in his book Thinking, Fast and Slow.

It is an odd strategy to adopt. Keep the able until they fail. Then replace them with somebody less able. But different.