Just says, “in mice”; just says, “in boys”

If anybody doubts that twitter has a valuable role in the world they should turn their attention to the twitter sensation that is @justsaysinmice.

The twitter feed exposes bad science journalism where extravagant claims are advanced with a penumbra of implication that something relevant to human life or happiness has been certified by peer reviewed science. It often turns out that, when the original research in interrogated, and in fairness at the very bottom of the journalistic puff piece, it just says, “in mice”. Cauliflower, cabbage, broccoli harbour prostate cancer inhibiting compound, was a recent subeditor’s attention grabbing headline. But the body of the article just says, “in mice”. Most days the author finds at least one item to tweet.

Population – Frame – Sample

The big point here is one of the really big points in understanding statistics.
We start generating data and doing statistics because there is something out there we are interested in. Some things or events. We call the things and events we are bothered about the population. The problem is that, in the real world, it is often difficult to get hold of all those things or events. In an opinion poll, we don’t know who will vote at the next election, or even who will still be alive. We don’t know all the people who follow a particular sports club. We can’t find everyone who’s ever tasted Marmite and expressed an opinion. Sometimes the events or things we are interested in don’t even exist yet and lie wholly in the future. That’s called prediction and forecasting.

In order to do the sort of statistical sampling that text books tell us about, we need to identify some relevant material that is available to us to measure or interrogate. For the opinion poll it would be everyone on the electoral register, perhaps. Or everyone who can be reached by dialing random numbers in the region of interest. Or everyone who signs up to an online database (seriously). Those won’t be the exact people who will be doing the voting at the next election. Some of them likely will be. But we have to make a judgment that they are, somehow, representative.

Similarly, if we want to survey sports club supporters we could use the club’s supporter database. Or the people who but tickets online. Or who tweet. Not perfect but, hey! And, perhaps, in some way representative.

The collection of things we are going to do the sampling on is called the sampling frame. We don’t need to look at the whole of the frame. We can sample. And statistical theory assures us about how much the sample can tell us about the frame, usually quite a lot if done properly. But as to the differences between population and frame, that is another question.

Enumerative and analytic statistics

These real world situations lie in contrast to the sort of simplified situations found in statistics text books. A inspector randomly samples 5 widgets from a batch of 100 and decides whether to accept or reject the batch (though why anyone would do this still defies rational explanation). Here the frame and population are identical. No need to worry.

W Edwards Deming was a statistician who, among his other achievements, developed the sampling techniques used in the 1940 US census. Deming thought deeply about sampling and continually emphasised the distinction between the sort of problems where population and frame were identical, what he called enumerative statistics, and the sundry real world situations where they were not, analytic statistics.1

The key to Deming’s thinking is that, where we are doing analytic statistics, we are not trying to learn about the frame, that is not what interests us, we are trying to learn something useful about the population of concern. That means that we have to use the frame data to learn about the cause system that is common to frame and population. By cause system, Deming meant the aggregate of competing, interacting and evolving factors, inherent and environmental, that influence the outcomes both in frame and population. As Donald Rumsfeld put it, the known knowns, the known unknowns and the unknown unknowns.

The task of understanding how any particular frame and population depend on a common cause-system requires deep subject matter knowledge. As does knowing the scope for reading across conclusions.

Just says, “in mice”

Experimenting on people is not straightforward. That’s why we do experiments on mice.

But here the frame and population are wildly disjoint.
Mice frameSo why? Well apparently, their genetic, biological and behavior characteristics closely resemble those of humans, and many symptoms of human conditions can be replicated in mice.2 That is, their cause systems have something in common. Not everything but things useful to researchers and subject matter experts.

Mice cause

Now, that means that experimental results in mice can’t just be read across as though we had done the experiment on humans. But they help subject matter experts learn more about those parts of the cause-system that are common. That might then lead to tentative theories about human welfare that can then be tested in the inevitably more ethically stringent regime of human trials.

So, not only is bad, often sensationalist, data journalism exposed, but we learn a little more about how science is done.

Just says, “in boys”

If the importance of this point needed emphasising then Caroline Criado Perez makes the case compellingly in her recent book Invisible Women.3

It turns out, that much medical research, much development of treatments and even assessment of motor vehicle safety have historically been performed on frames dominated by men, but with results then read across as though representative of men and women. Perez goes on to show how this has made women’s lives less safe and less healthy than they need have been.

It seems that it is not only journalists who are addicted to bad science.

Anyone doing statistics needs aggressively to scrutinise their sampling frame and how it matches the population of interest. Contrasts in respective cause systems need to be interrogated and distinguished with domain knowledge, background information and contextual data. Involvement in statistics carries responsibilities.


  1. Deming, W E (1975) “On probability as a basis for action”, American Statistician29 146
  2. Melina, R (2010) “Why Do Medical Researchers Use Mice?“, Live Science, retrieved 18:32 UCT 2/6/19
  3. Perez, C C (2019) Invisible Women: Exposing Data Bias in a World Designed for Men, Chatto & Windus

Populism and p-values

Time for The Guardian to get the bad data-journalism award for this headline (25 February 2019).

Vaccine scepticism grows in line with rise of populism – study

Surges in measles cases map tightly to countries where populism is on the march

The report was a journalistic account of a paper by Jonathan Kennedy of the Global Health Unit, Centre for Primary Care and Public Health, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, entitled Populist politics and vaccine hesitancy in Western Europe: an analysis of national-level data.1

Studies show a strong correlation between votes for populist parties and doubts that vaccines work, declared the newspaper, relying on support from the following single chart redrawn by The Guardian‘s own journalists.
PV Fig 1
It seemed to me there was more to the chart than the newspaper report. Is it possible this was all based on an uncritical regression line? Like this (hand drawn line but, believe me, it barely matters – you get the idea).
PV Fig 2
Perhaps there was a p-value too but I shall come back to that. However, looking back at the raw chart, I wondered if it wouldn’t bear a different analysis. The 10 countries, Portugal, …, Denmark and the UK all have “vaccine hesitancy” rates around 6%. That does not vary much with populist support varying between 0% and 27% between them. Again, France, Greece and Germany all have “hesitancy” rates of around 17%, such rate not varying much with populist support varying from 25% to 45%. In fact the Guardian journalist seems to have had that though too. The two groups are colour coded on the chart. So much for the relationship between “populism” and “vaccine hesitancy”. Austria seems not to fit into either group but perhaps that makes it interesting. France has three times the hesitancy of Denmark but is less “populist”.

So what about this picture?
PV Fig 3
Perhaps there are two groups, one with “hesitancy” around 6% and one, around 17%. Austria is an interesting exception. What differences are there between the two groups, aside from populist sentiment? I don’t know because it’s not my study or my domain of knowledge. But, whenever there is a contrast between two groups we ought to explore all the differences we can think of before putting forward an, even tentative, explanation. That’s what Ignaz Semmelweis did when he observed signal differences in post-natal mortality between two wards of the Vienna General Hospital in the 1840s.2 Austria again, coincidentally. He investigated all the differences that he could think of between the wards before advancing and testing his theory of infection. As to the vaccine analysis, we already suspect that there are particular dynamics in Italy around trust in bureaucracy. That’s where the food scare over hormone-treated beef seems to have started so there may be forces at work that make it atypical of the other countries.3, 4, 5

Slightly frustrated, I decided that I wanted to look at the original publication. This was available free of charge on the publisher’s website at the time I read The Guardian article. But now it isn’t. You will have to pay EUR 36, GBP 28 or USD 45 for 24 hour access. Perhaps you feel you already paid.

The academic research

The “populism” data comes from votes cast in the 2014 elections to the European Parliament. That means that the sampling frame was voters in that election. Turnout in the election was 42%. That is not the whole of the population of the respective countries and voters at EU elections are, I think I can safely say, are not representative of the population at large. The “hesitancy” data came from something called the Vaccine Confidence Project (“the VCP”) for which citations are given. It turns out that 65,819 individuals were sampled across 67 countries in 2015. I have not looked at details of the frame, sampling, handling of non-responses, adjustments and so on, but I start off by noting that the two variables are sampled from, inevitably, different frames and that is not really discussed in the paper. Of course here, We make no mockery of honest ad-hockery.6

The VCP put a number of questions to the respondents. It is not clear from the paper whether there were more questions than analysed here. Each question was answered “strongly agree”, “tend to agree”, “do not know”, “tend to disagree”, and “strongly disagree”. The “hesitancy” variable comes from the aggregate of the latter two categories. I would like to have seen the raw data.

The three questions are set out below, along with the associated R2s from the regression analysis.

Question R2
(1) Vaccines are important for children to have 63%
(2) Overall I think vaccines are effective 52%
(3) Overall I think vaccines are safe. 25%

Well the individual questions raise issues about variation in interpreting the respective meanings, notwithstanding translation issues between languages, fidelity and felicity.

As I guessed, there were p-values but, as usual, they add nil to the analysis.

We now see that the plot reproduced in The Guardian is for question (2) alone and has  R2 = 52%. I would have been interested in seeing R2 for my 2-level analysis. The plotted response for question (1) (not reproduced here) actually looks a bit more like a straight line and has better fit. However, in both cases, I am worried by how much leverage the Italy group has. Not discussed in the paper. No regression diagnostics.

So how about this picture, from Kennedy’s paper, for the response to question (3)?

PV Fig 5

Now, the variation in perceptions of vaccine safety, between France, Greece and Italy, is greater than between the remainder countries. Moreover, if anything, among that group, there is evidence that “hesitancy” falls as “populism” increases. There is certainly no evidence that it increases. In my opinion, that figure is powerful evidence that there are other important factors at work here. That is confirmed by the lousy R2 = 25% for the regression. And this is about perceptions of vaccine safety specifically.

I think that the paper also suffers from a failure to honour John Tukey’s trenchant distinction between exploratory data analysis and confirmatory data analysis. Such a failure always leads to trouble.

Confirmation bias

On the basis of his analysis, Kennedy felt confident to conclude as follows.

Vaccine hesitancy and political populism are driven by similar dynamics: a profound distrust in elites and experts. It is necessary for public health scholars and actors to work to build trust with parents that are reluctant to vaccinate their children, but there are limits to this strategy. The more general popular distrust of elites and experts which informs vaccine hesitancy will be difficult to resolve unless its underlying causes—the political disenfranchisement and economic marginalisation of large parts of the Western European population—are also addressed.

Well, in my opinion that goes a long way from what the data reveal. The data are far from from being conclusive as to association between “vaccine hesitancy” and “populism”. Then there is the unsupported assertion of a common causation in “political disenfranchisement and economic marginalisation”. While the focus remains there, the diligent search for other important factors is ignored and devalued.

We all suffer from a confirmation bias in favour of our own cherished narratives.7 We tend to believe and share evidence that we feel supports the narrative and ignore and criticise that which doesn’t. That has been particularly apparent over recent months in the energetic, even enthusiastic, reporting as fact of the scandalous accusations made against Nathan Phillips and dubious allegations made by Jussie Smollett. They fitted a narrative.

I am as bad. I hold to the narrative that people aren’t very good with statistics and constantly look for examples that I can dissect to “prove” that. Please tell me when you think I get it wrong.

Yet, it does seem to me the that the author here, and The Guardian journalist, ought to have been much more critical of the data and much more curious as to the factors at work. In my view, The Guardian had a particular duty of candour as the original research is not generally available to the public.

This sort of selective analysis does not build trust in “elites and experts”.


  1. Kennedy, J (2019) Populist politics and vaccine hesitancy in Western Europe: an analysis of national-level data, Journal of Public Health, ckz004, https://doi.org/10.1093/eurpub/ckz004
  2. Semmelweis, I (1860) The Etiology, Concept, and Prophylaxis of Childbed Fever, trans. K Codell Carter [1983] University of Wisconsin Press: Madison, Wisconsin
  3.  Kerr, W A & Hobbs, J E (2005). “9. Consumers, Cows and Carousels: Why the Dispute over Beef Hormones is Far More Important than its Commercial Value”, in Perdikis, N & Read, R, The WTO and the Regulation of International Trade. Edward Elgar Publishing, pp 191–214
  4. Caduff, L (August 2002). “Growth Hormones and Beyond” (PDF). ETH Zentrum. Archived from the original (PDF) on 25 May 2005. Retrieved 11 December 2007.
  5. Gandhi, R & Snedeker, S M (June 2000). “Consumer Concerns About Hormones in Food“. Program on Breast Cancer and Environmental Risk Factors. Cornell University. Archived from the original on 19 July 2011.
  6. I J Good
  7. Kahneman, D (2011) Thinking, Fast and Slow, London: Allen Lane, pp80-81

UK railway suicides – 2018 update

The latest UK rail safety statistics were published on 6 December 2018, again absent much of the press fanfare we had seen in the past. Regular readers of this blog will know that I have followed the suicide data series, and the press response, closely in 2017, 2016, 2015, 2014, 2013 and 2012. Again I have re-plotted the data myself on a Shewhart chart.


Readers should note the following about the chart.

  • Many thanks to Tom Leveson Gower at the Office of Rail and Road who confirmed that the figures are for the year up to the end of March.
  • Some of the numbers for earlier years have been updated by the statistical authority.
  • I have recalculated natural process limits (NPLs) as there are still no more than 20 annual observations, and because the historical data has been updated. The NPLs have therefore changed but, this year, not by much.
  • Again, the pattern of signals, with respect to the NPLs, is similar to last year.

The current chart again shows the same two signals, an observation above the upper NPL in 2015 and a run of 8 below the centre line from 2002 to 2009. As I always remark, the Terry Weight rule says that a signal gives us license to interpret the ups and downs on the chart. So I shall have a go at doing that.

After two successive annual falls there has been an increase in the number of fatalities.

I haven’t yet seen any real contemporaneous comment on the numbers from the press this year. But what conclusions can we really draw?

In 2015 I was coming to the conclusion that the data increasingly looked like a gradual upward trend. The 2016 and 2017 data offered a challenge to that but my view was still that it was too soon to say that the trend had reversed. There was nothing in the data incompatible with a continuing trend. The decline has not continued but how much can we read into that? There is nothing inherently informative about a relative increase. Remember, the data would certainly have gone up or down. Then again, was there some sort of peak in 2015?

Signal or noise?

Has there been a change to the underlying cause system that drives the suicide numbers? Since the 2016 data, I have fitted a trend line through the data and asked which narrative best fitted what I observed, a continuing increasing trend or a trend that had plateaued or even reversed. You can review my analysis from 2016 here. And from 2017 here.

Here is the data and fitted trend updated with this year’s numbers, along with NPLs around the fitted line, the same as I did in 2016 and 2017.


We always go back to the cause and effect diagram.


As I always emphasise, the difficulty with the suicide data is that there is very little reproducible and verifiable knowledge as to its causes. There is a lot of useful thinking from common human experience and from more general theories in psychology. But the uncertainty is great. It is not possible to come up with a definitive cause and effect diagram on which all will agree, other from the point of view of identifying candidate factors. In statistical terminology, the problem lacks rigidity.

The earlier evidence of a trend, however, suggests that there might be some causes that are developing over time. It is not difficult to imagine that economic trends and the cumulative awareness of other fatalities might have an impact. We are talking about a number of things that might appear on the cause and effect diagram and some that do not, the “unknown unknowns”. When I identified “time” as a factor, I was taking sundry “lurking” factors and suspected causes from the cause and effect diagram that might have a secular impact. I aggregated them under the proxy factor “time” for want of a more refined analysis.

What I have tried to do is to split the data into two parts:

  • A trend (linear simply for the sake of exploratory data analysis (EDA)); and
  • The residual variation about the trend.

The question I want to ask is whether the residual variation is stable, just plain noise, or whether there is a signal there that might give me a clue that a linear trend does not hold.

There is no signal in the detrended data, no signal that the trend has reversed. The tough truth of the data is that it supports either narrative.

  • The upward trend is continuing and is stable. There has been no reversal of trend yet.
  • The raw data is not stable. True there is evidence of an upward trend in the past but there is now evidence that deaths are decreasing, notwithstanding the increase over the last year.

Of course, there is no particular reason, absent the data, to believe in an increasing trend and the initiative to mitigate the situation might well be expected to result in an improvement.

Sometimes, with data, we have to be honest and say that we do not have the conclusive answer. That is the case here. All that can be done is to continue the existing initiatives and look to the future. Nobody ever likes that as a conclusion but it is no good pretending things are unambiguous when that is not the case.

Next steps

Previously I noted proposals to repeat a strategy from Japan of bathing railway platforms with blue light. In the UK, I understand that such lights were installed at Gatwick in summer 2014. There is some recent commentary here from the BBC but I feel the absence of any real systematic follow up on this. I have certainly seen nothing from Gatwick. My wife and I returned through there mid-January this year and the lights are still in place.

A huge amount of sincere endeavour has gone into this issue but further efforts have to be against the background that there is still no conclusive evidence of improvement.

Suggestions for alternative analyses are always welcomed here.

The risks of lead in the environment – social choice and individual values


Almost one in five deaths in the US can be linked to lead pollution, with even low levels of exposure potentially fatal, researchers have said.

That, in any event, was the headline in the Times (London) (£paywall) last week.

Gas pump lead warning

Historical environmental lead

The item turned out to be based on academic research by Professor Bruce Lanphear of Simon Fraser University, and others. You can find their published paper here in The Lancet: Public Health.1 It is publicly available at no charge, a practice very much to be encouraged. You know that I bristle at publicly funded research not being made available to the public.

As it was, no specific thing in either news report or the academic research struck me as wholly wrong. However, it made me wonder about the implied message of the news item and broader issues about communicating risk. I have some criticisms of the academic work, or at least how it is presented, but I will come to those below. I don’t have major doubts about the conclusions.

The pot odds of a jaywalker

Lanphear’s  principal result concerned hazard rates so it is worth talking a little about what they are. Suppose I stand still in the middle of the carriageway at Hyde Park Corner (London) or Time Square (New York) or … . Suppose the pedestrian lights are showing “Don’t walk”. The probability that I get hit by a motor car is fairly high. A good 70 to 80% in my judgment, if I stand there long enough.

Now, suppose I sprint across under the same conditions. My chances of emerging unscathed still aren’t great but I think they are better. A big difference is what engineers call the Time at Risk (TAR). In general, the longer I expose myself to a hazardous situation, the greater the probability that I encounter my nemesis.

Now, there might be other differences between the risks in the two situations. A moving target might be harder to hit or less easy to avoid. However, it feels difficult to make a fair comparison of the risk because of the different TARs. Hazard rates provide a common basis for comparing what actuaries call the force of mortality without the confounding effect of exposure time. Hazard rates, effectively, offer a probability per unit time. They are measured in units like “percent per hour”. The math is actually quite complicated but hazard rates translate into probabilities when you multiply them by TAR. Roughly.

I was recently reading of the British Army’s mission to Helmand Province in Afghanistan.2 In Operation Oqab Tsuka, military planners had to analyse the ground transport of a turbine to an hydroelectric plant. Terrain made the transport painfully slow along a route beset with insurgents and hostile militias. The highway had been seeded with IEDs (“Improvised Explosive Devices”) which slowed progress still more. The analysis predicted in the region of 50 British service deaths to get the turbine to its destination. The extended time to traverse the route escalated the TAR and hence the hazard, literally the force of mortality. That analysis led to a different transport route being explored and adopted.

So hazard rates provide a baseline of risk disregarding exposure time.

Lanphear’s results

Lanphear was working with a well established sampling frame of 18,825 adults in the USA whose lead levels had been measured some time in 1988 to 1994 when they were recruited to the panel. The cohort had been followed up in a longitudinal study so that data was to hand as to their subsequent morbidity and mortality.

What Lanphear actually looked at was a ratio of hazard rates. For the avoidance of doubt, the hazard that he was looking at was death from heart disease. There was already evidence of a link with lead exposure. He looked at, among other things, how much the hazard rate changed between the cohort members with the lowest measured blood-lead levels and with the highest. That is, as measured back in the period 1988 to 1994. He found, this is his headline result, that an increase in historical blood-lead from 1.0 μg/dL (microgram per decilitre) to 6.7 μg/dL was associated with an estimated 37% increase in hazard rate for heart disease.

Moreover, 1.0 and 6.7 μg/dL represented the lower and upper limits of the middle 80% of the sample. These were not wildly atypical levels. So in going from the blood-lead level that marks the 10% least exposed to the level of the 10% most exposed we get a 37% increase in instantaneous risk from heart disease.

Now there are a few things to note. Firstly, it is fairly obvious that historical lead in blood would be associated with other things that influence the onset of heart disease, location in an industrial zone, income, exercise regime etc. Lanphear took those into account, as far as is possible, in his statistical modelling. These are the known unknowns. It is also obvious that some things have an impact on heart disease that we don’t know about yet or which are simply too difficult, or too costly or too unethical, to measure. These are the unknown unknowns. Variation in these factors causes variation in morbidity and mortality. But we can’t assign the variation to an individual cause. Further, that variation causes uncertainty in all the estimates. It’s not exactly 37%. However, bearing all that in mind rather tentatively, this is all we have got.

Despite those other sources of variation, I happen to know my personal baseline risk of suffering cardiovascular disease. As I explored here, it is 5% over 10 years. Well, that was 4 years ago so its 3% over the next 6. Now, I was brought up in the industrial West Midlands of the UK, Rowley Regis to be exact, in the 1960s. Our nineteenth-century-built house had water supplied through lead pipes and there was no diligent running-off of drinking water before use. Who knew? Our house was beside a busy highway.3 I would guess that, on any determination of historical exposure to environmental lead, I would rate in the top 10%.

That gives me a personal probability over the next 6 years of 1.37 × 3% = 4%. Or so. Am I bothered?

Well, no. Neither should you be.

But …

Social Choice and Individual Values

That was the title of a seminal 1951 book by Nobel laureate economist Kenneth Arrow.4 Arrow applied his mind to the question of how society as a whole should respond when individuals in the society had differing views as to the right and the good, or even the true and the just.

The distinction between individual choice and social policy lies, I think, at the heart of the confusion of tone of the Times piece. The marginal risk to an individual, myself in particular, from historical lead is de minimis. I have taken a liberty in multiplying my hazard rate for morbidity by a hazard ratio for mortality but I think you get my point. There is no reason at all why I, or you, should be bothered in the slightest as to our personal health. Even with an egregious historical exposure. However, those minimal effects, aggregated across a national scale, add up to a real impact on the economy. Loss of productive hours, resources diverted to healthcare, developing professional expertise terminated early by disease. All these things have an impact on national wealth. A little elementary statistics, and a few not unreasonable assumptions, allows an estimate of the excess number of deaths that would not have occurred “but for” the environmental lead exposure. That number turns out to be 441,000 US deaths each year with an estimated annual impact on the economy of over $100 billion. If you are skeptical, perhaps it is one tenth of that.

Now, nobody is suggesting that environmental lead has precipitated some crisis in public health that ought to make us fear for our lives. That is where the Times article was badly framed. Lanphear and his colleagues are at pains to point out just how deaths from heart disease have declined over the past 50 years, how much healthier and long-lived we now are.

The analysis kicks in when policy makers come to consider choices between various taxation schemes, trade deals, international political actions, or infrastructure investment strategies. There, the impact of policy choices on environmental lead can be mapped directly into economic consequences. Here the figures matter a great deal. But to me? Not so much.

What is to be done?

How do we manage economy level policy when an individual might not perceive much of a stake? Arrow found that neither the ballot box nor markets offered a tremendously helpful solution. That leaves us with dependence on the bureaucratic professions, or the liberal elite as we are told we have to call them in these politically correct times. That in turn leads us back to Robert Michels’ Iron Law of Oligarchy. Historically, those elites have proved resistant to popular sentiments and democratic control. The modern solution is democratic governance. However, that is exactly what Michels viewed as doomed to fail. The account of the British Army in Afghanistan that I referred to above is a further anecdote of failure.5

But I am going to remain an optimist that bureaucrats can be controlled. Much of the difficulty arises from governance functions’ statistical naivety and lack of data smarts. Politicians aren’t usually the most data critical people around. The Times piece does not help. One of the things everyone can do is to be clearer that there are individual impacts and economy-wide impacts, and that they are different things. Just because you can discount a personal hazard does not mean there is not something that governments should be working to improve.

It’s not all about me.

Some remarks on the academic work

As I keep on saying, the most (sic) important part of any, at least conventional, regression modelling is residuals analysis and regression diagnostics.6 However, Lanphear and his colleagues were doing something a lot more complicated than the simple linear case. The were using proportional hazards modelling. Now, I know that there are really serious difficulties in residuals analysis for such models and in giving a neat summary figure of how much of the variation in the data is “explained” by the factors being investigated. However, there are diagnostic tools for proportional hazards and I would like to have seen something reported. Perhaps the analysis was done but my trenchant view is that it is vital that it is shared. For all the difficulties in this, progress will only be made by domain experts trying to develop practice collaboratively.

My mind is always haunted by the question Was the regression worth it? And please remember that p-values in no way answer that question.

References and notes

  1. Lanphear, BP (2018) Low-level lead exposure and mortality in US adults: a population-based cohort study, The Lancet: Public Health. Published online.
  2. Farrell, T (2017) Unwinnable: Britain’s War in Afghanistan 2001-2014, London: The Bodley Head, pp239-244
  3. During the industrial revolution, this had been the important Oldbury to Halesowen turnpike-road. Even in the 1960s it carried a lot of traffic. My Black Country grandfather always referred to it as the ‘oss road. a road so significant that one might find horses on it. Keep out o’ the ‘oss road, m’ mon. He knew about risk.
  4. Arrow, KJ [1951] (2012) Social Choice and Individual Values, Martino Fine Books
  5. Farrell Op. cit.
  6. Draper, NR & Smith, H (1998) Applied Regression Analysis, 3rd ed., New York:  Wiley, Chapters 2 and 8

Shewhart chart basics 1 – The environment sufficiently stable to be predictable

Everybody wants to be able to predict the future. Here is the forecaster’s catechism.

  • We can do no more that attach a probability to future events.
  • Where we have data from an environment that is sufficiently stable to be predictable we can project historical patterns into the future.
  • Otherwise, prediction is largely subjective;
  • … but there are tactics that can help.
  • The Shewhart chart is the tool that helps us know whether we are working with an environment that is sufficiently stable to be predictable.

Now let’s get to work.

What does a stable/ predictable environment look like?

Every trial lawyer knows the importance of constructing a narrative out of evidence, an internally consistent and compelling arrangement of the facts that asserts itself above competing explanations. Time is central to how a narrative evolves. It is time that suggests causes and effects, motivations, barriers and enablers, states of knowledge, external influences, sensitisers and cofactors. That’s why exploration of data always starts with plotting it in time order. Always.

Let’s start off by looking at something we know to be predictable. Imagine a bucket of thousands of spherical beads. Of the beads, 80% are white and 20%, red. You are given a paddle that will hold 50 beads. Use the paddle to stir the beads then draw out 50 with the paddle. Count the red beads. Now you may, at this stage, object. Surely, this is just random and inherently unpredictable. But I want to persuade you that this is the most predictable data you have ever seen. Let’s look at some data from 20 sequential draws. In time order, of course, in Fig. 1.

Shew Chrt 1

Just to look at the data from another angle, always a good idea, I have added up how many times a particular value, 9, 10, 11, … , turns up and tallied them on the right hand side. For example, here is the tally for 12 beads in Fig. 2.

Shew Chrt 2

We get this in Fig. 3.

Shew Chrt 3

Here are the important features of the data.

  • We can’t predict what the exact value will be on any particular draw.
  • The numbers vary irregularly from draw to draw, as far as we can see.
  • We can say that draws will vary somewhere between 2 (say) and 19 (say).
  • Most of the draws are fairly near 10.
  • Draws near 2 and 19 are much rarer.

I would be happy to predict that the 21st draw will be between 2 and 19, probably not too far from 10. I have tried to capture that in Fig. 4. There are limits to variation suggested by the experience base. As predictions go, let me promise you, that is as good as it gets.

Even statistical theory would point to an outcome not so very different from that. That theoretical support adds to my confidence.

Shew Chrt 4

But there’s something else. Something profound.

A philosopher, an engineer and a statistician walk into a bar …

… and agree.

I got my last three bullet points above from just looking at the tally on the right hand side. What about the time order I was so insistent on preserving? As Daniel Kahneman put it “A random event does not … lend itself to explanation, but collections of random events do behave in a highly regular fashion.” What is this “regularity” when we can see how irregularly the draws vary? This is where time and narrative make their appearance.

If we take the draw data above, the exact same data, and “shuffle” it into a fresh order, we get this, Fig. 5.

Shew Chrt 5

Now the bullet points still apply to the new arrangement. The story, the narrative, has not changed. We still see the “irregular” variation. That is its “regularity”, that is tells the same story when we shuffle it. The picture and its inferences are the same. We cannot predict an exact value on any future draw yet it is all but sure to be between 2 and 19 and probably quite close to 10.

In 1924, British philosopher W E Johnson and US engineer Walter Shewhart, independently, realised that this was the key to describing a predicable process. It shows the same “regular irregularity”, or shall we say stable irregularity, when you shuffle it. Italian statistician Bruno de Finetti went on to derive the rigorous mathematics a few years later with his famous representation theorem. The most important theorem in the whole of statistics.

This is the exact characterisation of noise. If you shuffle it, it makes no difference to what you see or the conclusions you draw. It makes no difference to the narrative you construct (sic). Paradoxically, it is noise that is predictable.

To understand this, let’s look at some data that isn’t just noise.

Events, dear boy, events.

That was the alleged response of British Prime Minister Harold Macmillan when asked what had been the most difficult aspect of governing Britain.

Suppose our data looks like this in Fig. 6.

Shew Chrt 6

Let’s make it more interesting. Suppose we are looking at the net approval rating of a politician (Fig. 7).

Shew Chrt 7

What this looks like is noise plus a material step change between the 10th and 11th observation. Now, this is a surprise. The regularity, and the predictability, is broken. In fact, my first reaction is to ask What happened? I research political events and find at that same time there was an announcement of universal tax cuts (Fig. 8). This is just fiction of course. That then correlates with the shift in the data I observe. The shift is a signal, a flag from the data telling me that something happened, that the stable irregularity has become an unstable irregularity. I use the time context to identify possible explanations. I come up with the tentative idea about tax cuts as an explanation of the sudden increase in popularity.

The bullet points above no longer apply. The most important feature of the data now is the shift, I say, caused by the Prime Minister’s intervention.

Shew Chrt 8

What happens when I shuffle the data into a random order though (Fig. 9)?

Shew Chrt 9

Now, the signal is distorted, hard to see and impossible to localise in time. I cannot tie it to a context. The message in the data is entirely different. The information in the chart is not preserved. The shuffled data does not bear the same narrative as the time ordered data. It does not tell the same story. It does not look the same. That is how I know there is a signal. The data changes its story when shuffled. The time order is crucial.

Of course, if I repeated the tally exercise that I did on Fig. 4, the tally would look the same, just as it did in the noise case in Fig. 5.

Is data with signals predictable?

The Prime Minister will say that they predicted that their tax cuts would be popular and they probably did so. My response to that would be to ask how big an improvement they predicted. While a response in the polls may have been foreseeable, specifying its magnitude is much more difficult and unlikely to be exact.

We might say that the approval data following the announcement has returned to stability. Can we not now predict the future polls? Perhaps tentatively in the short term but we know that “events” will continue to happen. Not all these will be planned by the government. Some government initiatives, triumphs and embarrassments will not register with the public. The public has other things to be interested in. Here is some UK data.


You can follow regular updates here if you are interested.

Shewhart’s ingenious chart

While Johnson and de Finetti were content with theory, Shewhart, working in the manufacture of telegraphy equipment, wanted a practical tool for his colleagues that would help them answer the question of predictability. A tool that would help users decide whether they were working with an environment sufficiently stable to be predictable. Moreover, he wanted a tool that would be easy to use by people who were short of time time for analysing data and had minds occupied by the usual distractions of the work place. He didn’t want people to have to run off to a statistician whenever they were perplexed by events.

In Part 2 I shall start to discuss how to construct Shewhart’s chart. In subsequent parts, I shall show you how to use it.