Shewhart chart basics 1 – The environment sufficiently stable to be predictable

Everybody wants to be able to predict the future. Here is the forecaster’s catechism.

  • We can do no more that attach a probability to future events.
  • Where we have data from an environment that is sufficiently stable to be predictable we can project historical patterns into the future.
  • Otherwise, prediction is largely subjective;
  • … but there are tactics that can help.
  • The Shewhart chart is the tool that helps us know whether we are working with an environment that is sufficiently stable to be predictable.

Now let’s get to work.

What does a stable/ predictable environment look like?

Every trial lawyer knows the importance of constructing a narrative out of evidence, an internally consistent and compelling arrangement of the facts that asserts itself above competing explanations. Time is central to how a narrative evolves. It is time that suggests causes and effects, motivations, barriers and enablers, states of knowledge, external influences, sensitisers and cofactors. That’s why exploration of data always starts with plotting it in time order. Always.

Let’s start off by looking at something we know to be predictable. Imagine a bucket of thousands of spherical beads. Of the beads, 80% are white and 20%, red. You are given a paddle that will hold 50 beads. Use the paddle to stir the beads then draw out 50 with the paddle. Count the red beads. Now you may, at this stage, object. Surely, this is just random and inherently unpredictable. But I want to persuade you that this is the most predictable data you have ever seen. Let’s look at some data from 20 sequential draws. In time order, of course, in Fig. 1.

Shew Chrt 1

Just to look at the data from another angle, always a good idea, I have added up how many times a particular value, 9, 10, 11, … , turns up and tallied them on the right hand side. For example, here is the tally for 12 beads in Fig. 2.

Shew Chrt 2

We get this in Fig. 3.

Shew Chrt 3

Here are the important features of the data.

  • We can’t predict what the exact value will be on any particular draw.
  • The numbers vary irregularly from draw to draw, as far as we can see.
  • We can say that draws will vary somewhere between 2 (say) and 19 (say).
  • Most of the draws are fairly near 10.
  • Draws near 2 and 19 are much rarer.

I would be happy to predict that the 21st draw will be between 2 and 19, probably not too far from 10. I have tried to capture that in Fig. 4. There are limits to variation suggested by the experience base. As predictions go, let me promise you, that is as good as it gets.

Even statistical theory would point to an outcome not so very different from that. That theoretical support adds to my confidence.

Shew Chrt 4

But there’s something else. Something profound.

A philosopher, an engineer and a statistician walk into a bar …

… and agree.

I got my last three bullet points above from just looking at the tally on the right hand side. What about the time order I was so insistent on preserving? As Daniel Kahneman put it “A random event does not … lend itself to explanation, but collections of random events do behave in a highly regular fashion.” What is this “regularity” when we can see how irregularly the draws vary? This is where time and narrative make their appearance.

If we take the draw data above, the exact same data, and “shuffle” it into a fresh order, we get this, Fig. 5.

Shew Chrt 5

Now the bullet points still apply to the new arrangement. The story, the narrative, has not changed. We still see the “irregular” variation. That is its “regularity”, that is tells the same story when we shuffle it. The picture and its inferences are the same. We cannot predict an exact value on any future draw yet it is all but sure to be between 2 and 19 and probably quite close to 10.

In 1924, British philosopher W E Johnson and US engineer Walter Shewhart, independently, realised that this was the key to describing a predicable process. It shows the same “regular irregularity”, or shall we say stable irregularity, when you shuffle it. Italian statistician Bruno de Finetti went on to derive the rigorous mathematics a few years later with his famous representation theorem. The most important theorem in the whole of statistics.

This is the exact characterisation of noise. If you shuffle it, it makes no difference to what you see or the conclusions you draw. It makes no difference to the narrative you construct (sic). Paradoxically, it is noise that is predictable.

To understand this, let’s look at some data that isn’t just noise.

Events, dear boy, events.

That was the alleged response of British Prime Minister Harold Macmillan when asked what had been the most difficult aspect of governing Britain.

Suppose our data looks like this in Fig. 6.

Shew Chrt 6

Let’s make it more interesting. Suppose we are looking at the net approval rating of a politician (Fig. 7).

Shew Chrt 7

What this looks like is noise plus a material step change between the 10th and 11th observation. Now, this is a surprise. The regularity, and the predictability, is broken. In fact, my first reaction is to ask What happened? I research political events and find at that same time there was an announcement of universal tax cuts (Fig. 8). This is just fiction of course. That then correlates with the shift in the data I observe. The shift is a signal, a flag from the data telling me that something happened, that the stable irregularity has become an unstable irregularity. I use the time context to identify possible explanations. I come up with the tentative idea about tax cuts as an explanation of the sudden increase in popularity.

The bullet points above no longer apply. The most important feature of the data now is the shift, I say, caused by the Prime Minister’s intervention.

Shew Chrt 8

What happens when I shuffle the data into a random order though (Fig. 9)?

Shew Chrt 9

Now, the signal is distorted, hard to see and impossible to localise in time. I cannot tie it to a context. The message in the data is entirely different. The information in the chart is not preserved. The shuffled data does not bear the same narrative as the time ordered data. It does not tell the same story. It does not look the same. That is how I know there is a signal. The data changes its story when shuffled. The time order is crucial.

Of course, if I repeated the tally exercise that I did on Fig. 4, the tally would look the same, just as it did in the noise case in Fig. 5.

Is data with signals predictable?

The Prime Minister will say that they predicted that their tax cuts would be popular and they probably did so. My response to that would be to ask how big an improvement they predicted. While a response in the polls may have been foreseeable, specifying its magnitude is much more difficult and unlikely to be exact.

We might say that the approval data following the announcement has returned to stability. Can we not now predict the future polls? Perhaps tentatively in the short term but we know that “events” will continue to happen. Not all these will be planned by the government. Some government initiatives, triumphs and embarrassments will not register with the public. The public has other things to be interested in. Here is some UK data.

poll20180302

You can follow regular updates here if you are interested.

Shewhart’s ingenious chart

While Johnson and de Finetti were content with theory, Shewhart, working in the manufacture of telegraphy equipment, wanted a practical tool for his colleagues that would help them answer the question of predictability. A tool that would help users decide whether they were working with an environment sufficiently stable to be predictable. Moreover, he wanted a tool that would be easy to use by people who were short of time time for analysing data and had minds occupied by the usual distractions of the work place. He didn’t want people to have to run off to a statistician whenever they were perplexed by events.

In Part 2 I shall start to discuss how to construct Shewhart’s chart. In subsequent parts, I shall show you how to use it.

Advertisements

UK railway suicides – 2017 update

The latest UK rail safety statistics were published on 23 November 2017, again absent much of the press fanfare we had seen in the past. Regular readers of this blog will know that I have followed the suicide data series, and the press response, closely in 2016, 20152014, 2013 and 2012. Again I have re-plotted the data myself on a Shewhart chart.

RailwaySuicides20171

Readers should note the following about the chart.

  • Many thanks to Tom Leveson Gower at the Office of Rail and Road who confirmed that the figures are for the year up to the end of March.
  • Some of the numbers for earlier years have been updated by the statistical authority.
  • I have recalculated natural process limits (NPLs) as there are still no more than 20 annual observations, and because the historical data has been updated. The NPLs have therefore changed but, this year, not by much.
  • Again, the pattern of signals, with respect to the NPLs, is similar to last year.

The current chart again shows two signals, an observation above the upper NPL in 2015 and a run of 8 below the centre line from 2002 to 2009. As I always remark, the Terry Weight rule says that a signal gives us license to interpret the ups and downs on the chart. So I shall have a go at doing that.

It will not escape anybody’s attention that this is now the second year in which there has been a fall in the number of fatalities.

I haven’t yet seen any real contemporaneous comment on the numbers from the press. This item appeared on the BBC, a weak performer in the field of data journalism but clearly with privileged access to the numbers, on 30 June 2017, confidently attributing the fall to past initiatives.

Sky News clearly also had advanced sight of the numbers and make the bold claim that:

… for every death, six more lives were saved through interventions.

That item goes on to highlight a campaign to encourage fellow train users to engage with anybody whose behaviour attracted attention.

But what conclusions can we really draw?

In 2015 I was coming to the conclusion that the data increasingly looked like a gradual upward trend. The 2016 data offered a challenge to that but my view was still that it was too soon to say that the trend had reversed. There was nothing in the data incompatible with a continuing trend. This year, 2017, has seen 2016’s fall repeated. A welcome development but does it really show conclusively that the upward trending pattern is broken? Regular readers of this blog will know that Langian statistics like “lowest for six years” carry no probative weight here.

Signal or noise?

Has there been a change to the underlying cause system that drives the suicide numbers? Last year, I fitted a trend line through the data and asked which narrative best fitted what I observed, a continuing increasing trend or a trend that had plateaued or even reversed. You can review my analysis from last year here.

Here is the data and fitted trend updated with this year’s numbers, along with NPLs around the fitted line, the same as I did last year.

RailwaySuicides20172

Let’s think a little deeper about how to analyse the data. The first step of any statistical investigation ought to be the cause and effect diagram.

SuicideCne

The difficulty with the suicide data is that there is very little reproducible and verifiable knowledge as to its causes. I have seen claims, of whose provenance I am uncertain, that railway suicide is virtually unknown in the USA. There is a lot of useful thinking from common human experience and from more general theories in psychology. But the uncertainty is great. It is not possible to come up with a definitive cause and effect diagram on which all will agree, other from the point of view of identifying candidate factors.

The earlier evidence of a trend, however, suggests that there might be some causes that are developing over time. It is not difficult to imagine that economic trends and the cumulative awareness of other fatalities might have an impact. We are talking about a number of things that might appear on the cause and effect diagram and some that do not, the “unknown unknowns”. When I identified “time” as a factor, I was taking sundry “lurking” factors and suspected causes from the cause and effect diagram that might have a secular impact. I aggregated them under the proxy factor “time” for want of a more exact analysis.

What I have tried to do is to split the data into two parts:

  • A trend (linear simply for the sake of exploratory data analysis (EDA); and
  • The residual variation about the trend.

The question I want to ask is whether the residual variation is stable, just plain noise, or whether there is a signal there that might give me a clue that a linear trend does not hold.

There is no signal in the detrended data, no signal that the trend has reversed. The tough truth of the data is that it supports either narrative.

  • The upward trend is continuing and is stable. There has been no reversal of trend yet.
  • The data is not stable. True there is evidence of an upward trend in the past but there is now evidence that deaths are decreasing.

Of course, there is no particular reason, absent the data, to believe in an increasing trend and the initiative to mitigate the situation might well be expected to result in an improvement.

Sometimes, with data, we have to be honest and say that we do not have the conclusive answer. That is the case here. All that can be done is to continue the existing initiatives and look to the future. Nobody ever likes that as a conclusion but it is no good pretending things are unambiguous when that is not the case.

Next steps

Previously I noted proposals to repeat a strategy from Japan of bathing railway platforms with blue light. In the UK, I understand that such lights were installed at Gatwick in summer 2014. In fact my wife and I were on the platform at Gatwick just this week and I had the opportunity to observe them. I also noted, on my way back from court the other day, blue strip lights along the platform edge at East Croydon. I think they are recently installed. However, I have not seen any data or heard of any analysis.

A huge amount of sincere endeavour has gone into this issue but further efforts have to be against the background that there is still no conclusive evidence of improvement.

Suggestions for alternative analyses are always welcomed here.

Regression done right: Part 3: Forecasts to believe in

There are three Sources of Uncertainty in a forecast.

  1. Whether the forecast is of “an environment that is sufficiently regular to be predictable”.1
  2. Uncertainty arising from the unexplained (residual) system variation.
  3. Technical statistical sampling error in the regression calculation.

Source of Uncertainty (3) is the one that fascinates statistical theorists. Sources (1) and (2) are the ones that obsess the rest of us. I looked at the first in Part 1 of this blog and, the second in Part 2. Now I want to look at the third Source of Uncertainty and try to put everything together.

If you are really most interested in (1) and (2), read “Prediction intervals” then skip forwards to “The fundamental theorem of forecasting”.

Prediction intervals

A prediction interval2 captures the range in which a future observation is expected to fall. Bafflingly, not all statistical software generates prediction intervals automatically so it is necessary, I fear, to know how to calculate them from first principles. However, understanding the calculation is, in itself, instructive.

But I emphasise that prediction intervals rely on a presumption that what is being forecast is “an environment that is sufficiently regular to be predictable”, that the (residual) business process data is exchangeable. If that presumption fails then all bets are off and we have to rely on a Cardinal Newman analysis. Of course, when I say that “all bets are off”, they aren’t. You will still be held to your existing contractual commitments even though your confidence in achieving them is now devastated. More on that another time.

Sources of variation in predictions

In the particular case of linear regression we need further to break down the third Source of Uncertainty.

  1. Uncertainty arising from the unexplained (residual) variation.
  2. Technical statistical sampling error in the regression calculation.
    1. Sampling error of the mean.
    2. Sampling error of the slope

Remember that we are, for the time being, assuming Source of Uncertainty (1) above can be disregarded. Let’s look at the other Sources of Uncertainty in turn: (2), (3A) and (3B).

Source of Variation (2) – Residual variation

We start with the Source of Uncertainty arising from the residual variation. This is the uncertainty because of all the things we don’t know. We talked about this a lot in Part 2. We are content, for the moment, that they are sufficiently stable to form a basis for prediction. We call this common cause variation. This variation has variance s2, where s is the residual standard deviation that will be output by your regression software.

RegressionResExpl2

Source of Variation (3A) – Sampling error in mean

To understand the next Source of Variation we need to know a little bit about how the regression is calculated. The calculations start off with the respective means of the X values ( X̄ ) and of the Y values ( Ȳ ). Uncertainty in estimating the mean of the Y , is the next contribution to the global prediction uncertainty.

An important part of calculating the regression line is to calculate the mean of the Ys. That mean is subject to sampling error. The variance of the sampling error is the familiar result from the statistics service course.

RegEq2

— where n is the number of pairs of X and Y. Obviously, as we collect more and more data this term gets more and more negligible.

RegressionMeanExpl

Source of Variation (3B) – Sampling error in slope

This is a bit more complicated. Skip forwards if you are already confused. Let me first give you the equation for the variance of predictions referable to sampling error in the slope.

RegEq3

This has now introduced the mysterious sum of squaresSXX. However, before we learn exactly what this is, we immediately notice two things.

  1. As we move away from the centre of the training data the variance gets larger.3
  2. As SXX gets larger the variance gets smaller.

The reason for the increasing sampling error as we move from the mean of X is obvious from thinking about how variation in slope works. The regression line pivots on the mean. Travelling further from the mean amplifies any disturbance in the slope.

RegressionSlopeExpl

Let’s look at where SXX comes from. The sum of squares is calculated from the Xs alone without considering the Ys. It is a characteristic of the sampling frame that we used to train the model. We take the difference of each X value from the mean of X, and then square that distance. To get the sum of squares we then add up all those individual squares. Note that this is a sum of the individual squares, not their average.

RegressionSXXTable

Two things then become obvious (if you think about it).

  1. As we get more and more data, SXX gets larger.
  2. As the individual Xs spread out over a greater range of XSXX gets larger.

What that (3B) term does emphasise is that even sampling error escalates as we exploit the edge of the original training data. As we extrapolate clear of the original sampling frame, the pure sampling error can quickly exceed even the residual variation.

Yet it is only a lower bound on the uncertainty in extrapolation. As we move away from the original range of Xs then, however happy we were previously with Source of Uncertainty (1), that the data was from “an environment that is sufficiently regular to be predictable”, then the question barges back in. We are now remote from our experience base in time and boundary. Nothing outside the original X-range will ever be a candidate for a comfort zone.

The fundamental theorem of prediction

Variances, generally, add up so we can sum the three Sources of Variation (2), (3A) and (3B). That gives the variance of an individual prediction, spred2. By an individual prediction I mean that somebody gives me an X and I use the regression formula to give them the (as yet unknown) corresponding Ypred.

RegEq4

It is immediately obvious that s2 is common to all three terms. However, the second and third terms, the sampling errors, can be made as small as we like by collecting more and more data. Collecting more and more data will have no impact on the first term. That arises from the residual variation. The stuff we don’t yet understand. It has variance s2, where s is the residual standard deviation that will be output by your regression software.

This, I say, is the fundamental theorem of prediction. The unexplained variation provides a hard limit on the precision of forecasts.

It is then a very simple step to convert the variance into a standard deviation, spred. This is the standard error of the prediction.4,5

RegEq5

Now, in general, where we have a measurement or prediction that has an uncertainty that can be characterised by a standard error u, there is an old trick for putting an interval round it. Remember that u is a measure of the variation in z. We can therefore put an interval around z as a number of standard errors, z±ku. Here, k is a constant of your choice. A prediction interval for the regression that generates prediction Ypred then becomes:

RegEq7

Choosing k=3 is very popular, conservative and robust.6,7 Other choices of k are available on the advice of a specialist mathematician.

It was Shewhart himself who took this all a bit further and defined tolerance intervals which contain a given proportion of future observations with a given probability.8 They are very much for the specialist.

Source of Variation (1) – Special causes

But all that assumes that we are sampling from “an environment that is sufficiently regular to be predictable”, that the residual variation is solely common cause. We checked that out on our original training data but the price of predictability is eternal vigilance. It can never be taken for granted. At any time fresh causes of variation may infiltrate the environment, or become newly salient because of some sensitising event or exotic interaction.

The real trouble with this world of ours is not that it is an unreasonable world, nor even that it is a reasonable one. The commonest kind of trouble is that it is nearly reasonable, but not quite. Life is not an illogicality; yet it is a trap for logicians. It looks just a little more mathematical and regular than it is; its exactitude is obvious, but its inexactitude is hidden; its wildness lies in wait.

G K Chesterton

The remedy for this risk is to continue plotting the residuals, the differences between the observed value and, now, the prediction. This is mandatory.

RegressionPBC2

Whenever we observe a signal of a potential special cause it puts us on notice to protect the forecast-user because our ability to predict the future has been exposed as deficient and fallible. But it also presents an opportunity. With timely investigation, a signal of a possible special cause may provide deeper insight into the variation of the cause-system. That in itself may lead to identifying further factors to build into the regression and a consequential reduction in s2.

It is reducing s2, by progressively accumulating understanding of the cause-system and developing the model, that leads to more precise, and more reliable, predictions.

Notes

  1. Kahneman, D (2011) Thinking, Fast and Slow, Allen Lane, p240
  2. Hahn, G J & Meeker, W Q (1991) Statistical Intervals: A Guide for Practitioners, Wiley, p31
  3. In fact s2/SXX is the sampling variance of the slope. The standard error of the slope is, notoriously, s/√SXX. A useful result sometimes. It is then obvious from the figure how variation is slope is amplified as we travel father from the centre of the Xs.
  4. Draper, N R & Smith, H (1998) Applied Regression Analysis, 3rd ed., Wiley, pp81-83
  5. Hahn & Meeker (1991) p232
  6. Wheeler, D J (2000) Normality and the Process Behaviour Chart, SPC Press, Chapter 6
  7. Vysochanskij, D F & Petunin, Y I (1980) “Justification of the 3σ rule for unimodal distributions”, Theory of Probability and Mathematical Statistics 21: 25–36
  8. Hahn & Meeker (1991) p231

Regression done right: Part 1: Can I predict the future?

I recently saw an article in the Harvard Business Review called “Refresher on Regression Analysis”. I thought it was horrible so I wanted to set the record straight.

Linear regression from the viewpoint of machine learning

Linear regression is important, not only because it is a useful tool in itself, but because it is (almost) the simplest statistical model. The issues that arise in a relatively straightforward form are issues that beset the whole of statistical modelling and predictive analytics. Anyone who understands linear regression properly is able to ask probing questions about more complicated models. The complex internal algorithms of Kalman filters, ARIMA processes and artificial neural networks are accessible only to the specialist mathematician. However, each has several general features in common with simple linear regression. A thorough understanding of linear regression enables a due diligence of the claims made by the machine learning advocate. Linear regression is the paradigmatic exemplar of machine learning.

There are two principal questions that I want to talk about that are the big takeaways of linear regression. They are always the first two questions to ask in looking at any statistical modelling or machine learning scenario.

  1. What predictions can I make (if any)?
  2. Is it worth the trouble?

I am going to start looking at (1) in this blog and complete it in a future Part 2. I will then look at (2) in a further Part 3.

Variation, variation, variation

Variation is a major problem for business, the tendency of key measures to fluctuate irregularly. Variation leads to uncertainty. Will the next report be high or low? Or in the middle? Because of the uncertainty we have to allow safety margins or swallow some non-conformancies. We have good days and bad days, good products and not so good. We have to carry costly working capital because of variation in cash flow. And so on.

We learned in our high school statistics class to characterise variation in a key process measure, call it the Big Y, by an histogram of observations. Perhaps we are bothered by the fluctuating level of monthly sales.

RegressionHistogram

The variation arises from a whole ecology of competing and interacting effects and factors that we call the cause-system of the outcome. In general, it is very difficult to single out individual factors as having been the cause of a particular observation, so entangled are they. It is still useful to capture them for reference on a cause and effect diagram.

RegressionIshikawa

One of the strengths of the cause and effect diagram is that it may prompt the thought that one of the factors is particularly important, call it Big X, perhaps it is “hours of TV advertising” (my age is showing). Motivated by that we can generate a sample of corresponding measurements data of both the Y and X and plot them on a scatter plot.

RegressionScatter1

Well what else is there to say? The scatter plot shows us all the information in the sample. Scatter plots are an important part of what statistician John Tukey called Exploratory Data Analysis (EDA). We have some hunches and ideas, or perhaps hardly any idea at all, and we attack the problem by plotting the data in any way we can think of. So much easier now than when W Edwards Deming wrote:1

[Statistical practice] means tedious work, such as studying the data in various forms, making tables and charts and re-making them, trying to use and preserve the evidence in the results and to be clear enough to the reader: to endure disappointment and discouragement.

Or as Chicago economist Ronald Coase put it.

If you torture the data enough, nature will always confess.

The scatter plot is a fearsome instrument of data torture. It tells me everything. It might even tempt me to think that I have a basis on which to make predictions.

Prediction

In machine learning terms, we can think of the sample used for the scatter plot as a training set of data. It can be used to set up, “train”, a numerical model that we will then fix and use to predict future outcomes. The scatter plot strongly suggests that if we know a future X alone we can have a go at predicting the corresponding future Y. To see that more clearly we can draw a straight line by hand on the scatter plot, just as we did in high school before anybody suggested anything more sophisticated.

RegressionScatter2

Given any particular X we can read off the corresponding Y.

RegressionScatter3

The immediate insight that comes from drawing in the line is that not all the observations lie on the line. There is variation about the line so that there is actually a range of values of Y that seem plausible and consistent for any specified X. More on that in Parts 2 and 3.

In understanding machine learning it makes sense to start by thinking about human learning. Psychologists Gary Klein and Daniel Kahneman investigated how firefighters were able to perform so successfully in assessing a fire scene and making rapid, safety critical decisions. Lives of the public and of other firefighters were at stake. This is the sort of human learning situation that machines, or rather their expert engineers, aspire to emulate. Together, Klein and Kahneman set out to describe how the brain could build up reliable memories that would be activated in the future, even in the agony of the moment. They came to the conclusion that there are two fundamental conditions for a human to acquire a skill.2

  • An environment that is sufficiently regular to be predictable.
  • An opportunity to learn these regularities through prolonged practice

The first bullet point is pretty much the most important idea in the whole of statistics. Before we can make any prediction from the regression, we have to be confident that the data has been sampled from “an environment that is sufficiently regular to be predictable”. The regression “learns” from those regularities, where they exist. The “learning” turns out to be the rather prosaic mechanics of matrix algebra as set out in all the standard texts.3 But that, after all, is what all machine “learning” is really about.

Statisticians capture the psychologists’ “sufficiently regular” through the mathematical concept of exchangeability. If a process is exchangeable then we can assume that the distribution of events in the future will be like the past. We can project our historic histogram forward. With regression we can do better than that.

Residuals analysis

Formally, the linear regression calculations calculate the characteristics of the model:

Y = mX + c + “stuff”

The “mX+c” bit is the familiar high school mathematics equation for a straight line. The “stuff” is variation about the straight line. What the linear regression mathematics does is (objectively) to calculate the m and c and then also tell us something about the “stuff”. It splits the variation in Y into two components:

  • What can be explained by the variation in X; and
  • The, as yet unexplained, variation in the “stuff”.

The first thing to learn about regression is that it is the “stuff” that is the interesting bit. In 1849 British astronomer Sir John Herschel observed that:

Almost all the greatest discoveries in astronomy have resulted from the consideration of what we have elsewhere termed RESIDUAL PHENOMENA, of a quantitative or numerical kind, that is to say, of such portions of the numerical or quantitative results of observation as remain outstanding and unaccounted for after subducting and allowing for all that would result from the strict application of known principles.

The straight line represents what we guessed about the causes of variation in Y and which the scatter plot confirmed. The “stuff” represents the causes of variation that we failed to identify and that continue to limit our ability to predict and manage. We call the predicted Ys that correspond to the measured Xs, and lie on the fitted straight line, the fits.

fiti = mXic

The residual values, or residuals, are obtained by subtracting the fits from the respective observed Y values. The residuals represent the “stuff”. Statistical software does this for us routinely. If yours doesn’t then bin it.

residuali = Yi – fiti

RegressionScatter4

There are a number of properties that the residuals need to satisfy for the regression to work. Investigating those properties is called residuals analysis.4 As far as use for prediction in concerned, it is sufficient that the “stuff”, the variation about the straight line, be exchangeable.5 That means that the “stuff” so far must appear from the data to be exchangeable and further that we have a rational belief that such a cause system will continue unchanged into the future. Shewhart charts are the best heuristics for checking the requirement for exchangeability, certainly as far as the historical data is concerned. Our first and, be under no illusion, mandatory check on the ability of the linear regression, or any statistical model, to make predictions is to plot the residuals against time on a Shewhart chart.

RegressionPBC

If there are any signals of special causes then the model cannot be used for prediction. It just can’t. For prediction we need residuals that are all noise and no signal. However, like all signals of special causes, such will provide an opportunity to explore and understand more about the cause system. The signal that prevents us from using this regression for prediction may be the very thing that enables an investigation leading to a superior model, able to predict more exactly than we ever hoped the failed model could. And even if there is sufficient evidence of exchangeability from the training data, we still need to continue vigilance and scrutiny of all future residuals to look out for any novel signals of special causes. Special causes that arise post-training provide fresh information about the cause system while at the same time compromising the reliability of the predictions.

Thorough regression diagnostics will also be able to identify issues such as serial correlation, lack of fit, leverage and heteroscedasticity. It is essential to regression and its ommision is intolerable. Residuals analysis is one of Stephen Stigler’s Seven Pillars of Statistical Wisdom.6 As Tukey said:

The greatest value of a picture is when it forces us to notice what we never expected to see.

To come:

Part 2: Is my regression significant? … is a dumb question.
Part 3: Quantifying predictions with statistical intervals.

References

  1. Deming, W E (‎1975) “On probability as a basis for action”, The American Statistician 29(4) pp146-152
  2. Kahneman, D (2011) Thinking, Fast and Slow, Allen Lane, p240
  3. Draper, N R & Smith, H (1998) Applied Regression Analysis, 3rd ed., Wiley, p44
  4. Draper & Smith (1998) Chs 2, 8
  5. I have to admit that weaker conditions may be adequate in some cases but these are far beyond any other than a specialist mathematician.
  6. Stigler, S M (2016) The Seven Pillars of Statistical Wisdom, Harvard University Press, Chapter 7

Amazon II: The sales story

Jeff Bezos' iconic laugh.jpgI recently commented on an item in the New York Times about Amazon’s pursuit of “rigorous data driven management”. Dina Vaccari, one of the employees cited in the original New York Times article, has taken the opportunity to tell her own story in this piece. I found it enlightening as to what goes on at Amazon. Of course, it is only another anecdote from a former employee, a data source of notoriously limited quality. However, as Arthur Koestler once observed:

Without the hard little bits of marble which are called ‘facts’ or ‘data’ one cannot compose a mosaic; what matters, however, are not so much the individual bits, but the successive patterns into which you arrange them, then break them up and rearrange them.

Vaccari’s role was to sell Amazon gift cards. The measure of her success was how many she sold. Vaccari had read Timothy Ferriss’ transgressive little book The 4-Hour Workweek. She decided to employ a subcontractor from Chennai, India to generate for her 100 leads daily for $10. The idea worked out well. Another illustration of the law of comparative advantage.

Vaccari them emailed the leads, not with the standard email that she had been instructed to use by Amazon, but with a formula of her own. Vacarri claims a 10 to 50% response rate. She then followed up using her traditional sales skills, exceeding her sales target and besting the rest of the sales team.

That drew attention from her supervisor. Not unnaturally he wanted to capture good practice. When he saw Vaccari’s non-standard email he was critical. We now know that process discipline is important at Amazon. Nothing wrong with that though if you really want to exercise your mind on the topic you would do well to watch the Hollywood movie Crimson Tide.

What is more interesting is that, when Vaccari answered the criticism by pointing to her response and sales figures, the supervisor retorted that this was “just luck”.

So there we have it. Somebody made a change and the organisation couldn’t agree whether or not it was an improvement. Vaccari said she saw a signal. Her supervisor said that it was just noise.

The supervisor’s response was particularly odd as he was shadowing Vacarri because of his favourable perception of her performance. It is as though his assessment as to whether Vacarri’s results were signal or noise depended on his approval or disapproval of how she had achieved them. It certainly seems that this is not normative behaviour at Amazon. Vaccari criticises her supervisor for failing to display Amazon Leadership Principles. The exchange illustrates what happens if an organisation generates data but is then unable to turn it into a reliable basis for action because there is no systematic and transparent method for creating a consensus around what is signal and what, noise. Vicarri’s exchange with her supervisor is reassuring in that both recognised that there is an important distinction. Vacarri knew that a signal should be a tocsin for action, in this case to embed a successful innovation through company wide standardisation. Her supervisor knew that to mistake noise for a signal would lead to a degraded process performance. Or at least he hid behind that to project his disapproval. Vacarri’s recall of the incident makes her “cringe”. Numbers aren’t just about numbers.

Trenchant data criticism, motivated by the rigorous segregation of signal and noise, is the catalyst of continual improvement in sales, product quality, economic efficiency and market agility.

The goal is not to be driven by data but to be led by the insights it yields.

Data science sold down the Amazon? Jeff Bezos and the culture of rigour

This blog appeared on the Royal Statistical Society website Statslife on 25 August 2015

Jeff Bezos' iconic laugh.jpgThis recent item in the New York Times has catalysed discussion among managers. The article tells of Amazon’s founder, Jeff Bezos, and his pursuit of rigorous data driven management. It also tells employees’ own negative stories of how that felt emotionally.

The New York Times says that Amazon is pervaded with abundant data streams that are used to judge individual human performance and which drive reward and advancement. They inform termination decisions too.

The recollections of former employees are not the best source of evidence about how a company conducts its business. Amazon’s share of the retail market is impressive and they must be doing something right. What everybody else wants to know is, what is it? Amazon are very coy about how they operate and there is a danger that the business world at large takes the wrong messages.

Targets

Targets are essential to business. The marketing director predicts that his new advertising campaign will create demand for 12,000 units next year. The operations director looks at her historical production data. She concludes that the process lacks the capability reliably to produce those volumes. She estimates the budget required to upgrade the process and to achieve 12,000 units annually. The executive board considers the business case and signs off the investment. Both marketing and operations directors now have a target.

Targets communicate improvement priorities. They build confidence between interfacing processes. They provide constraints and parameters that prevent the system causing harm. Harm to others or harm to itself. They allow the pace and substance of multiple business processes, and diverse entities, to be matched and aligned.

But everyone who has worked in business sees it as less simple than that. The marketing and operations directors are people.

Signal and noise

Drawing conclusions from data might be an uncontroversial matter were it not for the most common feature of data, fluctuation. Call it variation if you prefer. Business measures do not stand still. Every month, week, day and hour is different. All data features noise. Sometimes is goes up, sometimes down. A whole ecology of occult causes, weakly characterised, unknown and as yet unsuspected, interact to cause irregular variation. They are what cause a coin variously to fall “heads” or “tails”. That variation may often be stable enough, or if you like “exchangeable“, so as to allow statistical predictions to be made, as in the case of the coin toss.

If all data features noise then some data features signals. A signal is a sign, an indicator that some palpable cause has made the data stand out from the background noise. It is that assignable cause which enables inferences to be drawn about what interventions in the business process have had a tangible effect and what future innovations might cement any gains or lead to bigger prospective wins. Signal and noise lead to wholly different business strategies.

The relevance for business is that people, where not exposed to rigorous decision support, are really bad at telling the difference between signal and noise. Nobel laureate economist and psychologist Daniel Kahneman has amassed a lifetime of experimental and anecdotal data capturing noise misinterpreted as signal and judgments in the face of compelling data, distorted by emotional and contextual distractions.

Signal and accountability

It is a familiar trope of business, and government, that extravagant promises are made, impressive business cases set out and targets signed off. Yet the ultimate scrutiny as to whether that envisaged performance was realised often lacks rigour. Noise, with its irregular ups and downs, allows those seeking solace from failure to pick out select data points and cast self-serving narratives on the evidence.

Our hypothetical marketing director may fail to achieve his target but recount how there were two individual months where sales exceeded 1,000, construct elaborate rationales as to why only they are representative of his efforts and point to purported external factors that frustrated the remaining ten reports. Pairs of individual data points can always be selected to support any story, Don Wheeler’s classic executive time series.

This is where the ability to distinguish signal and noise is critical. To establish whether targets have been achieved requires crisp definition of business measures, not only outcomes but also the leading indicators that provide context and advise judgment as to prediction reliability. Distinguishing signal and noise requires transparent reporting that allows diverse streams of data criticism. It requires a rigorous approach to characterising noise and a systematic approach not only to identifying signals but to reacting to them in an agile and sustainable manner.

Data is essential to celebrating a target successfully achieved and to responding constructively to a failure. But where noise is gifted the status of signal to confirm a fanciful business case, or to protect a heavily invested reputation, then the business is misled, costs increased, profits foregone and investors cheated.

Where employees believe that success and reward is being fudged, whether because of wishful thinking or lack of data skills, or mistakenly through lack of transparency, then cynicism and demotivation will breed virulently. Employees watch the behaviours of their seniors carefully as models of what will lead to their own advancement. Where it is deceit or innumeracy that succeed, that is what will thrive.

Noise and blame

Here is some data of the number of defects caused by production workers last month.

Worker Defects
Al 10
Simone 6
Jose 10
Gabriela 16
Stan 10

What is to be done about Gabriela? Move to an easier job? Perhaps retraining? Or should she be let go? And Simone? Promote to supervisor?

Well, the numbers were just random numbers that I generated. I didn’t add anything in to make Gabriela’s score higher and there was nothing in the way that I generated the data to suggest who would come top or bottom. The data are simply noise. They are the sort of thing that you might observe in a manufacturing plant that presented a “stable system of trouble”. Nothing in the data signals any behaviour, attitude, skill or diligence that Gabriela lacked or wrongly exercised. The next month’s data would likely show a different candidate for dismissal.

Mistaking signal for noise is, like mistaking noise for signal, the path to business under performance and employee disillusionment. It has a particularly corrosive effect where used, as it might be in Gabriela’s case, to justify termination. The remaining staff will be bemused as to what Gabriela was actually doing wrong and start to attach myriad and irrational doubts to all sorts of things in the business. There may be a resort to magical thinking. The survivors will be less open and less willing to share problems with their supervisors. The business itself has the costs of recruitment to replace Gabriela. The saddest aspect of the whole business is the likelihood that Gabriela’s replacement will perform better than did Gabriela, vindicating the dismissal in the mind of her supervisor. This is the familiar statistical artefact of regression to the mean. An extreme event is likely to be followed by one less extreme. Again, Kahneman has collected sundry examples of managers so deceived by singular human performance and disappointed by its modest follow-up.

It was W Edwards Deming who observed that every time you recruit a new employee you take a random sample from the pool of job seekers. That’s why you get the regression to the mean. It must be true at Amazon too as their human resources executive Mr Tony Galbato explains their termination statistics by admitting that “We don’t always get it right.” Of course, everybody thinks that their recruitment procedures are better than average. That’s a management claim that could well do with rigorous testing by data.

Further, mistaking noise for signal brings the additional business expense of over adjustment, spending money to add costly variation while degrading customer satisfaction. Nobody in the business feels good about that.

Target quality, data quality

I admitted above that the evidence we have about Amazon’s operations is not of the highest quality. I’m not in a position to judge what goes on at Amazon. But all should fix in their minds that setting targets demands rigorous risk assessment, analysis of perverse incentives and intense customer focus.

It is a sad reality that, if you set incentives perversely enough,some individuals will find ways of misreporting data. BNFL’s embarrassment with Kansai Electric and Steven Eaton’s criminal conviction were not isolated incidents.

One thing that especially bothered me about the Amazon report was the soi-disant Anytime Feedback Tool that allowed unsolicited anonymous peer appraisal. Apparently, this formed part of the “data” that determined individual advancement or termination. The description was unchallenged by Amazon’s spokesman (sic) Mr Craig Berman. I’m afraid, and I say this as a practising lawyer, unsourced and unchallenged “evidence” carries the spoor of the Star Chamber and the party purge. I would have thought that a pretty reliable method for generating unreliable data would be to maximise the personal incentives for distortion while protecting it from scrutiny or governance.

Kahneman observed that:

… we pay more attention to the content of messages than to information about their reliability, and as a result end up with a view of the world around us that is simpler and more coherent than the data justify.

It is the perverse confluence of fluctuations and individual psychology that makes statistical science essential, data analytics interesting and business, law and government difficult.

#executivetimeseries

ExecTS1OxfordDon Wheeler coined the term executive time series. I was just leaving court in Oxford the other day when I saw this announcement on a hoarding. I immediately thought to myself “#executivetimeseries”.

Wheeler introduced the phrase in his 2000 book Understanding Variation: The Key to Managing Chaos. He meant to criticise the habitual way that statistics are presented in business and government. A comparison is made between performance at two instants in time. Grave significance is attached as to whether performance is better or worse at the second instant. Well, it was always unlikely that it would be the same.

The executive time series has the following characteristics.

  • It as applied to some statistic, metric, Key Performance Indicator (KPI) or other measure that will be perceived as important by its audience.
  • Two time instants are chosen.
  • The statistic is quoted at each of the two instants.
  • If the latter is greater than the first then an increase is inferred. A decrease is inferred from the converse.
  • Great significance is attached to the increase or decrease.

Why is this bad?

At its best it provides incomplete information devoid of context. At its worst it is subject to gross manipulation. The following problems arise.

  • Though a signal is usually suggested there is inadequate information to infer this.
  • There is seldom explanation of how the time points were chosen. It is open to manipulation.
  • Data is presented absent its context.
  • There is no basis for predicting the future.

The Oxford billboard is even worse than the usual example because it doesn’t even attempt to tell us over what period the carbon reduction is being claimed.

Signal and noise

Let’s first think about noise. As Daniel Kahneman put it “A random event does not … lend itself to explanation, but collections of random events do behave in a highly regular fashion.” Noise is a collection of random events. Some people also call it common cause variation.

Imagine a bucket of thousands of beads. Of the beads, 80% are white and 20%, red. You are given a paddle that will hold 50 beads. Use the paddle to stir the beads then draw out 50 with the paddle. Count the red beads. Repeat this, let us say once a week, until you have 20 counts. The data might look something like this.

RedBeads1

What we observe in Figure 1 is the irregular variation in the number of red beads. However, it is not totally unpredictable. In fact, it may be one of the most predictable things you have ever seen. Though we cannot forecast exactly how many red beads we will see in the coming week, it will most likely be in the rough range of 4 to 14 with rather more counts around 10 than at the extremities. The odd one below 4 or above 14 would not surprise you I think.

But nothing changed in the characteristics of the underlying process. It didn’t get better or worse. The percentage of reds in the bucket was constant. It is a stable system of trouble. And yet measured variation extended between 4 and 14 red beads. That is why an executive time series is so dangerous. It alleges change while the underlying cause-system is constant.

Figure 2 shows how an executive time series could be constructed in week 3.

RedBeads2

The number of beads has increase from 4 to 10, a 150% increase. Surely a “significant result”. And it will always be possible to find some managerial initiative between week 2 and 3 that can be invoked as the cause. “Between weeks 2 and 3 we changed the angle of inserting the paddle and it has increased the number of red beads by 150%.”

But Figure 2 is not the only executive time series that the data will support. In Figure 3 the manager can claim a 57% reduction from 14 to 6. More than the Oxford banner. Again, it will always be possible to find some factor or incident supposed to have caused the reduction. But nothing really changed.

RedBeads3

The executive can be even more ambitious. “Between week 2 and 17 we achieved a 250% increase in red beads.” Now that cannot be dismissed as a mere statistical blip.

RedBeads4

#executivetimeseries

Data has no meaning apart from its context.

Walter Shewhart

Not everyone who cites an executive time series is seeking to deceive. But many are. So anybody who relies on an executive times series, devoid of context, invites suspicion that they are manipulating the message. This is Langian statistics. par excellence. The fallacy of What you see is all there is. It is essential to treat all such claims with the utmost caution. What properly communicates the present reality of some measure is a plot against time that exposes its variation, its stability (or otherwise) and sets it in the time context of surrounding events.

We should call out the perpetrators. #executivetimeseries

Techie note

The data here is generated from a sequence of 20 Bernoulli experiments with probability of “red” equal to 0.2 and 50 independent trials in each experiment.