Shewhart chart basics 1 – The environment sufficiently stable to be predictable

Everybody wants to be able to predict the future. Here is the forecaster’s catechism.

  • We can do no more that attach a probability to future events.
  • Where we have data from an environment that is sufficiently stable to be predictable we can project historical patterns into the future.
  • Otherwise, prediction is largely subjective;
  • … but there are tactics that can help.
  • The Shewhart chart is the tool that helps us know whether we are working with an environment that is sufficiently stable to be predictable.

Now let’s get to work.

What does a stable/ predictable environment look like?

Every trial lawyer knows the importance of constructing a narrative out of evidence, an internally consistent and compelling arrangement of the facts that asserts itself above competing explanations. Time is central to how a narrative evolves. It is time that suggests causes and effects, motivations, barriers and enablers, states of knowledge, external influences, sensitisers and cofactors. That’s why exploration of data always starts with plotting it in time order. Always.

Let’s start off by looking at something we know to be predictable. Imagine a bucket of thousands of spherical beads. Of the beads, 80% are white and 20%, red. You are given a paddle that will hold 50 beads. Use the paddle to stir the beads then draw out 50 with the paddle. Count the red beads. Now you may, at this stage, object. Surely, this is just random and inherently unpredictable. But I want to persuade you that this is the most predictable data you have ever seen. Let’s look at some data from 20 sequential draws. In time order, of course, in Fig. 1.

Shew Chrt 1

Just to look at the data from another angle, always a good idea, I have added up how many times a particular value, 9, 10, 11, … , turns up and tallied them on the right hand side. For example, here is the tally for 12 beads in Fig. 2.

Shew Chrt 2

We get this in Fig. 3.

Shew Chrt 3

Here are the important features of the data.

  • We can’t predict what the exact value will be on any particular draw.
  • The numbers vary irregularly from draw to draw, as far as we can see.
  • We can say that draws will vary somewhere between 2 (say) and 19 (say).
  • Most of the draws are fairly near 10.
  • Draws near 2 and 19 are much rarer.

I would be happy to predict that the 21st draw will be between 2 and 19, probably not too far from 10. I have tried to capture that in Fig. 4. There are limits to variation suggested by the experience base. As predictions go, let me promise you, that is as good as it gets.

Even statistical theory would point to an outcome not so very different from that. That theoretical support adds to my confidence.

Shew Chrt 4

But there’s something else. Something profound.

A philosopher, an engineer and a statistician walk into a bar …

… and agree.

I got my last three bullet points above from just looking at the tally on the right hand side. What about the time order I was so insistent on preserving? As Daniel Kahneman put it “A random event does not … lend itself to explanation, but collections of random events do behave in a highly regular fashion.” What is this “regularity” when we can see how irregularly the draws vary? This is where time and narrative make their appearance.

If we take the draw data above, the exact same data, and “shuffle” it into a fresh order, we get this, Fig. 5.

Shew Chrt 5

Now the bullet points still apply to the new arrangement. The story, the narrative, has not changed. We still see the “irregular” variation. That is its “regularity”, that is tells the same story when we shuffle it. The picture and its inferences are the same. We cannot predict an exact value on any future draw yet it is all but sure to be between 2 and 19 and probably quite close to 10.

In 1924, British philosopher W E Johnson and US engineer Walter Shewhart, independently, realised that this was the key to describing a predicable process. It shows the same “regular irregularity”, or shall we say stable irregularity, when you shuffle it. Italian statistician Bruno de Finetti went on to derive the rigorous mathematics a few years later with his famous representation theorem. The most important theorem in the whole of statistics.

This is the exact characterisation of noise. If you shuffle it, it makes no difference to what you see or the conclusions you draw. It makes no difference to the narrative you construct (sic). Paradoxically, it is noise that is predictable.

To understand this, let’s look at some data that isn’t just noise.

Events, dear boy, events.

That was the alleged response of British Prime Minister Harold Macmillan when asked what had been the most difficult aspect of governing Britain.

Suppose our data looks like this in Fig. 6.

Shew Chrt 6

Let’s make it more interesting. Suppose we are looking at the net approval rating of a politician (Fig. 7).

Shew Chrt 7

What this looks like is noise plus a material step change between the 10th and 11th observation. Now, this is a surprise. The regularity, and the predictability, is broken. In fact, my first reaction is to ask What happened? I research political events and find at that same time there was an announcement of universal tax cuts (Fig. 8). This is just fiction of course. That then correlates with the shift in the data I observe. The shift is a signal, a flag from the data telling me that something happened, that the stable irregularity has become an unstable irregularity. I use the time context to identify possible explanations. I come up with the tentative idea about tax cuts as an explanation of the sudden increase in popularity.

The bullet points above no longer apply. The most important feature of the data now is the shift, I say, caused by the Prime Minister’s intervention.

Shew Chrt 8

What happens when I shuffle the data into a random order though (Fig. 9)?

Shew Chrt 9

Now, the signal is distorted, hard to see and impossible to localise in time. I cannot tie it to a context. The message in the data is entirely different. The information in the chart is not preserved. The shuffled data does not bear the same narrative as the time ordered data. It does not tell the same story. It does not look the same. That is how I know there is a signal. The data changes its story when shuffled. The time order is crucial.

Of course, if I repeated the tally exercise that I did on Fig. 4, the tally would look the same, just as it did in the noise case in Fig. 5.

Is data with signals predictable?

The Prime Minister will say that they predicted that their tax cuts would be popular and they probably did so. My response to that would be to ask how big an improvement they predicted. While a response in the polls may have been foreseeable, specifying its magnitude is much more difficult and unlikely to be exact.

We might say that the approval data following the announcement has returned to stability. Can we not now predict the future polls? Perhaps tentatively in the short term but we know that “events” will continue to happen. Not all these will be planned by the government. Some government initiatives, triumphs and embarrassments will not register with the public. The public has other things to be interested in. Here is some UK data.

poll20180302

You can follow regular updates here if you are interested.

Shewhart’s ingenious chart

While Johnson and de Finetti were content with theory, Shewhart, working in the manufacture of telegraphy equipment, wanted a practical tool for his colleagues that would help them answer the question of predictability. A tool that would help users decide whether they were working with an environment sufficiently stable to be predictable. Moreover, he wanted a tool that would be easy to use by people who were short of time time for analysing data and had minds occupied by the usual distractions of the work place. He didn’t want people to have to run off to a statistician whenever they were perplexed by events.

In Part 2 I shall start to discuss how to construct Shewhart’s chart. In subsequent parts, I shall show you how to use it.

Advertisements

Regression done right: Part 1: Can I predict the future?

I recently saw an article in the Harvard Business Review called “Refresher on Regression Analysis”. I thought it was horrible so I wanted to set the record straight.

Linear regression from the viewpoint of machine learning

Linear regression is important, not only because it is a useful tool in itself, but because it is (almost) the simplest statistical model. The issues that arise in a relatively straightforward form are issues that beset the whole of statistical modelling and predictive analytics. Anyone who understands linear regression properly is able to ask probing questions about more complicated models. The complex internal algorithms of Kalman filters, ARIMA processes and artificial neural networks are accessible only to the specialist mathematician. However, each has several general features in common with simple linear regression. A thorough understanding of linear regression enables a due diligence of the claims made by the machine learning advocate. Linear regression is the paradigmatic exemplar of machine learning.

There are two principal questions that I want to talk about that are the big takeaways of linear regression. They are always the first two questions to ask in looking at any statistical modelling or machine learning scenario.

  1. What predictions can I make (if any)?
  2. Is it worth the trouble?

I am going to start looking at (1) in this blog and complete it in a future Part 2. I will then look at (2) in a further Part 3.

Variation, variation, variation

Variation is a major problem for business, the tendency of key measures to fluctuate irregularly. Variation leads to uncertainty. Will the next report be high or low? Or in the middle? Because of the uncertainty we have to allow safety margins or swallow some non-conformancies. We have good days and bad days, good products and not so good. We have to carry costly working capital because of variation in cash flow. And so on.

We learned in our high school statistics class to characterise variation in a key process measure, call it the Big Y, by an histogram of observations. Perhaps we are bothered by the fluctuating level of monthly sales.

RegressionHistogram

The variation arises from a whole ecology of competing and interacting effects and factors that we call the cause-system of the outcome. In general, it is very difficult to single out individual factors as having been the cause of a particular observation, so entangled are they. It is still useful to capture them for reference on a cause and effect diagram.

RegressionIshikawa

One of the strengths of the cause and effect diagram is that it may prompt the thought that one of the factors is particularly important, call it Big X, perhaps it is “hours of TV advertising” (my age is showing). Motivated by that we can generate a sample of corresponding measurements data of both the Y and X and plot them on a scatter plot.

RegressionScatter1

Well what else is there to say? The scatter plot shows us all the information in the sample. Scatter plots are an important part of what statistician John Tukey called Exploratory Data Analysis (EDA). We have some hunches and ideas, or perhaps hardly any idea at all, and we attack the problem by plotting the data in any way we can think of. So much easier now than when W Edwards Deming wrote:1

[Statistical practice] means tedious work, such as studying the data in various forms, making tables and charts and re-making them, trying to use and preserve the evidence in the results and to be clear enough to the reader: to endure disappointment and discouragement.

Or as Chicago economist Ronald Coase put it.

If you torture the data enough, nature will always confess.

The scatter plot is a fearsome instrument of data torture. It tells me everything. It might even tempt me to think that I have a basis on which to make predictions.

Prediction

In machine learning terms, we can think of the sample used for the scatter plot as a training set of data. It can be used to set up, “train”, a numerical model that we will then fix and use to predict future outcomes. The scatter plot strongly suggests that if we know a future X alone we can have a go at predicting the corresponding future Y. To see that more clearly we can draw a straight line by hand on the scatter plot, just as we did in high school before anybody suggested anything more sophisticated.

RegressionScatter2

Given any particular X we can read off the corresponding Y.

RegressionScatter3

The immediate insight that comes from drawing in the line is that not all the observations lie on the line. There is variation about the line so that there is actually a range of values of Y that seem plausible and consistent for any specified X. More on that in Parts 2 and 3.

In understanding machine learning it makes sense to start by thinking about human learning. Psychologists Gary Klein and Daniel Kahneman investigated how firefighters were able to perform so successfully in assessing a fire scene and making rapid, safety critical decisions. Lives of the public and of other firefighters were at stake. This is the sort of human learning situation that machines, or rather their expert engineers, aspire to emulate. Together, Klein and Kahneman set out to describe how the brain could build up reliable memories that would be activated in the future, even in the agony of the moment. They came to the conclusion that there are two fundamental conditions for a human to acquire a skill.2

  • An environment that is sufficiently regular to be predictable.
  • An opportunity to learn these regularities through prolonged practice

The first bullet point is pretty much the most important idea in the whole of statistics. Before we can make any prediction from the regression, we have to be confident that the data has been sampled from “an environment that is sufficiently regular to be predictable”. The regression “learns” from those regularities, where they exist. The “learning” turns out to be the rather prosaic mechanics of matrix algebra as set out in all the standard texts.3 But that, after all, is what all machine “learning” is really about.

Statisticians capture the psychologists’ “sufficiently regular” through the mathematical concept of exchangeability. If a process is exchangeable then we can assume that the distribution of events in the future will be like the past. We can project our historic histogram forward. With regression we can do better than that.

Residuals analysis

Formally, the linear regression calculations calculate the characteristics of the model:

Y = mX + c + “stuff”

The “mX+c” bit is the familiar high school mathematics equation for a straight line. The “stuff” is variation about the straight line. What the linear regression mathematics does is (objectively) to calculate the m and c and then also tell us something about the “stuff”. It splits the variation in Y into two components:

  • What can be explained by the variation in X; and
  • The, as yet unexplained, variation in the “stuff”.

The first thing to learn about regression is that it is the “stuff” that is the interesting bit. In 1849 British astronomer Sir John Herschel observed that:

Almost all the greatest discoveries in astronomy have resulted from the consideration of what we have elsewhere termed RESIDUAL PHENOMENA, of a quantitative or numerical kind, that is to say, of such portions of the numerical or quantitative results of observation as remain outstanding and unaccounted for after subducting and allowing for all that would result from the strict application of known principles.

The straight line represents what we guessed about the causes of variation in Y and which the scatter plot confirmed. The “stuff” represents the causes of variation that we failed to identify and that continue to limit our ability to predict and manage. We call the predicted Ys that correspond to the measured Xs, and lie on the fitted straight line, the fits.

fiti = mXic

The residual values, or residuals, are obtained by subtracting the fits from the respective observed Y values. The residuals represent the “stuff”. Statistical software does this for us routinely. If yours doesn’t then bin it.

residuali = Yi – fiti

RegressionScatter4

There are a number of properties that the residuals need to satisfy for the regression to work. Investigating those properties is called residuals analysis.4 As far as use for prediction in concerned, it is sufficient that the “stuff”, the variation about the straight line, be exchangeable.5 That means that the “stuff” so far must appear from the data to be exchangeable and further that we have a rational belief that such a cause system will continue unchanged into the future. Shewhart charts are the best heuristics for checking the requirement for exchangeability, certainly as far as the historical data is concerned. Our first and, be under no illusion, mandatory check on the ability of the linear regression, or any statistical model, to make predictions is to plot the residuals against time on a Shewhart chart.

RegressionPBC

If there are any signals of special causes then the model cannot be used for prediction. It just can’t. For prediction we need residuals that are all noise and no signal. However, like all signals of special causes, such will provide an opportunity to explore and understand more about the cause system. The signal that prevents us from using this regression for prediction may be the very thing that enables an investigation leading to a superior model, able to predict more exactly than we ever hoped the failed model could. And even if there is sufficient evidence of exchangeability from the training data, we still need to continue vigilance and scrutiny of all future residuals to look out for any novel signals of special causes. Special causes that arise post-training provide fresh information about the cause system while at the same time compromising the reliability of the predictions.

Thorough regression diagnostics will also be able to identify issues such as serial correlation, lack of fit, leverage and heteroscedasticity. It is essential to regression and its ommision is intolerable. Residuals analysis is one of Stephen Stigler’s Seven Pillars of Statistical Wisdom.6 As Tukey said:

The greatest value of a picture is when it forces us to notice what we never expected to see.

To come:

Part 2: Is my regression significant? … is a dumb question.
Part 3: Quantifying predictions with statistical intervals.

References

  1. Deming, W E (‎1975) “On probability as a basis for action”, The American Statistician 29(4) pp146-152
  2. Kahneman, D (2011) Thinking, Fast and Slow, Allen Lane, p240
  3. Draper, N R & Smith, H (1998) Applied Regression Analysis, 3rd ed., Wiley, p44
  4. Draper & Smith (1998) Chs 2, 8
  5. I have to admit that weaker conditions may be adequate in some cases but these are far beyond any other than a specialist mathematician.
  6. Stigler, S M (2016) The Seven Pillars of Statistical Wisdom, Harvard University Press, Chapter 7

Imagine …

Ben Bernanke official portrait.jpgNo, not John Lennon’s dreary nursery rhyme for hippies.

In his memoir of the 2007-2008 banking crisis, The Courage to ActBen Benanke wrote about his surprise when the crisis materialised.

We saw, albeit often imperfectly, most of the pieces of the puzzle. But we failed to understand – “failed to imagine” might be a better phrase – how those pieces would fit together to produce a financial crisis that compared to, and arguably surpassed, the financial crisis that ushered in the Great Depression.

That captures the three essentials of any attempt to foresee a complex future.

  • The pieces
  • The fit
  • Imagination

In any well managed organisation, “the pieces” consist of the established Key Performance Indicators (KPIs) and leading measures. Diligent and rigorous criticism of historical data using process behaviour charts allows departures from stability to be identified timeously. A robust and disciplined system of management and escalation enables an agile response when special causes arise.

Of course, “the fit” demands a broader view of the data, recognising interactions between factors and the possibility of non-simple global responses remote from a locally well behaved response surface. As the old adage goes, “Fit locally. Think globally.” This is where the Cardinal Newman principle kicks in.

“The pieces” and “the fit”, taken at their highest, yield a map of historical events with some limited prediction as to how key measures will behave in the future. Yet it is common experience that novel factors persistently invade. The “bow wave” of such events will not fit a recognised pattern where there will be a ready consensus as to meaning, mechanism and action. These are the situations where managers are surprised by rapidly emerging events, only to protest, “We never imagined …”.

Nassim Taleb’s analysis of the financial crisis hinged on such surprises and took him back to the work of British economist G L S Shackle. Shackle had emphasised the importance of imagination in economics. Put at its most basic, any attempt to assign probabilities to future events depends upon the starting point of listing the alternatives that might occur. Statisticians call it the sample space. If we don’t imagine some specific future we won’t bother thinking about the probability that it might come to be. Imagination is crucial to economics but it turns out to be much more pervasive as an engine of improvement that at first is obvious.

Imagination and creativity

Frank Whittle had to imagine the jet engine before he could bring it into being. Alan Turing had to imagine the computer. They were both fortunate in that they were able to test their imagination by construction. It was all realised in a comparatively short period of time. Whittle’s and Turing’s respective imaginations were empirically verified.

What is now proved was once but imagined.

William Blake

Not everyone has had the privilege of seeing their imagination condense into reality within their lifetime. In 1946, Sir George Paget Thomson and Moses Blackman imagined a plentiful source of inexpensive civilian power from nuclear fusion. As of writing, prospects of a successful demonstration seem remote. Frustratingly, as far as I can see, the evidence still refuses to tip the balance as to whether future success is likely or that failure is inevitable.

Something as illusive as imagination can have a testable factual content. As we know, not all tests are conclusive.

Imagination and analysis

Imagination turns out to be essential to something as prosaic as Root Cause Analysis. And essential in a surprising way. Establishing an operative cause of a past event is an essential task in law and engineering. It entails the search for a counterfactual, not what happened but what might have happened to avoid the  regrettable outcome. That is inevitably an exercise in imagination.

In almost any interesting situation there will be multiple imagined pasts. If there is only one then it is time to worry. Sometimes it is straightforward to put our ideas to the test. This is where the Shewhart cycle comes into its own. In other cases we are in the realms of uncomfortable science. Sometimes empirical testing is frustrated because the trail has gone cold.

The issues of counterfactuals, Root Cause Analysis and causation have been explored by psychologists Daniel Kahneman1 and Ruth Byrne2 among others. Reading their research is a corrective to the optimistic view that Root Cause analysis is some sort of inevitably objective process. It is distorted by all sorts of heuristics and biases. Empirical testing is vital, if only through finding some data with borrowing strength.

Imagine a millennium bug

In 1984, Jerome and Marilyn Murray published Computers in Crisis in which they warned of a significant future risk to global infrastructure in telecommunications, energy, transport, finance, health and other domains. It was exactly those areas where engineers had been enthusiastic to exploit software from the earliest days, often against severe constraints of memory and storage. That had led to the frequent use of just two digits to represent a year, “71” for 1971, say. From the 1970s, software became more commonly embedded in devices of all types. As the year 2000 approached, the Murrays envisioned a scenario where the dawn of 1 January 2000 was heralded by multiple system failures where software registers reset to the year 1900, frustrating functions dependent on timing and forcing devices into a fault mode or a graceless degradation. Still worse, systems may simply malfunction abruptly and without warning, the only sensible signal being when human wellbeing was compromised. And the ruinous character of such a threat would be that failure would be inherently simultaneous and global, with safeguarding systems possibly beset with the same defects as the primary devices. It was easy to imagine a calamity.

Risk matrixYou might like to assess that risk yourself (ex ante) by locating it on the Risk Assessment Matrix to the left. It would be a brave analyst who would categorise it as “Low”, I think. Governments and corporations were impressed and embarked on a massive review of legacy software and embedded systems, estimated to have cost around $300 billion at year 2000 prices. A comprehensive upgrade programme was undertaken by nearly all substantial organisations, public and private.

Then, on 1 January 2000, there was no catastrophe. And that caused consternation. The promoters of the risk were accused of having caused massive expenditure and diversion of resources against a contingency of negligible impact. Computer professionals were accused, in terms, of self-serving scare mongering. There were a number of incidents which will not have been considered minor by the people involved. For example, in a British hospital, tests for Down’s syndrome were corrupted by the bug resulting in contra-indicated abortions and births. However, there was no global catastrophe.

This is the locus classicus of a counterfactual. Forecasters imagined a catastrophe. They persuaded others of their vision and the necessity of vast expenditure in order to avoid it. The preventive measures were implemented at great costs. The Catastrophe did not occur. Ex post, the forecasters were disbelieved. The danger had never been real. Even Cassandra would have sympathised.

Critics argued that there had been a small number of relatively minor incidents that would have been addressed most economically on a “fix on failure” basis. Much of this turns out to be a debate about the much neglected column of the risk assessment headed “Detectability”. Where a failure will inflict immediate pain, it is so much more critical as to management and mitigation than a failure that will present the opportunity for detection and protection in advance of a broader loss. Here, forecasting Detectability was just as important as Probability and Consequences in arriving at an economic strategy for management.

It is the fundamental paradox of risk assessment that, where control measures eliminate a risk, it is not obvious whether the benign outcome was caused by the control or whether the risk assessment was just plain wrong and the risk never existed. Another counterfactual. Again, finding some alternative data with borrowing strength can help though it will ever be difficult to build a narrative appealing to a wide population. There are links to some sources of data on the Wikipedia article about the bug. I will leave it to the reader.

Imagine …

Of course it is possible to find this all too difficult and to adopt the Biblical outlook.

I returned, and saw under the sun, that the race is not to the swift, nor the battle to the strong, neither yet bread to the wise, nor yet riches to men of understanding, nor yet favour to men of skill; but time and chance happeneth to them all.

Ecclesiastes 9:11
King James Bible

That is to adopt the outlook of the lady on the level crossing. Risk professionals look for evidence that their approach works.

The other day, I was reading the annual report of the UK Health and Safety Executive (pdf). It shows a steady improvement in the safety of people at work though oddly the report is too coy to say this in terms. The improvement occurs over the period where risk assessment has become ubiquitous in industry. In an individual work activity it will always be difficult to understand whether interventions are being effective. But using the borrowing strength of the overall statistics there is potent evidence that risk assessment works.

References

  1. Kahneman, D & Tversky, A (1979) “The simulation heuristic”, reprinted in Kahneman et al. (1982) Judgment under Uncertainty: Heuristics and Biases, Cambridge, p201
  2. Byrne, R M J (2007) The Rational Imagination: How People Create Alternatives to Reality, MIT Press

Data science sold down the Amazon? Jeff Bezos and the culture of rigour

This blog appeared on the Royal Statistical Society website Statslife on 25 August 2015

Jeff Bezos' iconic laugh.jpgThis recent item in the New York Times has catalysed discussion among managers. The article tells of Amazon’s founder, Jeff Bezos, and his pursuit of rigorous data driven management. It also tells employees’ own negative stories of how that felt emotionally.

The New York Times says that Amazon is pervaded with abundant data streams that are used to judge individual human performance and which drive reward and advancement. They inform termination decisions too.

The recollections of former employees are not the best source of evidence about how a company conducts its business. Amazon’s share of the retail market is impressive and they must be doing something right. What everybody else wants to know is, what is it? Amazon are very coy about how they operate and there is a danger that the business world at large takes the wrong messages.

Targets

Targets are essential to business. The marketing director predicts that his new advertising campaign will create demand for 12,000 units next year. The operations director looks at her historical production data. She concludes that the process lacks the capability reliably to produce those volumes. She estimates the budget required to upgrade the process and to achieve 12,000 units annually. The executive board considers the business case and signs off the investment. Both marketing and operations directors now have a target.

Targets communicate improvement priorities. They build confidence between interfacing processes. They provide constraints and parameters that prevent the system causing harm. Harm to others or harm to itself. They allow the pace and substance of multiple business processes, and diverse entities, to be matched and aligned.

But everyone who has worked in business sees it as less simple than that. The marketing and operations directors are people.

Signal and noise

Drawing conclusions from data might be an uncontroversial matter were it not for the most common feature of data, fluctuation. Call it variation if you prefer. Business measures do not stand still. Every month, week, day and hour is different. All data features noise. Sometimes is goes up, sometimes down. A whole ecology of occult causes, weakly characterised, unknown and as yet unsuspected, interact to cause irregular variation. They are what cause a coin variously to fall “heads” or “tails”. That variation may often be stable enough, or if you like “exchangeable“, so as to allow statistical predictions to be made, as in the case of the coin toss.

If all data features noise then some data features signals. A signal is a sign, an indicator that some palpable cause has made the data stand out from the background noise. It is that assignable cause which enables inferences to be drawn about what interventions in the business process have had a tangible effect and what future innovations might cement any gains or lead to bigger prospective wins. Signal and noise lead to wholly different business strategies.

The relevance for business is that people, where not exposed to rigorous decision support, are really bad at telling the difference between signal and noise. Nobel laureate economist and psychologist Daniel Kahneman has amassed a lifetime of experimental and anecdotal data capturing noise misinterpreted as signal and judgments in the face of compelling data, distorted by emotional and contextual distractions.

Signal and accountability

It is a familiar trope of business, and government, that extravagant promises are made, impressive business cases set out and targets signed off. Yet the ultimate scrutiny as to whether that envisaged performance was realised often lacks rigour. Noise, with its irregular ups and downs, allows those seeking solace from failure to pick out select data points and cast self-serving narratives on the evidence.

Our hypothetical marketing director may fail to achieve his target but recount how there were two individual months where sales exceeded 1,000, construct elaborate rationales as to why only they are representative of his efforts and point to purported external factors that frustrated the remaining ten reports. Pairs of individual data points can always be selected to support any story, Don Wheeler’s classic executive time series.

This is where the ability to distinguish signal and noise is critical. To establish whether targets have been achieved requires crisp definition of business measures, not only outcomes but also the leading indicators that provide context and advise judgment as to prediction reliability. Distinguishing signal and noise requires transparent reporting that allows diverse streams of data criticism. It requires a rigorous approach to characterising noise and a systematic approach not only to identifying signals but to reacting to them in an agile and sustainable manner.

Data is essential to celebrating a target successfully achieved and to responding constructively to a failure. But where noise is gifted the status of signal to confirm a fanciful business case, or to protect a heavily invested reputation, then the business is misled, costs increased, profits foregone and investors cheated.

Where employees believe that success and reward is being fudged, whether because of wishful thinking or lack of data skills, or mistakenly through lack of transparency, then cynicism and demotivation will breed virulently. Employees watch the behaviours of their seniors carefully as models of what will lead to their own advancement. Where it is deceit or innumeracy that succeed, that is what will thrive.

Noise and blame

Here is some data of the number of defects caused by production workers last month.

Worker Defects
Al 10
Simone 6
Jose 10
Gabriela 16
Stan 10

What is to be done about Gabriela? Move to an easier job? Perhaps retraining? Or should she be let go? And Simone? Promote to supervisor?

Well, the numbers were just random numbers that I generated. I didn’t add anything in to make Gabriela’s score higher and there was nothing in the way that I generated the data to suggest who would come top or bottom. The data are simply noise. They are the sort of thing that you might observe in a manufacturing plant that presented a “stable system of trouble”. Nothing in the data signals any behaviour, attitude, skill or diligence that Gabriela lacked or wrongly exercised. The next month’s data would likely show a different candidate for dismissal.

Mistaking signal for noise is, like mistaking noise for signal, the path to business under performance and employee disillusionment. It has a particularly corrosive effect where used, as it might be in Gabriela’s case, to justify termination. The remaining staff will be bemused as to what Gabriela was actually doing wrong and start to attach myriad and irrational doubts to all sorts of things in the business. There may be a resort to magical thinking. The survivors will be less open and less willing to share problems with their supervisors. The business itself has the costs of recruitment to replace Gabriela. The saddest aspect of the whole business is the likelihood that Gabriela’s replacement will perform better than did Gabriela, vindicating the dismissal in the mind of her supervisor. This is the familiar statistical artefact of regression to the mean. An extreme event is likely to be followed by one less extreme. Again, Kahneman has collected sundry examples of managers so deceived by singular human performance and disappointed by its modest follow-up.

It was W Edwards Deming who observed that every time you recruit a new employee you take a random sample from the pool of job seekers. That’s why you get the regression to the mean. It must be true at Amazon too as their human resources executive Mr Tony Galbato explains their termination statistics by admitting that “We don’t always get it right.” Of course, everybody thinks that their recruitment procedures are better than average. That’s a management claim that could well do with rigorous testing by data.

Further, mistaking noise for signal brings the additional business expense of over adjustment, spending money to add costly variation while degrading customer satisfaction. Nobody in the business feels good about that.

Target quality, data quality

I admitted above that the evidence we have about Amazon’s operations is not of the highest quality. I’m not in a position to judge what goes on at Amazon. But all should fix in their minds that setting targets demands rigorous risk assessment, analysis of perverse incentives and intense customer focus.

It is a sad reality that, if you set incentives perversely enough,some individuals will find ways of misreporting data. BNFL’s embarrassment with Kansai Electric and Steven Eaton’s criminal conviction were not isolated incidents.

One thing that especially bothered me about the Amazon report was the soi-disant Anytime Feedback Tool that allowed unsolicited anonymous peer appraisal. Apparently, this formed part of the “data” that determined individual advancement or termination. The description was unchallenged by Amazon’s spokesman (sic) Mr Craig Berman. I’m afraid, and I say this as a practising lawyer, unsourced and unchallenged “evidence” carries the spoor of the Star Chamber and the party purge. I would have thought that a pretty reliable method for generating unreliable data would be to maximise the personal incentives for distortion while protecting it from scrutiny or governance.

Kahneman observed that:

… we pay more attention to the content of messages than to information about their reliability, and as a result end up with a view of the world around us that is simpler and more coherent than the data justify.

It is the perverse confluence of fluctuations and individual psychology that makes statistical science essential, data analytics interesting and business, law and government difficult.

#executivetimeseries

ExecTS1OxfordDon Wheeler coined the term executive time series. I was just leaving court in Oxford the other day when I saw this announcement on a hoarding. I immediately thought to myself “#executivetimeseries”.

Wheeler introduced the phrase in his 2000 book Understanding Variation: The Key to Managing Chaos. He meant to criticise the habitual way that statistics are presented in business and government. A comparison is made between performance at two instants in time. Grave significance is attached as to whether performance is better or worse at the second instant. Well, it was always unlikely that it would be the same.

The executive time series has the following characteristics.

  • It as applied to some statistic, metric, Key Performance Indicator (KPI) or other measure that will be perceived as important by its audience.
  • Two time instants are chosen.
  • The statistic is quoted at each of the two instants.
  • If the latter is greater than the first then an increase is inferred. A decrease is inferred from the converse.
  • Great significance is attached to the increase or decrease.

Why is this bad?

At its best it provides incomplete information devoid of context. At its worst it is subject to gross manipulation. The following problems arise.

  • Though a signal is usually suggested there is inadequate information to infer this.
  • There is seldom explanation of how the time points were chosen. It is open to manipulation.
  • Data is presented absent its context.
  • There is no basis for predicting the future.

The Oxford billboard is even worse than the usual example because it doesn’t even attempt to tell us over what period the carbon reduction is being claimed.

Signal and noise

Let’s first think about noise. As Daniel Kahneman put it “A random event does not … lend itself to explanation, but collections of random events do behave in a highly regular fashion.” Noise is a collection of random events. Some people also call it common cause variation.

Imagine a bucket of thousands of beads. Of the beads, 80% are white and 20%, red. You are given a paddle that will hold 50 beads. Use the paddle to stir the beads then draw out 50 with the paddle. Count the red beads. Repeat this, let us say once a week, until you have 20 counts. The data might look something like this.

RedBeads1

What we observe in Figure 1 is the irregular variation in the number of red beads. However, it is not totally unpredictable. In fact, it may be one of the most predictable things you have ever seen. Though we cannot forecast exactly how many red beads we will see in the coming week, it will most likely be in the rough range of 4 to 14 with rather more counts around 10 than at the extremities. The odd one below 4 or above 14 would not surprise you I think.

But nothing changed in the characteristics of the underlying process. It didn’t get better or worse. The percentage of reds in the bucket was constant. It is a stable system of trouble. And yet measured variation extended between 4 and 14 red beads. That is why an executive time series is so dangerous. It alleges change while the underlying cause-system is constant.

Figure 2 shows how an executive time series could be constructed in week 3.

RedBeads2

The number of beads has increase from 4 to 10, a 150% increase. Surely a “significant result”. And it will always be possible to find some managerial initiative between week 2 and 3 that can be invoked as the cause. “Between weeks 2 and 3 we changed the angle of inserting the paddle and it has increased the number of red beads by 150%.”

But Figure 2 is not the only executive time series that the data will support. In Figure 3 the manager can claim a 57% reduction from 14 to 6. More than the Oxford banner. Again, it will always be possible to find some factor or incident supposed to have caused the reduction. But nothing really changed.

RedBeads3

The executive can be even more ambitious. “Between week 2 and 17 we achieved a 250% increase in red beads.” Now that cannot be dismissed as a mere statistical blip.

RedBeads4

#executivetimeseries

Data has no meaning apart from its context.

Walter Shewhart

Not everyone who cites an executive time series is seeking to deceive. But many are. So anybody who relies on an executive times series, devoid of context, invites suspicion that they are manipulating the message. This is Langian statistics. par excellence. The fallacy of What you see is all there is. It is essential to treat all such claims with the utmost caution. What properly communicates the present reality of some measure is a plot against time that exposes its variation, its stability (or otherwise) and sets it in the time context of surrounding events.

We should call out the perpetrators. #executivetimeseries

Techie note

The data here is generated from a sequence of 20 Bernoulli experiments with probability of “red” equal to 0.2 and 50 independent trials in each experiment.

Productivity and how to improve it: I -The foundational narrative

Again, much talk in the UK media recently about weak productivity statistics. Chancellor of the Exchequer (Finance Minister) George Osborne has launched a 15 point macroeconomic strategy aimed at improving national productivity. Some of the points are aimed at incentivising investment and training. There will be few who argue against that though I shall come back to the investment issue when I come to talk about signal and noise. I have already discussed training here. In any event, the strategy is fine as far as these things go. Which is not very far.

There remains the microeconomic task for all of us of actually improving our own productivity and that of the systems we manage. That is not the job of government.

Neither can I offer any generalised system for improving productivity. It will always be industry and organisation dependent. However, I wanted to write about some of the things that you have to understand if your efforts to improve output are going to be successful and sustainable.

  • Customer value and waste.
  • The difference between signal and noise.
  • How to recognise flow and manage a constraint.

Before going on to those in future weeks I first wanted to go back and look at what has become the foundational narrative of productivity improvement, the Hawthorne experiments. They still offer some surprising insights.

The Hawthorne experiments

In 1923, the US electrical engineering industry was looking to increase the adoption of electric lighting in American factories. Uptake had been disappointing despite the claims being made for increased productivity.

[Tests in nine companies have shown that] raising the average initial illumination from about 2.3 to 11.2 foot-candles resulted in an increase in production of more than 15%, at an additional cost of only 1.9% of the payroll.

Earl A Anderson
General Electric
Electrical World (1923)

E P Hyde, director of research at GE’s National Lamp Works, lobbied government for the establishment of a Committee on Industrial Lighting (“the CIL”) to co-ordinate marketing-oriented research. Western Electric volunteered to host tests at their Hawthorne Works in Cicero, IL.

Western Electric came up with a study design that comprised a team of experienced workers assembling relays, winding their coils and inspecting them. Tests commenced in November 1924 with active support from an elite group of academic and industrial engineers including the young Vannevar Bush, who would himself go on to an eminent career in government and science policy. Thomas Edison became honorary chairman of the CIL.

It’s a tantalising historical fact that Walter Shewhart was employed at the Hawthorne Works at the time but I have never seen anything suggesting his involvement in the experiments, nor that of his mentor George G Edwards, nor protégé Joseph Juran. In later life, Juran was dismissive of the personal impact that Shewhart had had on operations there.

However, initial results showed no influence of light level on productivity at all. Productivity rose throughout the test but was wholly uncorrelated with lighting level. Theories about the impact of human factors such as supervision and motivation started to proliferate.

A further schedule of tests was programmed starting in September 1926. Now, the lighting level was to be reduced to near darkness so that the threshold of effective work could be identified. Here is the summary data (from Richard Gillespie Manufacturing Knowledge: A History of the Hawthorne Experiments, Cambridge, 1991).

Hawthorne data-1

It requires no sophisticated statistical analysis to see that the data is all noise and no signal. Much to the disappointment of the CIL, and the industry, there was no evidence that illumination made any difference at all, even down to conditions of near darkness. It’s striking that the highest lighting levels embraced the full range of variation in productivity from the lowest to the highest. What had seemed so self evidently a boon to productivity was purely incidental. It is never safe to assume that a change will be an improvement. As W Edwards Deming insisted, “In God was trust. All others bring data.”

But the data still seemed to show a relentless improvement of productivity over time. The participants were all very experienced in the task at the start of the study so there should have been no learning by doing. There seemed no other explanation than that the participants were somehow subliminally motivated by the experimental setting. Or something.

Hawthorne data-2

That subliminally motivated increase in productivity came to be known as the Hawthorne effect. Attempts to explain it led to the development of whole fields of investigation and organisational theory, by Elton Mayo and others. It really was the foundation of the management consulting industry. Gillespie (supra) gives a rich and intriguing account.

A revisionist narrative

Because of the “failure” of the experiments’ purpose there was a falling off of interest and only the above summary results were ever published. The raw data were believed destroyed. Now “you know, at least you ought to know, for I have often told you so” about Shewhart’s two rules for data presentation.

  1. Data should always be presented in such a way as to preserve the evidence in the data for all the predictions that might be made from the data.
  2. Whenever an average, range or histogram is used to summarise observations, the summary must not mislead the user into taking any action that the user would not take if the data were presented in context.

The lack of any systematic investigation of the raw data led to the development of a discipline myth that every single experimental adjustment had led forthwith to an increase in productivity.

In 2009, Steven Levitt, best known to the public as the author of Freakonomics, along with John List and their research team, miraculously discovered a microfiche of the raw study data at a “small library in Milwaukee, WI” and the remainder in Boston, MA. They went on to analyse the data from scratch (Was there Really a Hawthorne Effect at the Hawthorne Plant? An Analysis of the Original Illumination Experiments, National Bureau of Economic Research, Working Paper 15016, 2009).

LevittHawthonePlot

Figure 3 of Levitt and List’s paper (reproduced above) shows the raw productivity measurements for each of the experiments. Levitt and List show how a simple plot such as this reveals important insights into how the experiments developed. It is a plot that yields a lot of information.

Levitt and List note that, in the first phase of experiments, productivity rose then fell when experiments were suspended. They speculate as to whether there was a seasonal effect with lower summer productivity.

The second period of experiments is that between the third and fourth vertical lines in the figure. Only room 1 experienced experimental variation in this period yet Levitt and List contend that productivity increased in all three rooms, falling again at the end of experimentation.

During the final period, data was only collected from room 1 where productivity continued to rise, even beyond the end of the experiment. Looking at the data overall, Levitt and List find some evidence that productivity responded more to changes in artificial light than to natural light. The evidence that increases in productivity were associated with every single experimental adjustment is weak. To this day, there is no compelling explanation of the increases in productivity.

Lessons in productivity improvement

Deming used to talk of “disappointment in great ideas”, the propensity for things that looked so good on paper simply to fail to deliver the anticipated benefits. Nobel laureate psychologist Daniel Kahneman warns against our individual bounded rationality.

To guard against entrapment by the vanity of imagination we need measurement and data to answer the ineluctable question of whether the change we implemented so passionately resulted in improvement. To be able to answer that question demands the separation of signal from noise. That requires trenchant data criticism.

And even then, some factors may yet be beyond our current knowledge. Bounded rationality again. That is why the trick of continual improvement in productivity is to use the rigorous criticism of historical data to build collective knowledge incrementally.

If you torture the data enough, nature will always confess.

Ronald Coase

Eventually.

FIFA and the Iron Law of Oligarchy

Йозеф Блаттер.jpgIn 1911, Robert Michels embarked on one of the earliest investigations into organisational culture. Michels was a pioneering sociologist, a student of Max Weber. In his book Political Parties he aggregated evidence about a range of trade unions and political groups, in particular the German Social Democratic Party.

He concluded that, as organisations become larger and more complex, a bureaucracy inevitably forms to take, co-ordinate and optimise decisions. It is the most straightforward way of creating alignment in decision making and unified direction of purpose and policy. Decision taking power ends up in the hands of a few bureaucrats and they increasingly use such power to further their own interests, isolating themselves from the rest of the organisation to protect their privilege. Michels called this the Iron Law of Oligarchy.

These are very difficult matters to capture quantitavely and Michels’ limited evidential sampling frame has more of the feel of anecdote than data. “Iron Law” surely takes the matter too far. However, when we look at the allegations concerning misconduct within FIFA it is tempting to feel that Michels’ theory is validated, or at least has gathered another anecdote to take the evidence base closer to data.

But beyond that, what Michels surely identifies is a danger that a bureaucracy, a management cadre, can successfully isolate itself from superior and inferior strata in an organisation, limiting the mobility of business data and fostering their own ease. The legitimate objectives of the organisation suffer.

Michels failed to identify a realistic solution, being seduced by the easy, but misguided, certainties of fascism. However, I think that a rigorous approach to the use of data can guard against some abuses without compromising human rights.

Oligarchs love traffic lights

I remember hearing the story of a CEO newly installed in a mature organisation. His direct reports had instituted a “traffic light” system to report status to the weekly management meeting. A green light meant all was well. An amber light meant that some intervention was needed. A red light signalled that threats to the company’s goals had emerged. At his first meeting, the CEO found that nearly all “lights” were green, with a few amber. The new CEO perceived an opportunity to assert his authority and show his analytical skills. He insisted that could not be so. There must be more problems and he demanded that the next meeting be an opportunity for honesty and confronting reality.

At the next meeting there was a kaleidoscope of red, amber and green “lights”. Of course, it turned out that the managers had flagged as red the things that were either actually fine or could be remedied quickly. They could then report green at the following meeting. Real career limiting problems were hidden behind green lights. The direct reports certainly didn’t want those exposed.

Openness and accountability

I’ve quoted Nobel laureate economist Kenneth Arrow before.

… a manager is an information channel of decidedly limited capacity.

Essays in the Theory of Risk-Bearing

Perhaps the fundamental problem of organisational design is how to enable communication of information so that:

  • Individual managers are not overloaded.
  • Confidence in the reliable satisfaction of process and organisational goals is shared.
  • Systemic shortfalls in process capability are transparent to the managers responsible, and their managers.
  • Leading indicators yield early warnings of threats to the system.
  • Agile responses to market opportunities are catalysed.
  • Governance functions can exploit the borrowing strength of diverse data sources to identify misreporting and misconduct.

All that requires using analytics to distinguish between signal and noise. Traffic lights offer a lousy system of intra-organisational analytics. Traffic light systems leave it up to the individual manager to decide what is “signal” and what “noise”. Nobel laureate psychologist Daniel Kahneman has studied how easily managers are confused and misled in subjective attempts to separate signal and noise. It is dangerous to think that What you see is all there is. Traffic lights offer a motley cloak to an oligarch wishing to shield his sphere of responsibility from scrutiny.

The answer is trenchant and candid criticism of historical data. That’s the only data you have. A rigorous system of goal deployment and mature use of process behaviour charts delivers a potent stimulus to reluctant data sharers. Process behaviour charts capture the development of process performance over time, for better or for worse. They challenge the current reality of performance through the Voice of the Customer. They capture a shared heuristic for characterising variation as signal or noise.

Individual managers may well prefer to interpret the chart with various competing narratives. The message of the data, the Voice of the Process, will not always be unambiguous. But collaborative sharing of data compels an organisation to address its structural and people issues. Shared data generation and investigation encourage an organisation to find practical ways of fostering team work, enabling problem solving and motivating participation. It is the data that can support the organic emergence of a shared organisational narrative that adds further value to the data and how it is used and developed. None of these organisational and people matters have generalised solutions but a proper focus on data drives an organisation to find practical strategies that work within their own context. And to test the effectiveness of those strategies.

Every week the press discloses allegations of hidden or fabricated assets, repudiated valuations, fraud, misfeasance, regulators blindsided, creative reporting, anti-competitive behaviour, abused human rights and freedoms.

Where a proper system of intra-organisational analytics is absent, you constantly have to ask yourself whether you have another FIFA on your hands. The FIFA allegations may be true or false but that they can be made surely betrays an absence of effective governance.

#oligarchslovetrafficlights