Regression done right: Part 3: Forecasts to believe in

There are three Sources of Uncertainty in a forecast.

  1. Whether the forecast is of “an environment that is sufficiently regular to be predictable”.1
  2. Uncertainty arising from the unexplained (residual) system variation.
  3. Technical statistical sampling error in the regression calculation.

Source of Uncertainty (3) is the one that fascinates statistical theorists. Sources (1) and (2) are the ones that obsess the rest of us. I looked at the first in Part 1 of this blog and, the second in Part 2. Now I want to look at the third Source of Uncertainty and try to put everything together.

If you are really most interested in (1) and (2), read “Prediction intervals” then skip forwards to “The fundamental theorem of forecasting”.

Prediction intervals

A prediction interval2 captures the range in which a future observation is expected to fall. Bafflingly, not all statistical software generates prediction intervals automatically so it is necessary, I fear, to know how to calculate them from first principles. However, understanding the calculation is, in itself, instructive.

But I emphasise that prediction intervals rely on a presumption that what is being forecast is “an environment that is sufficiently regular to be predictable”, that the (residual) business process data is exchangeable. If that presumption fails then all bets are off and we have to rely on a Cardinal Newman analysis. Of course, when I say that “all bets are off”, they aren’t. You will still be held to your existing contractual commitments even though your confidence in achieving them is now devastated. More on that another time.

Sources of variation in predictions

In the particular case of linear regression we need further to break down the third Source of Uncertainty.

  1. Uncertainty arising from the unexplained (residual) variation.
  2. Technical statistical sampling error in the regression calculation.
    1. Sampling error of the mean.
    2. Sampling error of the slope

Remember that we are, for the time being, assuming Source of Uncertainty (1) above can be disregarded. Let’s look at the other Sources of Uncertainty in turn: (2), (3A) and (3B).

Source of Variation (2) – Residual variation

We start with the Source of Uncertainty arising from the residual variation. This is the uncertainty because of all the things we don’t know. We talked about this a lot in Part 2. We are content, for the moment, that they are sufficiently stable to form a basis for prediction. We call this common cause variation. This variation has variance s2, where s is the residual standard deviation that will be output by your regression software.

RegressionResExpl2

Source of Variation (3A) – Sampling error in mean

To understand the next Source of Variation we need to know a little bit about how the regression is calculated. The calculations start off with the respective means of the X values ( X̄ ) and of the Y values ( Ȳ ). Uncertainty in estimating the mean of the Y , is the next contribution to the global prediction uncertainty.

An important part of calculating the regression line is to calculate the mean of the Ys. That mean is subject to sampling error. The variance of the sampling error is the familiar result from the statistics service course.

RegEq2

— where n is the number of pairs of X and Y. Obviously, as we collect more and more data this term gets more and more negligible.

RegressionMeanExpl

Source of Variation (3B) – Sampling error in slope

This is a bit more complicated. Skip forwards if you are already confused. Let me first give you the equation for the variance of predictions referable to sampling error in the slope.

RegEq3

This has now introduced the mysterious sum of squaresSXX. However, before we learn exactly what this is, we immediately notice two things.

  1. As we move away from the centre of the training data the variance gets larger.3
  2. As SXX gets larger the variance gets smaller.

The reason for the increasing sampling error as we move from the mean of X is obvious from thinking about how variation in slope works. The regression line pivots on the mean. Travelling further from the mean amplifies any disturbance in the slope.

RegressionSlopeExpl

Let’s look at where SXX comes from. The sum of squares is calculated from the Xs alone without considering the Ys. It is a characteristic of the sampling frame that we used to train the model. We take the difference of each X value from the mean of X, and then square that distance. To get the sum of squares we then add up all those individual squares. Note that this is a sum of the individual squares, not their average.

RegressionSXXTable

Two things then become obvious (if you think about it).

  1. As we get more and more data, SXX gets larger.
  2. As the individual Xs spread out over a greater range of XSXX gets larger.

What that (3B) term does emphasise is that even sampling error escalates as we exploit the edge of the original training data. As we extrapolate clear of the original sampling frame, the pure sampling error can quickly exceed even the residual variation.

Yet it is only a lower bound on the uncertainty in extrapolation. As we move away from the original range of Xs then, however happy we were previously with Source of Uncertainty (1), that the data was from “an environment that is sufficiently regular to be predictable”, then the question barges back in. We are now remote from our experience base in time and boundary. Nothing outside the original X-range will ever be a candidate for a comfort zone.

The fundamental theorem of prediction

Variances, generally, add up so we can sum the three Sources of Variation (2), (3A) and (3B). That gives the variance of an individual prediction, spred2. By an individual prediction I mean that somebody gives me an X and I use the regression formula to give them the (as yet unknown) corresponding Ypred.

RegEq4

It is immediately obvious that s2 is common to all three terms. However, the second and third terms, the sampling errors, can be made as small as we like by collecting more and more data. Collecting more and more data will have no impact on the first term. That arises from the residual variation. The stuff we don’t yet understand. It has variance s2, where s is the residual standard deviation that will be output by your regression software.

This, I say, is the fundamental theorem of prediction. The unexplained variation provides a hard limit on the precision of forecasts.

It is then a very simple step to convert the variance into a standard deviation, spred. This is the standard error of the prediction.4,5

RegEq5

Now, in general, where we have a measurement or prediction that has an uncertainty that can be characterised by a standard error u, there is an old trick for putting an interval round it. Remember that u is a measure of the variation in z. We can therefore put an interval around z as a number of standard errors, z±ku. Here, k is a constant of your choice. A prediction interval for the regression that generates prediction Ypred then becomes:

RegEq7

Choosing k=3 is very popular, conservative and robust.6,7 Other choices of k are available on the advice of a specialist mathematician.

It was Shewhart himself who took this all a bit further and defined tolerance intervals which contain a given proportion of future observations with a given probability.8 They are very much for the specialist.

Source of Variation (1) – Special causes

But all that assumes that we are sampling from “an environment that is sufficiently regular to be predictable”, that the residual variation is solely common cause. We checked that out on our original training data but the price of predictability is eternal vigilance. It can never be taken for granted. At any time fresh causes of variation may infiltrate the environment, or become newly salient because of some sensitising event or exotic interaction.

The real trouble with this world of ours is not that it is an unreasonable world, nor even that it is a reasonable one. The commonest kind of trouble is that it is nearly reasonable, but not quite. Life is not an illogicality; yet it is a trap for logicians. It looks just a little more mathematical and regular than it is; its exactitude is obvious, but its inexactitude is hidden; its wildness lies in wait.

G K Chesterton

The remedy for this risk is to continue plotting the residuals, the differences between the observed value and, now, the prediction. This is mandatory.

RegressionPBC2

Whenever we observe a signal of a potential special cause it puts us on notice to protect the forecast-user because our ability to predict the future has been exposed as deficient and fallible. But it also presents an opportunity. With timely investigation, a signal of a possible special cause may provide deeper insight into the variation of the cause-system. That in itself may lead to identifying further factors to build into the regression and a consequential reduction in s2.

It is reducing s2, by progressively accumulating understanding of the cause-system and developing the model, that leads to more precise, and more reliable, predictions.

Notes

  1. Kahneman, D (2011) Thinking, Fast and Slow, Allen Lane, p240
  2. Hahn, G J & Meeker, W Q (1991) Statistical Intervals: A Guide for Practitioners, Wiley, p31
  3. In fact s2/SXX is the sampling variance of the slope. The standard error of the slope is, notoriously, s/√SXX. A useful result sometimes. It is then obvious from the figure how variation is slope is amplified as we travel father from the centre of the Xs.
  4. Draper, N R & Smith, H (1998) Applied Regression Analysis, 3rd ed., Wiley, pp81-83
  5. Hahn & Meeker (1991) p232
  6. Wheeler, D J (2000) Normality and the Process Behaviour Chart, SPC Press, Chapter 6
  7. Vysochanskij, D F & Petunin, Y I (1980) “Justification of the 3σ rule for unimodal distributions”, Theory of Probability and Mathematical Statistics 21: 25–36
  8. Hahn & Meeker (1991) p231

Regression done right: Part 2: Is it worth the trouble?

In Part 1 I looked at linear regression from the point of view of machine learning and asked the question whether the data was from “An environment that is sufficiently regular to be predictable.”1 The next big question is whether it was worth it in the first place.

Variation explained

We previously looked at regression in terms of explaining variation. The original Big Y was beset with variation and uncertainty. We believed that some of that variation could be “explained” by a Big X. The linear regression split the variation in Y into variation that was explained by X and residual variation whose causes are as yet obscure.

I slipped in the word “explained”. Here it really means that we can draw a straight line relationship between X and Y. Of course, it is trite analytics that “association is not causation”. As long ago as 1710, Bishop George Berkeley observed that:2

The Connexion of Ideas does not imply the Relation of Cause and Effect, but only a Mark or Sign of the Thing signified.

Causation turns out to be a rather slippery concept, as all lawyers know, so I am going to leave it alone for the moment. There is a rather good discussion by Stephen Stigler in his recent book The Seven Pillars of Statistical Wisdom.3

That said, in real world practical terms there is not much point bothering with this if the variation explained by the X is small compared to the original variation in the Y with the majority of the variation still unexplained in the residuals.

Measuring variation

A useful measure of the variation in a quantity is its variance, familiar from the statistics service course. Variance is a good straightforward measure of the financial damage that variation does to a business.4 It also has the very nice property that we can add variances from sundry sources that aggregate together. Financial damage adds up. The very useful job that linear regression does is to split the variance of Y, the damage to the business that we captured with the histogram, into two components:

  • The contribution from X; and
  • The contribution of the residuals.

RegressionBlock1
The important thing to remember is that the residual variation is not some sort of technical statistical artifact. It is the aggregate of real world effects that remain unexamined and which will continue to cause loss and damage.

RegressionIshikawa2

Techie bit

Variance is the square of standard deviation. Your linear regression software will output the residual standard deviation, s, sometimes unhelpfully referred to as the residual standard error. The calculations are routine.5 Square s to get the residual variance, s2. The smaller is s2, the better. A small s2 means that not much variation remains unexplained. Small s2 means a very good understanding of the cause system. Large s2 means that much variation remains unexplained and our understanding is weak.
RegressionBlock2

The coefficient of determination

So how do we decide whether s2 is “small”? Dividing the variation explained by X by the total variance of Y, sY2,  yields the coefficient of determination, written as R2.6 That is a bit of a mouthful so we usually just call it “R-squared”. R2 sets the variance in Y to 100% and expresses the explained variation as a percentage. Put another way, it is the percentage of variation in Y explained by X.

RegressionBlock3The important thing to remember is that the residual variation is not a statistical artifact of the analysis. It is part of the real world business system, the cause-system of the Ys.7 It is the part on which you still have little quantitative grasp and which continues to hurt you. Returning to the cause and effect diagram, we picked one factor X to investigate and took its influence out of the data. The residual variation is the variation arising from the aggregate of all the other causes.

As we shall see in more detail in Part 3, the residual variation imposes a fundamental bound on the precision of predictions from the model. It turns out that s is the limiting standard error of future predictions

Whether your regression was a worthwhile one or not so you will want to probe the residual variation further. A technique like DMAIC works well. Other improvement processes are available.

So how big should R2 be? Well that is a question for your business leaders not a statistician. How much does the business gain financially from being able to explain just so much variation in the outcome? Anybody with an MBA should be able to answer this so you should have somebody in your organisation who can help.

The correlation coefficient

Some people like to take the square root of R2 to obtain what they call a correlation coefficient. I have never been clear as to what this was trying to achieve. It always ends up telling me less than the scatter plot. So why bother? R2 tells me something important that I understand and need to know. Leave it alone.

What about statistical significance?

I fear that “significance” is, pace George Miller, “a word worn smooth by many tongues”. It is a word that I try to avoid. Yet it seems a natural practice for some people to calculate a p-value and ask whether the regression is significant.

I have criticised p-values elsewhere. I might calculate them sometimes but only because I know what I am doing. The terrible fact is that if you collect sufficient data then your regression will eventually be significant. Statistical significance only tells me that you collected a lot of data. That’s why so many studies published in the press are misleading. Collect enough data and you will get a “significant” result. It doesn’t mean it matters in the real world.

R2 is the real world measure of sensible trouble (relatively) impervious to statistical manipulation. I can make p as small as I like just by collecting more and more data. In fact there is an equation that, for any given R2, links p and the number of observations, n, for linear regression.8

FvR2 equation

Here, Fμ, ν(x) is the F-distribution with μ and ν degrees of freedom. A little playing about with that equation in Excel will reveal that you can make p as small as you like without R2 changing at all. Simply by making n larger. Collecting data until p is small is mere p-hacking. All p-values should be avoided by the novice. R2 is the real world measure (relatively) impervious to statistical manipulation. That is what I am interested in. And what your boss should be interested in.

Next time

Once we are confident that our regression model is stable and predictable, and that the regression is worth having, we can move on to the next stage.

Next time I shall look at prediction intervals and how to assess uncertainty in forecasts.

References

  1. Kahneman, D (2011) Thinking, Fast and Slow, Allen Lane, p240
  2. Berkeley, G (1710) A Treatise Concerning the Principles of Human Knowledge, Part 1, Dublin
  3. Stigler, S M (2016) The Seven Pillars of Statistical Wisdom, Harvard University Press, pp141-148
  4. Taguchi, G (1987) The System of Experimental Design: Engineering Methods to Optimize Quality and Minimize Costs, Quality Resources
  5. Draper, N R & Smith, H (1998) Applied Regression Analysis, 3rd ed., Wiley, p30
  6. Draper & Smith (1998) p33
  7. For an appealing discussion of cause-systems from a broader cultural standpoint see: Bostridge, I (2015) Schubert’s Winter Journey: Anatomy of an Obsession, Faber, pp358-365
  8. Draper & Smith (1998) p243

Regression done right: Part 1: Can I predict the future?

I recently saw an article in the Harvard Business Review called “Refresher on Regression Analysis”. I thought it was horrible so I wanted to set the record straight.

Linear regression from the viewpoint of machine learning

Linear regression is important, not only because it is a useful tool in itself, but because it is (almost) the simplest statistical model. The issues that arise in a relatively straightforward form are issues that beset the whole of statistical modelling and predictive analytics. Anyone who understands linear regression properly is able to ask probing questions about more complicated models. The complex internal algorithms of Kalman filters, ARIMA processes and artificial neural networks are accessible only to the specialist mathematician. However, each has several general features in common with simple linear regression. A thorough understanding of linear regression enables a due diligence of the claims made by the machine learning advocate. Linear regression is the paradigmatic exemplar of machine learning.

There are two principal questions that I want to talk about that are the big takeaways of linear regression. They are always the first two questions to ask in looking at any statistical modelling or machine learning scenario.

  1. What predictions can I make (if any)?
  2. Is it worth the trouble?

I am going to start looking at (1) in this blog and complete it in a future Part 2. I will then look at (2) in a further Part 3.

Variation, variation, variation

Variation is a major problem for business, the tendency of key measures to fluctuate irregularly. Variation leads to uncertainty. Will the next report be high or low? Or in the middle? Because of the uncertainty we have to allow safety margins or swallow some non-conformancies. We have good days and bad days, good products and not so good. We have to carry costly working capital because of variation in cash flow. And so on.

We learned in our high school statistics class to characterise variation in a key process measure, call it the Big Y, by an histogram of observations. Perhaps we are bothered by the fluctuating level of monthly sales.

RegressionHistogram

The variation arises from a whole ecology of competing and interacting effects and factors that we call the cause-system of the outcome. In general, it is very difficult to single out individual factors as having been the cause of a particular observation, so entangled are they. It is still useful to capture them for reference on a cause and effect diagram.

RegressionIshikawa

One of the strengths of the cause and effect diagram is that it may prompt the thought that one of the factors is particularly important, call it Big X, perhaps it is “hours of TV advertising” (my age is showing). Motivated by that we can generate a sample of corresponding measurements data of both the Y and X and plot them on a scatter plot.

RegressionScatter1

Well what else is there to say? The scatter plot shows us all the information in the sample. Scatter plots are an important part of what statistician John Tukey called Exploratory Data Analysis (EDA). We have some hunches and ideas, or perhaps hardly any idea at all, and we attack the problem by plotting the data in any way we can think of. So much easier now than when W Edwards Deming wrote:1

[Statistical practice] means tedious work, such as studying the data in various forms, making tables and charts and re-making them, trying to use and preserve the evidence in the results and to be clear enough to the reader: to endure disappointment and discouragement.

Or as Chicago economist Ronald Coase put it.

If you torture the data enough, nature will always confess.

The scatter plot is a fearsome instrument of data torture. It tells me everything. It might even tempt me to think that I have a basis on which to make predictions.

Prediction

In machine learning terms, we can think of the sample used for the scatter plot as a training set of data. It can be used to set up, “train”, a numerical model that we will then fix and use to predict future outcomes. The scatter plot strongly suggests that if we know a future X alone we can have a go at predicting the corresponding future Y. To see that more clearly we can draw a straight line by hand on the scatter plot, just as we did in high school before anybody suggested anything more sophisticated.

RegressionScatter2

Given any particular X we can read off the corresponding Y.

RegressionScatter3

The immediate insight that comes from drawing in the line is that not all the observations lie on the line. There is variation about the line so that there is actually a range of values of Y that seem plausible and consistent for any specified X. More on that in Parts 2 and 3.

In understanding machine learning it makes sense to start by thinking about human learning. Psychologists Gary Klein and Daniel Kahneman investigated how firefighters were able to perform so successfully in assessing a fire scene and making rapid, safety critical decisions. Lives of the public and of other firefighters were at stake. This is the sort of human learning situation that machines, or rather their expert engineers, aspire to emulate. Together, Klein and Kahneman set out to describe how the brain could build up reliable memories that would be activated in the future, even in the agony of the moment. They came to the conclusion that there are two fundamental conditions for a human to acquire a skill.2

  • An environment that is sufficiently regular to be predictable.
  • An opportunity to learn these regularities through prolonged practice

The first bullet point is pretty much the most important idea in the whole of statistics. Before we can make any prediction from the regression, we have to be confident that the data has been sampled from “an environment that is sufficiently regular to be predictable”. The regression “learns” from those regularities, where they exist. The “learning” turns out to be the rather prosaic mechanics of matrix algebra as set out in all the standard texts.3 But that, after all, is what all machine “learning” is really about.

Statisticians capture the psychologists’ “sufficiently regular” through the mathematical concept of exchangeability. If a process is exchangeable then we can assume that the distribution of events in the future will be like the past. We can project our historic histogram forward. With regression we can do better than that.

Residuals analysis

Formally, the linear regression calculations calculate the characteristics of the model:

Y = mX + c + “stuff”

The “mX+c” bit is the familiar high school mathematics equation for a straight line. The “stuff” is variation about the straight line. What the linear regression mathematics does is (objectively) to calculate the m and c and then also tell us something about the “stuff”. It splits the variation in Y into two components:

  • What can be explained by the variation in X; and
  • The, as yet unexplained, variation in the “stuff”.

The first thing to learn about regression is that it is the “stuff” that is the interesting bit. In 1849 British astronomer Sir John Herschel observed that:

Almost all the greatest discoveries in astronomy have resulted from the consideration of what we have elsewhere termed RESIDUAL PHENOMENA, of a quantitative or numerical kind, that is to say, of such portions of the numerical or quantitative results of observation as remain outstanding and unaccounted for after subducting and allowing for all that would result from the strict application of known principles.

The straight line represents what we guessed about the causes of variation in Y and which the scatter plot confirmed. The “stuff” represents the causes of variation that we failed to identify and that continue to limit our ability to predict and manage. We call the predicted Ys that correspond to the measured Xs, and lie on the fitted straight line, the fits.

fiti = mXic

The residual values, or residuals, are obtained by subtracting the fits from the respective observed Y values. The residuals represent the “stuff”. Statistical software does this for us routinely. If yours doesn’t then bin it.

residuali = Yi – fiti

RegressionScatter4

There are a number of properties that the residuals need to satisfy for the regression to work. Investigating those properties is called residuals analysis.4 As far as use for prediction in concerned, it is sufficient that the “stuff”, the variation about the straight line, be exchangeable.5 That means that the “stuff” so far must appear from the data to be exchangeable and further that we have a rational belief that such a cause system will continue unchanged into the future. Shewhart charts are the best heuristics for checking the requirement for exchangeability, certainly as far as the historical data is concerned. Our first and, be under no illusion, mandatory check on the ability of the linear regression, or any statistical model, to make predictions is to plot the residuals against time on a Shewhart chart.

RegressionPBC

If there are any signals of special causes then the model cannot be used for prediction. It just can’t. For prediction we need residuals that are all noise and no signal. However, like all signals of special causes, such will provide an opportunity to explore and understand more about the cause system. The signal that prevents us from using this regression for prediction may be the very thing that enables an investigation leading to a superior model, able to predict more exactly than we ever hoped the failed model could. And even if there is sufficient evidence of exchangeability from the training data, we still need to continue vigilance and scrutiny of all future residuals to look out for any novel signals of special causes. Special causes that arise post-training provide fresh information about the cause system while at the same time compromising the reliability of the predictions.

Thorough regression diagnostics will also be able to identify issues such as serial correlation, lack of fit, leverage and heteroscedasticity. It is essential to regression and its ommision is intolerable. Residuals analysis is one of Stephen Stigler’s Seven Pillars of Statistical Wisdom.6 As Tukey said:

The greatest value of a picture is when it forces us to notice what we never expected to see.

To come:

Part 2: Is my regression significant? … is a dumb question.
Part 3: Quantifying predictions with statistical intervals.

References

  1. Deming, W E (‎1975) “On probability as a basis for action”, The American Statistician 29(4) pp146-152
  2. Kahneman, D (2011) Thinking, Fast and Slow, Allen Lane, p240
  3. Draper, N R & Smith, H (1998) Applied Regression Analysis, 3rd ed., Wiley, p44
  4. Draper & Smith (1998) Chs 2, 8
  5. I have to admit that weaker conditions may be adequate in some cases but these are far beyond any other than a specialist mathematician.
  6. Stigler, S M (2016) The Seven Pillars of Statistical Wisdom, Harvard University Press, Chapter 7

On leadership and the Chinese contract

Hanyu trad simp.svgBetween 1958 and 1960, 67 of the 120 inhabitants of the Chinese village of Xiaogang starved to death. But Mao Zedong’s cruel and incompetent collectivist policies continued to be imposed into the 1970s. In December 1978, 18 of Xiaogang’s leading villagers met secretly and illegally to find a way out of borderline starvation and grinding poverty. The first person to speak up at the meeting was Yan Jingchang. He suggested that the village’s principal families clandestinely divide the collective farm’s land among themselves. Then each family should own what it grew. Jingchang drew up an agreement on a piece of paper for the others to endorse. Then he hid it in a bamboo tube in the rafters of his house. Had it been discovered Jingchang and the village would have suffered brutal punishment and reprisal as “counter-revolutionaries”.

The village prospered under Jingchang’s structure. During 1979 the village produced more than it had in the previous five years. That attracted the attention of the local Communist Party chief who summoned Jingchang for interrogation. Jingchang must have given a good account of what had been happening. The regional party chief became intrigued at what was going on and prepared a report on how the system could be extended across the whole region.

Mao had died in 1976 and, amid the emerging competitors for power, it was still uncertain as to how China would develop economically and politically. By 1979, Deng Xiaoping was working his way towards the effective leadership of China. The report into the region’s proposals for agricultural reform fell on his desk. His contribution to the reforms was that he did nothing to stop them.

I have often found the idea of leadership a rather dubious one and wondered whether it actually described anything. It was, I think, Goethe who remarked that “When an idea is wanting, a word can always be found to take its place.” I have always been tempted to suspect that that was the case with “leadership”. However, the Jingchang story did make me think.1 If there is such a thing as leadership then this story exemplifies it and it is worth looking at what was involved.

Personal risk

This leader took personal risks. Perhaps to do otherwise is to be a mere manager. A leader has, to use the graphic modern idiom, “skin in the game”. The risk could be financial or reputational, or to liberty and life.

Luck

Luck is the converse of risk. Real risks carry the danger of failure and the consequences thereof. Jingchang must have been aware of that. Napoleon is said to have complained, “I have plenty of clever generals but just give me a lucky one.2 Had things turned out differently with the development of Chinese history, the personalities of the party officials or Deng’s reaction, we would probably never have heard of Jingchang. I suspect though that the history of China since the 1970s would not have been very different.

The more I practice, the luckier I get.

Gary Player
South African golfer

Catalysing alignment

It was Jingchang who drew up the contract, who crystallised the various ideas, doubts, ambitions and thoughts into a written agreement. In law we say that a valid contract requires a consensus ad idem, a meeting of minds. Jingchang listened to the emerging appetite of the the other villagers and captured it in a form in which all could invest. I think that is a critical part of leadership. A leader catalyses alignment and models constancy of purpose.

However, this sort of leadership may not be essential in every system. Management scientists are enduringly fascinated by The Morning Star Company, a California tomato grower that functions without any conventional management. The particular needs and capabilities of the individuals interact to create an emergent order that evolves and responds to external drivers. Austrian economist Friedrich Hayek coined the term catallaxy for a self-organising system of voluntary co-operation and explained how such a thing could arise and sustain and what its benefits to society.3

But sometimes the system needs the spark of a leader like Jingchang who puts himself at risk and creates a vivid vision of the future state against which followers can align.

Deng kept out of the way. Jingchang put himself on the line. The most important characteristic of leadership is the sagacity to know when the system can manage itself and when to intervene.

References

  1. I have this story from Matt Ridley (2015) The Evolution of Everything, Fourth Estate
  2. Apocryphal I think.
  3. Hayek, F A (1982) Law, Legislation, and Liberty, vol.2, Routledge, pp108–9

Imagination, data and leadership

I had an intriguing insight into the nature of imagination the other evening when I was watching David Eagleman’s BBC documentary The Brain which you can catch on iPlayer until 27 February 2016 if you have a UK IP address.

Eagleman told the strange story of Henry Molaison. Molaison suffered from debilitating epilepsy following a bicycle accident when he was nine years old. At age 27, Molaison underwent radical brain surgery that removed, all but completely, his hippocampi. The intervention stabilised the epilepsy but left Molaison’s memory severely impaired. Though he could recall his childhood, Molaison had no recall of events in the years leading up to his surgery and was unable to create new long-term memories. The case was important evidence for the theory that the hippocampus is critical to memory function. Molaison, having lost his, was profoundly compromised as to recall.

But Eagleman’s analysis went further and drew attention to a passage in a interview with Molaison later in his life.1 Though his presenting symptoms post-intervention were those of memory loss, Molaison also encountered difficulty in talking about what he would do the following day. Eagleman advances the theory that the hippocampus is critical, not only to memory, but to imagining the future. The systems that create memories are common to those that generate a model by which we can forecast, predict and envision novel outcomes.

I blogged about imagination back in November and how it was pivotal to core business activities from invention and creativity to risk management and root cause analysis. If Eagleman’s theory about the entanglement of memory and imagination is true then it might have profound implications for management. Perhaps our imagination will only function as well as our memory. That was, apparently, the case with Molaison. It could just be that an organisation’s ability to manage the future depends upon the same systems as those by which it critically captures the past.

That chimes with a theory of innovation put forward by W Brian Arthur of the Santa Fe Institute.2 Arthur argues that purportedly novel inventions are no more than combinations of known facts. There are no great leaps of creativity, just the incremental variation of a menagerie of artifacts and established technologies. Ideas similar to Arthur’s have been advanced by Matt Ridley,3,4 and Steven Berlin Johnson.5 Only mastery of the present exposes the opportunities to innovate. They say.

Data

This all should be no surprise to anybody experienced in business improvement. Diligent and rigorous criticism of historical data is the catalyst of change and the foundation of realising a vivid future. This is a good moment to remind ourselves of the power of the process behaviour chart in capturing learning and creating an organisational memory.

GenericPBC

The process behaviour chart provides a cogent record of the history of operation of a business process, its surprises and disappointments, existential risks and epochs of systematic productivity. It records attempted business solutions, successful, failed, temporary and partial work-rounds. It segregates signal from noise. It suggests realistic bounds on prediction. It is the focus of inclusive discussion about what the data means. It is the live report of experimentation and investigation, root cause analysis and problem solving. It matches data with its historical context. It is the organisation’s memory of development of a business process, and the people who developed it. It is the basis for creating the future.

If you are not familiar with how process behaviour charts work in this context then have a look at Don Wheeler’s example of A Japanese Control Chart.6

Leadership

Tim Harford tries to take the matter further.7 On Harford’s account of invention, “trial and error” consistently outperform “expert leadership” through a Darwinian struggle of competing ideas. The successful innovations, Harford says, propagate by adoption and form an ecology of further random variation, out of which the best ideas emergently repeat the cycle or birth and death. Of course, Leo Tolstoy wrote War and Peace, his “airport novel” avant la lettre, (also currently being dramatised by the BBC) to support exactly this theory of history. In Tolsoy’s intimate descriptions of the Battles of Austerlitz and Borodino, combatants lose contact with their superiors, battlefields are obscured by smoke from the commanding generals, individuals act on impulse and in despite of field discipline. How, Tolstoy asked in terms, could anyone claim to be the organising intelligence of victory or the culpable author of defeat?

However, I think that a view of war at odds with Tolstoy’s is found in the career of General George Marshall.8 Marshall rose to the rank of General of the Army of the USA as an expert in military logistics rather than as a commander in the field. Reading a biography of Marshall presents an account of war as a contest of supply chains. The events of the theatre of operations may well be arbitrary and capricious. It was the delivery of superior personnel and materiel to the battlefield that would prove decisive. That does not occur without organisation and systematic leadership. I think.

Harford and the others argue that, even were the individual missing from history, the innovation would still have occurred. But even though it could have been anyone, it still had to be someone. And what that someone had to provide was leadership to bring the idea to market or into operation. We would still have motor cars without Henry Ford and tablet devices without Steve Jobs but there would have been two other names who had put themselves on the line to create something out of nothing.

In my view, the evolutionary model of innovation is interesting but stretches a metaphor too far. Innovation demands leadership. The history of barbed wire is instructive.9 In May 1873, at a county fair in Illinois, Henry B Rose displayed a comical device to prevent cattle beating down primitive fencing, a “wooden strip with metallic points”. The device hung round the cattle’s horns and any attempts to butt the fence drove the spikes into the beast’s head. It didn’t catch on but at the fair that day were Joseph Glidden, Isaac L Ellwood and Jacob Haish. The three went on, within a few months, each to invent barbed wire. The winning memes often come from failed innovation.

Leadership is critical, not only in scrutinising innovation but in organising the logistics that will bring it to market.10 More fundamentally, leadership is pivotal in creating the organisation in which diligent criticism of historical data is routine and where it acts as a catalyst for innovation.11

References

  1. http://www.sciencemuseum.org.uk/visitmuseum_OLD/galleries/who_am_i/~/media/8A897264B5064BC7BE1D5476CFCE50C5.ashx, retrieved 29 January 2016, at p5
  2. Arthur, W B (2009) The Nature of Technology: What it is and How it Evolves, The Free Press/ Penguin Books.
  3. Ridley, M (2010) The Rational Optimist, Fourth Estate
  4. — (2015) The Evolution of Everything, Fourth Estate
  5. Johnson, S B (2010) Where Good Ideas Come From: The Seven Patterns of Innovation, Penguin
  6. Wheeler, D J (1992) Understanding Statistical Process Control, SPC Press
  7. Harford, T (2011) Adapt: Why Success Always Starts with Failure, Abacus
  8. Cray, E (2000) General of the Army: George C. Marshall, Soldier and Statesman, Cooper Square Press
  9. Krell, A (2002) The Devil’s Rope: A Cultural History of Barbed Wire, Reaktion Books
  10. Armytage, W H G (1976) A Social History of Engineering, 4th ed., Faber
  11. Nonaka, I & Takeuchi, H (1995) The Knowledge-Creating Company: How Japanese Companies Create the Dynamics of Innovation, Oxford University Press

Why did the polls get it wrong?

This week has seen much soul-searching by the UK polling industry over their performance leading up to the 2015 UK general election on 7 May. The polls had seemed to predict that Conservative and Labour Parties were neck and neck on the popular vote. In the actual election, the Conservatives polled 37.8% to Labour’s 31.2% leading to a working majority in the House of Commons, once the votes were divided among the seats contested. I can assure my readers that it was a shock result. Over breakfast on 7 May I told my wife that the probability of a Conservative majority in the House was nil. I hold my hands up.

An enquiry was set up by the industry led by the National Centre for Research Methods (NCRM). They presented their preliminary findings on 19 January 2016. The principal conclusion was that the failure to predict the voting share was because of biases in the way that the data were sampled and inadequate methods for correcting for those biases. I’m not so sure.

Population -> Frame -> Sample

The first thing students learn when studying statistics is the critical importance, and practical means, of specifying a sampling frame. If the sampling frame is not representative of the population of concern then simply collecting more and more data will not yield a prediction of greater accuracy. The errors associated with the specification of the frame are inherent to the sampling method. Creating a representative frame is very hard in opinion polling because of the difficulty in contacting particular individuals efficiently. It turns out that Conservative voters are harder than Labour voters to get hold of, so that they can be questioned. The NCRM study concluded that, within the commercial constraints of an opinion poll, there was a lower probability that a Conservative voter would be contacted. They therefore tended to be under-represented in the data causing a substantial bias towards Labour.

This is a well known problem in polling practice and there are demographic factors that can be used to make a statistical adjustment. Samples can be stratified. NCRM concluded that, in the run up to the 2015 election, there were important biases tending to under state the Conservative vote and the existing correction factors were inadequate. Fresh sampling strategies were needed to eradicate the bias and improve prediction. There are understandable fears that this will make polling more costly. More calls will be needed to catch Conservatives at home.

Of course, that all sounds an eminently believable narrative. These sorts of sampling frame biases are familiar but enormously troublesome for pollsters. However, I wanted to look at the data myself.

Plot data in time order

That is the starting point of all statistical analysis. Polls continued after the election, though with lesser frequency. I wanted to look at that data after the election in addition to the pre-election data. Here is a plot of poll results against time for Conservative and Labour. I have used data from 25 January to the end of 2015.1, 2 I have not managed to jitter the points so there is some overprinting of Conservative by Labour pre-election.

Polling201501

Now that is an arresting plot. Yet again plotting against time elucidates the cause system. Something happened on the date of the election. Before the election the polls had the two parties neck and neck. The instant (sic) the election was done there was clear red/ blue water between the parties. Applying my (very moderate) level of domain knowledge to the data before, the poll results look stable and predictable. There is a shift after the election to a new datum that remains stable and predictable. The respective arithmetic means are given below.

Party Mean Poll Before Election Mean Poll After
Conservative 33.3% 37.8% 38.8%
Labour 33.5% 31.2% 30.9%

The mean of the post-election polls is doing fairly well but is markedly different from the pre-election results. Now, it is trite statistics that the variation we observe on a chart is the aggregate of variation from two sources.

  • Variation from the thing of interest; and
  • Variation from the measurement process.

As far as I can gather, the sampling methods used by the polling companies have not so far been modified. They were awaiting the NCRM report. They certainly weren’t modified in the few days following the election. The abrupt change on 7 May cannot be because of corrected sampling methods. The misleading pre-election data and the “impressive” post-election polls were derived from common sampling practices. It seems to me difficult to reconcile NCRM’s narrative to the historical data. The shift in the data certainly needs explanation within that account.

What did change on the election date was that a distant intention turned into the recall of a past action. What everyone wants to know in advance is the result of the election. Unsurprisingly, and as we generally find, it is not possible to sample the future. Pollsters, and their clients, have to be content with individuals’ perceptions of how they will vote. The vast majority of people pay very little attention to politics at all and the general level of interest outside election time is de minimis. Standing in a polling booth with a ballot paper is a very different matter from being asked about intentions some days, weeks or months hence. Most people take voting very seriously. It is not obvious that the same diligence is directed towards answering pollster’s questions.

Perhaps the problems aren’t statistical at all and are more concerned with what psychologists call affective forecasting, predicting how we will feel and behave under future circumstances. Individuals are notoriously susceptible to all sorts of biases and inconsistencies in such forecasts. It must at least be a plausible source of error that intentions are only imperfectly formed in advance and mapping into votes is not straightforward. Is it possible that after the election respondents, once again disengaged from politics, simply recalled how they had voted in May? That would explain the good alignment with actual election results.

Imperfect foresight of voting intention before the election and 20/25 hindsight after is, I think, a narrative that sits well with the data. There is no reason whatever why internal reflections in the Cartesian theatre of future voting should be an unbiased predictor of actual votes. In fact, I think it would be a surprise, and one demanding explanation, if they were so.

The NCRM report does make some limited reference to post-election re-interviews of contacts. However, this is presented in the context of a possible “late swing” rather than affective forecasting. There are no conclusions I can use.

Meta-analysis

The UK polls took a horrible beating when they signally failed to predict the result of the 1992 election in under-estimating the Conservative lead by around 8%.3 Things then felt better. The 1997 election was happier, where Labour led by 13% at the election with final polls in the range of 10 to 18%.4 In 2001 each poll managed to get the Conservative vote within 3% but all over-estimated the Labour vote, some pollsters by as much as 5%.5 In 2005, the final poll had Labour on 38% and Conservative,  33%. The popular vote was Labour 36.2% and Conservative 33.2%.6 In 2010 the final poll had Labour on 29% and Conservative, 36%, with a popular vote of 29.7%/36.9%.7 The debacle of 1992 was all but forgotten when 2015 returned to pundits’ dismay.

Given the history and given the inherent difficulties of sampling and affective forecasting, I’m not sure why we are so surprised when the polls get it wrong. Unfortunately for the election strategist they are all we have. That is a common theme with real world data. Because of its imperfections it has to be interpreted within the context of other sources of evidence rather than followed slavishly. The objective is not to be driven by data but to be led by the insights it yields.

References

  1. Opinion polling for the 2015 United Kingdom general election. (2016, January 19). In Wikipedia, The Free Encyclopedia. Retrieved 22:57, January 20, 2016, from https://en.wikipedia.org/w/index.php?title=Opinion_polling_for_the_2015_United_Kingdom_general_election&oldid=700601063
  2. Opinion polling for the next United Kingdom general election. (2016, January 18). In Wikipedia, The Free Encyclopedia. Retrieved 22:55, January 20, 2016, from https://en.wikipedia.org/w/index.php?title=Opinion_polling_for_the_next_United_Kingdom_general_election&oldid=700453899
  3. Butler, D & Kavanagh, D (1992) The British General Election of 1992, Macmillan, Chapter 7
  4. — (1997) The British General Election of 1997, Macmillan, Chapter 7
  5. — (2002) The British General Election of 2001, Palgrave-Macmillan, Chapter 7
  6. Kavanagh, D & Butler, D (2005) The British General Election of 2005, Palgrave-Macmillan, Chapter 7
  7. Cowley, P & Kavanagh, D (2010) The British General Election of 2010, Palgrave-Macmillan, Chapter 7

UK railway suicides – 2015 update

The latest UK rail safety statistics were published in September 2015 absent the usual press fanfare. Regular readers of this blog will know that I have followed the suicide data series, and the press response, closely in 2014, 2013 and 2012.

This year I am conscious that one of those units is not a mere statistic but a dear colleague, Nigel Clements. It was poet W B Yeats who observed, in his valedictory verse Under Ben Bulben that “Measurement began our might.” He ends the poem by inviting us to “Cast a cold eye/ On life, on death.” Sometimes, with statistics, we cast the cold eye but the personal reminds us that it must never be an academic exercise.

Nigel’s death gives me an additional reason for following this series. I originally latched onto it because I felt that exaggerated claims  as to trends were being made. It struck me as a closely bounded problem that should be susceptible to taught measurement. And it was something important.  Again I have re-plotted the data myself on a Shewhart chart.

RailwaySuicides4

Readers should note the following about the chart.

  • Some of the numbers for earlier years have been updated by the statistical authority.
  • I have recalculated natural process limits as there are still no more than 20 annual observations.
  • The signal noted last year has persisted (in red) with two consecutive observations above the upper natural process limit. There are also now eight points below the centre line at the beginning of the series.

As my colleague Terry Weight always taught me, a signal gives us license to interpret the ups and downs on the chart. This increasingly looks like a gradual upward trend.

Though there was this year little coverage in the press, I did find this article in The Guardian newspaper. I had previously wondered whether the railway data simply reflected an increasing trend in UK suicide in general. The Guardian report is eager to emphasise:

The total number [of suicides] in the UK has risen in recent years, with the latest Office for National Statistics figures showing 6,233 suicides registered in the UK in 2013, a 4% increase on the previous year.

Well, #executivetimeseries! I have low expectations of press data journalism so I do not know why I am disappointed. In any event I decided to plot the data. There were a few problems. The railway data is not collected by calendar year so the latest observation is 2014/15. I have not managed to identify which months are included though, while I was hunting I found out that the railway data does not include London Underground. I can find no railway data before 2001/02. The national suicide data is collected by calendar year and the last year published is 2013. I have done my best by (not quite) arbitrarily identifying 2013/14 in the railway data with 2013 nationally. I also tried the obvious shift by one year and it did not change the picture.

RailwaySuicides5

I have added a LOWESS line (with smoothing parameter 0.4) to the national data the better to pick out the minimum around 2007, just before the start of the financial crisis. That is where the steady decline over the previous quarter century reverses. It is in itself an arresting statistic. But I don’t see the national trend mirrored in the railway data, thereby explaining that trend.

Previously I noted proposals to repeat a strategy from Japan of bathing railway platforms with blue light. Professor Michiko Ueda of Syracuse University was kind enough to send me details of the research. The conclusions were encouraging but tentative and, unfortunately, the Japanese rail companies have not made any fresh data available for analysis since 2010. In the UK, I understand that such lights were installed at Gatwick in summer 2014 but I have not seen any data.

A huge amount of sincere endeavour has gone into this issue but further efforts have to be against the background that there is an escalating and unexplained problem.

Things and actions are what they are and the consequences of them will be what they will be: why then should we desire to be deceived?

Joseph Butler