UK railway suicides – 2016 update

The latest UK rail safety statistics were published in September 2016, again absent much of the press press fanfare we had seen in the past. Apologies for the long delay but the day job has been busy. Regular readers of this blog will know that I have followed the suicide data series, and the press response, closely in 20152014, 2013 and 2012. Again, I “Cast a cold eye/ On life, on death.” Again I have re-plotted the data myself on a Shewhart chart.

railwaysuicides6

Readers should note the following about the chart.

  • Many thanks to Tom Leveson Gower at the Office of Rail and Road who confirmed that the figures are for the year up to the end of March.
  • Some of the numbers for earlier years have been updated by the statistical authority.
  • I have recalculated natural process limits (NPLs) as there are still no more than 20 annual observations, and because the historical data has been updated. The NPLs have therefore changed in that the 2014 total is no longer above the upper NPL.
  • The observation above the upper NPL in 2015 has not persisted. The latest total is within the NPLs. We have to think about how to interpret this.

The current chart shows two signals, an observation above the upper NPL in 2015 and a run of 8 below the centre line from 2002 to 2009. As I always remark, the Terry Weight rule says that a signal gives us license to interpret the ups and downs on the chart. So I shall have a go at doing that. Last year I was coming to the conclusion that the data increasingly looked like a gradual upward trend. Has the 2016 data changed that?

The Samaritans posted on their website, “Rail suicides fall by 12%,” and went on to say:

Suicide prevention measures put in place as part of the partnership between Samaritans, Network Rail and the wider rail industry are saving more lives on the railways.

In fairness, the Samaritans qualified their headline with the following footnote.

We must be mindful that suicide data is best understood by looking at trends over longer periods of time, and year-on-year fluctuations may not be indicative of longer term trends. It is however very encouraging to see such a decrease which we would hope to see continuing in future years.

The Huffington Post, no, not sure I really think of them as part of the MSM, were less cautious in banking the 12% by stating, “It is the first time the number has dropped in three years.” True, but #executivetimeseries!

Signal or noise?

What shall we make of the decrease, a decrease to  “back within” the NPLs? First, the mere fact that there are fewer suicides is good news. That is a “better” outcome. The question still remains as to whether we are making progress in reducing the frequency of suicides. Has there been a change to the underlying cause system that drives the suicide numbers? We might just be observing noise unrelated to an underlying signal or trend. Remember that extremely high measurements are usually followed by lower ones because of the principle of regression to the mean.1 Such a decrease is no evidence of an underlying improvement but merely a deceptive characteristic of common cause variation.

One thing that I can do is to try to fit a trend line through the data and to ask which narrative best fits what I observe, a continuing increasing trend or a trend that has plateaued or even reversed. As you know, I am very critical of the uncritical casting of regression lines on data plots. However, this time I have a definite purpose in mind. Here is the data with a fitted linear regression line.

railwaysuicides8a

What I wanted to do was to split the data into two parts:

  • A trend (linear simply for the sake of exploratory data analysis (EDA); and
  • The residual variation about the trend.

The question I want to ask is whether the residual variation is stable, just plain noise, or whether there is a signal there that might give me a clue that a linear trend does not hold. The way that I do that is to plot the residuals on a Shewhart chart.

railwaysuicides7

That shows a stable pattern of residuals. If I try to interpret the chart as a linear trend plus exchangeable noise then nothing in the data contradicts that. The original chart invites an interpretation, because of the signals. I adopt the interpretation of an increasing trend. Nothing in the data contradicts that. I can put the pictures together to show this model.

railwaysuicides8

My opinion is that, when I plot the data that way, I have a compelling picture of a growing trend about which there is some stable common cause variation. Had there been an observation below the lower NPL on the last chart then that could have been evidence that the trend was slowing or even reversing. But not here.

I note that there’s also a report here from Anna Taylor and her colleagues at the University of Bristol. They too find an increasing trend with no signal of amelioration. They have used a different approach from mine and the fact that we have both got to the same broad result should reinforce confidence in out common conclusion.

Measurement Systems Analysis

Of course, we should not draw any conclusions from the data without thinking about the measurement system. In this case there is a legal issue. It concerns the standard of proof that the law requires coroners to apply before finding suicide as the cause of death. Findings of fact in inquests in England and Wales are generally made if they satisfy the civil standard of proof, the balance of probabilities. However, a finding of suicide can only be returned if such a conclusion satisfies the higher standard of beyond reasonable doubt, the typical criminal standard.2 There have long been suggestions that that leads to under reporting of suicides.3 The Matthew Elvidge Trust is currently campaigning for the general civil standard of balance of probabilities to be adopted.4

Next steps

Previously I noted proposals to repeat a strategy from Japan of bathing railway platforms with blue light. In the UK, I understand that such lights were installed at Gatwick in summer 2014 but I have not seen any data or heard anything more about it.

A huge amount of sincere endeavour has gone into this issue but further efforts have to be against the background that there is an escalating and unexplained problem.

References

  1. Kahneman, D (2011) Thinking, Fast and Slow, Allen Lane, pp175-184
  2. Jervis on Coroners 13th ed. 13-70
  3. Chambers, D R (1989) “The coroner, the inquest and the verdict of suicide”, Medicine, Science and the Law 29, 181
  4. Trust responds to Coroner’s Consultation“, Mathew Elvidge Trust, retrieved 4/1/17

Plan B, gut feel and Shewhart charts

Elizabeth Holmes 2014 (cropped).jpgI honestly had the idea for this blog and started drafting it six months ago when I first saw this, now quite infamous, quote being shared around the internet.

The minute you have a back-up plan, you’ve admitted you’re not going to succeed.

Elizabeth Holmes

Good advice? I think not! Let’s review some science.

Confidence and trustworthiness

As far back as the 1970s, psychologists carried out a series of experiments on individual confidence.1 They took a sample of people and set each of them a series of general knowledge questions. The participants were to work independently of each other. The questions were things like What is the capital city of France? The respondents had, not only to do their best to answer the question, but also then to state the probability that they had answered correctly.

As a headline to their results the researchers found that, of all those answers in the aggregate about which people said they were 100% sure that they had answered correctly, more than 20% were answered incorrectly.

Now, we know that people who go around assigning 100% probabilities to things that happen only 80% of the time are setting themselves up for inevitable financial loss.2 Yet, this sort of over confidence in the quality and reliability of our individual, internal cognitive processes has been identified and repeated over multiple experiments and sundry real life situations.

There is even a theory that the only people whose probabilities are reliably calibrated against frequencies are those suffering from clinically diagnosed depression. The theory of depressive realism remains, however, controversial.

Psychologists like Daniel Kahneman have emphasised that human reasoning is limited by a bounded rationality. All our cognitive processes are built on individual experience, knowledge, cultural assumptions, habits for interpreting data (good, bad and indifferent) … everything. All those things are aggregated imperfectly, incompletely and partially. Nobody can can take the quality of their own judgments for granted.

Kahneman points out that, in particular, wherever individuals engage sophisticated techniques of analysis and rationalisation, and especially those tools that require long experience, education and training to acquire, there is over confidence in outcomes.3 Kahneman calls this the illusion of validity. The more thoroughly we construct an internally consistent narrative for ourselves, the more we are seduced by it. And it is instinctive for humans to seek such cogent models for experience and aspiration. Kahneman says:4

Confidence is a feeling, which reflects the coherence of the information and the cognitive ease of processing it. It is wise to take admissions of uncertainty seriously, but declarations of high confidence mainly tell you that an individual has constructed a coherent story in their mind, not necessarily that the story is true.

If illusion is the spectre of confidence then having a Plan B seems like a good idea. Of course, Holmes is correct that having a Plan B will tempt you to use it. When disappointments accumulate, in escalating costs, stagnating revenues or emerging political risks, it is very tempting to seek the repose of a lesser ambition or even a managed mitigation of residual losses.

But to proscribe a Plan B in order to motivate success is to display the risk appetite of a Kamikaze pilot. Sometimes reality tells you that your business plan is predicated on a false prospectus. Given the science of over confidence and the narrative of bounded rationality, we know that it will happen a lot of the time.

GenericPBCHolmes is also correct that disappointment is, in itself, no reason to change plan. What she neglects is that there is a phenomenon that does legitimately invite change: a surprise. It is a surprise that alerts us to an inconsistency between the real world and our design. A surprise ought to make us go back to our working business plan and examine the assumptions against the real world data. A switch to Plan B is not inevitable. There may be other means of mitigation: Act, Adapt or Abandon. The surprise could even be an opportunity to be grasped. The Plan B doesn’t have to be negative.

How then are we to tell a surprise from a disappointment? With a Shewhart chart of course. The chart has the benefits that:

  • Narrative building is shared not personal.
  • Narratives are challenged with data and context.
  • Surprise and disappointment are distinguished.
  • Predictive power is tested.

Analysis versus “gut feel”

I suppose that what lies behind Holmes’ quote is the theory that commitment and belief can, in themselves, overcome opposing forces, and that a commitment borne of emotion and instinctive confidence is all the more potent. Here is an old Linkedin post that caught my eye a while ago celebrating the virtues of “gut feel”.

The author believed that gut feel came from experience and individuals of long exposure to a complex world should be able to trump data with their intuition. Intuition forms part of what Kahneman called System 1 thinking which he contrasted with the System 2 thinking that we engage in when we perform careful and lengthy data analysis (we hope).5 System 1 thinking can be valuable. Philip Tetlock, a psychologist who researched the science of forecasting, noted this.6

Whether intuition generates delusion or insight depends on whether you work in a world full of valid cues you can unconsciously register for future use.

In fact, whether the world is full of the sorts of valid clues that support useful predictions is exactly the question that Shewhart charts are designed to answer. Whether we make decisions on data or on gut feel, either can mislead us with the illusion of validity.

Again, what the chart supports is the continual testing of the reliability and utility of intuitions. Gut feel is not forbidden but be sure that the successive predictions and revisions will be recorded and subjected to the scrutiny of the Shewhart chart. Impressive records of forecasting will form the armature of a continually developing shared narrative of organisational excellence. Unimpressive forecasters will have to yield ground.

References

  1. Lichtenstein, S et al. (1982) “Calibration of probabilities: The state of the art to 1980” in Kahneman, D et al. Judgment Under Uncertainty: Heuristics and Biases, Cambridge University Press
  2. De Finetti, B (1974) Theory of Probability: A Critical Introductory Treatment, Vol.1, trans. Machi, A & Smith, A; Wiley, p113
  3. Kahneman, D (2011) Thinking, Fast and Slow, Allen Lane, p217
  4. p212
  5. pp19-24
  6. Tetlock, P (2015) Superforecasting: The Art and Science of Prediction, Crown Publishing, Kindle loc 1031

Productivity and how to improve it: II – Profit = Customer value – Cost

I said I would be posting on this topic way back here. Perhaps that says something about my personal productivity but I have been being productive on other things. I have a day job.

I wanted to start off addressing customer value and waste. Here are a couple of revealing stories from the press.

Blue dollars and green dollars

File:Lemon.jpg

This story appeared on the BBC website about a pizza restaurant transferring the task of slicing lemons from the waiters to the kitchen staff. As you know I am rarely impressed by standards of data journalism at the state owned BBC. This item makes one of the gravest errors of attempted business improvement. It had been the practice that waiters, as their first job in the morning, would chop lemons for the day’s anticipated drinks orders. A pizza chef commented that chopping was one of the chefs’ trade skills. Lemon chopping should be transferred to the chefs. That would, purportedly, save the waiters from having to “take a break from their usual tasks, wash their hands, clear a space and then clean up after themselves.” The item goes on:

“Just by changing who chops the lemons, we were able to make a significant saving in hours which translates into a significant financial saving,” says Richard Hodgson, Pizza Express’ chief executive.

This looks, to the uncritical eye, like a saving. But it is a saving in what we call blue dollars (or pounds or euros). It appears in blue ink on an executive summary or monthly report. Did Pizza Express actually save any cash, as we call it green dollars (or …)? Did the initiative put a ding in the profit and loss account?

Perhaps it did but perhaps not. It is, actually, very easy to eliminate, or perhaps hide or redeploy, tasks or purchases and claim a saving in blue dollars. Demonstrating that this then mapped into a saving in green dollars requires committed analytics and the trenchant criticism of historical data. The blue dollars will turn into green dollars if Pizza Express can achieve a time saving that allows:

  • A reduction in payroll; or
  • Redeployment of time into an activity that creates greater value for the customer.

That is assuming that the initiative did result in a time saving. What it certainly lost was a team building opportunity between waiters and chefs and a signal for waiters to wash their hands.

The jury is out as to whether Pizza Express improved productivity. Translation of blue dollars into green dollars is not easy. It is certainly not automatic. Turning blue dollars into green dollars is the really tricky bit in improvement. The bit that requires all the skill and know-how. It turns on the Nolan and Provost question: How will you know when a change is an improvement? More work is needed here to persuade anybody of anything. More work is certainly needed by the BBC in improving their journalism.

Politicians don’t get it

I asked above if the freed time could be translated into an activity that creates greater value for the customer. The value of a thing is what somebody is willing to pay for it. When we say that an activity creates value we mean that it increases the price at which we can sell output. The importance of price is that it captures a revealed preference rather than just a casual attitude for which the subject will never have to give an account. Any activity that does not create value for the customer is waste. The Japanese word muda has become fashionable. It is at the core of achieving operational excellence that unrelenting, gradual and progressive elimination of waste is a daily activity for everybody in the organisation. Waste, everything that does not create value for the customer. Everything that does not make the customer willing to pay more. If the customer will not pay more there is no value for them.

John Redwood was a middle ranking official in John Major’s government of the 1990s though he had frustrated ambitions for higher office. He offered us his personal thoughts on productivity here. I think he illustrates how poorly politicians understand what productivity is. Redwood thinks that we are over simplifying things when we say that productivity is:

ProductiviityEq1

or, a better definition:

ProductiviityEq2

Redwood thinks that, in the service sector, “labour intensity is often seen as better service rather than as worse productivity”. It may be true but only in so far as the customer sees it as such and is willing to pay proportionately for the staffing. Where the customer will not pay then productivity is reduced and insisting that labour intensity is an inherent virtue is a delusion. I think this is the basis of what Redwood is trying to say about purchasing coffee from a store. The test is that the customer is willing to pay for the experience.

However imperfect the statistics, they do seek to capture what the customers have been willing to pay. The spend at the coffee stand should show up on the aggregated statistics for “customer value created” and so the retail coffee phenomenon will not manifest itself as a decrease in productivity. Redwood has completely misunderstood.

Of course there are measurement issues and they are serious ones. There is nothing though that suggests that the concept or its definition are at fault.

What is worrying is that Redwood’s background is in banking though I certainly know bankers who are less out of touch with the real world. Redwood needs to get that the fundamental theorem of business is that:

profit = price – cost

— and that price is set by the market. There are only two things to do to improve.

  • Develop products that enhance customer value.
  • Eliminate costs that do not contribute to customer value.

UK figures

I could not find a long-term productivity time series on the UK Office for National Statistics website (“ONS”). I think that is shameful. You know that I am always suspicious of politicians’ unwillingness to encourage sharing long term statistical series. I managed to find what I was looking for here at www.tradingeconomics.com. Click on the “MAX” tab on the chart.

That chart gave me a suspicion. The ONS website does have the data from 2008. There is a link to this data after Figure 3 of the ONS publication Labour Productivity: Oct to Dec 2015. However, all the charts in that publication are fairly hideous and lacking in graphical excellence. Here is the 2008 to 2015 data replotted.

Productivity1

I am satisfied that, following the steep drop in UK productivity coinciding with the world financial crisis of 2007/08, there has been a (fairly) steady rise in productivity to the region of pre-crash levels. Confirming that with a Shewhart chart is left as an exercise for the reader. Of course, there is common cause variation around the upward trend. And, I suspect, some special causes too. However, I think that inferences of gloom following the Quarter 4 2015 figures, the last observation plotted, are premature. A bad case of #executivetimeseries.

I think that makes me less gloomy about UK productivity than the press and politicians. I have a suspicion that growth since 2008 has been slower than historically but I do not want to take that too far here.

Coming next: Productivity and how to improve it III – Signal and noise

 

 

Why would a lawyer blog about statistics?

Brandeis and Taylor… is a question I often get asked. I blog here about statistics, data, quality, data quality, productivity, management and leadership. And evidence. I do it from my perspective as a practising lawyer and some people find that odd. Yet it turns out that the collaboration between law and quantitative management science is a venerable one.

The grandfather of scientific management is surely Frederick Winslow Taylor (1856-1915). Taylor introduced the idea of scientific study of work tasks, using data and quantitative methods to redesign and control business processes.

Yet one of Taylorism’s most effective champions was a lawyer, Louis Brandeis (1856-1941). In fact, it was Brandeis who coined the term scientific management.

Taylor

Taylor was a production engineer who advocated a four stage strategy for productivity improvement.

  1. Replace rule-of-thumb work methods with methods based on a scientific study of the tasks.
  2. Scientifically select, train, and develop each employee rather than passively leaving them to train themselves.
  3. Provide “Detailed instruction and supervision of each worker in the performance of that worker’s discrete task”.1
  4. Divide work nearly equally between managers and workers, so that the managers apply scientific management principles to planning the work and the workers actually perform the tasks.

Points (3) and (4) tend to jar with millennial attitudes towards engagement and collaborative work. Conservative political scientist Francis Fukuyama criticised Taylor’s approach as “[epitomising] the carrying of the low-trust, rule based factory system to its logical conclusion”.2 I have blogged many times on here about the importance of trust.

However, (1) and (2) provided the catalyst for pretty much all subsequent management science from W Edwards Deming, Elton Mayo, and Taiichi Ohno through to Six Sigma and Lean. Subsequent thinking has centred around creating trust in the workplace as inseparable from (1) and (2). Peter Drucker called Taylor the “Isaac Newton (or perhaps the Archimedes) of the science of work”.

Taylor claimed substantial successes with his redesign of work processes based on the evidence he had gathered, avant la lettre, in the gemba. His most cogent lesson was to exhort managers to direct their attention to where value was created rather than to confine their horizons to monthly accounts and executive summaries.

Of course, Taylor was long dead before modern business analytics began with Walter Shewhart in 1924. There is more than a whiff of the #executivetimeseries about some of Taylor’s work. Once management had Measurement System Analysis and the Shewhart chart there would no longer be any hiding place for groundless claims to non-existent improvements.

Brandeis

Brandeis practised as a lawyer in the US from 1878 until he was appointed a Justice of the Supreme Court in 1916. Brandeis’ principles as a commercial lawyer were, “first, that he would never have to deal with intermediaries, but only with the person in charge…[and] second, that he must be permitted to offer advice on any and all aspects of the firm’s affairs”. Brandies was trenchant about the benefits of a coherent commitment to business quality. He also believed that these things were achieved, not by chance, but by the application of policy deployment.

Errors are prevented instead of being corrected. The terrible waste of delays and accidents is avoided. Calculation is substituted for guess; demonstration for opinion.

Brandeis clearly had a healthy distaste for muda.3 Moreover, he was making a land grab for the disputed high ground that these days often earns the vague and fluffy label strategy.

The Eastern Rate Case

The worlds of Taylor and Brandeis embraced in the Eastern Rate Case of 1910. The Eastern Railroad Company had applied to the Interstate Commerce Commission (“the ICC”) arguing that their cost base had inflated and that an increase in their carriage rates was necessary to sustain the business. The ICC was the then regulator of those utilities that had a monopoly element. Brandeis by this time had taken on the role of the People’s Lawyer, acting pro bono in whatever he deemed to be the public interest.

Brandeis opposed the rate increase arguing that the escalation in Eastern’s cost base was the result of management failure, not an inevitable consequence of market conditions. The cost of a monopoly’s ineffective governance should, he submitted, not be born by the public, nor yet by the workers. In court Brandeis was asked what Eastern should do and he advocated scientific management. That is where and when the term was coined.4

Taylor-Brandeis

The insight that profit cannot simply be wished into being by the fiat of cost plus, a fortiori of the hourly rate, is the Milvian bridge to lean.

But everyone wants to occupy the commanding heights of an integrated policy nurturing quality, product development, regulatory compliance, organisational development and the economic exploitation of customer value. What’s so special about lawyers in the mix? I think we ought to remind ourselves that if lawyers know about anything then we know about evidence. And we just might know as much about it as the statisticians, the engineers and the enforcers. Here’s a tale that illustrates our value.

Thereza Imanishi-Kari was a postdoctoral researcher in molecular biology at the Massachusetts Institute of Technology. In 1986 a co-worker raised inconsistencies in Imanishi-Kari’s earlier published work that escalated into allegations that she had fabricated results to validate publicly funded research. Over the following decade, the allegations grew in seriousness, involving the US Congress, the Office of Scientific Integrity and the FBI. Imanishi-Kari was ultimately exonerated by a departmental appeal board constituted of an eminent molecular biologist and two lawyers. The board heard cross-examination of the relevant experts including those in statistics and document examination. It was that cross-examination that exposed the allegations as without foundation.5

Lawyers can make a real contribution to discovering how a business can be run successfully. But we have to live the change we want to be. The first objective is to bring management science to our own business.

The black-letter man may be the man of the present but the man of the future is the man of statistics and the master of economics.

Oliver Wendell Holmes, 1897

References

  1. Montgomery, D (1989) The Fall of the House of Labor: The Workplace, the State, and American Labor Activism, 1865-1925, Cambridge University Press, p250
  2. Fukuyama, F (1995) Trust: The Social Virtues and the Creation of Prosperity, Free Press, p226
  3. Kraines, O (1960) “Brandeis’ philosophy of scientific management” The Western Political Quarterly 13(1), 201
  4. Freedman, L (2013) Strategy: A History, Oxford University Press, pp464-465
  5. Kevles, D J (1998) The Baltimore Case: A Trial of Politics, Science and Character, Norton

Regression done right: Part 3: Forecasts to believe in

There are three Sources of Uncertainty in a forecast.

  1. Whether the forecast is of “an environment that is sufficiently regular to be predictable”.1
  2. Uncertainty arising from the unexplained (residual) system variation.
  3. Technical statistical sampling error in the regression calculation.

Source of Uncertainty (3) is the one that fascinates statistical theorists. Sources (1) and (2) are the ones that obsess the rest of us. I looked at the first in Part 1 of this blog and, the second in Part 2. Now I want to look at the third Source of Uncertainty and try to put everything together.

If you are really most interested in (1) and (2), read “Prediction intervals” then skip forwards to “The fundamental theorem of forecasting”.

Prediction intervals

A prediction interval2 captures the range in which a future observation is expected to fall. Bafflingly, not all statistical software generates prediction intervals automatically so it is necessary, I fear, to know how to calculate them from first principles. However, understanding the calculation is, in itself, instructive.

But I emphasise that prediction intervals rely on a presumption that what is being forecast is “an environment that is sufficiently regular to be predictable”, that the (residual) business process data is exchangeable. If that presumption fails then all bets are off and we have to rely on a Cardinal Newman analysis. Of course, when I say that “all bets are off”, they aren’t. You will still be held to your existing contractual commitments even though your confidence in achieving them is now devastated. More on that another time.

Sources of variation in predictions

In the particular case of linear regression we need further to break down the third Source of Uncertainty.

  1. Uncertainty arising from the unexplained (residual) variation.
  2. Technical statistical sampling error in the regression calculation.
    1. Sampling error of the mean.
    2. Sampling error of the slope

Remember that we are, for the time being, assuming Source of Uncertainty (1) above can be disregarded. Let’s look at the other Sources of Uncertainty in turn: (2), (3A) and (3B).

Source of Variation (2) – Residual variation

We start with the Source of Uncertainty arising from the residual variation. This is the uncertainty because of all the things we don’t know. We talked about this a lot in Part 2. We are content, for the moment, that they are sufficiently stable to form a basis for prediction. We call this common cause variation. This variation has variance s2, where s is the residual standard deviation that will be output by your regression software.

RegressionResExpl2

Source of Variation (3A) – Sampling error in mean

To understand the next Source of Variation we need to know a little bit about how the regression is calculated. The calculations start off with the respective means of the X values ( X̄ ) and of the Y values ( Ȳ ). Uncertainty in estimating the mean of the Y , is the next contribution to the global prediction uncertainty.

An important part of calculating the regression line is to calculate the mean of the Ys. That mean is subject to sampling error. The variance of the sampling error is the familiar result from the statistics service course.

RegEq2

— where n is the number of pairs of X and Y. Obviously, as we collect more and more data this term gets more and more negligible.

RegressionMeanExpl

Source of Variation (3B) – Sampling error in slope

This is a bit more complicated. Skip forwards if you are already confused. Let me first give you the equation for the variance of predictions referable to sampling error in the slope.

RegEq3

This has now introduced the mysterious sum of squaresSXX. However, before we learn exactly what this is, we immediately notice two things.

  1. As we move away from the centre of the training data the variance gets larger.3
  2. As SXX gets larger the variance gets smaller.

The reason for the increasing sampling error as we move from the mean of X is obvious from thinking about how variation in slope works. The regression line pivots on the mean. Travelling further from the mean amplifies any disturbance in the slope.

RegressionSlopeExpl

Let’s look at where SXX comes from. The sum of squares is calculated from the Xs alone without considering the Ys. It is a characteristic of the sampling frame that we used to train the model. We take the difference of each X value from the mean of X, and then square that distance. To get the sum of squares we then add up all those individual squares. Note that this is a sum of the individual squares, not their average.

RegressionSXXTable

Two things then become obvious (if you think about it).

  1. As we get more and more data, SXX gets larger.
  2. As the individual Xs spread out over a greater range of XSXX gets larger.

What that (3B) term does emphasise is that even sampling error escalates as we exploit the edge of the original training data. As we extrapolate clear of the original sampling frame, the pure sampling error can quickly exceed even the residual variation.

Yet it is only a lower bound on the uncertainty in extrapolation. As we move away from the original range of Xs then, however happy we were previously with Source of Uncertainty (1), that the data was from “an environment that is sufficiently regular to be predictable”, then the question barges back in. We are now remote from our experience base in time and boundary. Nothing outside the original X-range will ever be a candidate for a comfort zone.

The fundamental theorem of prediction

Variances, generally, add up so we can sum the three Sources of Variation (2), (3A) and (3B). That gives the variance of an individual prediction, spred2. By an individual prediction I mean that somebody gives me an X and I use the regression formula to give them the (as yet unknown) corresponding Ypred.

RegEq4

It is immediately obvious that s2 is common to all three terms. However, the second and third terms, the sampling errors, can be made as small as we like by collecting more and more data. Collecting more and more data will have no impact on the first term. That arises from the residual variation. The stuff we don’t yet understand. It has variance s2, where s is the residual standard deviation that will be output by your regression software.

This, I say, is the fundamental theorem of prediction. The unexplained variation provides a hard limit on the precision of forecasts.

It is then a very simple step to convert the variance into a standard deviation, spred. This is the standard error of the prediction.4,5

RegEq5

Now, in general, where we have a measurement or prediction that has an uncertainty that can be characterised by a standard error u, there is an old trick for putting an interval round it. Remember that u is a measure of the variation in z. We can therefore put an interval around z as a number of standard errors, z±ku. Here, k is a constant of your choice. A prediction interval for the regression that generates prediction Ypred then becomes:

RegEq7

Choosing k=3 is very popular, conservative and robust.6,7 Other choices of k are available on the advice of a specialist mathematician.

It was Shewhart himself who took this all a bit further and defined tolerance intervals which contain a given proportion of future observations with a given probability.8 They are very much for the specialist.

Source of Variation (1) – Special causes

But all that assumes that we are sampling from “an environment that is sufficiently regular to be predictable”, that the residual variation is solely common cause. We checked that out on our original training data but the price of predictability is eternal vigilance. It can never be taken for granted. At any time fresh causes of variation may infiltrate the environment, or become newly salient because of some sensitising event or exotic interaction.

The real trouble with this world of ours is not that it is an unreasonable world, nor even that it is a reasonable one. The commonest kind of trouble is that it is nearly reasonable, but not quite. Life is not an illogicality; yet it is a trap for logicians. It looks just a little more mathematical and regular than it is; its exactitude is obvious, but its inexactitude is hidden; its wildness lies in wait.

G K Chesterton

The remedy for this risk is to continue plotting the residuals, the differences between the observed value and, now, the prediction. This is mandatory.

RegressionPBC2

Whenever we observe a signal of a potential special cause it puts us on notice to protect the forecast-user because our ability to predict the future has been exposed as deficient and fallible. But it also presents an opportunity. With timely investigation, a signal of a possible special cause may provide deeper insight into the variation of the cause-system. That in itself may lead to identifying further factors to build into the regression and a consequential reduction in s2.

It is reducing s2, by progressively accumulating understanding of the cause-system and developing the model, that leads to more precise, and more reliable, predictions.

Notes

  1. Kahneman, D (2011) Thinking, Fast and Slow, Allen Lane, p240
  2. Hahn, G J & Meeker, W Q (1991) Statistical Intervals: A Guide for Practitioners, Wiley, p31
  3. In fact s2/SXX is the sampling variance of the slope. The standard error of the slope is, notoriously, s/√SXX. A useful result sometimes. It is then obvious from the figure how variation is slope is amplified as we travel father from the centre of the Xs.
  4. Draper, N R & Smith, H (1998) Applied Regression Analysis, 3rd ed., Wiley, pp81-83
  5. Hahn & Meeker (1991) p232
  6. Wheeler, D J (2000) Normality and the Process Behaviour Chart, SPC Press, Chapter 6
  7. Vysochanskij, D F & Petunin, Y I (1980) “Justification of the 3σ rule for unimodal distributions”, Theory of Probability and Mathematical Statistics 21: 25–36
  8. Hahn & Meeker (1991) p231

Regression done right: Part 2: Is it worth the trouble?

In Part 1 I looked at linear regression from the point of view of machine learning and asked the question whether the data was from “An environment that is sufficiently regular to be predictable.”1 The next big question is whether it was worth it in the first place.

Variation explained

We previously looked at regression in terms of explaining variation. The original Big Y was beset with variation and uncertainty. We believed that some of that variation could be “explained” by a Big X. The linear regression split the variation in Y into variation that was explained by X and residual variation whose causes are as yet obscure.

I slipped in the word “explained”. Here it really means that we can draw a straight line relationship between X and Y. Of course, it is trite analytics that “association is not causation”. As long ago as 1710, Bishop George Berkeley observed that:2

The Connexion of Ideas does not imply the Relation of Cause and Effect, but only a Mark or Sign of the Thing signified.

Causation turns out to be a rather slippery concept, as all lawyers know, so I am going to leave it alone for the moment. There is a rather good discussion by Stephen Stigler in his recent book The Seven Pillars of Statistical Wisdom.3

That said, in real world practical terms there is not much point bothering with this if the variation explained by the X is small compared to the original variation in the Y with the majority of the variation still unexplained in the residuals.

Measuring variation

A useful measure of the variation in a quantity is its variance, familiar from the statistics service course. Variance is a good straightforward measure of the financial damage that variation does to a business.4 It also has the very nice property that we can add variances from sundry sources that aggregate together. Financial damage adds up. The very useful job that linear regression does is to split the variance of Y, the damage to the business that we captured with the histogram, into two components:

  • The contribution from X; and
  • The contribution of the residuals.

RegressionBlock1
The important thing to remember is that the residual variation is not some sort of technical statistical artifact. It is the aggregate of real world effects that remain unexamined and which will continue to cause loss and damage.

RegressionIshikawa2

Techie bit

Variance is the square of standard deviation. Your linear regression software will output the residual standard deviation, s, sometimes unhelpfully referred to as the residual standard error. The calculations are routine.5 Square s to get the residual variance, s2. The smaller is s2, the better. A small s2 means that not much variation remains unexplained. Small s2 means a very good understanding of the cause system. Large s2 means that much variation remains unexplained and our understanding is weak.
RegressionBlock2

The coefficient of determination

So how do we decide whether s2 is “small”? Dividing the variation explained by X by the total variance of Y, sY2,  yields the coefficient of determination, written as R2.6 That is a bit of a mouthful so we usually just call it “R-squared”. R2 sets the variance in Y to 100% and expresses the explained variation as a percentage. Put another way, it is the percentage of variation in Y explained by X.

RegressionBlock3The important thing to remember is that the residual variation is not a statistical artifact of the analysis. It is part of the real world business system, the cause-system of the Ys.7 It is the part on which you still have little quantitative grasp and which continues to hurt you. Returning to the cause and effect diagram, we picked one factor X to investigate and took its influence out of the data. The residual variation is the variation arising from the aggregate of all the other causes.

As we shall see in more detail in Part 3, the residual variation imposes a fundamental bound on the precision of predictions from the model. It turns out that s is the limiting standard error of future predictions

Whether your regression was a worthwhile one or not so you will want to probe the residual variation further. A technique like DMAIC works well. Other improvement processes are available.

So how big should R2 be? Well that is a question for your business leaders not a statistician. How much does the business gain financially from being able to explain just so much variation in the outcome? Anybody with an MBA should be able to answer this so you should have somebody in your organisation who can help.

The correlation coefficient

Some people like to take the square root of R2 to obtain what they call a correlation coefficient. I have never been clear as to what this was trying to achieve. It always ends up telling me less than the scatter plot. So why bother? R2 tells me something important that I understand and need to know. Leave it alone.

What about statistical significance?

I fear that “significance” is, pace George Miller, “a word worn smooth by many tongues”. It is a word that I try to avoid. Yet it seems a natural practice for some people to calculate a p-value and ask whether the regression is significant.

I have criticised p-values elsewhere. I might calculate them sometimes but only because I know what I am doing. The terrible fact is that if you collect sufficient data then your regression will eventually be significant. Statistical significance only tells me that you collected a lot of data. That’s why so many studies published in the press are misleading. Collect enough data and you will get a “significant” result. It doesn’t mean it matters in the real world.

R2 is the real world measure of sensible trouble (relatively) impervious to statistical manipulation. I can make p as small as I like just by collecting more and more data. In fact there is an equation that, for any given R2, links p and the number of observations, n, for linear regression.8

FvR2 equation

Here, Fμ, ν(x) is the F-distribution with μ and ν degrees of freedom. A little playing about with that equation in Excel will reveal that you can make p as small as you like without R2 changing at all. Simply by making n larger. Collecting data until p is small is mere p-hacking. All p-values should be avoided by the novice. R2 is the real world measure (relatively) impervious to statistical manipulation. That is what I am interested in. And what your boss should be interested in.

Next time

Once we are confident that our regression model is stable and predictable, and that the regression is worth having, we can move on to the next stage.

Next time I shall look at prediction intervals and how to assess uncertainty in forecasts.

References

  1. Kahneman, D (2011) Thinking, Fast and Slow, Allen Lane, p240
  2. Berkeley, G (1710) A Treatise Concerning the Principles of Human Knowledge, Part 1, Dublin
  3. Stigler, S M (2016) The Seven Pillars of Statistical Wisdom, Harvard University Press, pp141-148
  4. Taguchi, G (1987) The System of Experimental Design: Engineering Methods to Optimize Quality and Minimize Costs, Quality Resources
  5. Draper, N R & Smith, H (1998) Applied Regression Analysis, 3rd ed., Wiley, p30
  6. Draper & Smith (1998) p33
  7. For an appealing discussion of cause-systems from a broader cultural standpoint see: Bostridge, I (2015) Schubert’s Winter Journey: Anatomy of an Obsession, Faber, pp358-365
  8. Draper & Smith (1998) p243

Regression done right: Part 1: Can I predict the future?

I recently saw an article in the Harvard Business Review called “Refresher on Regression Analysis”. I thought it was horrible so I wanted to set the record straight.

Linear regression from the viewpoint of machine learning

Linear regression is important, not only because it is a useful tool in itself, but because it is (almost) the simplest statistical model. The issues that arise in a relatively straightforward form are issues that beset the whole of statistical modelling and predictive analytics. Anyone who understands linear regression properly is able to ask probing questions about more complicated models. The complex internal algorithms of Kalman filters, ARIMA processes and artificial neural networks are accessible only to the specialist mathematician. However, each has several general features in common with simple linear regression. A thorough understanding of linear regression enables a due diligence of the claims made by the machine learning advocate. Linear regression is the paradigmatic exemplar of machine learning.

There are two principal questions that I want to talk about that are the big takeaways of linear regression. They are always the first two questions to ask in looking at any statistical modelling or machine learning scenario.

  1. What predictions can I make (if any)?
  2. Is it worth the trouble?

I am going to start looking at (1) in this blog and complete it in a future Part 2. I will then look at (2) in a further Part 3.

Variation, variation, variation

Variation is a major problem for business, the tendency of key measures to fluctuate irregularly. Variation leads to uncertainty. Will the next report be high or low? Or in the middle? Because of the uncertainty we have to allow safety margins or swallow some non-conformancies. We have good days and bad days, good products and not so good. We have to carry costly working capital because of variation in cash flow. And so on.

We learned in our high school statistics class to characterise variation in a key process measure, call it the Big Y, by an histogram of observations. Perhaps we are bothered by the fluctuating level of monthly sales.

RegressionHistogram

The variation arises from a whole ecology of competing and interacting effects and factors that we call the cause-system of the outcome. In general, it is very difficult to single out individual factors as having been the cause of a particular observation, so entangled are they. It is still useful to capture them for reference on a cause and effect diagram.

RegressionIshikawa

One of the strengths of the cause and effect diagram is that it may prompt the thought that one of the factors is particularly important, call it Big X, perhaps it is “hours of TV advertising” (my age is showing). Motivated by that we can generate a sample of corresponding measurements data of both the Y and X and plot them on a scatter plot.

RegressionScatter1

Well what else is there to say? The scatter plot shows us all the information in the sample. Scatter plots are an important part of what statistician John Tukey called Exploratory Data Analysis (EDA). We have some hunches and ideas, or perhaps hardly any idea at all, and we attack the problem by plotting the data in any way we can think of. So much easier now than when W Edwards Deming wrote:1

[Statistical practice] means tedious work, such as studying the data in various forms, making tables and charts and re-making them, trying to use and preserve the evidence in the results and to be clear enough to the reader: to endure disappointment and discouragement.

Or as Chicago economist Ronald Coase put it.

If you torture the data enough, nature will always confess.

The scatter plot is a fearsome instrument of data torture. It tells me everything. It might even tempt me to think that I have a basis on which to make predictions.

Prediction

In machine learning terms, we can think of the sample used for the scatter plot as a training set of data. It can be used to set up, “train”, a numerical model that we will then fix and use to predict future outcomes. The scatter plot strongly suggests that if we know a future X alone we can have a go at predicting the corresponding future Y. To see that more clearly we can draw a straight line by hand on the scatter plot, just as we did in high school before anybody suggested anything more sophisticated.

RegressionScatter2

Given any particular X we can read off the corresponding Y.

RegressionScatter3

The immediate insight that comes from drawing in the line is that not all the observations lie on the line. There is variation about the line so that there is actually a range of values of Y that seem plausible and consistent for any specified X. More on that in Parts 2 and 3.

In understanding machine learning it makes sense to start by thinking about human learning. Psychologists Gary Klein and Daniel Kahneman investigated how firefighters were able to perform so successfully in assessing a fire scene and making rapid, safety critical decisions. Lives of the public and of other firefighters were at stake. This is the sort of human learning situation that machines, or rather their expert engineers, aspire to emulate. Together, Klein and Kahneman set out to describe how the brain could build up reliable memories that would be activated in the future, even in the agony of the moment. They came to the conclusion that there are two fundamental conditions for a human to acquire a skill.2

  • An environment that is sufficiently regular to be predictable.
  • An opportunity to learn these regularities through prolonged practice

The first bullet point is pretty much the most important idea in the whole of statistics. Before we can make any prediction from the regression, we have to be confident that the data has been sampled from “an environment that is sufficiently regular to be predictable”. The regression “learns” from those regularities, where they exist. The “learning” turns out to be the rather prosaic mechanics of matrix algebra as set out in all the standard texts.3 But that, after all, is what all machine “learning” is really about.

Statisticians capture the psychologists’ “sufficiently regular” through the mathematical concept of exchangeability. If a process is exchangeable then we can assume that the distribution of events in the future will be like the past. We can project our historic histogram forward. With regression we can do better than that.

Residuals analysis

Formally, the linear regression calculations calculate the characteristics of the model:

Y = mX + c + “stuff”

The “mX+c” bit is the familiar high school mathematics equation for a straight line. The “stuff” is variation about the straight line. What the linear regression mathematics does is (objectively) to calculate the m and c and then also tell us something about the “stuff”. It splits the variation in Y into two components:

  • What can be explained by the variation in X; and
  • The, as yet unexplained, variation in the “stuff”.

The first thing to learn about regression is that it is the “stuff” that is the interesting bit. In 1849 British astronomer Sir John Herschel observed that:

Almost all the greatest discoveries in astronomy have resulted from the consideration of what we have elsewhere termed RESIDUAL PHENOMENA, of a quantitative or numerical kind, that is to say, of such portions of the numerical or quantitative results of observation as remain outstanding and unaccounted for after subducting and allowing for all that would result from the strict application of known principles.

The straight line represents what we guessed about the causes of variation in Y and which the scatter plot confirmed. The “stuff” represents the causes of variation that we failed to identify and that continue to limit our ability to predict and manage. We call the predicted Ys that correspond to the measured Xs, and lie on the fitted straight line, the fits.

fiti = mXic

The residual values, or residuals, are obtained by subtracting the fits from the respective observed Y values. The residuals represent the “stuff”. Statistical software does this for us routinely. If yours doesn’t then bin it.

residuali = Yi – fiti

RegressionScatter4

There are a number of properties that the residuals need to satisfy for the regression to work. Investigating those properties is called residuals analysis.4 As far as use for prediction in concerned, it is sufficient that the “stuff”, the variation about the straight line, be exchangeable.5 That means that the “stuff” so far must appear from the data to be exchangeable and further that we have a rational belief that such a cause system will continue unchanged into the future. Shewhart charts are the best heuristics for checking the requirement for exchangeability, certainly as far as the historical data is concerned. Our first and, be under no illusion, mandatory check on the ability of the linear regression, or any statistical model, to make predictions is to plot the residuals against time on a Shewhart chart.

RegressionPBC

If there are any signals of special causes then the model cannot be used for prediction. It just can’t. For prediction we need residuals that are all noise and no signal. However, like all signals of special causes, such will provide an opportunity to explore and understand more about the cause system. The signal that prevents us from using this regression for prediction may be the very thing that enables an investigation leading to a superior model, able to predict more exactly than we ever hoped the failed model could. And even if there is sufficient evidence of exchangeability from the training data, we still need to continue vigilance and scrutiny of all future residuals to look out for any novel signals of special causes. Special causes that arise post-training provide fresh information about the cause system while at the same time compromising the reliability of the predictions.

Thorough regression diagnostics will also be able to identify issues such as serial correlation, lack of fit, leverage and heteroscedasticity. It is essential to regression and its ommision is intolerable. Residuals analysis is one of Stephen Stigler’s Seven Pillars of Statistical Wisdom.6 As Tukey said:

The greatest value of a picture is when it forces us to notice what we never expected to see.

To come:

Part 2: Is my regression significant? … is a dumb question.
Part 3: Quantifying predictions with statistical intervals.

References

  1. Deming, W E (‎1975) “On probability as a basis for action”, The American Statistician 29(4) pp146-152
  2. Kahneman, D (2011) Thinking, Fast and Slow, Allen Lane, p240
  3. Draper, N R & Smith, H (1998) Applied Regression Analysis, 3rd ed., Wiley, p44
  4. Draper & Smith (1998) Chs 2, 8
  5. I have to admit that weaker conditions may be adequate in some cases but these are far beyond any other than a specialist mathematician.
  6. Stigler, S M (2016) The Seven Pillars of Statistical Wisdom, Harvard University Press, Chapter 7