UK railway suicides – 2017 update

The latest UK rail safety statistics were published on 23 November 2017, again absent much of the press fanfare we had seen in the past. Regular readers of this blog will know that I have followed the suicide data series, and the press response, closely in 2016, 20152014, 2013 and 2012. Again I have re-plotted the data myself on a Shewhart chart.

RailwaySuicides20171

Readers should note the following about the chart.

  • Many thanks to Tom Leveson Gower at the Office of Rail and Road who confirmed that the figures are for the year up to the end of March.
  • Some of the numbers for earlier years have been updated by the statistical authority.
  • I have recalculated natural process limits (NPLs) as there are still no more than 20 annual observations, and because the historical data has been updated. The NPLs have therefore changed but, this year, not by much.
  • Again, the pattern of signals, with respect to the NPLs, is similar to last year.

The current chart again shows two signals, an observation above the upper NPL in 2015 and a run of 8 below the centre line from 2002 to 2009. As I always remark, the Terry Weight rule says that a signal gives us license to interpret the ups and downs on the chart. So I shall have a go at doing that.

It will not escape anybody’s attention that this is now the second year in which there has been a fall in the number of fatalities.

I haven’t yet seen any real contemporaneous comment on the numbers from the press. This item appeared on the BBC, a weak performer in the field of data journalism but clearly with privileged access to the numbers, on 30 June 2017, confidently attributing the fall to past initiatives.

Sky News clearly also had advanced sight of the numbers and make the bold claim that:

… for every death, six more lives were saved through interventions.

That item goes on to highlight a campaign to encourage fellow train users to engage with anybody whose behaviour attracted attention.

But what conclusions can we really draw?

In 2015 I was coming to the conclusion that the data increasingly looked like a gradual upward trend. The 2016 data offered a challenge to that but my view was still that it was too soon to say that the trend had reversed. There was nothing in the data incompatible with a continuing trend. This year, 2017, has seen 2016’s fall repeated. A welcome development but does it really show conclusively that the upward trending pattern is broken? Regular readers of this blog will know that Langian statistics like “lowest for six years” carry no probative weight here.

Signal or noise?

Has there been a change to the underlying cause system that drives the suicide numbers? Last year, I fitted a trend line through the data and asked which narrative best fitted what I observed, a continuing increasing trend or a trend that had plateaued or even reversed. You can review my analysis from last year here.

Here is the data and fitted trend updated with this year’s numbers, along with NPLs around the fitted line, the same as I did last year.

RailwaySuicides20172

Let’s think a little deeper about how to analyse the data. The first step of any statistical investigation ought to be the cause and effect diagram.

SuicideCne

The difficulty with the suicide data is that there is very little reproducible and verifiable knowledge as to its causes. I have seen claims, of whose provenance I am uncertain, that railway suicide is virtually unknown in the USA. There is a lot of useful thinking from common human experience and from more general theories in psychology. But the uncertainty is great. It is not possible to come up with a definitive cause and effect diagram on which all will agree, other from the point of view of identifying candidate factors.

The earlier evidence of a trend, however, suggests that there might be some causes that are developing over time. It is not difficult to imagine that economic trends and the cumulative awareness of other fatalities might have an impact. We are talking about a number of things that might appear on the cause and effect diagram and some that do not, the “unknown unknowns”. When I identified “time” as a factor, I was taking sundry “lurking” factors and suspected causes from the cause and effect diagram that might have a secular impact. I aggregated them under the proxy factor “time” for want of a more exact analysis.

What I have tried to do is to split the data into two parts:

  • A trend (linear simply for the sake of exploratory data analysis (EDA); and
  • The residual variation about the trend.

The question I want to ask is whether the residual variation is stable, just plain noise, or whether there is a signal there that might give me a clue that a linear trend does not hold.

There is no signal in the detrended data, no signal that the trend has reversed. The tough truth of the data is that it supports either narrative.

  • The upward trend is continuing and is stable. There has been no reversal of trend yet.
  • The data is not stable. True there is evidence of an upward trend in the past but there is now evidence that deaths are decreasing.

Of course, there is no particular reason, absent the data, to believe in an increasing trend and the initiative to mitigate the situation might well be expected to result in an improvement.

Sometimes, with data, we have to be honest and say that we do not have the conclusive answer. That is the case here. All that can be done is to continue the existing initiatives and look to the future. Nobody ever likes that as a conclusion but it is no good pretending things are unambiguous when that is not the case.

Next steps

Previously I noted proposals to repeat a strategy from Japan of bathing railway platforms with blue light. In the UK, I understand that such lights were installed at Gatwick in summer 2014. In fact my wife and I were on the platform at Gatwick just this week and I had the opportunity to observe them. I also noted, on my way back from court the other day, blue strip lights along the platform edge at East Croydon. I think they are recently installed. However, I have not seen any data or heard of any analysis.

A huge amount of sincere endeavour has gone into this issue but further efforts have to be against the background that there is still no conclusive evidence of improvement.

Suggestions for alternative analyses are always welcomed here.

Advertisements

UK railway suicides – 2015 update

The latest UK rail safety statistics were published in September 2015 absent the usual press fanfare. Regular readers of this blog will know that I have followed the suicide data series, and the press response, closely in 2014, 2013 and 2012.

This year I am conscious that one of those units is not a mere statistic but a dear colleague, Nigel Clements. It was poet W B Yeats who observed, in his valedictory verse Under Ben Bulben that “Measurement began our might.” He ends the poem by inviting us to “Cast a cold eye/ On life, on death.” Sometimes, with statistics, we cast the cold eye but the personal reminds us that it must never be an academic exercise.

Nigel’s death gives me an additional reason for following this series. I originally latched onto it because I felt that exaggerated claims  as to trends were being made. It struck me as a closely bounded problem that should be susceptible to taught measurement. And it was something important.  Again I have re-plotted the data myself on a Shewhart chart.

RailwaySuicides4

Readers should note the following about the chart.

  • Some of the numbers for earlier years have been updated by the statistical authority.
  • I have recalculated natural process limits as there are still no more than 20 annual observations.
  • The signal noted last year has persisted (in red) with two consecutive observations above the upper natural process limit. There are also now eight points below the centre line at the beginning of the series.

As my colleague Terry Weight always taught me, a signal gives us license to interpret the ups and downs on the chart. This increasingly looks like a gradual upward trend.

Though there was this year little coverage in the press, I did find this article in The Guardian newspaper. I had previously wondered whether the railway data simply reflected an increasing trend in UK suicide in general. The Guardian report is eager to emphasise:

The total number [of suicides] in the UK has risen in recent years, with the latest Office for National Statistics figures showing 6,233 suicides registered in the UK in 2013, a 4% increase on the previous year.

Well, #executivetimeseries! I have low expectations of press data journalism so I do not know why I am disappointed. In any event I decided to plot the data. There were a few problems. The railway data is not collected by calendar year so the latest observation is 2014/15. I have not managed to identify which months are included though, while I was hunting I found out that the railway data does not include London Underground. I can find no railway data before 2001/02. The national suicide data is collected by calendar year and the last year published is 2013. I have done my best by (not quite) arbitrarily identifying 2013/14 in the railway data with 2013 nationally. I also tried the obvious shift by one year and it did not change the picture.

RailwaySuicides5

I have added a LOWESS line (with smoothing parameter 0.4) to the national data the better to pick out the minimum around 2007, just before the start of the financial crisis. That is where the steady decline over the previous quarter century reverses. It is in itself an arresting statistic. But I don’t see the national trend mirrored in the railway data, thereby explaining that trend.

Previously I noted proposals to repeat a strategy from Japan of bathing railway platforms with blue light. Professor Michiko Ueda of Syracuse University was kind enough to send me details of the research. The conclusions were encouraging but tentative and, unfortunately, the Japanese rail companies have not made any fresh data available for analysis since 2010. In the UK, I understand that such lights were installed at Gatwick in summer 2014 but I have not seen any data.

A huge amount of sincere endeavour has gone into this issue but further efforts have to be against the background that there is an escalating and unexplained problem.

Things and actions are what they are and the consequences of them will be what they will be: why then should we desire to be deceived?

Joseph Butler

How to predict floods

File:Llanrwst Floods 2015 1.ogvI started my grown-up working life on a project seeking to predict extreme ocean currents off the north west coast of the UK. As a result I follow environmental disasters very closely. I fear that it’s natural that incidents in my own country have particular salience. I don’t want to minimise disasters elsewhere in the world when I talk about recent flooding in the north of England. It’s just that they are close enough to home for me to get a better understanding of the essential features.

The causes of the flooding are multi-factorial and many of the factors are well beyond my expertise. However, The Times (London) reported on 28 December 2015 that “Some scientists say that [the UK Environment Agency] has been repeatedly caught out by the recent heavy rainfall because it sets too much store by predictions based on historical records” (p7). Setting store by predictions based on historical records is very much where my hands-on experience of statistics began.

The starting point of prediction is extreme value theory, developed by Sir Ronald Fisher and L H C Tippett in the 1920s. Extreme value analysis (EVA) aims to put probabilistic bounds on events outside the existing experience base by predicating that such events follow a special form of probability distribution. Historical data can be used to fit such a distribution using the usual statistical estimation methods. Prediction is then based on a double extrapolation: firstly in the exact form of the tail of the extreme value distribution and secondly from the past data to future safety. As the old saying goes, “Interpolation is (almost) always safe. Extrapolation is always dangerous.”

EVA rests on some non-trivial assumptions about the process under scrutiny. No statistical method yields more than was input in the first place. If we are being allowed to extrapolate beyond the experience base then there are inevitably some assumptions. Where the real world process doesn’t follow those assumptions the extrapolation is compromised. To some extent there is no cure for this other than to come to a rational decision about the sensitivity of the analysis to the assumptions and to apply a substantial safety factor to the physical engineering solutions.

One of those assumptions also plays to the dimension of extrapolation from past to future. Statisticians often demand that the data be independent and identically distributed. However, that is a weird thing to demand of data. Real world data is hardly ever independent as every successive observation provides more information about the distribution and alters the probability of future observations. We need a better idea to capture process stability.

Historical data can only be projected into the future if it comes from a process that is “sufficiently regular to be predictable”. That regularity is effectively characterised by the property of exchangeability. Deciding whether data is exchangeable demands, not only statistical evidence of its past regularity, but also domain knowledge of the physical process that it measures. The exchangeability must continue into the predicable future if historical data is to provide any guide. In the matter of flooding, knowledge of hydrology, climatology, planning and engineering, law, in addition to local knowledge about economics and infrastructure changes already in development, is essential. Exchangeability is always a judgment. And a critical one.

Predicting extreme floods is a complex business and I send my good wishes to all involved. It is an example of something that is essentially a team enterprise as it demands the co-operative inputs of diverse sets of experience and skills.

In many ways this is an exemplary model of how to act on data. There is no mechanistic process of inference that stands outside a substantial knowledge of what is being measured. The secret of data analysis, which often hinges on judgments about exchangeability, is to visualize the data in a compelling and transparent way so that it can be subjected to collaborative criticism by a diverse team.

Imagine …

Ben Bernanke official portrait.jpgNo, not John Lennon’s dreary nursery rhyme for hippies.

In his memoir of the 2007-2008 banking crisis, The Courage to ActBen Benanke wrote about his surprise when the crisis materialised.

We saw, albeit often imperfectly, most of the pieces of the puzzle. But we failed to understand – “failed to imagine” might be a better phrase – how those pieces would fit together to produce a financial crisis that compared to, and arguably surpassed, the financial crisis that ushered in the Great Depression.

That captures the three essentials of any attempt to foresee a complex future.

  • The pieces
  • The fit
  • Imagination

In any well managed organisation, “the pieces” consist of the established Key Performance Indicators (KPIs) and leading measures. Diligent and rigorous criticism of historical data using process behaviour charts allows departures from stability to be identified timeously. A robust and disciplined system of management and escalation enables an agile response when special causes arise.

Of course, “the fit” demands a broader view of the data, recognising interactions between factors and the possibility of non-simple global responses remote from a locally well behaved response surface. As the old adage goes, “Fit locally. Think globally.” This is where the Cardinal Newman principle kicks in.

“The pieces” and “the fit”, taken at their highest, yield a map of historical events with some limited prediction as to how key measures will behave in the future. Yet it is common experience that novel factors persistently invade. The “bow wave” of such events will not fit a recognised pattern where there will be a ready consensus as to meaning, mechanism and action. These are the situations where managers are surprised by rapidly emerging events, only to protest, “We never imagined …”.

Nassim Taleb’s analysis of the financial crisis hinged on such surprises and took him back to the work of British economist G L S Shackle. Shackle had emphasised the importance of imagination in economics. Put at its most basic, any attempt to assign probabilities to future events depends upon the starting point of listing the alternatives that might occur. Statisticians call it the sample space. If we don’t imagine some specific future we won’t bother thinking about the probability that it might come to be. Imagination is crucial to economics but it turns out to be much more pervasive as an engine of improvement that at first is obvious.

Imagination and creativity

Frank Whittle had to imagine the jet engine before he could bring it into being. Alan Turing had to imagine the computer. They were both fortunate in that they were able to test their imagination by construction. It was all realised in a comparatively short period of time. Whittle’s and Turing’s respective imaginations were empirically verified.

What is now proved was once but imagined.

William Blake

Not everyone has had the privilege of seeing their imagination condense into reality within their lifetime. In 1946, Sir George Paget Thomson and Moses Blackman imagined a plentiful source of inexpensive civilian power from nuclear fusion. As of writing, prospects of a successful demonstration seem remote. Frustratingly, as far as I can see, the evidence still refuses to tip the balance as to whether future success is likely or that failure is inevitable.

Something as illusive as imagination can have a testable factual content. As we know, not all tests are conclusive.

Imagination and analysis

Imagination turns out to be essential to something as prosaic as Root Cause Analysis. And essential in a surprising way. Establishing an operative cause of a past event is an essential task in law and engineering. It entails the search for a counterfactual, not what happened but what might have happened to avoid the  regrettable outcome. That is inevitably an exercise in imagination.

In almost any interesting situation there will be multiple imagined pasts. If there is only one then it is time to worry. Sometimes it is straightforward to put our ideas to the test. This is where the Shewhart cycle comes into its own. In other cases we are in the realms of uncomfortable science. Sometimes empirical testing is frustrated because the trail has gone cold.

The issues of counterfactuals, Root Cause Analysis and causation have been explored by psychologists Daniel Kahneman1 and Ruth Byrne2 among others. Reading their research is a corrective to the optimistic view that Root Cause analysis is some sort of inevitably objective process. It is distorted by all sorts of heuristics and biases. Empirical testing is vital, if only through finding some data with borrowing strength.

Imagine a millennium bug

In 1984, Jerome and Marilyn Murray published Computers in Crisis in which they warned of a significant future risk to global infrastructure in telecommunications, energy, transport, finance, health and other domains. It was exactly those areas where engineers had been enthusiastic to exploit software from the earliest days, often against severe constraints of memory and storage. That had led to the frequent use of just two digits to represent a year, “71” for 1971, say. From the 1970s, software became more commonly embedded in devices of all types. As the year 2000 approached, the Murrays envisioned a scenario where the dawn of 1 January 2000 was heralded by multiple system failures where software registers reset to the year 1900, frustrating functions dependent on timing and forcing devices into a fault mode or a graceless degradation. Still worse, systems may simply malfunction abruptly and without warning, the only sensible signal being when human wellbeing was compromised. And the ruinous character of such a threat would be that failure would be inherently simultaneous and global, with safeguarding systems possibly beset with the same defects as the primary devices. It was easy to imagine a calamity.

Risk matrixYou might like to assess that risk yourself (ex ante) by locating it on the Risk Assessment Matrix to the left. It would be a brave analyst who would categorise it as “Low”, I think. Governments and corporations were impressed and embarked on a massive review of legacy software and embedded systems, estimated to have cost around $300 billion at year 2000 prices. A comprehensive upgrade programme was undertaken by nearly all substantial organisations, public and private.

Then, on 1 January 2000, there was no catastrophe. And that caused consternation. The promoters of the risk were accused of having caused massive expenditure and diversion of resources against a contingency of negligible impact. Computer professionals were accused, in terms, of self-serving scare mongering. There were a number of incidents which will not have been considered minor by the people involved. For example, in a British hospital, tests for Down’s syndrome were corrupted by the bug resulting in contra-indicated abortions and births. However, there was no global catastrophe.

This is the locus classicus of a counterfactual. Forecasters imagined a catastrophe. They persuaded others of their vision and the necessity of vast expenditure in order to avoid it. The preventive measures were implemented at great costs. The Catastrophe did not occur. Ex post, the forecasters were disbelieved. The danger had never been real. Even Cassandra would have sympathised.

Critics argued that there had been a small number of relatively minor incidents that would have been addressed most economically on a “fix on failure” basis. Much of this turns out to be a debate about the much neglected column of the risk assessment headed “Detectability”. Where a failure will inflict immediate pain, it is so much more critical as to management and mitigation than a failure that will present the opportunity for detection and protection in advance of a broader loss. Here, forecasting Detectability was just as important as Probability and Consequences in arriving at an economic strategy for management.

It is the fundamental paradox of risk assessment that, where control measures eliminate a risk, it is not obvious whether the benign outcome was caused by the control or whether the risk assessment was just plain wrong and the risk never existed. Another counterfactual. Again, finding some alternative data with borrowing strength can help though it will ever be difficult to build a narrative appealing to a wide population. There are links to some sources of data on the Wikipedia article about the bug. I will leave it to the reader.

Imagine …

Of course it is possible to find this all too difficult and to adopt the Biblical outlook.

I returned, and saw under the sun, that the race is not to the swift, nor the battle to the strong, neither yet bread to the wise, nor yet riches to men of understanding, nor yet favour to men of skill; but time and chance happeneth to them all.

Ecclesiastes 9:11
King James Bible

That is to adopt the outlook of the lady on the level crossing. Risk professionals look for evidence that their approach works.

The other day, I was reading the annual report of the UK Health and Safety Executive (pdf). It shows a steady improvement in the safety of people at work though oddly the report is too coy to say this in terms. The improvement occurs over the period where risk assessment has become ubiquitous in industry. In an individual work activity it will always be difficult to understand whether interventions are being effective. But using the borrowing strength of the overall statistics there is potent evidence that risk assessment works.

References

  1. Kahneman, D & Tversky, A (1979) “The simulation heuristic”, reprinted in Kahneman et al. (1982) Judgment under Uncertainty: Heuristics and Biases, Cambridge, p201
  2. Byrne, R M J (2007) The Rational Imagination: How People Create Alternatives to Reality, MIT Press

Superforecasting – the thing that TalkTalk didn’t do

I have just been reading Superforecasting: The Art and Science of Prediction by Philip Tetlock and Dan Gardner. The book has attracted much attention and enthusiasm in the press. It makes a bold claim that some people, superforecasters, though inexpert in the conventional sense, are possessed of the ability to make predictions with a striking degree of accuracy, that those individuals exploit a strategy for forecasting applicable even to the least structured evidence, and that the method can be described and learned. The book summarises results of a study sponsored by US intelligence agencies as part of the Good Judgment Project but, be warned, there is no study data in the book.

I haven’t found any really good distinction between forecasting and prediction so I might swap between the two words arbitrarily here.

What was being predicted?

The forecasts/ predictions in question were all in the field of global politics and economics. For example, a question asked in January 2011 was:

Will Italy restructure or default on its debt by 31 December 2011?

This is a question that invited a yes/ no answer. However, participants were encouraged to answer with a probability, a number between 0% and 100% inclusive. If they were certain of the outcome they could answer 100%, if certain that it would not occur, 0%. The participants were allowed, I think encouraged, to update and re-update their forecasts at any time. So, as far as I can see, a forecaster who predicted 60% for Italian debt restructuring in January 2011 could revise that to 0% in December, even up to the 30th. Each update was counted as a separate forecast.

The study looked for “Goldilocks” problems, not too difficult, not to easy but just right.

Bruno de Finetti was very sniffy about using the word “prediction” in this context and preferred the word “prevision”. It didn’t catch on.

Who was studied?

The study was conducted by means of a tournament among volunteers. I gather that the participants wanted to be identified and thereby personally associated with their scores. Contestants had to be college graduates and, as a preliminary, had to complete a battery of standard cognitive and general knowledge tests designed to characterise their given capabilities. The competitors in general fell in the upper 30 percent of the general population for intelligence and knowledge. When some book reviews comment on how the superforecasters included housewives and unemployed factory workers I think they give the wrong impression. This was a smart, self-selecting, competitive group with an interest in global politics. As far as I can tell, backgrounds in mathematics, science and computing were typical. It is true that most were amateurs in global politics.

With such a sampling frame, of what population is it representative? The authors recognise that problem though don’t come up with an answer.

How measured?

Forecasters were assessed using Brier scores. I fear that Brier scores fail to be intuitive, are hard to understand and raise all sorts of conceptual difficulties. I don’t feel that they are sufficiently well explained, challenged or justified in the book. Suppose that a competitor predicts a probability p for the Italian default of 60%. Rewrite this as a probability in the range 0 to 1 for convenience, 0.6 If the competitor accepts finite additivity then the probability of “no default” is 1- 0.6 = 0.4. Now suppose that outcomes f are coded as 1 when confirmed and 0 when disconfirmed. That means that if a default occurs then f ( default ) = 1 and f (no default ) = 0. If there is no default then f ( default ) = 0 and f (no default ) = 1. It’s not easy to get. We then take the difference between the ps and the fs, calculate the square of the differences and sum them. This is illustrated below for “no default” which yields a Brier score of 0.72.

Event p f ( pf ) 2
Default 0.6 0 0.36
No default 0.4 1 0.36
Sum 1.0 0.72

Suppose we were dealing with a fair coin toss. Nobody would criticise a forecasting probability of 50% for heads and 50% for tails. The long run Brier score would be 0.5 (think about it). Brier scores were averaged for each competitor and used as the basis of ranking them. If a competitor updated a prediction then every fresh update was counted as an individual prediction and each prediction was scored. More on this later. An average of 0.5 would be similar to a chimp throwing darts at a target. That is about how well expert professional forecasters had performed in a previous study. The lower the score the better. Zero would be perfect foresight.

I would have liked to have seen some alternative analyses and I think that a Hosmer-Lemeshow statistic or detailed calibration study would in some ways have been more intuitive and yielded more insight.

What the results?

The results are not given in the book, only some anecdotes. Competitor Doug Lorch, a former IBM programmer it says, answered 104 questions in the first year and achieved a Brier score of 0.22. He was fifth on the drop list. The top 58 competitors, the superforecasters, had an average Brier score of 0.25 compared with 0.37 for the balance. In the second year, Lorch joined a team of other competitors identified as superforecasters and achieved an average Brier score of 0.14. He beat a prediction market of traders dealing in futures in the outcomes, the book says by 40% though it does not explain what that means.

I don’t think that I find any of that, in itself, persuasive. However, there is a limited amount of analysis here on the (old) Good Judgment Project website. Despite the reservations I have set out above there are some impressive results, in particular this chart.

The competitors’ Brier scores were measured over the first 25 questions. The 100 with the lowest scores were identified, the blue line. The chart then shows the performance of that same group of competitors over the subsequent 175 questions. Group membership is not updated. It is the same 100 competitors as performed best at the start who are plotted across the whole 200 questions. The red line shows the performance of the worst 100 competitors from the first 25 questions, again with the same cohort plotted for all 200 questions.

Unfortunately, it is not raw Brier scores that are plotted but standardised scores. The scores have been adjusted so that the mean is zero and standard deviation one. That actually adds nothing to the chart but obscures somewhat how it is interpreted. I think that violates all Shewhart’s rules of data presentation.

That said, over the first 25 questions the blue cohort outperform the red. Then that same superiority of performance is maintained over the subsequent 175 questions. We don’t know how much is the difference in performance because of the standardisation. However, the superiority of performance is obvious. If that is a mere artefact of the data then I am unable to see how. Despite the way that data is presented and my difficulties with Brier scores, I cannot think of any interpretation other than there being a cohort of superforecasters who were, in general, better at prediction than the rest.

Conclusions

Tetlock comes up with some tentative explanations as to the consistent performance of the best. In particular he notes that the superforecasters updated their predictions more frequently than the remainder. Each of those updates was counted as a fresh prediction. I wonder how much of the variation in Brier scores is accounted for by variation in the time of making the forecast? If superforecasters are simply more active than the rest, making lots of forecasts once the outcome is obvious then the result is not very surprising.

That may well not be the case as the book contends that superforecasters predicting 300 days in the future did better than the balance of competitors predicting 100 days. However, I do feel that the variation arising from the time a prediction was made needs to be taken out of the data so that the variation in, shall we call it, foresight can be visualised. The book is short on actual analysis and I would like to have seen more. Even in a popular management book.

The data on the website on purported improvements from training is less persuasive, a bit of an #executivetimeseries.

Some of the recommendations for being a superforecaster are familiar ideas from behavoural psychology. Be a fox not a hedgehog, don’t neglect base rates, be astute to the distinction between signal and noise, read widely and richly, etc..

Takeaways

There was one unanticipated and intriguing result. The superforecasters updated their predictions not only frequently but by fine degrees, perhaps from 61% to 62%. I think that some further analysis is required to show that that is not simply an artefact of the measurement. Because Brier scores have a squared term they would be expected to punish the variation in large adjustments.

However, taking the conclusion at face value, it has some important consequences for risk assessment which often proceeds by broadly granular ranking on a rating scale of 1 to 5, say. The study suggests that the best predictions will be those where careful attention is paid to fine gradations in probability.

Of course, continual updating of predictions is essential to even the most primitive risk management though honoured more often in the breach than the observance. I shall come back to the significance of this for risk management in a future post.

There is also an interesting discussion about making predictions in teams but I shall have to come back to that another time.

The amateurs out-performed the professionals on global politics. I wonder if the same result would have been encountered against experts in structural engineering.

And TalkTalk? They forgot, pace Stanley Baldwin, that the hacker will always get through.

Professor Tetlock invites you to join the programme at http://www.goodjudgment.com.

The Iron Law at Volkswagen

So Michael Horn, VW’s US CEO has made a “sincere apology” for what went on at VW.

And like so many “sincere apologies” he blamed somebody else. “My understanding is that it was a couple of software engineers who put these in.”

As an old automotive hand I have always been very proud of the industry. I have held it up as a model of efficiency, aesthetic aspiration, ambition, enlightenment and probity. My wife will tell you how many times I have responded to tales of workplace chaos with “It couldn’t happen in a car plant”. Fortunately we don’t own a VW but I still feel betrayed by this. Here’s why.

A known risk

Everybody knew from the infancy of emissions testing, which came along at about the same time as the adoption of engine management systems, the risks of a “cheat device”. It was obvious to all that engineers might be tempted to manoeuvre a recalcitrant engine through a challenging emissions test by writing software so as to detect test conditions and thereon modify performance.

In the better sort of motor company, engineers were left in no doubt that this was forbidden and the issue was heavily policed with code reviews and process surveillance.

This was not something that nobody saw coming, not a blind spot of risk identification.

The Iron Law

I wrote before about the Iron Law of Oligarchy. Decision taking executives in an organisation try not to pass information upwards. That will only result in interference and enquiry. Supervisory boards are well aware of this phenomenon because, during their own rise to the board, they themselves were the senior managers who constituted the oligarchy and who kept all the information to themselves. As I guessed last time I wrote, decisions like this don’t get taken at board level. They are taken out of the line of sight of the board.

Governance

So here we have a known risk. A threat that would likely not be detected in the usual run of line management. And it was of such a magnitude as would inflict hideous ruin on Volkswagen’s value, accrued over decades of hard built customer reputation. Volkswagen, an eminent manufacturer with huge resources, material, human and intellectual. What was the governance function to do?

Borrowing strength again

It would have been simple, actually simple, to secret shop the occasional vehicle and run it through an on-road emissions test. Any surprising discrepancy between the results and the regulatory tests would then have been a signal that the company was at risk and triggered further investigation. An important check on any data integrity is to compare it with cognate data collected by an independent route, data that shares borrowing strength.

Volkswagen’s governance function simply didn’t do the simple thing. Never have so many ISO 31000 manuals been printed in vain. Theirs were the pot odds of a jaywalker.

Knowledge

In the English breach of trust case of Baden, Delvaux and Lecuit v Société Générale [1983] BCLC 325, Mr Justice Peter Gibson identified five levels of knowledge that might implicate somebody in wrongdoing.

  • Actual knowledge.
  • Wilfully shutting one’s eyes to the obvious (Nelsonian knowledge).
  • Wilfully and recklessly failing to make such enquiries as an honest and reasonable man would make.
  • Knowledge of circumstances that would indicate the facts to an honest and reasonable man.
  • Knowledge of circumstances that would put an honest and reasonable man on enquiry.

I wonder where VW would place themselves in that.

How do you sound when you feel sorry?

… is the somewhat barbed rejoinder to an ungracious apology. Let me explain how to be sorry. There are three “R”s.

  • Remorse: Different from the “regret” that you got caught. A genuine internal emotional reaction. The public are good at spotting when emotions are genuine but it is best evidenced by the following two “R”s.
  • Reparation: Trying to undo the damage. VW will not have much choice about this as far as the motorists are concerned but the shareholders may be a different matter. I don’t think Horn’s director’s insurance will go very far.
  • Reform: This is the barycentre of repentance. Can VW now change the way it operates to adopt genuine governance and systematic risk management?

Mr Horn tells us that he has little control over what happens in his company. That is probably true. I trust that he will remember that at his next remuneration review. If there is one.

When they said, “Repent!”, I wonder what they meant.

Leonard Cohen
The Future

First thoughts on VW’s emmissions debacle

It is far too soon to tell exactly what went on at VW, in the wider motor industry, within the respective regulators and within governments. However, the way that the news has come out, and the financial and operational impact that it is likely to have, are enough to encourage all enterprises to revisit their risk management, governance and customer reputation management policies. Corporate scandals are not a new phenomenon, from the collapse of the Medici Bank in 1494, Warren Hastings’ alleged despotism in the British East India Company, down to the FIFA corruption allegations that broke earlier this year. Organisational scandals are as old as organisations. The bigger the organisations get, the bigger the scandals are going to be.

Normal Scandals

In 1984, Scott Perrow published his pessimistic analysis of what he saw as the inevitability of Normal Accidents in complex technologies. I am sure that there is a market for a book entitled Normal Scandals: Living with High-Risk Organisational Structures. But I don’t share Perrow’s pessimism. Life is getting safer. Let’s adopt the spirit of continual improvement to make investment safer too. That’s investment for those of us trying to accumulate a modest portfolio for retirement. Those who aspire to join the super rich will still have to take their chances.

I fully understand that organisations sometimes have to take existential risks to stay in business. The development of Rolls-Royce’s RB211 aero-engine well illustrates what happens when a manufacturer finds itself with proven technologies that are inadequately aligned with the Voice of the Customer. The market will not wait while the business catches up. There is time to develop a response but only if that solution works first time. In the case of Rolls-Royce it didn’t and insolvency followed. However, there was no alternative but to try.

What happened at VW? I just wonder whether the Iron Law of Oligarchy was at work. To imagine that a supervisory board sits around discussing the details of engine management software is naïve. In fact it was the RB211 crisis that condemned such signal failures of management to delegate. Do VW’s woes flow from a decision taken by a middle manager, or a blind eye turned, that escaped an inadequate system of governance? Perhaps a short term patch in anticipation of an ultimate solution?

Cardinal Newman’s contribution to governance theory

John Henry Newman learned about risk management the hard way. Newman was an English Anglican divine who converted to the Catholic Church in 1845. In 1850 Newman became involved in the controversy surrounding Giacinto Achilli, a priest expelled from the Catholic Church for rape and sexual assault but who was making a name from himself in England as a champion of the protestant evangelical cause. Conflict between Catholic and protestant was a significant feature of the nineteenth century English political landscape. Newman was minded to ensure that Achilli’s background was widely known. He took legal advice from counsel James Hope-Scott about the risks of a libel action from Achilli. Hope-Scott was reassuring and Newman published. The publication resulted in Newman’s prosecution and conviction for criminal libel.

Speculation about what legal advice VW have received as to their emissions strategy would be inappropriate. However, I trust that, if they imagined they were externalising any risk thereby, they checked the value of their legal advisors’ professional indemnity insurance.

Newman certainly seems to have learned his lesson and subsequently had much to teach the modern world about risk management and governance. After the Achilli trial Newman started work on his philosophical apologia, The Grammar of Assent. One argument in that book has had such an impact on modern thinking about evidence and probability that it was quoted in full by Bruno de Finetti in Volume 1 of his 1974 Theory of Probability.

Supposes a thesis (e.g. the guilt of an accused man) is supported by a great deal of circumstantial evidence of different forms, but in agreement with each other; then even if each piece of evidence is in itself insufficient to produce any strong belief, the thesis is decisively strengthened by their joint effect.

De Finetti set out the detailed mathematics and called this the Cardinal Newman principle. It is fundamental to the modern concept of borrowing strength.

The standard means of defeating governance are all well known to oligarchs, regulator capture, “stake-driving” – taking actions outside the oversight of governance that will not be undone without engaging the regulator in controversy, “whipsawing” – promising A that approval will be forthcoming from B while telling B that A has relied upon her anticipated, and surely “uncontroversial”, approval. There are plenty of others. Robert Caro’s biography The Power Broker: Robert Moses and the Fall of New York sets out the locus classicus.

Governance functions need to exploit the borrowing strength of diverse data sources to identify misreporting and misconduct. And continually improve how they do that. The answer is trenchant and candid criticism of historical data. That’s the only data you have. A rigorous system of goal deployment and mature use of process behaviour charts delivers a potent stimulus to reluctant data sharers.

Things and actions are what they are and the consequences of them will be what they will be: why then should we desire to be deceived?

Bishop Joseph Butler