Superforecasting – the thing that TalkTalk didn’t do

I have just been reading Superforecasting: The Art and Science of Prediction by Philip Tetlock and Dan Gardner. The book has attracted much attention and enthusiasm in the press. It makes a bold claim that some people, superforecasters, though inexpert in the conventional sense, are possessed of the ability to make predictions with a striking degree of accuracy, that those individuals exploit a strategy for forecasting applicable even to the least structured evidence, and that the method can be described and learned. The book summarises results of a study sponsored by US intelligence agencies as part of the Good Judgment Project but, be warned, there is no study data in the book.

I haven’t found any really good distinction between forecasting and prediction so I might swap between the two words arbitrarily here.

What was being predicted?

The forecasts/ predictions in question were all in the field of global politics and economics. For example, a question asked in January 2011 was:

Will Italy restructure or default on its debt by 31 December 2011?

This is a question that invited a yes/ no answer. However, participants were encouraged to answer with a probability, a number between 0% and 100% inclusive. If they were certain of the outcome they could answer 100%, if certain that it would not occur, 0%. The participants were allowed, I think encouraged, to update and re-update their forecasts at any time. So, as far as I can see, a forecaster who predicted 60% for Italian debt restructuring in January 2011 could revise that to 0% in December, even up to the 30th. Each update was counted as a separate forecast.

The study looked for “Goldilocks” problems, not too difficult, not to easy but just right.

Bruno de Finetti was very sniffy about using the word “prediction” in this context and preferred the word “prevision”. It didn’t catch on.

Who was studied?

The study was conducted by means of a tournament among volunteers. I gather that the participants wanted to be identified and thereby personally associated with their scores. Contestants had to be college graduates and, as a preliminary, had to complete a battery of standard cognitive and general knowledge tests designed to characterise their given capabilities. The competitors in general fell in the upper 30 percent of the general population for intelligence and knowledge. When some book reviews comment on how the superforecasters included housewives and unemployed factory workers I think they give the wrong impression. This was a smart, self-selecting, competitive group with an interest in global politics. As far as I can tell, backgrounds in mathematics, science and computing were typical. It is true that most were amateurs in global politics.

With such a sampling frame, of what population is it representative? The authors recognise that problem though don’t come up with an answer.

How measured?

Forecasters were assessed using Brier scores. I fear that Brier scores fail to be intuitive, are hard to understand and raise all sorts of conceptual difficulties. I don’t feel that they are sufficiently well explained, challenged or justified in the book. Suppose that a competitor predicts a probability p for the Italian default of 60%. Rewrite this as a probability in the range 0 to 1 for convenience, 0.6 If the competitor accepts finite additivity then the probability of “no default” is 1- 0.6 = 0.4. Now suppose that outcomes f are coded as 1 when confirmed and 0 when disconfirmed. That means that if a default occurs then f ( default ) = 1 and f (no default ) = 0. If there is no default then f ( default ) = 0 and f (no default ) = 1. It’s not easy to get. We then take the difference between the ps and the fs, calculate the square of the differences and sum them. This is illustrated below for “no default” which yields a Brier score of 0.72.

Event p f ( pf ) 2
Default 0.6 0 0.36
No default 0.4 1 0.36
Sum 1.0 0.72

Suppose we were dealing with a fair coin toss. Nobody would criticise a forecasting probability of 50% for heads and 50% for tails. The long run Brier score would be 0.5 (think about it). Brier scores were averaged for each competitor and used as the basis of ranking them. If a competitor updated a prediction then every fresh update was counted as an individual prediction and each prediction was scored. More on this later. An average of 0.5 would be similar to a chimp throwing darts at a target. That is about how well expert professional forecasters had performed in a previous study. The lower the score the better. Zero would be perfect foresight.

I would have liked to have seen some alternative analyses and I think that a Hosmer-Lemeshow statistic or detailed calibration study would in some ways have been more intuitive and yielded more insight.

What the results?

The results are not given in the book, only some anecdotes. Competitor Doug Lorch, a former IBM programmer it says, answered 104 questions in the first year and achieved a Brier score of 0.22. He was fifth on the drop list. The top 58 competitors, the superforecasters, had an average Brier score of 0.25 compared with 0.37 for the balance. In the second year, Lorch joined a team of other competitors identified as superforecasters and achieved an average Brier score of 0.14. He beat a prediction market of traders dealing in futures in the outcomes, the book says by 40% though it does not explain what that means.

I don’t think that I find any of that, in itself, persuasive. However, there is a limited amount of analysis here on the (old) Good Judgment Project website. Despite the reservations I have set out above there are some impressive results, in particular this chart.

The competitors’ Brier scores were measured over the first 25 questions. The 100 with the lowest scores were identified, the blue line. The chart then shows the performance of that same group of competitors over the subsequent 175 questions. Group membership is not updated. It is the same 100 competitors as performed best at the start who are plotted across the whole 200 questions. The red line shows the performance of the worst 100 competitors from the first 25 questions, again with the same cohort plotted for all 200 questions.

Unfortunately, it is not raw Brier scores that are plotted but standardised scores. The scores have been adjusted so that the mean is zero and standard deviation one. That actually adds nothing to the chart but obscures somewhat how it is interpreted. I think that violates all Shewhart’s rules of data presentation.

That said, over the first 25 questions the blue cohort outperform the red. Then that same superiority of performance is maintained over the subsequent 175 questions. We don’t know how much is the difference in performance because of the standardisation. However, the superiority of performance is obvious. If that is a mere artefact of the data then I am unable to see how. Despite the way that data is presented and my difficulties with Brier scores, I cannot think of any interpretation other than there being a cohort of superforecasters who were, in general, better at prediction than the rest.

Conclusions

Tetlock comes up with some tentative explanations as to the consistent performance of the best. In particular he notes that the superforecasters updated their predictions more frequently than the remainder. Each of those updates was counted as a fresh prediction. I wonder how much of the variation in Brier scores is accounted for by variation in the time of making the forecast? If superforecasters are simply more active than the rest, making lots of forecasts once the outcome is obvious then the result is not very surprising.

That may well not be the case as the book contends that superforecasters predicting 300 days in the future did better than the balance of competitors predicting 100 days. However, I do feel that the variation arising from the time a prediction was made needs to be taken out of the data so that the variation in, shall we call it, foresight can be visualised. The book is short on actual analysis and I would like to have seen more. Even in a popular management book.

The data on the website on purported improvements from training is less persuasive, a bit of an #executivetimeseries.

Some of the recommendations for being a superforecaster are familiar ideas from behavoural psychology. Be a fox not a hedgehog, don’t neglect base rates, be astute to the distinction between signal and noise, read widely and richly, etc..

Takeaways

There was one unanticipated and intriguing result. The superforecasters updated their predictions not only frequently but by fine degrees, perhaps from 61% to 62%. I think that some further analysis is required to show that that is not simply an artefact of the measurement. Because Brier scores have a squared term they would be expected to punish the variation in large adjustments.

However, taking the conclusion at face value, it has some important consequences for risk assessment which often proceeds by broadly granular ranking on a rating scale of 1 to 5, say. The study suggests that the best predictions will be those where careful attention is paid to fine gradations in probability.

Of course, continual updating of predictions is essential to even the most primitive risk management though honoured more often in the breach than the observance. I shall come back to the significance of this for risk management in a future post.

There is also an interesting discussion about making predictions in teams but I shall have to come back to that another time.

The amateurs out-performed the professionals on global politics. I wonder if the same result would have been encountered against experts in structural engineering.

And TalkTalk? They forgot, pace Stanley Baldwin, that the hacker will always get through.

Professor Tetlock invites you to join the programme at http://www.goodjudgment.com.

Advertisements

Data science sold down the Amazon? Jeff Bezos and the culture of rigour

This blog appeared on the Royal Statistical Society website Statslife on 25 August 2015

Jeff Bezos' iconic laugh.jpgThis recent item in the New York Times has catalysed discussion among managers. The article tells of Amazon’s founder, Jeff Bezos, and his pursuit of rigorous data driven management. It also tells employees’ own negative stories of how that felt emotionally.

The New York Times says that Amazon is pervaded with abundant data streams that are used to judge individual human performance and which drive reward and advancement. They inform termination decisions too.

The recollections of former employees are not the best source of evidence about how a company conducts its business. Amazon’s share of the retail market is impressive and they must be doing something right. What everybody else wants to know is, what is it? Amazon are very coy about how they operate and there is a danger that the business world at large takes the wrong messages.

Targets

Targets are essential to business. The marketing director predicts that his new advertising campaign will create demand for 12,000 units next year. The operations director looks at her historical production data. She concludes that the process lacks the capability reliably to produce those volumes. She estimates the budget required to upgrade the process and to achieve 12,000 units annually. The executive board considers the business case and signs off the investment. Both marketing and operations directors now have a target.

Targets communicate improvement priorities. They build confidence between interfacing processes. They provide constraints and parameters that prevent the system causing harm. Harm to others or harm to itself. They allow the pace and substance of multiple business processes, and diverse entities, to be matched and aligned.

But everyone who has worked in business sees it as less simple than that. The marketing and operations directors are people.

Signal and noise

Drawing conclusions from data might be an uncontroversial matter were it not for the most common feature of data, fluctuation. Call it variation if you prefer. Business measures do not stand still. Every month, week, day and hour is different. All data features noise. Sometimes is goes up, sometimes down. A whole ecology of occult causes, weakly characterised, unknown and as yet unsuspected, interact to cause irregular variation. They are what cause a coin variously to fall “heads” or “tails”. That variation may often be stable enough, or if you like “exchangeable“, so as to allow statistical predictions to be made, as in the case of the coin toss.

If all data features noise then some data features signals. A signal is a sign, an indicator that some palpable cause has made the data stand out from the background noise. It is that assignable cause which enables inferences to be drawn about what interventions in the business process have had a tangible effect and what future innovations might cement any gains or lead to bigger prospective wins. Signal and noise lead to wholly different business strategies.

The relevance for business is that people, where not exposed to rigorous decision support, are really bad at telling the difference between signal and noise. Nobel laureate economist and psychologist Daniel Kahneman has amassed a lifetime of experimental and anecdotal data capturing noise misinterpreted as signal and judgments in the face of compelling data, distorted by emotional and contextual distractions.

Signal and accountability

It is a familiar trope of business, and government, that extravagant promises are made, impressive business cases set out and targets signed off. Yet the ultimate scrutiny as to whether that envisaged performance was realised often lacks rigour. Noise, with its irregular ups and downs, allows those seeking solace from failure to pick out select data points and cast self-serving narratives on the evidence.

Our hypothetical marketing director may fail to achieve his target but recount how there were two individual months where sales exceeded 1,000, construct elaborate rationales as to why only they are representative of his efforts and point to purported external factors that frustrated the remaining ten reports. Pairs of individual data points can always be selected to support any story, Don Wheeler’s classic executive time series.

This is where the ability to distinguish signal and noise is critical. To establish whether targets have been achieved requires crisp definition of business measures, not only outcomes but also the leading indicators that provide context and advise judgment as to prediction reliability. Distinguishing signal and noise requires transparent reporting that allows diverse streams of data criticism. It requires a rigorous approach to characterising noise and a systematic approach not only to identifying signals but to reacting to them in an agile and sustainable manner.

Data is essential to celebrating a target successfully achieved and to responding constructively to a failure. But where noise is gifted the status of signal to confirm a fanciful business case, or to protect a heavily invested reputation, then the business is misled, costs increased, profits foregone and investors cheated.

Where employees believe that success and reward is being fudged, whether because of wishful thinking or lack of data skills, or mistakenly through lack of transparency, then cynicism and demotivation will breed virulently. Employees watch the behaviours of their seniors carefully as models of what will lead to their own advancement. Where it is deceit or innumeracy that succeed, that is what will thrive.

Noise and blame

Here is some data of the number of defects caused by production workers last month.

Worker Defects
Al 10
Simone 6
Jose 10
Gabriela 16
Stan 10

What is to be done about Gabriela? Move to an easier job? Perhaps retraining? Or should she be let go? And Simone? Promote to supervisor?

Well, the numbers were just random numbers that I generated. I didn’t add anything in to make Gabriela’s score higher and there was nothing in the way that I generated the data to suggest who would come top or bottom. The data are simply noise. They are the sort of thing that you might observe in a manufacturing plant that presented a “stable system of trouble”. Nothing in the data signals any behaviour, attitude, skill or diligence that Gabriela lacked or wrongly exercised. The next month’s data would likely show a different candidate for dismissal.

Mistaking signal for noise is, like mistaking noise for signal, the path to business under performance and employee disillusionment. It has a particularly corrosive effect where used, as it might be in Gabriela’s case, to justify termination. The remaining staff will be bemused as to what Gabriela was actually doing wrong and start to attach myriad and irrational doubts to all sorts of things in the business. There may be a resort to magical thinking. The survivors will be less open and less willing to share problems with their supervisors. The business itself has the costs of recruitment to replace Gabriela. The saddest aspect of the whole business is the likelihood that Gabriela’s replacement will perform better than did Gabriela, vindicating the dismissal in the mind of her supervisor. This is the familiar statistical artefact of regression to the mean. An extreme event is likely to be followed by one less extreme. Again, Kahneman has collected sundry examples of managers so deceived by singular human performance and disappointed by its modest follow-up.

It was W Edwards Deming who observed that every time you recruit a new employee you take a random sample from the pool of job seekers. That’s why you get the regression to the mean. It must be true at Amazon too as their human resources executive Mr Tony Galbato explains their termination statistics by admitting that “We don’t always get it right.” Of course, everybody thinks that their recruitment procedures are better than average. That’s a management claim that could well do with rigorous testing by data.

Further, mistaking noise for signal brings the additional business expense of over adjustment, spending money to add costly variation while degrading customer satisfaction. Nobody in the business feels good about that.

Target quality, data quality

I admitted above that the evidence we have about Amazon’s operations is not of the highest quality. I’m not in a position to judge what goes on at Amazon. But all should fix in their minds that setting targets demands rigorous risk assessment, analysis of perverse incentives and intense customer focus.

It is a sad reality that, if you set incentives perversely enough,some individuals will find ways of misreporting data. BNFL’s embarrassment with Kansai Electric and Steven Eaton’s criminal conviction were not isolated incidents.

One thing that especially bothered me about the Amazon report was the soi-disant Anytime Feedback Tool that allowed unsolicited anonymous peer appraisal. Apparently, this formed part of the “data” that determined individual advancement or termination. The description was unchallenged by Amazon’s spokesman (sic) Mr Craig Berman. I’m afraid, and I say this as a practising lawyer, unsourced and unchallenged “evidence” carries the spoor of the Star Chamber and the party purge. I would have thought that a pretty reliable method for generating unreliable data would be to maximise the personal incentives for distortion while protecting it from scrutiny or governance.

Kahneman observed that:

… we pay more attention to the content of messages than to information about their reliability, and as a result end up with a view of the world around us that is simpler and more coherent than the data justify.

It is the perverse confluence of fluctuations and individual psychology that makes statistical science essential, data analytics interesting and business, law and government difficult.

Productivity and how to improve it: I -The foundational narrative

Again, much talk in the UK media recently about weak productivity statistics. Chancellor of the Exchequer (Finance Minister) George Osborne has launched a 15 point macroeconomic strategy aimed at improving national productivity. Some of the points are aimed at incentivising investment and training. There will be few who argue against that though I shall come back to the investment issue when I come to talk about signal and noise. I have already discussed training here. In any event, the strategy is fine as far as these things go. Which is not very far.

There remains the microeconomic task for all of us of actually improving our own productivity and that of the systems we manage. That is not the job of government.

Neither can I offer any generalised system for improving productivity. It will always be industry and organisation dependent. However, I wanted to write about some of the things that you have to understand if your efforts to improve output are going to be successful and sustainable.

  • Customer value and waste.
  • The difference between signal and noise.
  • How to recognise flow and manage a constraint.

Before going on to those in future weeks I first wanted to go back and look at what has become the foundational narrative of productivity improvement, the Hawthorne experiments. They still offer some surprising insights.

The Hawthorne experiments

In 1923, the US electrical engineering industry was looking to increase the adoption of electric lighting in American factories. Uptake had been disappointing despite the claims being made for increased productivity.

[Tests in nine companies have shown that] raising the average initial illumination from about 2.3 to 11.2 foot-candles resulted in an increase in production of more than 15%, at an additional cost of only 1.9% of the payroll.

Earl A Anderson
General Electric
Electrical World (1923)

E P Hyde, director of research at GE’s National Lamp Works, lobbied government for the establishment of a Committee on Industrial Lighting (“the CIL”) to co-ordinate marketing-oriented research. Western Electric volunteered to host tests at their Hawthorne Works in Cicero, IL.

Western Electric came up with a study design that comprised a team of experienced workers assembling relays, winding their coils and inspecting them. Tests commenced in November 1924 with active support from an elite group of academic and industrial engineers including the young Vannevar Bush, who would himself go on to an eminent career in government and science policy. Thomas Edison became honorary chairman of the CIL.

It’s a tantalising historical fact that Walter Shewhart was employed at the Hawthorne Works at the time but I have never seen anything suggesting his involvement in the experiments, nor that of his mentor George G Edwards, nor protégé Joseph Juran. In later life, Juran was dismissive of the personal impact that Shewhart had had on operations there.

However, initial results showed no influence of light level on productivity at all. Productivity rose throughout the test but was wholly uncorrelated with lighting level. Theories about the impact of human factors such as supervision and motivation started to proliferate.

A further schedule of tests was programmed starting in September 1926. Now, the lighting level was to be reduced to near darkness so that the threshold of effective work could be identified. Here is the summary data (from Richard Gillespie Manufacturing Knowledge: A History of the Hawthorne Experiments, Cambridge, 1991).

Hawthorne data-1

It requires no sophisticated statistical analysis to see that the data is all noise and no signal. Much to the disappointment of the CIL, and the industry, there was no evidence that illumination made any difference at all, even down to conditions of near darkness. It’s striking that the highest lighting levels embraced the full range of variation in productivity from the lowest to the highest. What had seemed so self evidently a boon to productivity was purely incidental. It is never safe to assume that a change will be an improvement. As W Edwards Deming insisted, “In God was trust. All others bring data.”

But the data still seemed to show a relentless improvement of productivity over time. The participants were all very experienced in the task at the start of the study so there should have been no learning by doing. There seemed no other explanation than that the participants were somehow subliminally motivated by the experimental setting. Or something.

Hawthorne data-2

That subliminally motivated increase in productivity came to be known as the Hawthorne effect. Attempts to explain it led to the development of whole fields of investigation and organisational theory, by Elton Mayo and others. It really was the foundation of the management consulting industry. Gillespie (supra) gives a rich and intriguing account.

A revisionist narrative

Because of the “failure” of the experiments’ purpose there was a falling off of interest and only the above summary results were ever published. The raw data were believed destroyed. Now “you know, at least you ought to know, for I have often told you so” about Shewhart’s two rules for data presentation.

  1. Data should always be presented in such a way as to preserve the evidence in the data for all the predictions that might be made from the data.
  2. Whenever an average, range or histogram is used to summarise observations, the summary must not mislead the user into taking any action that the user would not take if the data were presented in context.

The lack of any systematic investigation of the raw data led to the development of a discipline myth that every single experimental adjustment had led forthwith to an increase in productivity.

In 2009, Steven Levitt, best known to the public as the author of Freakonomics, along with John List and their research team, miraculously discovered a microfiche of the raw study data at a “small library in Milwaukee, WI” and the remainder in Boston, MA. They went on to analyse the data from scratch (Was there Really a Hawthorne Effect at the Hawthorne Plant? An Analysis of the Original Illumination Experiments, National Bureau of Economic Research, Working Paper 15016, 2009).

LevittHawthonePlot

Figure 3 of Levitt and List’s paper (reproduced above) shows the raw productivity measurements for each of the experiments. Levitt and List show how a simple plot such as this reveals important insights into how the experiments developed. It is a plot that yields a lot of information.

Levitt and List note that, in the first phase of experiments, productivity rose then fell when experiments were suspended. They speculate as to whether there was a seasonal effect with lower summer productivity.

The second period of experiments is that between the third and fourth vertical lines in the figure. Only room 1 experienced experimental variation in this period yet Levitt and List contend that productivity increased in all three rooms, falling again at the end of experimentation.

During the final period, data was only collected from room 1 where productivity continued to rise, even beyond the end of the experiment. Looking at the data overall, Levitt and List find some evidence that productivity responded more to changes in artificial light than to natural light. The evidence that increases in productivity were associated with every single experimental adjustment is weak. To this day, there is no compelling explanation of the increases in productivity.

Lessons in productivity improvement

Deming used to talk of “disappointment in great ideas”, the propensity for things that looked so good on paper simply to fail to deliver the anticipated benefits. Nobel laureate psychologist Daniel Kahneman warns against our individual bounded rationality.

To guard against entrapment by the vanity of imagination we need measurement and data to answer the ineluctable question of whether the change we implemented so passionately resulted in improvement. To be able to answer that question demands the separation of signal from noise. That requires trenchant data criticism.

And even then, some factors may yet be beyond our current knowledge. Bounded rationality again. That is why the trick of continual improvement in productivity is to use the rigorous criticism of historical data to build collective knowledge incrementally.

If you torture the data enough, nature will always confess.

Ronald Coase

Eventually.

Royal babies and the wisdom of crowds

Prince George of Cambridge with wombat plush toy (crop).jpgIn 2004 James Surowiecki published a book with the unequivocal title The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations. It was intended as a gloss on Charles Mackay’s 1841 book Extraordinary Popular Delusions and the Madness of Crowds. Both books are essential reading for any risk professional.

I am something of a believer in the wisdom of crowds. The other week I was fretting about the possible relegation of English Premier League soccer club West Bromwich Albion. It’s an emotional and atavistic tie for me. I always feel there is merit, as part of my overall assessment of risk, in checking online bookmakers’ odds. They surely represent the aggregated risk assessment of gamblers if nobody else. I was relieved that bookmakers were offering typically 100/1 against West Brom being relegated. My own assessment of risk is, of course, contaminated with personal anxiety so I was pleased that the crowd was more phlegmatic.

However, while I was on the online bookmaker’s website, I couldn’t help but notice that they were also accepting bets on the imminent birth of the royal baby, the next child of the Duke and Duchess of Cambridge. It struck me as weird that anyone would bet on the sex of the royal baby. Surely this was a mere coin toss, though I know that people will bet on that. Being hopelessly inquisitive I had a look. I was somewhat astonished to find these odds being offered (this was 22 April 2015, ten days before the royal birth).

odds implied probability
Girl 1/2 0.67
Boy 6/4 0.40
 Total 1.07

Here I have used the usual formula for converting between odds and implied probabilities: odds of m / n against an event imply a probability of n / (m + n) of the event occurring. Of course, the principle of finite additivity requires that probabilities add up to one. Here they don’t and there is an overround of 7%. Like the rest of us, bookmakers have to make a living and I was unsurprised to find a Dutch book.

The odds certainly suggested that the crowd thought a girl manifestly more probable than a boy. Bookmakers shorten the odds on the outcome that is attracting the money to avoid a heavy payout on an event that the crowd seems to know something about.

Historical data on sex ratio

I started, at this stage, to doubt my assumption that boy/ girl represented no more than a coin toss, 50:50, an evens bet. As with most things, sex ratio turns out to be an interesting subject. I found this interesting research paper which showed that sex ratio was definitely dependent on factors such as the age and ethnicity of the mother. The narrative of this chart was very interesting.

Sex ratio

However, the paper confirmed that the sex of a baby is independent of previous births, conditioned on the factors identified, and that the ratio of girls to boys is nowhere and no time greater than 1,100 to 1000, about 52% girls.

So why the odds?

Bookmakers lengthen the odds on the outcome attracting the smaller value of bets in order to encourage stakes on the less fancied outcomes, on which there is presumably less risk of having to pay out. At odds of 6/4, a punter betting £10 on a boy would receive his stake back plus £15 ( = 6 × £10 / 4 ). If we assume an equal chance of boy or girl then that is an expected return of £12.50 ( = 0.5 × £25 ) for a £10.00 stake. I’m not sure I’d seen such a good value wager since we all used to bet against Tim Henman winning Wimbledon.

Ex ante there are two superficially suggestive explanations as to the asymmetry in the odds. At least this is all my bounded rationality could imagine.

  • A lot of people (mistakenly) thought that the run of five male royal births (Princes Andrew, Edward, William, Harry and George) escalated the probability of a girl being next. “It was overdue.”
  • A lot of people believed that somebody “knew something” and that they knew what it was.

In his book about cognitive biases in decision making (Thinking, Fast and Slow, Allen Lane, 2011) Nobel laureate economist Daniel Kahneman describes widespread misconceptions concerning randomness of boy/ girl birth outcomes (at p115). People tend to see regularity in sequences of data as evidence of non-randomness, even where patterns are typical of, and unsurprising in, random events.

I had thought that there could not be sufficient gamblers who would be fooled by the baseless belief that a long run of boys made the next birth more likely to be a girl. But then Danny Finkelstein reminded me (The (London) Times, Saturday 25 April 2015) of a survey of UK politicians that revealed their limited ability to deal with chance and probabilities. Are politicians more or less competent with probabilities than online gamblers? That is a question for another day. I could add that the survey compared politicians of various parties but we have an on-going election campaign in the UK at the moment so I would, in the interest of balance, invite my voting-age UK readers not to draw any inferences therefrom.

The alternative is the possibility that somebody thought that somebody knew something. The parents avowed that they didn’t know. Medical staff may or may not have. The sort of people who work in VIP medicine in the UK are not the sort of people who divulge information. But one can imagine that a random shift in sentiment, perhaps because of the misconception that a girl was “overdue”, and a consequent drift in the odds, could lead others to infer that there was insight out there. It is not completely impossible. How many other situations in life and business does that model?

It’s a girl!

The wisdom of crowds or pure luck? We shall never know. I think it was Thomas Mann who observed that the best proof of the genuineness of a prophesy was that it turned out to be false. Had the royal baby been a boy we could have been sure that the crowd was mad.

To be complete, Bayes’ theorem tells us that the outcome should enhance our degree of belief in the crowd’s wisdom. But it is a modest increase (Bayes’ factor of 2, 3 deciban after Alan Turing’s suggestion) and as we were most sceptical before we remain unpersuaded.

In his book, Surowiecki identified five factors that can impair crowd intelligence. One of these is homogeneity. Insufficient diversity frustrates the inherent virtue on which the principle is founded. I wonder how much variety there is among online punters? Similarly, where judgments are made sequentially there is a danger of influence. That was surely a factor at work here. There must also have been an element of emotion, the factor that led to all those unrealistically short odds on Henman at Wimbledon on which the wise dined so well.

But I’m trusting that none of that applies to the West Brom odds.

M5 “fireworks crash” – risk identification and reputation management

UK readers will recall this tragic accident in November 2011 when 51 people were injured and seven killed in an accident on a fog bound motorway.

What marked out the accident from a typical collision in fog was the suggestion that the environmental conditions had been exacerbated by smoke that had drifted onto the motorway from a fireworks display at nearby Taunton Rugby Club.

This suggestion excited a lot of press comment. Geoffrey Counsell, the fireworks professional who had been contracted to organise the event, was subsequently charged with manslaughter. The prosecutor’s allegation was that he had fallen so far below the standard or care he purportedly owed to the motorway traffic that a reasonable person would think a criminal sanction appropriate.

It is very difficult to pick out from the press exactly how this whole prosecution unravelled. Firstly the prosecutors resiled from the manslaughter charge, a most serious matter that in the UK can attract a life sentence. They substituted a charge under section 3(2) of the Health and Safety at Work etc. Act 1974 that Mr Counsell had failed “to conduct his undertaking in such a way as to ensure, so far as is reasonably practicable, that … other persons (not being his employees) who may be affected thereby are not thereby exposed to risks to their health or safety.”

There has been much commentary from judges and others on the meaning of “reasonably practicable” but suffice to say, for the purposes of this blog, that a self employed person is required to make substantial effort in protecting the public. That said, the section 3 offence carries a maximum sentence of no more than two years’ imprisonment.

The trial on the section 3(2) indictment opened on 18 November 2013. “Serious weaknesses” in the planning of the event were alleged. There were vague press reports about Mr Counsell’s risk assessment but insufficient for me to form any exact view. It does seem that he had not considered smoke drifting onto the motorway and interacting with fog to create an especial hazard to drivers.

A more worrying feature of the prosecution was the press suggestion that an expert meteorologist had based his opinion on a biased selection of witness statements that he had been provided with and which described which way the smoke from the fireworks display had been drifting. I only have the journalistic account of the trial but it looks far from certain that the smoke did in fact drift towards the motorway.

In any event, on 10 December 2013, following the close of the prosecution evidence, the judge directed the jury to acquit Mr Counsell. The prosecutors had brought forward insufficient evidence against Mr Counsell for a jury reasonably to return a conviction, even without any evidence in his defence.

An individual, no matter how expert, is at a serious disadvantage in identifying novel risks. An individual’s bounded rationality will always limit the futures he can conjure and the weight that he gives to them. To be fair to Mr Counsell, he says that he did seek input from the Highways Agency, Taunton Deane Borough Council and Avon and Somerset Police but he says that they did not respond. If that is the case, I am sure that those public bodies will now reflect on how they could have assisted Mr Counsell’s risk assessment the better to protect the motorists and, in fact, Mr Counsell. The judge’s finding, that this was an accident that Mr Counsell could not reasonably have foreseen, feels like a just decision.

Against that, hypothetically, had the fireworks been set by a household name corporation, they would rightly have felt ashamed at not having anticipated the risk and taken any necessary steps to protect the motorway drivers. There would have been reputational damage. A sufficient risk assessment would have provided the basis for investigating whether the smoke was in fact a cause of the accident and, where appropriate, advancing a robust and persuasive rebuttal of blame.

That is the power of risk assessment. Not only is it a critical foundational element of organisational management, it provides a powerful tool in managing reputation and litigation risk. Unfortunately, unless there is a critical mass of expertise dedicated to risk identification it is more likely that it will provide a predatory regulator with evidence of slipshod practice. Its absence is, of course, damning.

As a matter of good business and efficient leadership, the Highways Agency, Taunton Deane Borough Council, and Avon and Somerset Police ought to have taken Mr Counsell’s risk assessment seriously if they were aware of it. They would surely have known that they were in a better position than Mr Counsell to assess risks to motorists. Fireworks displays are tightly regulated in the UK yet all such regulation has failed to protect the public in this case. Again, I think that the regulators might look to their own role.

Organisations must be aware of external risks. Where they are not engaged with the external assessment of such risks they are really in an oppositional situation that must be managed accordingly. Where they are engaged the external assessments must become integrated into their own risk strategy.

It feels as though Mr Counsell has been unjustly singled out in this tragic matter. There was a rush to blame somebody and I suspect that an availability heuristic was at work. Mr Counsellor attracted attention because the alleged causation of the accident seemed so exotic and unusual. The very grounds on which the court held him blameless.

Do I have to be a scientist to assess food safety?

I saw this BBC item on the web before Christmas: Why are we more scared of raw egg than reheated rice? Just after Christmas seemed like a good time to blog about food safety. Actually, the link I followed asked Are some foods more dangerous that others? A question that has a really easy answer.

However, understanding the characteristic risks of various foods and how most safely to prepare them is less simple. Risk theorist John Adams draws a distinction between readily identified inherent and obvious risks, and risks that can only be perceived with the help of science. Food risks fall into the latter category. As far as I can see, “folk wisdom” is no reliable guide here, even partially. The BBC article refers to risks from rice, pasta and salad vegetables which are not obvious. At the same time, in the UK at least, the risk from raw eggs is very small.

Ironically, raw eggs are one food that springs readily to British people’s minds when food risk is raised, largely due to the folk memory of a high profile but ill thought out declaration by a government minister in the 1980s. This is an example of what Amos Tversky and Daniel Kahneman called an availability heuristic: If you can think of it, it must be important.

Food safety is an environment where an individual is best advised to follow the advice of scientists. We commonly receive this filtered, even if only for accessibility, through government agencies. That takes us back to the issue of trust in bureaucracy on which I have blogged before.

I wonder whether governments are in the best position to provide such advice. It is food suppliers who suffer from the public’s misallocated fears. The egg fiasco of the 1980s had a catastrophic effect on UK egg sales. All food suppliers have an interest in a market characterised by a perception that the products are safe. The food industry is also likely to be in the best position to know what is best practice, to improve such practice, to know how to communicate it to their customers, to tailor it to their products and to provide the effective behavioural “nudges” that promote safe handling. Consumers are likely to be cynical about governments, “one size fits all” advice and cycles of academic meta-analysis.

I think there are also lessons here for organisations. Some risks are assessed on the basis of scientific analysis. It is important that the prestige of that origin is communicated to all staff who will be involved in working with risk. The danger for any organisation is that an individual employee might make a reassessment based on local data and their own self-serving emotional response. As I have blogged before, some individuals have particular difficulty in aligning themselves with the wider organisation.

Of course, individuals must also be equipped with the means of detecting when the assumptions behind the science have been violated and initiating an agile escalation so that employee, customer and organisation can be protected while a reassessment is conducted. Social media provide new ways of sharing experience. I note from the BBC article that, in the UK at least, there is no real data on the origins of food poisoning outbreaks.

So the short answer to the question at the head of this blog still turns out to be “yes”. There are some things where we simply have to rely on science if we want to look after ourselves, our families and our employees.

But even scientists are limited by their own bounded rationality. Science is a work in progress. Using that science itself as a background against which to look for novel phenomena and neglected residual effects leverages that original risk analysis into a key tool in managing, improving and growing a business.