Why did the polls get it wrong?

This week has seen much soul-searching by the UK polling industry over their performance leading up to the 2015 UK general election on 7 May. The polls had seemed to predict that Conservative and Labour Parties were neck and neck on the popular vote. In the actual election, the Conservatives polled 37.8% to Labour’s 31.2% leading to a working majority in the House of Commons, once the votes were divided among the seats contested. I can assure my readers that it was a shock result. Over breakfast on 7 May I told my wife that the probability of a Conservative majority in the House was nil. I hold my hands up.

An enquiry was set up by the industry led by the National Centre for Research Methods (NCRM). They presented their preliminary findings on 19 January 2016. The principal conclusion was that the failure to predict the voting share was because of biases in the way that the data were sampled and inadequate methods for correcting for those biases. I’m not so sure.

Population -> Frame -> Sample

The first thing students learn when studying statistics is the critical importance, and practical means, of specifying a sampling frame. If the sampling frame is not representative of the population of concern then simply collecting more and more data will not yield a prediction of greater accuracy. The errors associated with the specification of the frame are inherent to the sampling method. Creating a representative frame is very hard in opinion polling because of the difficulty in contacting particular individuals efficiently. It turns out that Conservative voters are harder than Labour voters to get hold of, so that they can be questioned. The NCRM study concluded that, within the commercial constraints of an opinion poll, there was a lower probability that a Conservative voter would be contacted. They therefore tended to be under-represented in the data causing a substantial bias towards Labour.

This is a well known problem in polling practice and there are demographic factors that can be used to make a statistical adjustment. Samples can be stratified. NCRM concluded that, in the run up to the 2015 election, there were important biases tending to under state the Conservative vote and the existing correction factors were inadequate. Fresh sampling strategies were needed to eradicate the bias and improve prediction. There are understandable fears that this will make polling more costly. More calls will be needed to catch Conservatives at home.

Of course, that all sounds an eminently believable narrative. These sorts of sampling frame biases are familiar but enormously troublesome for pollsters. However, I wanted to look at the data myself.

Plot data in time order

That is the starting point of all statistical analysis. Polls continued after the election, though with lesser frequency. I wanted to look at that data after the election in addition to the pre-election data. Here is a plot of poll results against time for Conservative and Labour. I have used data from 25 January to the end of 2015.1, 2 I have not managed to jitter the points so there is some overprinting of Conservative by Labour pre-election.

Polling201501

Now that is an arresting plot. Yet again plotting against time elucidates the cause system. Something happened on the date of the election. Before the election the polls had the two parties neck and neck. The instant (sic) the election was done there was clear red/ blue water between the parties. Applying my (very moderate) level of domain knowledge to the data before, the poll results look stable and predictable. There is a shift after the election to a new datum that remains stable and predictable. The respective arithmetic means are given below.

Party Mean Poll Before Election Mean Poll After
Conservative 33.3% 37.8% 38.8%
Labour 33.5% 31.2% 30.9%

The mean of the post-election polls is doing fairly well but is markedly different from the pre-election results. Now, it is trite statistics that the variation we observe on a chart is the aggregate of variation from two sources.

  • Variation from the thing of interest; and
  • Variation from the measurement process.

As far as I can gather, the sampling methods used by the polling companies have not so far been modified. They were awaiting the NCRM report. They certainly weren’t modified in the few days following the election. The abrupt change on 7 May cannot be because of corrected sampling methods. The misleading pre-election data and the “impressive” post-election polls were derived from common sampling practices. It seems to me difficult to reconcile NCRM’s narrative to the historical data. The shift in the data certainly needs explanation within that account.

What did change on the election date was that a distant intention turned into the recall of a past action. What everyone wants to know in advance is the result of the election. Unsurprisingly, and as we generally find, it is not possible to sample the future. Pollsters, and their clients, have to be content with individuals’ perceptions of how they will vote. The vast majority of people pay very little attention to politics at all and the general level of interest outside election time is de minimis. Standing in a polling booth with a ballot paper is a very different matter from being asked about intentions some days, weeks or months hence. Most people take voting very seriously. It is not obvious that the same diligence is directed towards answering pollster’s questions.

Perhaps the problems aren’t statistical at all and are more concerned with what psychologists call affective forecasting, predicting how we will feel and behave under future circumstances. Individuals are notoriously susceptible to all sorts of biases and inconsistencies in such forecasts. It must at least be a plausible source of error that intentions are only imperfectly formed in advance and mapping into votes is not straightforward. Is it possible that after the election respondents, once again disengaged from politics, simply recalled how they had voted in May? That would explain the good alignment with actual election results.

Imperfect foresight of voting intention before the election and 20/25 hindsight after is, I think, a narrative that sits well with the data. There is no reason whatever why internal reflections in the Cartesian theatre of future voting should be an unbiased predictor of actual votes. In fact, I think it would be a surprise, and one demanding explanation, if they were so.

The NCRM report does make some limited reference to post-election re-interviews of contacts. However, this is presented in the context of a possible “late swing” rather than affective forecasting. There are no conclusions I can use.

Meta-analysis

The UK polls took a horrible beating when they signally failed to predict the result of the 1992 election in under-estimating the Conservative lead by around 8%.3 Things then felt better. The 1997 election was happier, where Labour led by 13% at the election with final polls in the range of 10 to 18%.4 In 2001 each poll managed to get the Conservative vote within 3% but all over-estimated the Labour vote, some pollsters by as much as 5%.5 In 2005, the final poll had Labour on 38% and Conservative,  33%. The popular vote was Labour 36.2% and Conservative 33.2%.6 In 2010 the final poll had Labour on 29% and Conservative, 36%, with a popular vote of 29.7%/36.9%.7 The debacle of 1992 was all but forgotten when 2015 returned to pundits’ dismay.

Given the history and given the inherent difficulties of sampling and affective forecasting, I’m not sure why we are so surprised when the polls get it wrong. Unfortunately for the election strategist they are all we have. That is a common theme with real world data. Because of its imperfections it has to be interpreted within the context of other sources of evidence rather than followed slavishly. The objective is not to be driven by data but to be led by the insights it yields.

References

  1. Opinion polling for the 2015 United Kingdom general election. (2016, January 19). In Wikipedia, The Free Encyclopedia. Retrieved 22:57, January 20, 2016, from https://en.wikipedia.org/w/index.php?title=Opinion_polling_for_the_2015_United_Kingdom_general_election&oldid=700601063
  2. Opinion polling for the next United Kingdom general election. (2016, January 18). In Wikipedia, The Free Encyclopedia. Retrieved 22:55, January 20, 2016, from https://en.wikipedia.org/w/index.php?title=Opinion_polling_for_the_next_United_Kingdom_general_election&oldid=700453899
  3. Butler, D & Kavanagh, D (1992) The British General Election of 1992, Macmillan, Chapter 7
  4. — (1997) The British General Election of 1997, Macmillan, Chapter 7
  5. — (2002) The British General Election of 2001, Palgrave-Macmillan, Chapter 7
  6. Kavanagh, D & Butler, D (2005) The British General Election of 2005, Palgrave-Macmillan, Chapter 7
  7. Cowley, P & Kavanagh, D (2010) The British General Election of 2010, Palgrave-Macmillan, Chapter 7
Advertisements

The Iron Law at Volkswagen

So Michael Horn, VW’s US CEO has made a “sincere apology” for what went on at VW.

And like so many “sincere apologies” he blamed somebody else. “My understanding is that it was a couple of software engineers who put these in.”

As an old automotive hand I have always been very proud of the industry. I have held it up as a model of efficiency, aesthetic aspiration, ambition, enlightenment and probity. My wife will tell you how many times I have responded to tales of workplace chaos with “It couldn’t happen in a car plant”. Fortunately we don’t own a VW but I still feel betrayed by this. Here’s why.

A known risk

Everybody knew from the infancy of emissions testing, which came along at about the same time as the adoption of engine management systems, the risks of a “cheat device”. It was obvious to all that engineers might be tempted to manoeuvre a recalcitrant engine through a challenging emissions test by writing software so as to detect test conditions and thereon modify performance.

In the better sort of motor company, engineers were left in no doubt that this was forbidden and the issue was heavily policed with code reviews and process surveillance.

This was not something that nobody saw coming, not a blind spot of risk identification.

The Iron Law

I wrote before about the Iron Law of Oligarchy. Decision taking executives in an organisation try not to pass information upwards. That will only result in interference and enquiry. Supervisory boards are well aware of this phenomenon because, during their own rise to the board, they themselves were the senior managers who constituted the oligarchy and who kept all the information to themselves. As I guessed last time I wrote, decisions like this don’t get taken at board level. They are taken out of the line of sight of the board.

Governance

So here we have a known risk. A threat that would likely not be detected in the usual run of line management. And it was of such a magnitude as would inflict hideous ruin on Volkswagen’s value, accrued over decades of hard built customer reputation. Volkswagen, an eminent manufacturer with huge resources, material, human and intellectual. What was the governance function to do?

Borrowing strength again

It would have been simple, actually simple, to secret shop the occasional vehicle and run it through an on-road emissions test. Any surprising discrepancy between the results and the regulatory tests would then have been a signal that the company was at risk and triggered further investigation. An important check on any data integrity is to compare it with cognate data collected by an independent route, data that shares borrowing strength.

Volkswagen’s governance function simply didn’t do the simple thing. Never have so many ISO 31000 manuals been printed in vain. Theirs were the pot odds of a jaywalker.

Knowledge

In the English breach of trust case of Baden, Delvaux and Lecuit v Société Générale [1983] BCLC 325, Mr Justice Peter Gibson identified five levels of knowledge that might implicate somebody in wrongdoing.

  • Actual knowledge.
  • Wilfully shutting one’s eyes to the obvious (Nelsonian knowledge).
  • Wilfully and recklessly failing to make such enquiries as an honest and reasonable man would make.
  • Knowledge of circumstances that would indicate the facts to an honest and reasonable man.
  • Knowledge of circumstances that would put an honest and reasonable man on enquiry.

I wonder where VW would place themselves in that.

How do you sound when you feel sorry?

… is the somewhat barbed rejoinder to an ungracious apology. Let me explain how to be sorry. There are three “R”s.

  • Remorse: Different from the “regret” that you got caught. A genuine internal emotional reaction. The public are good at spotting when emotions are genuine but it is best evidenced by the following two “R”s.
  • Reparation: Trying to undo the damage. VW will not have much choice about this as far as the motorists are concerned but the shareholders may be a different matter. I don’t think Horn’s director’s insurance will go very far.
  • Reform: This is the barycentre of repentance. Can VW now change the way it operates to adopt genuine governance and systematic risk management?

Mr Horn tells us that he has little control over what happens in his company. That is probably true. I trust that he will remember that at his next remuneration review. If there is one.

When they said, “Repent!”, I wonder what they meant.

Leonard Cohen
The Future

First thoughts on VW’s emmissions debacle

It is far too soon to tell exactly what went on at VW, in the wider motor industry, within the respective regulators and within governments. However, the way that the news has come out, and the financial and operational impact that it is likely to have, are enough to encourage all enterprises to revisit their risk management, governance and customer reputation management policies. Corporate scandals are not a new phenomenon, from the collapse of the Medici Bank in 1494, Warren Hastings’ alleged despotism in the British East India Company, down to the FIFA corruption allegations that broke earlier this year. Organisational scandals are as old as organisations. The bigger the organisations get, the bigger the scandals are going to be.

Normal Scandals

In 1984, Scott Perrow published his pessimistic analysis of what he saw as the inevitability of Normal Accidents in complex technologies. I am sure that there is a market for a book entitled Normal Scandals: Living with High-Risk Organisational Structures. But I don’t share Perrow’s pessimism. Life is getting safer. Let’s adopt the spirit of continual improvement to make investment safer too. That’s investment for those of us trying to accumulate a modest portfolio for retirement. Those who aspire to join the super rich will still have to take their chances.

I fully understand that organisations sometimes have to take existential risks to stay in business. The development of Rolls-Royce’s RB211 aero-engine well illustrates what happens when a manufacturer finds itself with proven technologies that are inadequately aligned with the Voice of the Customer. The market will not wait while the business catches up. There is time to develop a response but only if that solution works first time. In the case of Rolls-Royce it didn’t and insolvency followed. However, there was no alternative but to try.

What happened at VW? I just wonder whether the Iron Law of Oligarchy was at work. To imagine that a supervisory board sits around discussing the details of engine management software is naïve. In fact it was the RB211 crisis that condemned such signal failures of management to delegate. Do VW’s woes flow from a decision taken by a middle manager, or a blind eye turned, that escaped an inadequate system of governance? Perhaps a short term patch in anticipation of an ultimate solution?

Cardinal Newman’s contribution to governance theory

John Henry Newman learned about risk management the hard way. Newman was an English Anglican divine who converted to the Catholic Church in 1845. In 1850 Newman became involved in the controversy surrounding Giacinto Achilli, a priest expelled from the Catholic Church for rape and sexual assault but who was making a name from himself in England as a champion of the protestant evangelical cause. Conflict between Catholic and protestant was a significant feature of the nineteenth century English political landscape. Newman was minded to ensure that Achilli’s background was widely known. He took legal advice from counsel James Hope-Scott about the risks of a libel action from Achilli. Hope-Scott was reassuring and Newman published. The publication resulted in Newman’s prosecution and conviction for criminal libel.

Speculation about what legal advice VW have received as to their emissions strategy would be inappropriate. However, I trust that, if they imagined they were externalising any risk thereby, they checked the value of their legal advisors’ professional indemnity insurance.

Newman certainly seems to have learned his lesson and subsequently had much to teach the modern world about risk management and governance. After the Achilli trial Newman started work on his philosophical apologia, The Grammar of Assent. One argument in that book has had such an impact on modern thinking about evidence and probability that it was quoted in full by Bruno de Finetti in Volume 1 of his 1974 Theory of Probability.

Supposes a thesis (e.g. the guilt of an accused man) is supported by a great deal of circumstantial evidence of different forms, but in agreement with each other; then even if each piece of evidence is in itself insufficient to produce any strong belief, the thesis is decisively strengthened by their joint effect.

De Finetti set out the detailed mathematics and called this the Cardinal Newman principle. It is fundamental to the modern concept of borrowing strength.

The standard means of defeating governance are all well known to oligarchs, regulator capture, “stake-driving” – taking actions outside the oversight of governance that will not be undone without engaging the regulator in controversy, “whipsawing” – promising A that approval will be forthcoming from B while telling B that A has relied upon her anticipated, and surely “uncontroversial”, approval. There are plenty of others. Robert Caro’s biography The Power Broker: Robert Moses and the Fall of New York sets out the locus classicus.

Governance functions need to exploit the borrowing strength of diverse data sources to identify misreporting and misconduct. And continually improve how they do that. The answer is trenchant and candid criticism of historical data. That’s the only data you have. A rigorous system of goal deployment and mature use of process behaviour charts delivers a potent stimulus to reluctant data sharers.

Things and actions are what they are and the consequences of them will be what they will be: why then should we desire to be deceived?

Bishop Joseph Butler

 

Toxic

Engine exhaust contrailsMuch in the UK press this week about alleged personal injuries from what has been described as “toxic air” in aircraft. Contamination of cabin air with, perhaps, organophosphates from the engines, either ambiently or during “fume events”, is alleged to cause ill health both in air crew and passengers. It seems that pre-action correspondence is being sent and litigation is afoot.

Of course, the issues, engineering, physiological and legal, are complex and await a proper forensic exploration. The courts are actually very good at this sort of thing as I shall go on to discuss below. However, the press coverage reminded me of one of the recurrent themes in this blog, trust in bureaucracy.

Trust

Part of the background to the litigation is found in the work of the Committee on Toxicity (“the CoT”). The CoT consists of working scientists who provide independent advice to the UK government. The CoT looked into the “toxic air” allegations. In their report, the CoT concede that the measurement systems for measuring cabin air quality are not entirely satisfactory. However, the CoT go on to arrive at the following conclusion as to ambient exposure;

For the types of aircraft studied, and in the absence of a major fume event, airborne concentrations of the pollutants that were measured in the study are likely to be very low (well below the levels that might cause symptoms) during most flights. The data do not rule out the possibility of higher concentrations on some flights … or of higher concentrations of other pollutants that were not measured.

— and for the “fume events”:

… the Committee considers that a toxic mechanism for the illness that has been reported in temporal relation to fume incidents is unlikely. Many different chemicals have been identified in the bleed air from aircraft engines, but to cause serious acute toxicity, they would have to occur at very much higher concentrations than have been found to date (although lower concentrations of some might cause an odour or minor irritation of the eyes or airways). Furthermore, the symptoms that have been reported following fume incidents have been wide-ranging (including headache, hot flushes, nausea, vomiting, chest pain, respiratory problems, dizziness and light-headedness), whereas toxic effects of chemicals tend to be more specific. However, uncertainties remain, and a toxic mechanism for symptoms cannot confidently be ruled out.

It’s not unusual for academics to be guarded if asked for an opinion and the CoT certainly don’t regard fume related injuries as impossible. However, having taken the matter as far as they are able with their resources, their honest opinion is that the reported symptoms were not caused by toxic fumes. I have not been able to find any fully argued study that says that they are. And yet, as the BBC points out, there are anecdotes that have to be considered against a background of data that, in itself, does not conclusively exclude the alleged symptoms. The matter is not quite closed but this turns out to be another issue beset with personal attitudes to evidence and risk.

Any lawyer has to be on the side of their client. However, when the BBC interviewed aviation lawyer Frank Cannon I think he went a little further than mere advocacy in his cause. He said:

If you look at the tobacco industry, the asbestos, contaminated blood issues, if you look at all that, the government say it’s perfectly safe, perfectly safe and then “wham”, they suddenly have to admit they got it wrong for so many years.

I am pretty sure that the UK government, at least, never advised that tobacco or asbestos was safe. William Cooke, the pathologist of Wigan infirmary, made arguably the first scientific report of lung disease caused by asbestos in 1924. There had been anecdotal evidence previously but Cooke’s was the first systematic analysis. Regulation and successful litigation soon followed. I am not aware of any serious body of scientific opinion ever saying that airborne asbestos exposure was safe after that point.

AsbestosCooke

As to smoking tobacco, the first statistical evidence associating smoking with cancer seems to have come in 1929 from Fritz Lickint. After Richard Doll’s work from the 1950s onwards I don’t think there was serious scientific dispute.

Of course, in the early years of the twentieth century life was comparatively unregulated. Though an absence of regulatory framework may now appear like a governmental endorsement that is to apply a very much post-World War II perspective. In any event, governments did respond with regulation, on both smoking and asbestos, even if its rigour is condemned by hindsight. The story of asbestos is a particularly tragic one. The story of contaminated blood is, I admit, more complex. I think it will make an edifying subject for a further blog.

The narrative of a callous, self-serving government bureaucracy only exposed by the heroic endeavours of maverick scientists is an attractive one to many people. Its prototype is Ibsen’s 1882 play An Enemy of the People. The twist in that drama is [spoiler alert!] that the population join the bureaucracy in turning against the scientist, whose credibility goes notably unchallenged by the author.

Attitudes to risk are entangled with emotional responses to broader cultural matters, as I blogged about here. That ecology of personal attitudes also feeds into how individuals react to the outputs of a bureaucracy, even one holding itself out as an exemplar of scientific objectivity, as I blogged about here. It is amid those conflicting cultural responses that forensic examination has a real part to play in resolving the conflicting doubts.

Forensics

Thereza Imanishi-Kari was a postdoctoral researcher in molecular biology at the Massachusetts Institute of Technology. In 1986 a co-worker raised inconsistencies in Imanishi-Kari’s earlier published work that led to allegations that she had fabricated results to validate publicly funded research. In his excellent 1998 book The Baltimore Case, Daniel Kevles details the growing intensity of the allegations against Imanishi-Kari over the following decade, involving the US Congress, the Office of Scientific Integrity and the FBI. Imanishi-Kari was ultimately exonerated by a departmental appeal board constituted of an eminent molecular biologist and two lawyers. The board allowed cross-examination of the relevant experts including those in statistics and document examination. It was that cross-examination that exposed the allegations as without foundation.

As eminent an engineer as George Stephenson found that he could not ask Parliament to approve the building of the Liverpool and Manchester Railway on the basis of faulty surveying that he had not properly supervised. After his cross-examination by Edward Hall Alderson he complained:

I was not long in the witness box before I began to wish for a hole to creep out at.

Certainly in England and Wales, expert evidence only provides guidelines within which the court makes its findings of fact. In the Canadian case of Reynolds v C.S.N. the learned judge, analysing whether a strike induced shut down at an aluminium facility had caused plant damage, disregarded the evidence of two statisticians, who could not agree how to calculate a Kaplan-Meier estimator, and preferred that of an engineer who had adopted a superficially less exact approach.

Process improvement

Though every branch of science has been advancing with sure and rapid strides, it is perhaps not too much to say that from the time of Lord Mansfield, and Folkes v Chadd, to the present; there has been a steady decrease in the credit awarded to the testimony of scientific witnesses.

Anonymous
“Expert testimony”
American Law Review (1870)

Throughout the nineteenth century the forensic evidence of scientific experts garnered a poor reputation. Robert Angus Smith, the discoverer of acid rain, refused to take expert work as he regarded it as corrupt beyond remedy and wished not to taint his reputation.

However, English law gradually drew the matter under supervision. The whole process by which English law adapted to embrace the conflicting evidence of specialists, woven through their respective esoteric expertise, is set out by Tal Golan in Chapter Three of his 2004 history of expert evidence, Laws of Men and Laws of Nature. Within the common law world, evaluation of expert evidence continues to evolve. The Australian courts have made important contributions with innovations such as hot tubbing. The common law courts have developed into a sophisticated forum for adjudicating on competing claims as to knowledge, not from an absolute standpoint, but from the pragmatic worldview of allocating resources. For practical people there has to be an end to every dispute.

The life of the law has not been logic; it has been experience… The law embodies the story of a nation’s development through many centuries, and it cannot be dealt with as if it contained only the axioms and corollaries of a book of mathematics.

Oliver Wendell Holmes
The Common Law (1881)

Is data the plural of anecdote?

I seem to hear this intriguing quote everywhere these days.

The plural of anecdote is not data.

There is certainly one website that traces it back to Raymond Wolfinger, a political scientist from Berkeley, who claims to have said sometime around 1969 to 1970:

The plural of anecdote is data.

So, which is it?

Anecdote

My Concise Oxford English Dictionary (“COED”) defines “anecdote” as:

Narrative … of amusing or interesting incident.

Wiktionary gives a further alternative definition.

An account which supports an argument, but which is not supported by scientific or statistical analysis.

Edward Jenner by James Northcote.jpg

It’s clear that anecdote itself is a concept without a very exact meaning. It’s a story, not usually reported through an objective channel such as a journalism, or scientific or historical research, that carries some implication of its own unreliability. Perhaps it is inherently implausible when read against objective background evidence. Perhaps it is hearsay or multiple hearsay.

The anecdote’s suspect reliability is offset by the evidential weight it promises, either as a counter example to a cherished theory or as compelling support for a controversial hypothesis. Lyall Watson’s hundredth monkey story is an anecdote. So, in eighteenth century England, was the folk wisdom, recounted to Edward Jenner (pictured), that milkmaids were generally immune to smallpox.

Data

My COED defines “data” as:

Facts or impormation, esp[ecially] as basis for inference.

Wiktionary gives a further alternative definition.

Pieces of information.

Again, not much help. But the principal definition in the COED is:

Thing[s] known or granted, assumption or premise from which inferences may be drawn.

The suggestion in the word “data” is that what is given is the reliable starting point from which we can start making deductions or even inductive inferences. Data carries the suggestion of reliability, soundness and objectivity captured in the familiar Arthur Koestler quote.

Without the little hard bits of marble which are called “facts” or “data” one cannot compose a mosaic …

Yet it is common knowledge that “data” cannot always be trusted. Trust in data is a recurring theme in this blog. Cyril Burt’s purported data on the heritability of IQ is a famous case. There are legions of others.

Smart investigators know that the provenance, reliability and quality of data cannot be taken for granted but must be subject to appropriate scrutiny. The modern science of Measurement Systems Analysis (“MSA”) has developed to satisfy this need. The defining characteristic of anecdote is that it has been subject to no such scrutiny.

Evidence

Anecdote and data, as broadly defined above, are both forms of evidence. All evidence is surrounded by a penumbra of doubt and unreliability. Even the most exacting engineering measurement is accompanied by a recognition of its uncertainty and the limitations that places on its use and the inferences that can be drawn from it. In fact, it is exactly because such a measurement comes accompanied by a numerical characterisation of its precision and accuracy, that  its reliability and usefulness are validated.

It seems inherent in the definition of anecdote that it should not be taken at face value. Happenstance or wishful fabrication, it may not be a reliable basis for inference or, still less, action. However, it was Jenner’s attention to the smallpox story that led him to develop vaccination against smallpox. No mean outcome. Against that, the hundredth monkey storey is mere fantastical fiction.

Anecdotes about dogs sniffing out cancer stand at the beginning of the journey of confirmation and exploitation.

Two types of analysis

Part of the answer to the dilemma comes from statistician John Tukey’s observation that there are two kinds of data analysis: Exploratory Data Analysis (“EDA”) and Confirmatory Data Analysis (“CDA”).

EDA concerns the exploration of all the available data in order to suggest some interesting theories. As economist Ronald Coase put it:

If you torture the data long enough, it will confess.

Once a concrete theory or hypothesis is to mind, a rigorous process of data generation allows formal statistical techniques to be brought to bear (“CDA”) in separating the signal in the data from the noise and in testing the theory. People who muddle up EDA and CDA tend to get into difficulties. It is a foundation of statistical practice to understand the distinction and its implications.

Anecdote may be well suited to EDA. That’s how Jenner successfully proceeded though his CDA of testing his vaccine on live human subjects wouldn’t get past many ethics committees today.

However, absent that confirmatory CDA phase, the beguiling anecdote may be no more than the wrecker’s false light.

A basis for action

Tukey’s analysis is useful for the academic or the researcher in an R&D department where the environment is not dynamic and time not of the essence. Real life is more problematic. There is not always the opportunity to carry out CDA. The past does not typically repeat itself so that we can investigate outcomes with alternative factor settings. As economist Paul Samuelson observed:

We have but one sample of history.

History is the only thing that we have any data from. There is no data on the future. Tukey himself recognised the problem and coined the phrase uncomfortable science for inferences from observations whose repetition was not feasible or practical.

In his recent book Strategy: A History (Oxford University Press, 2013), Lawrence Freedman points out the risks of managing by anecdote “The Trouble with Stories” (pp615-618). As Nobel laureate psychologist Daniel Kahneman has investigated at length, our interpretation of anecdote is beset by all manner of cognitive biases such as the availability heuristic and base rate fallacy. The traps for the statistically naïve are perilous.

But it would be a fool who would ignore all evidence that could not be subjected to formal validation. With a background knowledge of statistical theory and psychological biases, it is possible to manage trenchantly. Bayes’ theorem suggests that all evidence has its value.

I think that the rather prosaic answer to the question posed at the head of this blog is that data is the plural of anecdote, as it is the singular, but anecdotes are not the best form of data. They may be all you have in the real world. It would be wise to have the sophistication to exploit them.

Trust in data – IV – trusting the team

Today (20 November 2013) I was reading an item in The Times (London) with the headline “We fiddle our crime numbers, admit police”. This is a fairly unedifying business.

The blame is once again laid at the door of government targets and performance related pay. I fear that this is akin to blaming police corruption on the largesse of criminals. If only organised crime would stop offering bribes, the police would not succumb to taking them in consideration of repudiating their office as constable, so the argument might run (pace Brian Joiner). Of course, the argument is nonsense. What we expect of police constables is honesty even, perhaps especially, when temptation presents itself. We expect the police to give truthful evidence in court, to deal with the public fairly and to conduct their investigations diligently and rationally. The public expects the police to behave in this way even in the face of manifest temptation to do otherwise. The public expects the same honest approach to reporting their performance. I think Robert Frank put it well in Passions within Reason.

The honest individual … is someone who values trustworthiness for its own sake. That he might receive a material payoff for such behaviour is beyond his concern. And it is precisely because he has this attitude that he can be trusted in situations where his behaviour cannot be monitored. Trustworthiness, provided it is recognizable, creates valuable opportunities that would not otherwise be available.

Matt Ridley put it starkly in his overview of evolutionary psychology, The Origins of Virtue. He wasn’t speaking of policing in particular.

The virtuous are virtuous for no other reason that it enables them to join forces with others who are virtuous, for mutual benefit.

What worried me most about the article was a remark from Peter Barron, a former detective chief superintendent in the Metropolitan Police. Should any individual challenge the distortion of data:

You are judged to be not a team player.

“Teamwork” can be a smokescreen for the most appalling bullying. In our current corporate cultures, to be branded as “not a team player” can be the most horrible slur, smearing the individual’s contribution to the overall mission. One can see how such an environment can allow a team’s behaviours and objectives to become misaligned from those of the parent organisation. That is a problem that can often be addressed by management with a proper system of goal deployment.

However, the problem is more severe when the team is in fact well aligned to what are distorted organisational goals. The remedies for this lie in the twin processes of governance and whistleblowing. Neither seem to be working very well in UK policing at the moment but that simply leaves an opportunity for process improvement. Work is underway. The English law of whistleblowing has been amended this year. If you aren’t familiar with it you can find it here.

Governance has to take scrutiny of data seriously. Reported performance needs to be compared with other sources of data. Reporting and recording processes need themselves to be assessed. Where there is no coherent picture questions need to be asked.

Trust in data – III – being honest about honesty

I found this presentation by Dan Ariely intriguing. I suspect that this is originally a TED talk with some patronising cartoons added. You can just listen.

When I started off in operational excellence learning about the Deming philosophy, my instructors always used to say These are honest men’s [sic] tools. From that point of view Airely’s presentation is pretty pessimistic. I don’t think I am entirely surprised when I recall Matt Ridley’s summary of evolutionary psychology from his book The Origins of Virtue.

Human beings have some instincts that foster the greater good and others that foster self-interest and anti-social behaviour. We must design a society that encourages the former and discourages the latter.

When wearing a change management hat it’s easy to be sanguine about designing a system or organisation that fosters virtue and the sort of diligent data collection that confronts present reality. However, it is useful to have a toolkit of tactics to build such a system. I think Ariely’s ideas are helpful here.

His idea of “reminders” is something that resonates with maintaining a continual focus on the Voice of the Customer/ Voice of the Business. Periodically exploring with data collectors the purpose of their data collection and the system wide consequences of fabrication is something that seems worthwhile in itself. However, the work Ariely refers to suggests that there might be reasons why such a “nudge” would be particularly effective in improving data trustworthiness.

His idea of “confessions” is a little trickier. I might reflect for a while then blog some more.