Is data the plural of anecdote?

I seem to hear this intriguing quote everywhere these days.

The plural of anecdote is not data.

There is certainly one website that traces it back to Raymond Wolfinger, a political scientist from Berkeley, who claims to have said sometime around 1969 to 1970:

The plural of anecdote is data.

So, which is it?

Anecdote

My Concise Oxford English Dictionary (“COED”) defines “anecdote” as:

Narrative … of amusing or interesting incident.

Wiktionary gives a further alternative definition.

An account which supports an argument, but which is not supported by scientific or statistical analysis.

Edward Jenner by James Northcote.jpg

It’s clear that anecdote itself is a concept without a very exact meaning. It’s a story, not usually reported through an objective channel such as a journalism, or scientific or historical research, that carries some implication of its own unreliability. Perhaps it is inherently implausible when read against objective background evidence. Perhaps it is hearsay or multiple hearsay.

The anecdote’s suspect reliability is offset by the evidential weight it promises, either as a counter example to a cherished theory or as compelling support for a controversial hypothesis. Lyall Watson’s hundredth monkey story is an anecdote. So, in eighteenth century England, was the folk wisdom, recounted to Edward Jenner (pictured), that milkmaids were generally immune to smallpox.

Data

My COED defines “data” as:

Facts or impormation, esp[ecially] as basis for inference.

Wiktionary gives a further alternative definition.

Pieces of information.

Again, not much help. But the principal definition in the COED is:

Thing[s] known or granted, assumption or premise from which inferences may be drawn.

The suggestion in the word “data” is that what is given is the reliable starting point from which we can start making deductions or even inductive inferences. Data carries the suggestion of reliability, soundness and objectivity captured in the familiar Arthur Koestler quote.

Without the little hard bits of marble which are called “facts” or “data” one cannot compose a mosaic …

Yet it is common knowledge that “data” cannot always be trusted. Trust in data is a recurring theme in this blog. Cyril Burt’s purported data on the heritability of IQ is a famous case. There are legions of others.

Smart investigators know that the provenance, reliability and quality of data cannot be taken for granted but must be subject to appropriate scrutiny. The modern science of Measurement Systems Analysis (“MSA”) has developed to satisfy this need. The defining characteristic of anecdote is that it has been subject to no such scrutiny.

Evidence

Anecdote and data, as broadly defined above, are both forms of evidence. All evidence is surrounded by a penumbra of doubt and unreliability. Even the most exacting engineering measurement is accompanied by a recognition of its uncertainty and the limitations that places on its use and the inferences that can be drawn from it. In fact, it is exactly because such a measurement comes accompanied by a numerical characterisation of its precision and accuracy, that  its reliability and usefulness are validated.

It seems inherent in the definition of anecdote that it should not be taken at face value. Happenstance or wishful fabrication, it may not be a reliable basis for inference or, still less, action. However, it was Jenner’s attention to the smallpox story that led him to develop vaccination against smallpox. No mean outcome. Against that, the hundredth monkey storey is mere fantastical fiction.

Anecdotes about dogs sniffing out cancer stand at the beginning of the journey of confirmation and exploitation.

Two types of analysis

Part of the answer to the dilemma comes from statistician John Tukey’s observation that there are two kinds of data analysis: Exploratory Data Analysis (“EDA”) and Confirmatory Data Analysis (“CDA”).

EDA concerns the exploration of all the available data in order to suggest some interesting theories. As economist Ronald Coase put it:

If you torture the data long enough, it will confess.

Once a concrete theory or hypothesis is to mind, a rigorous process of data generation allows formal statistical techniques to be brought to bear (“CDA”) in separating the signal in the data from the noise and in testing the theory. People who muddle up EDA and CDA tend to get into difficulties. It is a foundation of statistical practice to understand the distinction and its implications.

Anecdote may be well suited to EDA. That’s how Jenner successfully proceeded though his CDA of testing his vaccine on live human subjects wouldn’t get past many ethics committees today.

However, absent that confirmatory CDA phase, the beguiling anecdote may be no more than the wrecker’s false light.

A basis for action

Tukey’s analysis is useful for the academic or the researcher in an R&D department where the environment is not dynamic and time not of the essence. Real life is more problematic. There is not always the opportunity to carry out CDA. The past does not typically repeat itself so that we can investigate outcomes with alternative factor settings. As economist Paul Samuelson observed:

We have but one sample of history.

History is the only thing that we have any data from. There is no data on the future. Tukey himself recognised the problem and coined the phrase uncomfortable science for inferences from observations whose repetition was not feasible or practical.

In his recent book Strategy: A History (Oxford University Press, 2013), Lawrence Freedman points out the risks of managing by anecdote “The Trouble with Stories” (pp615-618). As Nobel laureate psychologist Daniel Kahneman has investigated at length, our interpretation of anecdote is beset by all manner of cognitive biases such as the availability heuristic and base rate fallacy. The traps for the statistically naïve are perilous.

But it would be a fool who would ignore all evidence that could not be subjected to formal validation. With a background knowledge of statistical theory and psychological biases, it is possible to manage trenchantly. Bayes’ theorem suggests that all evidence has its value.

I think that the rather prosaic answer to the question posed at the head of this blog is that data is the plural of anecdote, as it is the singular, but anecdotes are not the best form of data. They may be all you have in the real world. It would be wise to have the sophistication to exploit them.

Bad Statistics I – the phantom line

I came across this chart on the web recently.

BadScatter01

This really is one of my pet hates: a perfectly informative scatter chart with a meaningless straight line drawn on it.

The scatter chart is interesting. Each individual blot represents a nation state. Its vertical position represents national average life expectancy. I take that to be mean life expectancy at birth, though it is not explained in terms. The horizontal axis represents annual per capita health spending, though there is no indication as to whether that is adjusted for purchasing power. The whole thing is a snapshot from 2011. The message I take from the chart is that Hungary and Mexico, and I think two smaller blots, represent special causes, they are outside the experience base represented by the balance of the nations. As to the other nations the chart suggests that average life expectancy doesn’t depend very strongly on health spending.

Of course, there is much more to a thorough investigation of the impact of health spending on outcomes. The chart doesn’t reveal differential performance as to morbidity, or lost hours, or a host of important economic indicators. But it does put forward that one, slightly surprising, message that longevity is not enhanced by health spending. Or at least it wasn’t in 2011 and there is no explanation as to why that year was isolated.

The question is then as to why the author decided to put the straight line through it. As the chart “helpfully” tells me it is a “Linear Trend line”. I guess (sic) that this is a linear regression through the blots, possibly with some weighting as to national population. I originally thought that the size of the blot was related to population but there doesn’t seem to be enough variation in the blot sizes. It looks like there are only two sizes of blot and the USA (population 318.5 million) is the same size as Norway (5.1 million).

The difficulty here is that I can see that the two special cause nations, Hungary and Mexico, have very high leverage. That means that they have a large impact on where the straight lines goes, because they are so unusual as observations. The impact of those two atypical countries drags the straight line down to the left and exaggerates the impact that spending appears to have on longevity. It really is an unhelpful straight line.

These lines seem to appear a lot. I think that is because of the ease with which they can be generated in Excel. They are an example of what statistician Edward Tufte called chartjunk. They simply clutter the message of the data.

Of course, the chart here is a snapshot, not a video. If you do want to know how to use scatter charts to explain life expectancy then you need to learn here from the master, Hans Rosling.

There are no lines in nature, only areas of colour, one against another.

Edouard Manet

Trust in data – I

I was listening to the BBC’s election coverage on 2 May (2013) when Nick Robinson announced that UKIP supporters were five times more likely than other voters to believe that the MMR vaccine was dangerous.

I had a search on the web. The following graphic had appeared on Mike Smithson’s PoliticalBetting blog on 21 April 2013.

MMR plot

It’s not an attractive bar chart. The bars are different colours. There is a “mean” bar that tends to make the variation look less than it is and makes the UKIP bar (next to it) look more extreme. I was, however, intrigued so I had a look for the original data which had come from a YouGov survey of 1765 respondents. You can find the data here.

Here is a summary of the salient points of the data from the YouGov website in a table which I think is less distracting than the graphic.

Voting   intention Con. Lab. Lib. Dem. UKIP
No. Of   respondents 417 518 142 212
% % % %
MMR safe 99 85 84 72
MMR unsafe 1 3 12 28
Don’t know 0 12 3 0

My first question was: Where had Nick Robinson and Mike Smithson got their numbers from? It is possible that there was another survey I have not found. It is also possible that I am being thick. In any event, the YouGov data raises some interesting questions. This is an exploratory date analysis exercise. We are looking for interesting theories. I don’t think there is any doubt that there is a signal in this data. How do we interpret it? There does look to be some relationship between voting intention and attitude to public safety data.

Should anyone be tempted to sneer at people with political views other than their own, it is worth remembering that it is unlikely that anyone surveyed had scrutinised any of the published scientific research on the topic. All will have digested it, most probably at third hand, through the press, internet, or cooler moment. They may not have any clear idea of the provenance of the assurances as to the vaccination’s safety. They may not have clearly identified issues as to whether what they had absorbed was a purportedly independent scientific study or a governmental policy statement that sought to rely on the science. I suspect that most of my readers have given it no more thought.

The mental process behind the answers probably wouldn’t withstand much analysis. This would be part of Kahneman’s System 1 thinking. However, the question of how such heuristics become established is an interesting one. I suspect there is a factor here that can be labelled “trust in data”.

Trust in data is an issue we all encounter, in business and in life. How do we know when we can trust data?

A starting point for many in this debate is the often cited observation of Brian Joiner that, when presented with a numerical target, a manager has three options: Manage the system so as to achieve the target, distort the system so the target is achieved but at the cost of performance elsewhere (possibly not on the dashboard), or simply distort the data. This, no doubt true, observation is then cited in support of the general proposition that management by numerical target is at best ineffective and at worst counter productive. John Seddon is a particular advocate of the view that, whatever benefits may flow from management by target (and they are seldom championed with any great energy), they are outweighed by the inevitable corruption of the organisation’s data generation and reporting.

It is an unhappy view. One immediate objection is that the broader system cannot operate without targets. Unless the machine part’s diameter is between 49.99 and 50.01 mm it will not fit. Unless chlorine concentrations are below the safe limit, swimmers risk being poisoned. Unless demand for working capital is cut by 10% we will face the consequences of insolvency. Advocates of the target free world respond that those matters can be characterised as the legitimate voice of the customer/ business. It is only arbitrary targets that are corrosive.

I am not persuaded that the legitimate/ arbitrary distinction is a real one, nor how the distinction motivates two different kinds of behaviour. I will blog more about this later. Leadership’s urgent task is to ensure that all managers have the tools to measure present reality and work to improve it. Without knowing how much improvement is essential a manager cannot make rational decisions about the allocation of resources. In that context, when the correct management control is exercised, improving the system is easier than cheating. I shall blog about goal deployment and Hoshin Kanri on another occasion.

Trust in data is just a factor of trust in general. In his popular book on evolutionary psychology and economics, The Origins of Virtue, Matt Ridley observes the following.

Trust is as vital a form of social capital as money is a form of actual capital. … Trust, like money, can be lent (‘I trust you because I trust the person who told me he trusts you’), and can be risked, hoarded or squandered. It pays dividends in the currency of more trust.

Within an organisation, trust in data is something for everybody to work on building collaboratively under diligent leadership. As to the public sphere, trust in data is related to trust in politicians and that may be a bigger problem to solve. It is also a salutary warning as to what happens when there is a failure of trust in leadership.