Data science sold down the Amazon? Jeff Bezos and the culture of rigour

This blog appeared on the Royal Statistical Society website Statslife on 25 August 2015

Jeff Bezos' iconic laugh.jpgThis recent item in the New York Times has catalysed discussion among managers. The article tells of Amazon’s founder, Jeff Bezos, and his pursuit of rigorous data driven management. It also tells employees’ own negative stories of how that felt emotionally.

The New York Times says that Amazon is pervaded with abundant data streams that are used to judge individual human performance and which drive reward and advancement. They inform termination decisions too.

The recollections of former employees are not the best source of evidence about how a company conducts its business. Amazon’s share of the retail market is impressive and they must be doing something right. What everybody else wants to know is, what is it? Amazon are very coy about how they operate and there is a danger that the business world at large takes the wrong messages.

Targets

Targets are essential to business. The marketing director predicts that his new advertising campaign will create demand for 12,000 units next year. The operations director looks at her historical production data. She concludes that the process lacks the capability reliably to produce those volumes. She estimates the budget required to upgrade the process and to achieve 12,000 units annually. The executive board considers the business case and signs off the investment. Both marketing and operations directors now have a target.

Targets communicate improvement priorities. They build confidence between interfacing processes. They provide constraints and parameters that prevent the system causing harm. Harm to others or harm to itself. They allow the pace and substance of multiple business processes, and diverse entities, to be matched and aligned.

But everyone who has worked in business sees it as less simple than that. The marketing and operations directors are people.

Signal and noise

Drawing conclusions from data might be an uncontroversial matter were it not for the most common feature of data, fluctuation. Call it variation if you prefer. Business measures do not stand still. Every month, week, day and hour is different. All data features noise. Sometimes is goes up, sometimes down. A whole ecology of occult causes, weakly characterised, unknown and as yet unsuspected, interact to cause irregular variation. They are what cause a coin variously to fall “heads” or “tails”. That variation may often be stable enough, or if you like “exchangeable“, so as to allow statistical predictions to be made, as in the case of the coin toss.

If all data features noise then some data features signals. A signal is a sign, an indicator that some palpable cause has made the data stand out from the background noise. It is that assignable cause which enables inferences to be drawn about what interventions in the business process have had a tangible effect and what future innovations might cement any gains or lead to bigger prospective wins. Signal and noise lead to wholly different business strategies.

The relevance for business is that people, where not exposed to rigorous decision support, are really bad at telling the difference between signal and noise. Nobel laureate economist and psychologist Daniel Kahneman has amassed a lifetime of experimental and anecdotal data capturing noise misinterpreted as signal and judgments in the face of compelling data, distorted by emotional and contextual distractions.

Signal and accountability

It is a familiar trope of business, and government, that extravagant promises are made, impressive business cases set out and targets signed off. Yet the ultimate scrutiny as to whether that envisaged performance was realised often lacks rigour. Noise, with its irregular ups and downs, allows those seeking solace from failure to pick out select data points and cast self-serving narratives on the evidence.

Our hypothetical marketing director may fail to achieve his target but recount how there were two individual months where sales exceeded 1,000, construct elaborate rationales as to why only they are representative of his efforts and point to purported external factors that frustrated the remaining ten reports. Pairs of individual data points can always be selected to support any story, Don Wheeler’s classic executive time series.

This is where the ability to distinguish signal and noise is critical. To establish whether targets have been achieved requires crisp definition of business measures, not only outcomes but also the leading indicators that provide context and advise judgment as to prediction reliability. Distinguishing signal and noise requires transparent reporting that allows diverse streams of data criticism. It requires a rigorous approach to characterising noise and a systematic approach not only to identifying signals but to reacting to them in an agile and sustainable manner.

Data is essential to celebrating a target successfully achieved and to responding constructively to a failure. But where noise is gifted the status of signal to confirm a fanciful business case, or to protect a heavily invested reputation, then the business is misled, costs increased, profits foregone and investors cheated.

Where employees believe that success and reward is being fudged, whether because of wishful thinking or lack of data skills, or mistakenly through lack of transparency, then cynicism and demotivation will breed virulently. Employees watch the behaviours of their seniors carefully as models of what will lead to their own advancement. Where it is deceit or innumeracy that succeed, that is what will thrive.

Noise and blame

Here is some data of the number of defects caused by production workers last month.

Worker Defects
Al 10
Simone 6
Jose 10
Gabriela 16
Stan 10

What is to be done about Gabriela? Move to an easier job? Perhaps retraining? Or should she be let go? And Simone? Promote to supervisor?

Well, the numbers were just random numbers that I generated. I didn’t add anything in to make Gabriela’s score higher and there was nothing in the way that I generated the data to suggest who would come top or bottom. The data are simply noise. They are the sort of thing that you might observe in a manufacturing plant that presented a “stable system of trouble”. Nothing in the data signals any behaviour, attitude, skill or diligence that Gabriela lacked or wrongly exercised. The next month’s data would likely show a different candidate for dismissal.

Mistaking signal for noise is, like mistaking noise for signal, the path to business under performance and employee disillusionment. It has a particularly corrosive effect where used, as it might be in Gabriela’s case, to justify termination. The remaining staff will be bemused as to what Gabriela was actually doing wrong and start to attach myriad and irrational doubts to all sorts of things in the business. There may be a resort to magical thinking. The survivors will be less open and less willing to share problems with their supervisors. The business itself has the costs of recruitment to replace Gabriela. The saddest aspect of the whole business is the likelihood that Gabriela’s replacement will perform better than did Gabriela, vindicating the dismissal in the mind of her supervisor. This is the familiar statistical artefact of regression to the mean. An extreme event is likely to be followed by one less extreme. Again, Kahneman has collected sundry examples of managers so deceived by singular human performance and disappointed by its modest follow-up.

It was W Edwards Deming who observed that every time you recruit a new employee you take a random sample from the pool of job seekers. That’s why you get the regression to the mean. It must be true at Amazon too as their human resources executive Mr Tony Galbato explains their termination statistics by admitting that “We don’t always get it right.” Of course, everybody thinks that their recruitment procedures are better than average. That’s a management claim that could well do with rigorous testing by data.

Further, mistaking noise for signal brings the additional business expense of over adjustment, spending money to add costly variation while degrading customer satisfaction. Nobody in the business feels good about that.

Target quality, data quality

I admitted above that the evidence we have about Amazon’s operations is not of the highest quality. I’m not in a position to judge what goes on at Amazon. But all should fix in their minds that setting targets demands rigorous risk assessment, analysis of perverse incentives and intense customer focus.

It is a sad reality that, if you set incentives perversely enough,some individuals will find ways of misreporting data. BNFL’s embarrassment with Kansai Electric and Steven Eaton’s criminal conviction were not isolated incidents.

One thing that especially bothered me about the Amazon report was the soi-disant Anytime Feedback Tool that allowed unsolicited anonymous peer appraisal. Apparently, this formed part of the “data” that determined individual advancement or termination. The description was unchallenged by Amazon’s spokesman (sic) Mr Craig Berman. I’m afraid, and I say this as a practising lawyer, unsourced and unchallenged “evidence” carries the spoor of the Star Chamber and the party purge. I would have thought that a pretty reliable method for generating unreliable data would be to maximise the personal incentives for distortion while protecting it from scrutiny or governance.

Kahneman observed that:

… we pay more attention to the content of messages than to information about their reliability, and as a result end up with a view of the world around us that is simpler and more coherent than the data justify.

It is the perverse confluence of fluctuations and individual psychology that makes statistical science essential, data analytics interesting and business, law and government difficult.

#executivetimeseries

ExecTS1OxfordDon Wheeler coined the term executive time series. I was just leaving court in Oxford the other day when I saw this announcement on a hoarding. I immediately thought to myself “#executivetimeseries”.

Wheeler introduced the phrase in his 2000 book Understanding Variation: The Key to Managing Chaos. He meant to criticise the habitual way that statistics are presented in business and government. A comparison is made between performance at two instants in time. Grave significance is attached as to whether performance is better or worse at the second instant. Well, it was always unlikely that it would be the same.

The executive time series has the following characteristics.

  • It as applied to some statistic, metric, Key Performance Indicator (KPI) or other measure that will be perceived as important by its audience.
  • Two time instants are chosen.
  • The statistic is quoted at each of the two instants.
  • If the latter is greater than the first then an increase is inferred. A decrease is inferred from the converse.
  • Great significance is attached to the increase or decrease.

Why is this bad?

At its best it provides incomplete information devoid of context. At its worst it is subject to gross manipulation. The following problems arise.

  • Though a signal is usually suggested there is inadequate information to infer this.
  • There is seldom explanation of how the time points were chosen. It is open to manipulation.
  • Data is presented absent its context.
  • There is no basis for predicting the future.

The Oxford billboard is even worse than the usual example because it doesn’t even attempt to tell us over what period the carbon reduction is being claimed.

Signal and noise

Let’s first think about noise. As Daniel Kahneman put it “A random event does not … lend itself to explanation, but collections of random events do behave in a highly regular fashion.” Noise is a collection of random events. Some people also call it common cause variation.

Imagine a bucket of thousands of beads. Of the beads, 80% are white and 20%, red. You are given a paddle that will hold 50 beads. Use the paddle to stir the beads then draw out 50 with the paddle. Count the red beads. Repeat this, let us say once a week, until you have 20 counts. The data might look something like this.

RedBeads1

What we observe in Figure 1 is the irregular variation in the number of red beads. However, it is not totally unpredictable. In fact, it may be one of the most predictable things you have ever seen. Though we cannot forecast exactly how many red beads we will see in the coming week, it will most likely be in the rough range of 4 to 14 with rather more counts around 10 than at the extremities. The odd one below 4 or above 14 would not surprise you I think.

But nothing changed in the characteristics of the underlying process. It didn’t get better or worse. The percentage of reds in the bucket was constant. It is a stable system of trouble. And yet measured variation extended between 4 and 14 red beads. That is why an executive time series is so dangerous. It alleges change while the underlying cause-system is constant.

Figure 2 shows how an executive time series could be constructed in week 3.

RedBeads2

The number of beads has increase from 4 to 10, a 150% increase. Surely a “significant result”. And it will always be possible to find some managerial initiative between week 2 and 3 that can be invoked as the cause. “Between weeks 2 and 3 we changed the angle of inserting the paddle and it has increased the number of red beads by 150%.”

But Figure 2 is not the only executive time series that the data will support. In Figure 3 the manager can claim a 57% reduction from 14 to 6. More than the Oxford banner. Again, it will always be possible to find some factor or incident supposed to have caused the reduction. But nothing really changed.

RedBeads3

The executive can be even more ambitious. “Between week 2 and 17 we achieved a 250% increase in red beads.” Now that cannot be dismissed as a mere statistical blip.

RedBeads4

#executivetimeseries

Data has no meaning apart from its context.

Walter Shewhart

Not everyone who cites an executive time series is seeking to deceive. But many are. So anybody who relies on an executive times series, devoid of context, invites suspicion that they are manipulating the message. This is Langian statistics. par excellence. The fallacy of What you see is all there is. It is essential to treat all such claims with the utmost caution. What properly communicates the present reality of some measure is a plot against time that exposes its variation, its stability (or otherwise) and sets it in the time context of surrounding events.

We should call out the perpetrators. #executivetimeseries

Techie note

The data here is generated from a sequence of 20 Bernoulli experiments with probability of “red” equal to 0.2 and 50 independent trials in each experiment.

Does noise make you fat?

“A new study has unearthed some eye-opening facts about the effects of noise pollution on obesity,” proclaimed The Huffington Post recently in another piece or poorly uncritical data journalism.

Journalistic standards notwithstanding, in Exposure to traffic noise and markers of obesity (BMJ Occupational and environmental medicine, May 2015) Andrei Pyko and eight (sic) collaborators found “evidence of a link between traffic noise and metabolic outcomes, especially central obesity.” The particular conclusion picked up by the press was that each 5 dB increase in traffic noise could add 2 mm to the waistline.

Not trusting the press I decided I wanted to have a look at this research myself. I was fortunate that the paper was available for free download for a brief period after the press release. It took some finding though. The BMJ insists that you will now have to pay. I do find that objectionable as I see that the research was funded in part by the European Union. Us European citizens have all paid once. Why should we have to pay again?

On reading …

I was though shocked reading Pyko’s paper as the Huffington Post journalists obviously hadn’t. They state “Lack of sleep causes reduced energy levels, which can then lead to a more sedentary lifestyle and make residents less willing to exercise.” Pyko’s paper says no such thing. The researchers had, in particular, conditioned on level of exercise so that effect had been taken out. It cannot stand as an explanation of the results. Pyko’s narrative concerned noise-induced stress and cortisol production, not lack of exercise.

In any event, the paper is densely written and not at all easy to analyse and understand. I have tried to pick out the points that I found most bothering but first a statistics lesson.

Prediction 101

Frame(Almost) the first thing to learn in statistics is the relationship between population, frame and sample. We are concerned about the population. The frame is the enumerable and accessible set of things that approximate the population. The sample is a subset of the frame, selected in an economic, systematic and well characterised manner.

In Some Theory of Sampling (1950), W Edwards Deming drew a distinction between two broad types of statistical studies, enumerative and analytic.

  • Enumerative: Action will be taken on the frame.
  • Analytic: Action will be on the cause-system that produced the frame.

It is explicit in Pyko’s work that the sampling frame was metropolitan Stockholm, Sweden between the years 2002 and 2006. It was a cross-sectional study. I take it from the institutional funding that the study intended to advise policy makers as to future health interventions. Concern was beyond the population of Stockholm, or even Sweden. This was an analytic study. It aspired to draw generalised lessons about the causal mechanisms whereby traffic noise aggravated obesity so as to support future society-wide health improvement.

How representative was the frame of global urban areas stretching over future decades? I have not the knowledge to make a judgment. The issue is mentioned in the paper but, I think, with insufficient weight.

There are further issues as to the sampling from the frame. Data was taken from participants in a pre-existing study into diabetes that had itself specific criteria for recruitment. These are set out in the paper but intensify the questions of whether the sample is representative of the population of interest.

The study

The researchers chose three measures of obesity, waist circumference, waist-hip ratio and BMI. Each has been put forwards, from time to time, as a measure of health risk.

There were 5,075 individual participants in the study, a sample of 5,075 observations. The researchers performed both a linear regression simpliciter and a logistic regression. For want of time and space I am only going to comment on the former. It is the origin of the headline 2 mm per 5 dB claim.

The researchers have quoted p-values but they haven’t committed the worst of sins as they have shown the size of the effects with confidence intervals. It’s not surprising that they found so many soi-disant significant effects given the sample size.

However, there was little assistance in judging how much of the observed variation in obesity was down to traffic noise. I would have liked to see a good old fashioned analysis of variance table. I could then at least have had a go at comparing variation from the measurement process, traffic noise and other effects. I could also have calculated myself an adjusted R2.

Measurement Systems Analysis

Understanding variation from the measurement process is critical to any analysis. I have looked at the World Health Organisation’s definitive 2011 report on the effects of waist circumference on health. Such Measurement Systems Analysis as there is occurs at p7. They report a “technical error” (me neither) of 1.31 cm from intrameasurer error (I’m guessing repeatability) and 1.56 cm from intermeasurer error (I’m guessing reproducibility). They remark that “Even when the same protocol is used, there may be variability within and between measurers when more than one measurement is made.” They recommend further research but I have found none. There is no way of knowing from what is published by Pyko whether the reported effects are real or flow from confounding between traffic noise and intermeasurer variation.

When it comes to waist-hip ratio I presume that there are similar issues in measuring hip circumference. When the two dimensions are divided then the individual measurement uncertainties aggregate. More problems, not addressed.

Noise data

The key predictor of obesity was supposed to be noise. The noise data used were not in situ measurements in the participants’ respective homes. The road traffic noise data were themselves predicted from a mathematical model using “terrain data, ground surface, building height, traffic data, including 24 h yearly average traffic flow, diurnal distribution and speed limits, as well as information on noise barriers”. The model output provided 5 dB contours. The authors then applied some further ad hoc treatments to the data.

The authors recognise that there is likely to be some error in the actual noise levels, not least from the granularity. However, they then seem to assume that this is simply an errors in variables situation. That would do no more than (conservatively) bias any observed effect towards zero. However, it does seem to me that there is potential for much more structured systematic effects to be introduced here and I think this should have been explored further.

Model criticism

The authors state that they carried out a residuals analysis but they give no details and there are no charts, even in the supplementary material. I would like to have had a look myself as the residuals are actually the interesting bit. Residuals analysis is essential in establishing stability.

In fact, in the current study there is so much data that I would have expected the authors to have saved some of the data for cross-validation. That would have provided some powerful material for model criticism and validation.

Given that this is an analytic study these are all very serious failings. With nine researchers on the job I would have expected some effort on these matters and some attention from whoever was the statistical referee.

Results

Separate results are presented for road, rail and air traffic noise. Again, for brevity I am looking at the headline 2 mm / 5 dB quoted for road traffic noise. Now, waist circumference is dependent on gross body size. Men are bigger than women and have larger waists. Similarly, the tall are larger-waisted than the short. Pyko’s regression does not condition on height (as a gross characterisation of body size).

BMI is a factor that attempts to allow for body size. Pyko found no significant influence on BMI from road traffic noise.

Waist-hip ration is another parameter that attempts to allow for body size. It is often now cited as a better predictor of morbidity than BMI. That of course is irrelevant to the question of whether noise makes you fat. As far as I can tell from Pyko’s published results, a 5 dB increase in road traffic noise accounted for a 0.16 increase in waist-hip ratio. Now, let us look at this broadly. Consider a woman with waist circumference 85 cm, hip 100 cm, hence waist-hip ratio, 0.85. All pretty typical for the study. Predictively the study is suggesting that a 5 dB increase in road traffic noise might unremarkably take her waist-hip ratio up over 1.0. That seems barely consistent with the results from waist circumference alone where there would not only be millimetres of growth. It is incredible physically.

I must certainly have misunderstood what the waist-hip result means but I could find no elucidation in Pyko’s paper.

Policy

Research such as this has to be aimed at advising future interventions to control traffic noise in urban environments. Broadly speaking, 5 dB is a level of noise change that is noticeable to human hearing but no more. All the same, achieving such a reduction in an urban environment is something that requires considerable economic resources. Yet, taking the research at its highest, it only delivers 2 mm on the waistline.

I had many criticisms other than those above and I do not, in any event, consider this study adequate for making any prediction about a future intervention. Nothing in it makes me feel the subject deserves further study. Or that I need to avoid noise to stay slim.

Soccer management – signal, noise and contract negotiation

Some poor data journalism here from the BBC on 28 May 2015, concerning turnover in professional soccer managers in England. “Managerial sackings reach highest level for 13 years” says the headline. A classic executive time series. What is the significance of the 13 years? Other than it being the last year with more sackings than the present.

The data was purportedly from the League Managers’ Association (LMA) and their Richard Bevan thought the matter “very concerning”. The BBC provided a chart (fair use claimed).

MgrSackingsto201503

Now, I had a couple of thoughts as soon as I saw this. Firstly, why chart only back to 2005/6? More importantly, this looked to me like a stable system of trouble (for football managers) with the possible exception of this (2014/15) season’s Championship coach turnover. Personally, I detest multiple time series on a common chart unless there is a good reason for doing so. I do not think it the best way of showing variation and/ or association.

Signal and noise

The first task of any analyst looking at data is to seek to separate signal from noise. Nate Silver made this point powerfully in his book The Signal and the Noise: The Art and Science of Prediction. As Don Wheeler put it: all data has noise; some data has signal.

Noise is typically the irregular aggregate of many causes. It is predictable in the same way as a roulette wheel. A signal is a sign of some underlying factor that has had so large an effect that it stands out from the noise. Signals can herald a fundamental unpredictability of future behaviour.

If we find a signal we look for a special cause. If we start assigning special causes to observations that are simply noise then, at best, we spend money and effort to no effect and, at worst, we aggravate the situation.

The Championship data

In any event, I wanted to look at the data for myself. I was most interested in the Championship data as that was where the BBC and LMA had been quick to find a signal. I looked on the LMA’s website and this is the latest data I found. The data only records dismissals up to 31 March of the 2014/15 season. There were 16. The data in the report gives the total number of dismissals for each preceding season back to 2005/6. The report separates out “dismissals” from “resignations” but does not say exactly how the classification was made. It can be ambiguous. A manager may well resign because he feels his club have themselves repudiated his contract, a situation known in England as constructive dismissal.

The BBC’s analysis included dismissals right up to the end of each season including 2014/15. Reading from the chart they had 20. The BBC have added some data for 2014/15 that isn’t in the LMA report and not given the source. I regard that as poor data journalism.

I found one source of further data at website The Sack Race. That told me that since the end of March there had been four terminations.

Manager Club Termination Date
Malky Mackay Wigan Athletic Sacked 6 April
Lee Clark Blackpool Resigned 9 May
Neil Redfearn Leeds United Contract expired 20 May
Steve McClaren Derby County Sacked 25 May

As far as I can tell, “dismissals” include contract non-renewals and terminations by mutual consent. There are then a further three dismissals, not four. However, Clark left Blackpool amid some corporate chaos. That is certainly a termination that is classifiable either way. In any event, I have taken the BBC figure at face value though I am alerted as to some possible data quality issues here.

Signal and noise

Looking at the Championship data, this was the process behaviour chart, plotted as an individuals chart.

MgrSackingsto201503

There is a clear signal for the 2014/15 season with an observation, 20 dismissals,, above the upper natural process limit of 19.18 dismissals. Where there is a signal we should seek a special cause. There is no guarantee that we will find a special cause. Data limitations and bounded rationality are always constraints. In fact, there is no guarantee that there was a special cause. The signal could be a false positive. Such effects cannot be eliminated. However, signals efficiently direct our limited energy for, what Daniel Kahneman calls, System 2 thinking towards the most promising enquiries.

Analysis

The BBC reports one narrative woven round the data.

Bevan said the current tenure of those employed in the second tier was about eight months. And the demand to reach the top flight, where a new record £5.14bn TV deal is set to begin in 2016, had led to clubs hitting the “panic button” too quickly.

It is certainly a plausible view. I compiled a list of the dismissals and non-renewals, not the resignations, with data from Wikipedia and The Sack Race. I only identified 17 which again suggests some data quality issue around classification. I have then charted a scatter plot of date of dismissal against the club’s then league position.

MgrSackings201415

It certainly looks as though risk of relegation is the major driver for dismissal. Aside from that, Watford dismissed Billy McKinlay after only two games when they were third in the league, equal on points with the top two. McKinlay had been an emergency appointment after Oscar Garcia had been compelled to resign through ill health. Watford thought they had quickly found a better manager in Slavisa Jokanovic. Watford ended the season in second place and were promoted to the Premiership.

There were two dismissals after the final game on 2 May by disappointed mid-table teams. Beyond that, the only evidence for impulsive managerial changes in pursuit of promotion is the three mid-season, mid-table dismissals.

Club league position
Manager Club On dismissal At end of season
Nigel Adkins Reading 16 19
Bob Peeters Charlton Athletic 14 12
Stuart Pearce Nottingham Forrest 12 14

A table that speaks for itself. I am not impressed by the argument that there has been the sort of increase in panic sackings that Bevan fears. Both Blackpool and Leeds experienced chaotic executive management which will have resulted in an enhanced force of mortality on their respective coaches. That along with the data quality issues and the technical matter I have described below lead me to feel that there was no great enhanced threat to the typical Championship manager in 2014/15.

Next season I would expect some regression to the mean with a lower number of dismissals. Not much of a prediction really but that’s what the data tells me. If Bevan tries to attribute that to the LMA’s activism them I fear that he will be indulging in Langian statistical analysis. Will he be able to resist?

Techie bit

I have a preference for individuals charts but I did also try plotting the data on an np-chart where I found no signal. It is trite service-course statistics that a Poisson distribution with mean λ has standard deviation √λ so an upper 3-sigma limit for a (homogeneous) Poisson process with mean 11.1 dismissals would be 21.1 dismissals. Kahneman has cogently highlighted how people tend to see patterns in data as signals even where they are typical of mere noise. In this case I am aware that the data is not atypical of a Poisson process so I am unsurprised that I failed to identify a special cause.

A Poisson process with mean 11.1 dismissals is a pretty good model going forwards and that is the basis I would press on any managers in contract negotiations.

Of course, the clubs should remember that when they look for a replacement manager they will then take a random sample from the pool of job seekers. Really!

Data and anecdote revisited – the case of the lime jellybean

JellyBellyBeans.jpgI have already blogged about the question of whether data is the plural of anecdote. Then I recently came across the following problem in the late Richard Jeffrey’s marvellous little book Subjective Probability: The Real Thing (2004, Cambridge) and it struck me as a useful template for thinking about data and anecdotes.

The problem looks like a staple of elementary statistics practice exercises.

You are drawing a jellybean from a bag in which you know half the beans are green, all the lime flavoured ones are green and the green ones are equally divided between lime and mint flavours.

You draw a green bean. Before you taste it, what is the probability that it is lime flavoured?

A mathematically neat answer would be 50%. But what if, asked Jeffrey, when you drew the green bean you caught a whiff of mint? Or the bean was a particular shade of green that you had come to associate with “mint”. Would your probability still be 50%?

The given proportions of beans in the bag are our data. The whiff of mint or subtle colouration is the anecdote.

What use is the anecdote?

It would certainly be open to a participant in the bean problem to maintain the 50% probability derived from the data and ignore the inferential power of the anecdote. However, the anecdote is evidence that we have and, if we choose to ignore it simply because it is difficult to deal with, then we base our assessment of risk on a more restricted picture than that actually available to us.

The difficulty with the anecdote is that it does not lead to any compelling inference in the same way as do the mathematical proportions. It is easy to see how the bean proportions would give rise to a quite extensive consensus about the probability of “lime”. There would be more variety in individual responses to the anecdote, in what weight to give the evidence and in what it tended to imply.

That illustrates the tension between data and anecdote. Data tends to consensus. If there is disagreement as to its weight and relevance then the community is likely to divide into camps rather than exhibit a spectrum of views. Anecdote does not lead to such a consensus. Individuals interpret anecdotes in diverse ways and invest them with varying degrees of credence.

Yet, the person who is best at weighing and interpreting the anecdotal evidence has the advantage over the broad community who are in agreement about what the proportion data tells them. It will often be the discipline specialist who is in the best position to interpret an anecdote.

From anecdote to data

One of the things that the “mint” anecdote might do is encourage us to start collecting future data on what we smelled when a bean was drawn. A sequence of such observations, along with the actual “lime/ mint” outcome, potentially provides a potent decision support mechanism for future draws. At this point the anecdote has been developed into data.

This may be a difficult process. The whiff of mint or subtle colouration could be difficult to articulate but recognising its significance (sic) is the beginning of operationalising and sharing.

Statistician John Tukey advocated the practice of exploratory data analysis (EDA) to identify such anecdotal evidence before settling on a premature model. As he observed:

The greatest value of a picture is when it forces us to notice what we never expected to see.

Of course, the person who was able to use the single anecdote on its own has the advantage over those who had to wait until they had compelling data. Data that they share with everybody else who has the same idea.

Data or anecdote

When I previously blogged about this I had trouble in coming to any definition that distinguished data and anecdote. Having reflected, I have a modest proposal. Data is the output of some reasonably well-defined process. Anecdote isn’t. It’s not clear how it was generated.

We are not told by what process the proportion of beans was established but I am willing to wager that it was some form of counting.

If we know the process generating evidence then we can examine its biases, non-responses, precision, stability, repeatability and reproducibility. Anecdote we cannot. It is because we can characterise the measurement process, through measurement systems analysis, that we can assess its reliability and make appropriate allowances and adjustments for its limitations. An assessment that most people will agree with most of the time. Because the most potent tools for assessing the reliability of evidence are absent in the case of anecdote, there are inherent difficulties in its interpretation and there will be a spectrum of attitudes from the community.

However, having had our interest pricked by the anecdote, we can set up a process to generate data.

Borrowing strength again

Using an anecdote as the basis for further data generation is one approach to turning anecdote into reliable knowledge. There is another way.

Today in the UK, a jury of 12 found nurse Victorino Chua, beyond reasonable doubt, guilty of poisoning 21 of his patients with insulin. Two died. There was no single compelling piece of evidence put before the jury. It was all largely circumstantial. The prosecution had sought to persuade the jury that those various items of circumstantial evidence reinforced each other and led to a compelling inference.

This is a common situation in litigation where there is no single conclusive piece of data but various pieces of circumstantial evidence that have to be put together. Where these reinforce, they inherit borrowing strength from each other.

Anecdotal evidence is not really the sort of evidence we want to have. But those who know how to use it are way ahead of those embarrassed by it.

Data is the plural of anecdote, either through repetition or through borrowing.