UK Election of June 2017 – Polling review

Pollin2017Overview

Here are all the published opinion polls for the June 2017 UK general election, plotted as a Shewhart chart.

The Conservative lead over Labour had been pretty constant at 16% from February 2017, after May’s Lancaster House speech. The initial Natural Process Limits (“NPLs”) on the chart extend back to that date. Then something odd happened in the polls around Easter. There were several polls above the upper NPL. That does not seem to fit with any surrounding event. Article 50 had been declared two weeks before and had had no real immediate impact.

I suspect that the “fugue state” around Easter was reflected in the respective parties’ private polling. It is possible that public reaction to the election announcement somehow locked in the phenomenon for a short while.

Things then seem to settle down to the 16% lead level again. However, the local election results at the bottom of the range of polls ought to have sounded some alarm bells. Local election results are not a reliable predictor of general elections but this data should not have felt very comforting.

Then the slide in lead begins. But when exactly? A lot of commentators have assumed that it was the badly received Conservative Party manifesto that started the decline. It is not possible to be definitive from the chart but it is certainly arguable that it was the leak of the Labour Party manifesto that started to shift voting intention.

Then the swing from Conservative to Labour continued unabated to polling day.

Polling performance

How did the individual pollsters fair? I have, somewhat arbitrarily, summarised all polls conducted in the 10 days before the election (29 May to 7 June). Here is the plot along with the actual popular poll result which gave a 2.5% margin of Conservative over Labour. That is the number that everybody was trying to predict.

PollsterPerformance

The red points are the surveys from the 5 days before the election (3 to 7 June). Visually, they seem to be no closer, in general, than the other points (6 to 10 days before). The vertical lines are just an aid for the eye in grouping the points. The absence of “closing in” is confirmed by looking at the mean squared error (MSE) (in %2) for the points over 10 days (31.1) and 5 days (34.8). There is no evidence of polls closing in on the final result. The overall Shewhart chart certainly doesn’t suggest that.

Taking the polls over the 10 day period, then, here is the performance of the pollsters in terms of MSE. Lower MSE is better.

Pollster MSE
Norstat 2.25
Survation 2.31
Kantar Public 6.25
Survey Monkey 8.25
YouGov 9.03
Opinium 16.50
Qriously 20.25
Ipsos MORI 20.50
Panelbase 30.25
ORB 42.25
ComRes 74.25
ICM 78.36
BMG 110.25

Norstat and Survation pollsters will have been enjoying bonuses on the morning after the election. There are a few other commendable performances.

YouGov model

I should also mention the YouGov model (the green line on the Shewhart chart) that has an MSE of 2.25. YouGov conduct web-based surveys against at huge data base or around 50,000 registered participants. They also collect, with permission, deep demographic data on those individuals concerning income, profession, education and other factors. There is enough published demographic data from the national census to judge whether that is a representative frame from which to sample.

YouGov did not poll and publish the raw, or even adjusted, voting intention. They used their poll to  construct a model, perhaps a logistic regression or an artificial neural network, they don’t say, to predict voting intention from demographic factors. They then input into that model, not their own demographic data but data from the national census. That then gave their published forecast. I have to say that this looks about the best possible method for eliminating sampling frame effects.

It remains to be seen how widely this approach is adopted next time.

Advertisements

Why did the polls get it wrong?

This week has seen much soul-searching by the UK polling industry over their performance leading up to the 2015 UK general election on 7 May. The polls had seemed to predict that Conservative and Labour Parties were neck and neck on the popular vote. In the actual election, the Conservatives polled 37.8% to Labour’s 31.2% leading to a working majority in the House of Commons, once the votes were divided among the seats contested. I can assure my readers that it was a shock result. Over breakfast on 7 May I told my wife that the probability of a Conservative majority in the House was nil. I hold my hands up.

An enquiry was set up by the industry led by the National Centre for Research Methods (NCRM). They presented their preliminary findings on 19 January 2016. The principal conclusion was that the failure to predict the voting share was because of biases in the way that the data were sampled and inadequate methods for correcting for those biases. I’m not so sure.

Population -> Frame -> Sample

The first thing students learn when studying statistics is the critical importance, and practical means, of specifying a sampling frame. If the sampling frame is not representative of the population of concern then simply collecting more and more data will not yield a prediction of greater accuracy. The errors associated with the specification of the frame are inherent to the sampling method. Creating a representative frame is very hard in opinion polling because of the difficulty in contacting particular individuals efficiently. It turns out that Conservative voters are harder than Labour voters to get hold of, so that they can be questioned. The NCRM study concluded that, within the commercial constraints of an opinion poll, there was a lower probability that a Conservative voter would be contacted. They therefore tended to be under-represented in the data causing a substantial bias towards Labour.

This is a well known problem in polling practice and there are demographic factors that can be used to make a statistical adjustment. Samples can be stratified. NCRM concluded that, in the run up to the 2015 election, there were important biases tending to under state the Conservative vote and the existing correction factors were inadequate. Fresh sampling strategies were needed to eradicate the bias and improve prediction. There are understandable fears that this will make polling more costly. More calls will be needed to catch Conservatives at home.

Of course, that all sounds an eminently believable narrative. These sorts of sampling frame biases are familiar but enormously troublesome for pollsters. However, I wanted to look at the data myself.

Plot data in time order

That is the starting point of all statistical analysis. Polls continued after the election, though with lesser frequency. I wanted to look at that data after the election in addition to the pre-election data. Here is a plot of poll results against time for Conservative and Labour. I have used data from 25 January to the end of 2015.1, 2 I have not managed to jitter the points so there is some overprinting of Conservative by Labour pre-election.

Polling201501

Now that is an arresting plot. Yet again plotting against time elucidates the cause system. Something happened on the date of the election. Before the election the polls had the two parties neck and neck. The instant (sic) the election was done there was clear red/ blue water between the parties. Applying my (very moderate) level of domain knowledge to the data before, the poll results look stable and predictable. There is a shift after the election to a new datum that remains stable and predictable. The respective arithmetic means are given below.

Party Mean Poll Before Election Mean Poll After
Conservative 33.3% 37.8% 38.8%
Labour 33.5% 31.2% 30.9%

The mean of the post-election polls is doing fairly well but is markedly different from the pre-election results. Now, it is trite statistics that the variation we observe on a chart is the aggregate of variation from two sources.

  • Variation from the thing of interest; and
  • Variation from the measurement process.

As far as I can gather, the sampling methods used by the polling companies have not so far been modified. They were awaiting the NCRM report. They certainly weren’t modified in the few days following the election. The abrupt change on 7 May cannot be because of corrected sampling methods. The misleading pre-election data and the “impressive” post-election polls were derived from common sampling practices. It seems to me difficult to reconcile NCRM’s narrative to the historical data. The shift in the data certainly needs explanation within that account.

What did change on the election date was that a distant intention turned into the recall of a past action. What everyone wants to know in advance is the result of the election. Unsurprisingly, and as we generally find, it is not possible to sample the future. Pollsters, and their clients, have to be content with individuals’ perceptions of how they will vote. The vast majority of people pay very little attention to politics at all and the general level of interest outside election time is de minimis. Standing in a polling booth with a ballot paper is a very different matter from being asked about intentions some days, weeks or months hence. Most people take voting very seriously. It is not obvious that the same diligence is directed towards answering pollster’s questions.

Perhaps the problems aren’t statistical at all and are more concerned with what psychologists call affective forecasting, predicting how we will feel and behave under future circumstances. Individuals are notoriously susceptible to all sorts of biases and inconsistencies in such forecasts. It must at least be a plausible source of error that intentions are only imperfectly formed in advance and mapping into votes is not straightforward. Is it possible that after the election respondents, once again disengaged from politics, simply recalled how they had voted in May? That would explain the good alignment with actual election results.

Imperfect foresight of voting intention before the election and 20/25 hindsight after is, I think, a narrative that sits well with the data. There is no reason whatever why internal reflections in the Cartesian theatre of future voting should be an unbiased predictor of actual votes. In fact, I think it would be a surprise, and one demanding explanation, if they were so.

The NCRM report does make some limited reference to post-election re-interviews of contacts. However, this is presented in the context of a possible “late swing” rather than affective forecasting. There are no conclusions I can use.

Meta-analysis

The UK polls took a horrible beating when they signally failed to predict the result of the 1992 election in under-estimating the Conservative lead by around 8%.3 Things then felt better. The 1997 election was happier, where Labour led by 13% at the election with final polls in the range of 10 to 18%.4 In 2001 each poll managed to get the Conservative vote within 3% but all over-estimated the Labour vote, some pollsters by as much as 5%.5 In 2005, the final poll had Labour on 38% and Conservative,  33%. The popular vote was Labour 36.2% and Conservative 33.2%.6 In 2010 the final poll had Labour on 29% and Conservative, 36%, with a popular vote of 29.7%/36.9%.7 The debacle of 1992 was all but forgotten when 2015 returned to pundits’ dismay.

Given the history and given the inherent difficulties of sampling and affective forecasting, I’m not sure why we are so surprised when the polls get it wrong. Unfortunately for the election strategist they are all we have. That is a common theme with real world data. Because of its imperfections it has to be interpreted within the context of other sources of evidence rather than followed slavishly. The objective is not to be driven by data but to be led by the insights it yields.

References

  1. Opinion polling for the 2015 United Kingdom general election. (2016, January 19). In Wikipedia, The Free Encyclopedia. Retrieved 22:57, January 20, 2016, from https://en.wikipedia.org/w/index.php?title=Opinion_polling_for_the_2015_United_Kingdom_general_election&oldid=700601063
  2. Opinion polling for the next United Kingdom general election. (2016, January 18). In Wikipedia, The Free Encyclopedia. Retrieved 22:55, January 20, 2016, from https://en.wikipedia.org/w/index.php?title=Opinion_polling_for_the_next_United_Kingdom_general_election&oldid=700453899
  3. Butler, D & Kavanagh, D (1992) The British General Election of 1992, Macmillan, Chapter 7
  4. — (1997) The British General Election of 1997, Macmillan, Chapter 7
  5. — (2002) The British General Election of 2001, Palgrave-Macmillan, Chapter 7
  6. Kavanagh, D & Butler, D (2005) The British General Election of 2005, Palgrave-Macmillan, Chapter 7
  7. Cowley, P & Kavanagh, D (2010) The British General Election of 2010, Palgrave-Macmillan, Chapter 7

Data science sold down the Amazon? Jeff Bezos and the culture of rigour

This blog appeared on the Royal Statistical Society website Statslife on 25 August 2015

Jeff Bezos' iconic laugh.jpgThis recent item in the New York Times has catalysed discussion among managers. The article tells of Amazon’s founder, Jeff Bezos, and his pursuit of rigorous data driven management. It also tells employees’ own negative stories of how that felt emotionally.

The New York Times says that Amazon is pervaded with abundant data streams that are used to judge individual human performance and which drive reward and advancement. They inform termination decisions too.

The recollections of former employees are not the best source of evidence about how a company conducts its business. Amazon’s share of the retail market is impressive and they must be doing something right. What everybody else wants to know is, what is it? Amazon are very coy about how they operate and there is a danger that the business world at large takes the wrong messages.

Targets

Targets are essential to business. The marketing director predicts that his new advertising campaign will create demand for 12,000 units next year. The operations director looks at her historical production data. She concludes that the process lacks the capability reliably to produce those volumes. She estimates the budget required to upgrade the process and to achieve 12,000 units annually. The executive board considers the business case and signs off the investment. Both marketing and operations directors now have a target.

Targets communicate improvement priorities. They build confidence between interfacing processes. They provide constraints and parameters that prevent the system causing harm. Harm to others or harm to itself. They allow the pace and substance of multiple business processes, and diverse entities, to be matched and aligned.

But everyone who has worked in business sees it as less simple than that. The marketing and operations directors are people.

Signal and noise

Drawing conclusions from data might be an uncontroversial matter were it not for the most common feature of data, fluctuation. Call it variation if you prefer. Business measures do not stand still. Every month, week, day and hour is different. All data features noise. Sometimes is goes up, sometimes down. A whole ecology of occult causes, weakly characterised, unknown and as yet unsuspected, interact to cause irregular variation. They are what cause a coin variously to fall “heads” or “tails”. That variation may often be stable enough, or if you like “exchangeable“, so as to allow statistical predictions to be made, as in the case of the coin toss.

If all data features noise then some data features signals. A signal is a sign, an indicator that some palpable cause has made the data stand out from the background noise. It is that assignable cause which enables inferences to be drawn about what interventions in the business process have had a tangible effect and what future innovations might cement any gains or lead to bigger prospective wins. Signal and noise lead to wholly different business strategies.

The relevance for business is that people, where not exposed to rigorous decision support, are really bad at telling the difference between signal and noise. Nobel laureate economist and psychologist Daniel Kahneman has amassed a lifetime of experimental and anecdotal data capturing noise misinterpreted as signal and judgments in the face of compelling data, distorted by emotional and contextual distractions.

Signal and accountability

It is a familiar trope of business, and government, that extravagant promises are made, impressive business cases set out and targets signed off. Yet the ultimate scrutiny as to whether that envisaged performance was realised often lacks rigour. Noise, with its irregular ups and downs, allows those seeking solace from failure to pick out select data points and cast self-serving narratives on the evidence.

Our hypothetical marketing director may fail to achieve his target but recount how there were two individual months where sales exceeded 1,000, construct elaborate rationales as to why only they are representative of his efforts and point to purported external factors that frustrated the remaining ten reports. Pairs of individual data points can always be selected to support any story, Don Wheeler’s classic executive time series.

This is where the ability to distinguish signal and noise is critical. To establish whether targets have been achieved requires crisp definition of business measures, not only outcomes but also the leading indicators that provide context and advise judgment as to prediction reliability. Distinguishing signal and noise requires transparent reporting that allows diverse streams of data criticism. It requires a rigorous approach to characterising noise and a systematic approach not only to identifying signals but to reacting to them in an agile and sustainable manner.

Data is essential to celebrating a target successfully achieved and to responding constructively to a failure. But where noise is gifted the status of signal to confirm a fanciful business case, or to protect a heavily invested reputation, then the business is misled, costs increased, profits foregone and investors cheated.

Where employees believe that success and reward is being fudged, whether because of wishful thinking or lack of data skills, or mistakenly through lack of transparency, then cynicism and demotivation will breed virulently. Employees watch the behaviours of their seniors carefully as models of what will lead to their own advancement. Where it is deceit or innumeracy that succeed, that is what will thrive.

Noise and blame

Here is some data of the number of defects caused by production workers last month.

Worker Defects
Al 10
Simone 6
Jose 10
Gabriela 16
Stan 10

What is to be done about Gabriela? Move to an easier job? Perhaps retraining? Or should she be let go? And Simone? Promote to supervisor?

Well, the numbers were just random numbers that I generated. I didn’t add anything in to make Gabriela’s score higher and there was nothing in the way that I generated the data to suggest who would come top or bottom. The data are simply noise. They are the sort of thing that you might observe in a manufacturing plant that presented a “stable system of trouble”. Nothing in the data signals any behaviour, attitude, skill or diligence that Gabriela lacked or wrongly exercised. The next month’s data would likely show a different candidate for dismissal.

Mistaking signal for noise is, like mistaking noise for signal, the path to business under performance and employee disillusionment. It has a particularly corrosive effect where used, as it might be in Gabriela’s case, to justify termination. The remaining staff will be bemused as to what Gabriela was actually doing wrong and start to attach myriad and irrational doubts to all sorts of things in the business. There may be a resort to magical thinking. The survivors will be less open and less willing to share problems with their supervisors. The business itself has the costs of recruitment to replace Gabriela. The saddest aspect of the whole business is the likelihood that Gabriela’s replacement will perform better than did Gabriela, vindicating the dismissal in the mind of her supervisor. This is the familiar statistical artefact of regression to the mean. An extreme event is likely to be followed by one less extreme. Again, Kahneman has collected sundry examples of managers so deceived by singular human performance and disappointed by its modest follow-up.

It was W Edwards Deming who observed that every time you recruit a new employee you take a random sample from the pool of job seekers. That’s why you get the regression to the mean. It must be true at Amazon too as their human resources executive Mr Tony Galbato explains their termination statistics by admitting that “We don’t always get it right.” Of course, everybody thinks that their recruitment procedures are better than average. That’s a management claim that could well do with rigorous testing by data.

Further, mistaking noise for signal brings the additional business expense of over adjustment, spending money to add costly variation while degrading customer satisfaction. Nobody in the business feels good about that.

Target quality, data quality

I admitted above that the evidence we have about Amazon’s operations is not of the highest quality. I’m not in a position to judge what goes on at Amazon. But all should fix in their minds that setting targets demands rigorous risk assessment, analysis of perverse incentives and intense customer focus.

It is a sad reality that, if you set incentives perversely enough,some individuals will find ways of misreporting data. BNFL’s embarrassment with Kansai Electric and Steven Eaton’s criminal conviction were not isolated incidents.

One thing that especially bothered me about the Amazon report was the soi-disant Anytime Feedback Tool that allowed unsolicited anonymous peer appraisal. Apparently, this formed part of the “data” that determined individual advancement or termination. The description was unchallenged by Amazon’s spokesman (sic) Mr Craig Berman. I’m afraid, and I say this as a practising lawyer, unsourced and unchallenged “evidence” carries the spoor of the Star Chamber and the party purge. I would have thought that a pretty reliable method for generating unreliable data would be to maximise the personal incentives for distortion while protecting it from scrutiny or governance.

Kahneman observed that:

… we pay more attention to the content of messages than to information about their reliability, and as a result end up with a view of the world around us that is simpler and more coherent than the data justify.

It is the perverse confluence of fluctuations and individual psychology that makes statistical science essential, data analytics interesting and business, law and government difficult.

Productivity and how to improve it: I -The foundational narrative

Again, much talk in the UK media recently about weak productivity statistics. Chancellor of the Exchequer (Finance Minister) George Osborne has launched a 15 point macroeconomic strategy aimed at improving national productivity. Some of the points are aimed at incentivising investment and training. There will be few who argue against that though I shall come back to the investment issue when I come to talk about signal and noise. I have already discussed training here. In any event, the strategy is fine as far as these things go. Which is not very far.

There remains the microeconomic task for all of us of actually improving our own productivity and that of the systems we manage. That is not the job of government.

Neither can I offer any generalised system for improving productivity. It will always be industry and organisation dependent. However, I wanted to write about some of the things that you have to understand if your efforts to improve output are going to be successful and sustainable.

  • Customer value and waste.
  • The difference between signal and noise.
  • How to recognise flow and manage a constraint.

Before going on to those in future weeks I first wanted to go back and look at what has become the foundational narrative of productivity improvement, the Hawthorne experiments. They still offer some surprising insights.

The Hawthorne experiments

In 1923, the US electrical engineering industry was looking to increase the adoption of electric lighting in American factories. Uptake had been disappointing despite the claims being made for increased productivity.

[Tests in nine companies have shown that] raising the average initial illumination from about 2.3 to 11.2 foot-candles resulted in an increase in production of more than 15%, at an additional cost of only 1.9% of the payroll.

Earl A Anderson
General Electric
Electrical World (1923)

E P Hyde, director of research at GE’s National Lamp Works, lobbied government for the establishment of a Committee on Industrial Lighting (“the CIL”) to co-ordinate marketing-oriented research. Western Electric volunteered to host tests at their Hawthorne Works in Cicero, IL.

Western Electric came up with a study design that comprised a team of experienced workers assembling relays, winding their coils and inspecting them. Tests commenced in November 1924 with active support from an elite group of academic and industrial engineers including the young Vannevar Bush, who would himself go on to an eminent career in government and science policy. Thomas Edison became honorary chairman of the CIL.

It’s a tantalising historical fact that Walter Shewhart was employed at the Hawthorne Works at the time but I have never seen anything suggesting his involvement in the experiments, nor that of his mentor George G Edwards, nor protégé Joseph Juran. In later life, Juran was dismissive of the personal impact that Shewhart had had on operations there.

However, initial results showed no influence of light level on productivity at all. Productivity rose throughout the test but was wholly uncorrelated with lighting level. Theories about the impact of human factors such as supervision and motivation started to proliferate.

A further schedule of tests was programmed starting in September 1926. Now, the lighting level was to be reduced to near darkness so that the threshold of effective work could be identified. Here is the summary data (from Richard Gillespie Manufacturing Knowledge: A History of the Hawthorne Experiments, Cambridge, 1991).

Hawthorne data-1

It requires no sophisticated statistical analysis to see that the data is all noise and no signal. Much to the disappointment of the CIL, and the industry, there was no evidence that illumination made any difference at all, even down to conditions of near darkness. It’s striking that the highest lighting levels embraced the full range of variation in productivity from the lowest to the highest. What had seemed so self evidently a boon to productivity was purely incidental. It is never safe to assume that a change will be an improvement. As W Edwards Deming insisted, “In God was trust. All others bring data.”

But the data still seemed to show a relentless improvement of productivity over time. The participants were all very experienced in the task at the start of the study so there should have been no learning by doing. There seemed no other explanation than that the participants were somehow subliminally motivated by the experimental setting. Or something.

Hawthorne data-2

That subliminally motivated increase in productivity came to be known as the Hawthorne effect. Attempts to explain it led to the development of whole fields of investigation and organisational theory, by Elton Mayo and others. It really was the foundation of the management consulting industry. Gillespie (supra) gives a rich and intriguing account.

A revisionist narrative

Because of the “failure” of the experiments’ purpose there was a falling off of interest and only the above summary results were ever published. The raw data were believed destroyed. Now “you know, at least you ought to know, for I have often told you so” about Shewhart’s two rules for data presentation.

  1. Data should always be presented in such a way as to preserve the evidence in the data for all the predictions that might be made from the data.
  2. Whenever an average, range or histogram is used to summarise observations, the summary must not mislead the user into taking any action that the user would not take if the data were presented in context.

The lack of any systematic investigation of the raw data led to the development of a discipline myth that every single experimental adjustment had led forthwith to an increase in productivity.

In 2009, Steven Levitt, best known to the public as the author of Freakonomics, along with John List and their research team, miraculously discovered a microfiche of the raw study data at a “small library in Milwaukee, WI” and the remainder in Boston, MA. They went on to analyse the data from scratch (Was there Really a Hawthorne Effect at the Hawthorne Plant? An Analysis of the Original Illumination Experiments, National Bureau of Economic Research, Working Paper 15016, 2009).

LevittHawthonePlot

Figure 3 of Levitt and List’s paper (reproduced above) shows the raw productivity measurements for each of the experiments. Levitt and List show how a simple plot such as this reveals important insights into how the experiments developed. It is a plot that yields a lot of information.

Levitt and List note that, in the first phase of experiments, productivity rose then fell when experiments were suspended. They speculate as to whether there was a seasonal effect with lower summer productivity.

The second period of experiments is that between the third and fourth vertical lines in the figure. Only room 1 experienced experimental variation in this period yet Levitt and List contend that productivity increased in all three rooms, falling again at the end of experimentation.

During the final period, data was only collected from room 1 where productivity continued to rise, even beyond the end of the experiment. Looking at the data overall, Levitt and List find some evidence that productivity responded more to changes in artificial light than to natural light. The evidence that increases in productivity were associated with every single experimental adjustment is weak. To this day, there is no compelling explanation of the increases in productivity.

Lessons in productivity improvement

Deming used to talk of “disappointment in great ideas”, the propensity for things that looked so good on paper simply to fail to deliver the anticipated benefits. Nobel laureate psychologist Daniel Kahneman warns against our individual bounded rationality.

To guard against entrapment by the vanity of imagination we need measurement and data to answer the ineluctable question of whether the change we implemented so passionately resulted in improvement. To be able to answer that question demands the separation of signal from noise. That requires trenchant data criticism.

And even then, some factors may yet be beyond our current knowledge. Bounded rationality again. That is why the trick of continual improvement in productivity is to use the rigorous criticism of historical data to build collective knowledge incrementally.

If you torture the data enough, nature will always confess.

Ronald Coase

Eventually.

Does noise make you fat?

“A new study has unearthed some eye-opening facts about the effects of noise pollution on obesity,” proclaimed The Huffington Post recently in another piece or poorly uncritical data journalism.

Journalistic standards notwithstanding, in Exposure to traffic noise and markers of obesity (BMJ Occupational and environmental medicine, May 2015) Andrei Pyko and eight (sic) collaborators found “evidence of a link between traffic noise and metabolic outcomes, especially central obesity.” The particular conclusion picked up by the press was that each 5 dB increase in traffic noise could add 2 mm to the waistline.

Not trusting the press I decided I wanted to have a look at this research myself. I was fortunate that the paper was available for free download for a brief period after the press release. It took some finding though. The BMJ insists that you will now have to pay. I do find that objectionable as I see that the research was funded in part by the European Union. Us European citizens have all paid once. Why should we have to pay again?

On reading …

I was though shocked reading Pyko’s paper as the Huffington Post journalists obviously hadn’t. They state “Lack of sleep causes reduced energy levels, which can then lead to a more sedentary lifestyle and make residents less willing to exercise.” Pyko’s paper says no such thing. The researchers had, in particular, conditioned on level of exercise so that effect had been taken out. It cannot stand as an explanation of the results. Pyko’s narrative concerned noise-induced stress and cortisol production, not lack of exercise.

In any event, the paper is densely written and not at all easy to analyse and understand. I have tried to pick out the points that I found most bothering but first a statistics lesson.

Prediction 101

Frame(Almost) the first thing to learn in statistics is the relationship between population, frame and sample. We are concerned about the population. The frame is the enumerable and accessible set of things that approximate the population. The sample is a subset of the frame, selected in an economic, systematic and well characterised manner.

In Some Theory of Sampling (1950), W Edwards Deming drew a distinction between two broad types of statistical studies, enumerative and analytic.

  • Enumerative: Action will be taken on the frame.
  • Analytic: Action will be on the cause-system that produced the frame.

It is explicit in Pyko’s work that the sampling frame was metropolitan Stockholm, Sweden between the years 2002 and 2006. It was a cross-sectional study. I take it from the institutional funding that the study intended to advise policy makers as to future health interventions. Concern was beyond the population of Stockholm, or even Sweden. This was an analytic study. It aspired to draw generalised lessons about the causal mechanisms whereby traffic noise aggravated obesity so as to support future society-wide health improvement.

How representative was the frame of global urban areas stretching over future decades? I have not the knowledge to make a judgment. The issue is mentioned in the paper but, I think, with insufficient weight.

There are further issues as to the sampling from the frame. Data was taken from participants in a pre-existing study into diabetes that had itself specific criteria for recruitment. These are set out in the paper but intensify the questions of whether the sample is representative of the population of interest.

The study

The researchers chose three measures of obesity, waist circumference, waist-hip ratio and BMI. Each has been put forwards, from time to time, as a measure of health risk.

There were 5,075 individual participants in the study, a sample of 5,075 observations. The researchers performed both a linear regression simpliciter and a logistic regression. For want of time and space I am only going to comment on the former. It is the origin of the headline 2 mm per 5 dB claim.

The researchers have quoted p-values but they haven’t committed the worst of sins as they have shown the size of the effects with confidence intervals. It’s not surprising that they found so many soi-disant significant effects given the sample size.

However, there was little assistance in judging how much of the observed variation in obesity was down to traffic noise. I would have liked to see a good old fashioned analysis of variance table. I could then at least have had a go at comparing variation from the measurement process, traffic noise and other effects. I could also have calculated myself an adjusted R2.

Measurement Systems Analysis

Understanding variation from the measurement process is critical to any analysis. I have looked at the World Health Organisation’s definitive 2011 report on the effects of waist circumference on health. Such Measurement Systems Analysis as there is occurs at p7. They report a “technical error” (me neither) of 1.31 cm from intrameasurer error (I’m guessing repeatability) and 1.56 cm from intermeasurer error (I’m guessing reproducibility). They remark that “Even when the same protocol is used, there may be variability within and between measurers when more than one measurement is made.” They recommend further research but I have found none. There is no way of knowing from what is published by Pyko whether the reported effects are real or flow from confounding between traffic noise and intermeasurer variation.

When it comes to waist-hip ratio I presume that there are similar issues in measuring hip circumference. When the two dimensions are divided then the individual measurement uncertainties aggregate. More problems, not addressed.

Noise data

The key predictor of obesity was supposed to be noise. The noise data used were not in situ measurements in the participants’ respective homes. The road traffic noise data were themselves predicted from a mathematical model using “terrain data, ground surface, building height, traffic data, including 24 h yearly average traffic flow, diurnal distribution and speed limits, as well as information on noise barriers”. The model output provided 5 dB contours. The authors then applied some further ad hoc treatments to the data.

The authors recognise that there is likely to be some error in the actual noise levels, not least from the granularity. However, they then seem to assume that this is simply an errors in variables situation. That would do no more than (conservatively) bias any observed effect towards zero. However, it does seem to me that there is potential for much more structured systematic effects to be introduced here and I think this should have been explored further.

Model criticism

The authors state that they carried out a residuals analysis but they give no details and there are no charts, even in the supplementary material. I would like to have had a look myself as the residuals are actually the interesting bit. Residuals analysis is essential in establishing stability.

In fact, in the current study there is so much data that I would have expected the authors to have saved some of the data for cross-validation. That would have provided some powerful material for model criticism and validation.

Given that this is an analytic study these are all very serious failings. With nine researchers on the job I would have expected some effort on these matters and some attention from whoever was the statistical referee.

Results

Separate results are presented for road, rail and air traffic noise. Again, for brevity I am looking at the headline 2 mm / 5 dB quoted for road traffic noise. Now, waist circumference is dependent on gross body size. Men are bigger than women and have larger waists. Similarly, the tall are larger-waisted than the short. Pyko’s regression does not condition on height (as a gross characterisation of body size).

BMI is a factor that attempts to allow for body size. Pyko found no significant influence on BMI from road traffic noise.

Waist-hip ration is another parameter that attempts to allow for body size. It is often now cited as a better predictor of morbidity than BMI. That of course is irrelevant to the question of whether noise makes you fat. As far as I can tell from Pyko’s published results, a 5 dB increase in road traffic noise accounted for a 0.16 increase in waist-hip ratio. Now, let us look at this broadly. Consider a woman with waist circumference 85 cm, hip 100 cm, hence waist-hip ratio, 0.85. All pretty typical for the study. Predictively the study is suggesting that a 5 dB increase in road traffic noise might unremarkably take her waist-hip ratio up over 1.0. That seems barely consistent with the results from waist circumference alone where there would not only be millimetres of growth. It is incredible physically.

I must certainly have misunderstood what the waist-hip result means but I could find no elucidation in Pyko’s paper.

Policy

Research such as this has to be aimed at advising future interventions to control traffic noise in urban environments. Broadly speaking, 5 dB is a level of noise change that is noticeable to human hearing but no more. All the same, achieving such a reduction in an urban environment is something that requires considerable economic resources. Yet, taking the research at its highest, it only delivers 2 mm on the waistline.

I had many criticisms other than those above and I do not, in any event, consider this study adequate for making any prediction about a future intervention. Nothing in it makes me feel the subject deserves further study. Or that I need to avoid noise to stay slim.

Data and anecdote revisited – the case of the lime jellybean

JellyBellyBeans.jpgI have already blogged about the question of whether data is the plural of anecdote. Then I recently came across the following problem in the late Richard Jeffrey’s marvellous little book Subjective Probability: The Real Thing (2004, Cambridge) and it struck me as a useful template for thinking about data and anecdotes.

The problem looks like a staple of elementary statistics practice exercises.

You are drawing a jellybean from a bag in which you know half the beans are green, all the lime flavoured ones are green and the green ones are equally divided between lime and mint flavours.

You draw a green bean. Before you taste it, what is the probability that it is lime flavoured?

A mathematically neat answer would be 50%. But what if, asked Jeffrey, when you drew the green bean you caught a whiff of mint? Or the bean was a particular shade of green that you had come to associate with “mint”. Would your probability still be 50%?

The given proportions of beans in the bag are our data. The whiff of mint or subtle colouration is the anecdote.

What use is the anecdote?

It would certainly be open to a participant in the bean problem to maintain the 50% probability derived from the data and ignore the inferential power of the anecdote. However, the anecdote is evidence that we have and, if we choose to ignore it simply because it is difficult to deal with, then we base our assessment of risk on a more restricted picture than that actually available to us.

The difficulty with the anecdote is that it does not lead to any compelling inference in the same way as do the mathematical proportions. It is easy to see how the bean proportions would give rise to a quite extensive consensus about the probability of “lime”. There would be more variety in individual responses to the anecdote, in what weight to give the evidence and in what it tended to imply.

That illustrates the tension between data and anecdote. Data tends to consensus. If there is disagreement as to its weight and relevance then the community is likely to divide into camps rather than exhibit a spectrum of views. Anecdote does not lead to such a consensus. Individuals interpret anecdotes in diverse ways and invest them with varying degrees of credence.

Yet, the person who is best at weighing and interpreting the anecdotal evidence has the advantage over the broad community who are in agreement about what the proportion data tells them. It will often be the discipline specialist who is in the best position to interpret an anecdote.

From anecdote to data

One of the things that the “mint” anecdote might do is encourage us to start collecting future data on what we smelled when a bean was drawn. A sequence of such observations, along with the actual “lime/ mint” outcome, potentially provides a potent decision support mechanism for future draws. At this point the anecdote has been developed into data.

This may be a difficult process. The whiff of mint or subtle colouration could be difficult to articulate but recognising its significance (sic) is the beginning of operationalising and sharing.

Statistician John Tukey advocated the practice of exploratory data analysis (EDA) to identify such anecdotal evidence before settling on a premature model. As he observed:

The greatest value of a picture is when it forces us to notice what we never expected to see.

Of course, the person who was able to use the single anecdote on its own has the advantage over those who had to wait until they had compelling data. Data that they share with everybody else who has the same idea.

Data or anecdote

When I previously blogged about this I had trouble in coming to any definition that distinguished data and anecdote. Having reflected, I have a modest proposal. Data is the output of some reasonably well-defined process. Anecdote isn’t. It’s not clear how it was generated.

We are not told by what process the proportion of beans was established but I am willing to wager that it was some form of counting.

If we know the process generating evidence then we can examine its biases, non-responses, precision, stability, repeatability and reproducibility. Anecdote we cannot. It is because we can characterise the measurement process, through measurement systems analysis, that we can assess its reliability and make appropriate allowances and adjustments for its limitations. An assessment that most people will agree with most of the time. Because the most potent tools for assessing the reliability of evidence are absent in the case of anecdote, there are inherent difficulties in its interpretation and there will be a spectrum of attitudes from the community.

However, having had our interest pricked by the anecdote, we can set up a process to generate data.

Borrowing strength again

Using an anecdote as the basis for further data generation is one approach to turning anecdote into reliable knowledge. There is another way.

Today in the UK, a jury of 12 found nurse Victorino Chua, beyond reasonable doubt, guilty of poisoning 21 of his patients with insulin. Two died. There was no single compelling piece of evidence put before the jury. It was all largely circumstantial. The prosecution had sought to persuade the jury that those various items of circumstantial evidence reinforced each other and led to a compelling inference.

This is a common situation in litigation where there is no single conclusive piece of data but various pieces of circumstantial evidence that have to be put together. Where these reinforce, they inherit borrowing strength from each other.

Anecdotal evidence is not really the sort of evidence we want to have. But those who know how to use it are way ahead of those embarrassed by it.

Data is the plural of anecdote, either through repetition or through borrowing.

Target and the Targeteers

This blog appeared on the Royal Statistical Society website Statslife on 29 May 2014

DartboardJohn Pullinger, newly appointed head of the UK Statistics Authority, has given a trenchant warning about the “unsophisticated” use of targets. As reported in The Times (London) (“Targets could be skewing the truth, statistics chief warns”, 26 May 2014 – paywall) he cautions:

Anywhere we have had targets, there is a danger that they become an end in themselves and people lose sight of what they’re trying to achieve. We have numbers everywhere but haven’t been well enough schooled on how to use them and that’s where problems occur.

He goes on.

The whole point of all these things is to change behaviour. The trick is to have a sophisticated understanding of what will happen when you put these things out.

Pullinger makes it clear that he is no opponent of targets, but that in the hands of the unskilled they can create perverse incentives, encouraging behaviour that distorts the system they sought to control and frustrating the very improvement they were implemented to achieve.

For example, two train companies are being assessed by the regulator for punctuality. A train is defined as “on-time” if it arrives within 5 minutes of schedule. The target is 95% punctuality.
TrainTargets
Evidently, simple management by target fails to reveal that Company 1 is doing better than Company 2 in offering a punctual service to its passengers. A simple statement of “95% punctuality (punctuality defined as arriving within 5 minutes of timetable)” discards much of the information in the data.

Further, when presented with a train that has slipped outside the 5 minute tolerance, a manager held solely to the target of 95% has no incentive to stop the late train from slipping even further behind. Certainly, if it puts further trains at risk of lateness, there will always be a temptation to strip it of all priority. Here, the target is not only a barrier to effective measurement and improvement, it is a threat to the proper operation of the railway. That is the point that Pullinger was seeking to make about the behaviour induced by the target.

And again, targets often provide only a “snapshot” rather than the “video” that discloses the information in the data that can be used for planning and managing an enterprise.

I am glad that Pullinger was not hesitant to remind users that proper deployment of system measurement requires an appreciation of psychology. Nobel Laureate psychologist Daniel Kahneman warns of the inherent human trait of thinking that What you see is all there is (WYSIATI). On their own, targets do little to guard against such bounded rationality.

In support of a corporate programme of improvement and integrated in a culture of rigorous data criticism, targets have manifest benefits. They communicate improvement priorities. They build confidence between interfacing processes. They provide constraints and parameters that prevent the system causing harm. Harm to others or harm to itself. What is important is that the targets do not become a shield to weak managers who wish to hide their lack of understanding of their own processes behind the defence that “all targets were met”.

However, all that requires some sophistication in approach. I think the following points provide a basis for auditing how an organisation is using targets.

Risk assessment

Targets should be risk assessed, anticipating realistic psychology and envisaging the range of behaviours the targets are likely to catalyse.

Customer focus

Anyone tasked with operating to a target should be periodically challenged with a review of the Voice of the Customer and how their own role contributes to the organisational system. The target is only an aid to the continual improvement of the alignment between the Voice of the Process and the Voice of the Customer. That is the only game in town.

Borrowed validation

Any organisation of any size will usually have independent data of sufficient borrowing strength to support mutual validation. There was a very good recent example of this in the UK where falling crime statistics, about which the public were rightly cynical and incredulous, were effectively validated by data collection from hospital emergency departments (Violent crime in England and Wales falls again, A&E data shows).

Over-adjustment

Mechanisms must be in place to deter over-adjustment, what W Edwards Deming called “tampering”, where naïve pursuit of a target adds variation and degrades performance.

Discipline

Employees must be left in no doubt that lack of care in maintaining the integrity of the organisational system and pursuing customer excellence will not be excused by mere adherence to a target, no matter how heroic.

Targets are for the guidance of the wise. To regard them as anything else is to ask them to do too much.