Science journal bans p-values

p-valueInteresting news here that psychology journal Basic and Applied Social Psychology (BASP) has banned the use of p-values in the academic research papers that it will publish in the future.

The dangers of p-values are widely known though their use seems to persist in any number of disciplines, from the Higgs boson to climate change.

There has been some wonderful recent advocacy deprecating p-values, from Deirdre McCloskey and Regina Nuzzo among others. BASP editor David Trafimow has indicated that the journal will not now publish formal hypothesis tests (of the Neyman-Pearson type) or confidence intervals purporting to support experimental results. I presume that appeals to “statistical significance” are proscribed too. Trafimow has no dogma as to what people should do instead but is keen to encourage descriptive statistics. That is good news.

However, Trafimow does say something that worries me.

… as the sample size increases, descriptive statistics become increasingly stable and sampling error is less of a problem.

It is trite statistics that merely increasing sample size, as in the raw number of observations, is no guarantee of improving sampling error. If the sample is not rich enough to capture all the relevant sources of variation then data is amassed in vain. A common example is that of inter-laboratory studies of analytical techniques. A researcher who takes 10 observations from Laboratory A and 10 from Laboratory B really only has two observations. At least as far as the really important and dominant sources of variation are concerned. Increasing the number of observations to 100 from each laboratory would simply be a waste of resources.

But that is not all there is to it. Sampling error only addresses how well we have represented the sampling frame. In any reasonably interesting statistics, and certainly in any attempt to manage risk, we are only interested in the future. The critical question before we can engage in any, even tentative, statistical inference is “Is the data representative of the future?”. That requires that the data has the statistical property of exchangeability. Some people prefer the more management-oriented term “stable and predictable”. That’s why I wished Trafimow hadn’t used the word “stable”.

Assessment of stability and predictability is fundamental to any prediction or data based management. It demands confident use of process-behaviour charts and trenchant scrutiny of the sources of variation that drive the data. It is the necessary starting point of all reliable inference. A taste for p-values is a major impediment to clear thinking on the matter. They do not help. It would be encouraging to believe that scepticism was on the march but I don’t think prohibition is the best means of education.

 

How to use data to scare people …

… and how to use data for analytics.

Crisis hit GP surgeries forced to turn away millions of patients

That was the headline on the Royal College of General Practitioners (“RCGP” – UK family physicians) website today. The catastrophic tone was elaborated in The (London) Times: Millions shut out of doctors’ surgeries (paywall).
Blutdruck.jpg
The GPs’ alarm was based on data from the GP Patient Survey which is a survey conducted on behalf or the National Health Service (“NHS”) by pollsters Ipsos MORI. The study is conducted by way of a survey questionnaire sent out to selected NHS patients. You can find the survey form here. Ipsos MORI’s careful analysis is here.

Participants were asked to recall their experience of making an appointment last time they wanted to. From this, the GPs have extracted the material for their blog’s lead paragraph.

GP surgeries are so overstretched due to the lack of investment in general practice that in 2015 on more than 51.3m occasions patients in England will be unable to get an appointment to see a GP or nurse when they contact their local practice, according to new research.

Now, this is not analysis. For the avoidance of doubt, the Ipsos MORI report cited above does not suffer from such tendentious framing. The RCGP blog features the following tropes of Langian statistical method.

  • Using emotive language such as “crisis”, “forced” and “turn away”.
  • Stating the cause of the avowed problem, “lack of investment”, without presenting any supporting argument.
  • Quoting an absolute number of affected patients rather than a percentage which would properly capture individual risk.
  • Casually extrapolating to a future round number, over 50 million.
  • Seeking to bolster their position by citing “new research”.
  • Failing to recognise the inevitable biases that beset human descriptions of past events.

Humans are notoriously susceptible to bias in how they recall and report past events. Psychologist Daniel Kahneman has spent a lifetime mapping out the various cognitive biases that afflict our thinking. The Ipsos MORI survey appears to me rigorously designed but no degree of rigour can eliminate the frailties of human memory, especially about an uneventful visit to the GP. An individual is much more likely to recall a frustrating attempt to make an appointment than a straightforward encounter.

Sometimes, such survey data will be the best we can do and will be the least bad guide to action though in itself flawed. As Charles Babbage observed:

Errors using inadequate data are much less than those using no data at all.

Yet the GPs’ use of this external survey data to support their funding campaign looks particularly out of place in this situation. This is a case where there is a better source of evidence. The point is that the problem under investigation lies entirely within the GPs’ own domain. The GPs themselves are in a vastly superior position to collect data on frustrated appointments, within their own practices. Data can be generated at the moment an appointment is sought. Memory biases and patient non-responses can be eliminated. The reasons for any diary difficulties can be recorded as they are encountered. And investigated before the trail has gone cold. Data can be explored within the practice, improvements proposed, gains measured, solutions shared on social media. The RCGP could play the leadership role of aggregating the data and fostering sharing of ideas.

It is only with local data generation that the capability of an appointments system can be assessed. Constraints can be identified, managed and stabilised. It is only when the system is shown to be incapable that a case can be made for investment. And the local data collected is exactly the data needed to make that case. Not only does such data provide a compelling visual narrative of the appointment system’s inability to heal itself but, when supported by rigorous analysis, it liquidates the level of investment and creates its own business case. Rigorous criticism of data inhibits groundless extrapolation. At the very least, local data would have provided some borrowing strength to validate the patient survey.

Looking to external data to support a case when there is better data to be had internally, both to improve now what is in place and to support the business case for new investment, is neither pretty nor effective. And it is not analysis.

The Productivity Paradox

File:City of London skyline at dusk.jpgThis last week saw a further report from the Bank of England that UK productivity has fallen inexplicably behind the nation’s aspirations. There is a compelling picture of the development of productivity over time on the Office of National Statistics (“ONS”) website here.

There is general puzzlement, and disquiet, among UK economists as to why productivity is not improving. It seems to suggest that cutting the costs of production is not at the top of UK business agendas. It’s true that there are other important things to worry about: design and redesign of products and services, reputation, customer experience, workplace engagement, safety and sustainability.

But I suspect that there is nothing more important than productivity. It is only by learning how to do more with less that resources can be freed up to develop novel income streams. Even on matters of safety and environment, it is the efficient organisation that finds the resources to take those matters seriously.

The road to increased productivity is well mapped out. The continual improvement of the alignment between the Voice of the Process and the Voice of the Customer, by the means of diligent criticism of historical data is an open secret.

The dark side of discipline

W Edwards Deming was very impressed with Japanese railways. In Out of the Crisis (1986) he wrote this.

The economy of a single plan that will work is obvious. As an example, may I cite a proposed itinerary in Japan:

          1725 h Leave Taku City.
          1923 h Arrive Hakata.
Change trains.
          1924 h Leave Hakata [for Osaka, at 210 km/hr]

Only one minute to change trains? You don’t need a whole minute. You will have 30 seconds left over. No alternate plan was necessary.

My friend Bob King … while in Japan in November 1983 received these instructions to reach by train a company that he was to visit.

          0903 h Board the train. Pay no attention to trains at 0858, 0901.
          0957 h Off.

No further instruction was needed.

Deming seemed to assume that these outcomes were delivered by a capable and, moreover, stable system. That may well have been the case in 1983. However, by 2005 matters had drifted.

Aftermath of the Amagasaki rail crashThe other night I watched, recorded from the BBC, the documentary Brakeless: Why Trains Crash about the Amagasaki rail crash on 25 April 2005. I fear that it is no longer available in BBC iPlayer. However, most of the documentaries in this BBC Storyville strand are independently produced and usually have some limited theatrical release or are available elsewhere. I now see that the documentary is available here on Dailymotion.

The documentary painted a system of “discipline” on the railway where drivers were held directly responsible for outcomes, overridingly punctuality. This was not a documentary aimed at engineers but the first thing missing for me was any risk assessment of the way the railway was run. Perhaps it was there but it is difficult to see what thought process would lead to a failure to mitigate the risks of production pressures.

However, beyond that, for me the documentary raised some important issues of process discipline. We must be very careful when we make anyone working within a process responsible for its outputs. That sounds a strange thing to say but Paul Jennings at Rolls-Royce always used to remind me You can’t work on outcomes.

The difficulty that the Amagasaki train drivers had was that the railway was inherently subject to sources of variation over which the drivers had no control. In the face of those sources of variation, they were pressured to maintain the discipline of a punctual timetable. They way they did that was to transgress other dimensions of process discipline, in the Amagasaki case, speed limits.

Anybody at work must diligently follow the process given to them. But if that process does not deliver the intended outcome then that is the responsibility of the manager who owns the process, not the worker. When a worker, with the best of intentions, seeks independently to modify the process, they are in a poor position, constrained as they are by their own bounded rationality. They will inevitably by trapped by System 1 thinking.

Of course, it is great when workers can get involved with the manager’s efforts to align the voice of the process with the voice of the customer. However, the experimentation stops when they start operating the process live.

Fundamentally, it is a moral certainty that purblind pursuit of a target will lead to over-adjustment by the worker, what Deming called “tampering”. That in turn leads to increased costs, aggravated risk and vitiated consumer satisfaction.

Target and the Targeteers

This blog appeared on the Royal Statistical Society website Statslife on 29 May 2014

DartboardJohn Pullinger, newly appointed head of the UK Statistics Authority, has given a trenchant warning about the “unsophisticated” use of targets. As reported in The Times (London) (“Targets could be skewing the truth, statistics chief warns”, 26 May 2014 – paywall) he cautions:

Anywhere we have had targets, there is a danger that they become an end in themselves and people lose sight of what they’re trying to achieve. We have numbers everywhere but haven’t been well enough schooled on how to use them and that’s where problems occur.

He goes on.

The whole point of all these things is to change behaviour. The trick is to have a sophisticated understanding of what will happen when you put these things out.

Pullinger makes it clear that he is no opponent of targets, but that in the hands of the unskilled they can create perverse incentives, encouraging behaviour that distorts the system they sought to control and frustrating the very improvement they were implemented to achieve.

For example, two train companies are being assessed by the regulator for punctuality. A train is defined as “on-time” if it arrives within 5 minutes of schedule. The target is 95% punctuality.
TrainTargets
Evidently, simple management by target fails to reveal that Company 1 is doing better than Company 2 in offering a punctual service to its passengers. A simple statement of “95% punctuality (punctuality defined as arriving within 5 minutes of timetable)” discards much of the information in the data.

Further, when presented with a train that has slipped outside the 5 minute tolerance, a manager held solely to the target of 95% has no incentive to stop the late train from slipping even further behind. Certainly, if it puts further trains at risk of lateness, there will always be a temptation to strip it of all priority. Here, the target is not only a barrier to effective measurement and improvement, it is a threat to the proper operation of the railway. That is the point that Pullinger was seeking to make about the behaviour induced by the target.

And again, targets often provide only a “snapshot” rather than the “video” that discloses the information in the data that can be used for planning and managing an enterprise.

I am glad that Pullinger was not hesitant to remind users that proper deployment of system measurement requires an appreciation of psychology. Nobel Laureate psychologist Daniel Kahneman warns of the inherent human trait of thinking that What you see is all there is (WYSIATI). On their own, targets do little to guard against such bounded rationality.

In support of a corporate programme of improvement and integrated in a culture of rigorous data criticism, targets have manifest benefits. They communicate improvement priorities. They build confidence between interfacing processes. They provide constraints and parameters that prevent the system causing harm. Harm to others or harm to itself. What is important is that the targets do not become a shield to weak managers who wish to hide their lack of understanding of their own processes behind the defence that “all targets were met”.

However, all that requires some sophistication in approach. I think the following points provide a basis for auditing how an organisation is using targets.

Risk assessment

Targets should be risk assessed, anticipating realistic psychology and envisaging the range of behaviours the targets are likely to catalyse.

Customer focus

Anyone tasked with operating to a target should be periodically challenged with a review of the Voice of the Customer and how their own role contributes to the organisational system. The target is only an aid to the continual improvement of the alignment between the Voice of the Process and the Voice of the Customer. That is the only game in town.

Borrowed validation

Any organisation of any size will usually have independent data of sufficient borrowing strength to support mutual validation. There was a very good recent example of this in the UK where falling crime statistics, about which the public were rightly cynical and incredulous, were effectively validated by data collection from hospital emergency departments (Violent crime in England and Wales falls again, A&E data shows).

Over-adjustment

Mechanisms must be in place to deter over-adjustment, what W Edwards Deming called “tampering”, where naïve pursuit of a target adds variation and degrades performance.

Discipline

Employees must be left in no doubt that lack of care in maintaining the integrity of the organisational system and pursuing customer excellence will not be excused by mere adherence to a target, no matter how heroic.

Targets are for the guidance of the wise. To regard them as anything else is to ask them to do too much.