Trust in data – IV – trusting the team

Today (20 November 2013) I was reading an item in The Times (London) with the headline “We fiddle our crime numbers, admit police”. This is a fairly unedifying business.

The blame is once again laid at the door of government targets and performance related pay. I fear that this is akin to blaming police corruption on the largesse of criminals. If only organised crime would stop offering bribes, the police would not succumb to taking them in consideration of repudiating their office as constable, so the argument might run (pace Brian Joiner). Of course, the argument is nonsense. What we expect of police constables is honesty even, perhaps especially, when temptation presents itself. We expect the police to give truthful evidence in court, to deal with the public fairly and to conduct their investigations diligently and rationally. The public expects the police to behave in this way even in the face of manifest temptation to do otherwise. The public expects the same honest approach to reporting their performance. I think Robert Frank put it well in Passions within Reason.

The honest individual … is someone who values trustworthiness for its own sake. That he might receive a material payoff for such behaviour is beyond his concern. And it is precisely because he has this attitude that he can be trusted in situations where his behaviour cannot be monitored. Trustworthiness, provided it is recognizable, creates valuable opportunities that would not otherwise be available.

Matt Ridley put it starkly in his overview of evolutionary psychology, The Origins of Virtue. He wasn’t speaking of policing in particular.

The virtuous are virtuous for no other reason that it enables them to join forces with others who are virtuous, for mutual benefit.

What worried me most about the article was a remark from Peter Barron, a former detective chief superintendent in the Metropolitan Police. Should any individual challenge the distortion of data:

You are judged to be not a team player.

“Teamwork” can be a smokescreen for the most appalling bullying. In our current corporate cultures, to be branded as “not a team player” can be the most horrible slur, smearing the individual’s contribution to the overall mission. One can see how such an environment can allow a team’s behaviours and objectives to become misaligned from those of the parent organisation. That is a problem that can often be addressed by management with a proper system of goal deployment.

However, the problem is more severe when the team is in fact well aligned to what are distorted organisational goals. The remedies for this lie in the twin processes of governance and whistleblowing. Neither seem to be working very well in UK policing at the moment but that simply leaves an opportunity for process improvement. Work is underway. The English law of whistleblowing has been amended this year. If you aren’t familiar with it you can find it here.

Governance has to take scrutiny of data seriously. Reported performance needs to be compared with other sources of data. Reporting and recording processes need themselves to be assessed. Where there is no coherent picture questions need to be asked.

Trust in data – III – being honest about honesty

I found this presentation by Dan Ariely intriguing. I suspect that this is originally a TED talk with some patronising cartoons added. You can just listen.

When I started off in operational excellence learning about the Deming philosophy, my instructors always used to say These are honest men’s [sic] tools. From that point of view Airely’s presentation is pretty pessimistic. I don’t think I am entirely surprised when I recall Matt Ridley’s summary of evolutionary psychology from his book The Origins of Virtue.

Human beings have some instincts that foster the greater good and others that foster self-interest and anti-social behaviour. We must design a society that encourages the former and discourages the latter.

When wearing a change management hat it’s easy to be sanguine about designing a system or organisation that fosters virtue and the sort of diligent data collection that confronts present reality. However, it is useful to have a toolkit of tactics to build such a system. I think Ariely’s ideas are helpful here.

His idea of “reminders” is something that resonates with maintaining a continual focus on the Voice of the Customer/ Voice of the Business. Periodically exploring with data collectors the purpose of their data collection and the system wide consequences of fabrication is something that seems worthwhile in itself. However, the work Ariely refers to suggests that there might be reasons why such a “nudge” would be particularly effective in improving data trustworthiness.

His idea of “confessions” is a little trickier. I might reflect for a while then blog some more.

The graph of doom – one year on

I recently came across the chart (sic) below on this web site.

GraphofDoom

It’s apparently called the “graph of doom”. It first came to public attention in May 2012 in the UK newspaper The Guardian. It purports to show how the London Borough of Barnet’s spending on social services will overtake the Borough’s total budget some time around 2022.

At first sight the chart doesn’t offend too much against the principles of graphical excellence as set down by Edward Tufte in his book The Visual Display of Quantitative Information. The bars could probably have been better replaced by lines and that would have saved some expensive, coloured non-data ink. That is a small quibble.

The most puzzling thing about the chart is that it shows very little data. I presume that the figures for 2010/11 are actuals. The 2011/12 may be provisional. But the rest of the area of the chart shows predictions. There is a lot of ink on this chart showing predictions and very little showing actual data. Further, the chart does not distinguish, graphically, between actual data and predictions. I worry that that might lend the dramatic picture more authority that it is really entitled to. The visible trend lies wholly in the predictions.

Some past history would have exposed variation in both funding and spending and enabled the viewer to set the predictions in that historical context. A chart showing a converging trend of historical data projected into the future is more impressive than a chart showing historical stability with all the convergence found in the future prediction. This chart does not tell us which is the actual picture.

Further, I suspect that this is not the first time the author had made a prediction of future funds or demand. What would interest me, were I in the position of decision maker, is some history of how those predictions have performed in the past.

We are now more than one year on from the original chart and I trust that the 2012/13 data is now available. Perhaps the authors have produced an updated chart but it has not made its way onto the internet.

The chart shows hardly any historical data. Such data would have been useful to a decision maker. The ink devoted to predictions could have been saved. All that was really needed was to say that spending was projected to exceed total income around 2022. Some attempt at quantifying the uncertainty in that prediction would also have been useful.

Graphical representations of data carry a potent authority. Unfortunately, when on the receiving end of most Powerpoint presentations we don’t have long to deconstruct them. We invest a lot of trust in the author of a chart that it can be taken at face value. That ought to be the chart’s function, to communicate the information in the data efficiently and as dramatically as the data and its context justifies.

I think that the following principles can usefully apply to the charting of predictions and forecasts.

  • Use ink on data rather than speculation.
  • Ditto for chart space.
  • Chart predictions using a distinctive colour or symbol so as to be less prominent than measured data.
  • Use historical data to set predictions in context.
  • Update chart as soon as predictions become data.
  • Ensure everybody who got the original chart gets the updated chart.
  • Leave the prediction on the updated chart.

The last point is what really sets predictions in context.

Note: I have tagged this post “Data visualization”, adopting the US spelling which I feel has become standard English.

Trust in data – II

I just picked up on this, now not so recent, news item about the prosecution of Steven Eaton. Eaton was gaoled for falsifying data in clinical trials. His prosecution was pursuant to the Good Laboratory Practice Regulations 1999. The Regulations apply to chemical safety assessments and come to us, in the UK, from that supra-national body the OECD. Sadly I have managed to find few details other than the press reports. I have had a look at the website of the prosecuting Medicines and Healthcare Products Regulatory Agency but found nothing beyond the press release. I thought about a request under the Freedom of Information Act 2000 but wonder whether an exemption is being claimed pursuant to section 31.

It’s a shame because it would have been an opportunity to compare and contrast with another notable recent case of industrial data fabrication, that concerning BNFL and the Kansai Electric contract. Fortunately, in that case, the HSE made public a detailed report.

In the BNFL case, technicians had fabricated measurements of the diameters of fuel pellets in nuclear fuel rods, it appears principally out of boredom at doing the actual job. The customer spotted it, BNFL didn’t. The matter caused huge reputational damage to BNFL and resulted in the shipment of nuclear fuel rods, necessarily under armed escort, being turned around mid-ocean and returned to the supplier.

For me, the important lesson of the BNFL affair is that businesses must avoid a culture where employees decide what parts of the job are important and interesting to them, what is called intrinsic motivation. Intrinsic motivation is related to a sense of cognitive ease. That sense rests, as Daniel Kahneman has pointed out, on an ecology of unknown and unknowable beliefs and prejudices. No doubt the technicians had encountered nothing but boringly uniform products. They took that as a signal, and felt a sense of cognitive ease in doing so, to stop measuring and conceal the fact that they had stopped.

However, nobody in the supply chain is entitled to ignore the customer’s wishes. Businesses need to foster the extrinsic motivation of the voice of the customer. That is what defines a job well done. Sometimes it will be irksome and involve a lot of measuring pellets whose dimensions look just the same as the last batch. We simply have to get over it!

The customer wanted the data collected, not simply as a sterile exercise in box-ticking, but as a basis for diligent surveillance of the manufacturing process and as a critical component of managing the risks attendant in real world nuclear industry operations. The customer showed that a proper scrutiny of the data, exactly what they had thought that BNFL would perform as part of the contract, would have exposed its inauthenticity. BNFL were embarrassed, not only by their lack of management control of their own technicians, but by the exposure of their own incapacity to scrutinise data and act on its signal message. Even if all the pellets were of perfect dimension, the customer would be legitimately appalled that so little critical attention was being paid to keeping them so.

Data that is properly scrutinised, as part of a system of objective process management and with the correct statistical tools, will readily be exposed if it is fabricated. That is part of incentivising technicians to do the job diligently. Dishonesty must not be tolerated. However, it is essential that everybody in an organisation understands the voice of the customer and understands the particular way in which they themselves add value. A scheme of goal deployment weaves the threads of the voice of the customer together with those of individual process management tactics. That is what provides an individual’s insight into how their work adds value for the customer. That is what provides the “nudge” towards honesty.

Risks of Paediatric heart surgery in the NHS

I thought, before posting, I would let the controversy die down around this topic and in particular the anxieties and policy changes around Leeds General Infirmary. However, I had a look at this report and found there were some interesting things in it worth blogging about.

Readers will remember that there was anxiety in the UK about mortality rates from paediatric surgery and whether differential mortality rates from the limited number of hospitals was evidence of relative competence and, moreover, patient safety. For a time Leeds General Infirmary suspended all such surgery. The report I’ve been looking at was a re-analysis of the data after some early data quality problems had been resolved. Leeds was exonerated and recommenced surgery.

The data analysed is from 2009 to 2012. The headline graphic in the report is this. The three letter codes indicate individual hospitals.

Heart Summary

I like this chart as it makes an important point. There is nothing, in itself, significant about having the highest mortality rate. There will always be exactly two hospitals at the extremes of any league table. The task of data analysis is to tell us whether that is simply a manifestation of the noise in the system or whether it is a signal of an underlying special cause. Nate Silver makes these points very well in his book The Signal and the Noise. Leeds General Infirmary had the greatest number of deaths, relative to expectations, but then somebody had to. It may feel emotionally uncomfortable being at the top but it is no guide to executive action.

Statisticians like the word “significant” though I detest it. It is a “word worn smooth by a million tongues”. The important idea is that of a sign or signal that stands out in unambiguous contrast to the noise. As Don Wheeler observed, all data has noise, some data also has signals. Even the authors of the report seem to have lost confidence in the word as they enclose it in quotes in their introduction. However, what this report is all about is trying to separate signal from noise. Against all the variation in outcomes in paediatric heart surgery, is there a signal? If so, what does the signal tell us and what ought we to do?

The authors go about their analysis using p-values. I agree with Stephen Ziliak and Deirdre McCloskey in their criticism of p-values. They are a deeply unreliable as a guide to action. I do not think they do much harm they way they are used in this report but I would have preferred to see the argument made in a different way.

The methodology of the report starts out by recognising that the procedural risks will not be constant for all hospitals. Factors such as differential distributions of age, procedural complexity and the patient’s comorbidities will affect the risk. The report’s analysis is predicated on a model (PRAiS) that predicts the number of deaths to be expected in a given year as a function of these sorts of variables. The model is based on historical data, I presume from before 2009. I shall call this the “training” data. The PRAiS model endeavours to create a “level playing field”. If the PRAiS adjusted mortality figures are stable and predictable then we are simply observing noise. The noise is the variation that the PRAiS model cannot explain. It is caused by factors as yet unknown and possibly unknowable. What we are really interested in is whether any individual hospital in an individual year shows a signal, a mortality rate that is surprising given the PRAiS prediction.

The authors break down the PRAiS adjusted data by year and hospital. They then take a rather odd approach to the analysis. In each year, they make a further adjustment to the observed deaths based on the overall mortality rate for all hospitals in that year. I fear that there is no clear explanation as to why this was adopted.

I suppose that this enables them to make an annual comparison between hospitals. However, it does have some drawbacks. Any year-on-year variation not explained by the PRAiS model is part of the common cause variation, the noise, in the system. It ought to have been stable and predictable over the data with which the model was “trained”. It seems odd to adjust data on the basis of noise. If there were a deterioration common to all hospitals, it would not be picked up in the authors’ p-values. Further, a potential signal of deterioration in one hospital might be masked by a moderately bad, but unsurprising, year in general.

What the analysis does mask is that there is a likely signal here suggesting a general improvement in mortality rates common across the hospitals. Look at 2009-10 for example. Most hospitals reported fewer deaths than the PRAiS model predicted. The few that didn’t, barely exceeded the prediction.

Hear0910

In total, over the three years and 9930 procedures studied, the PRAiS model predicted 291 deaths. There were 243. For what it’s worth, I get a p-value of 0.002. Taking that at face value, there is a signal that mortality has dropped. Not a fact that I would want to disguise.

The plot that I would like to have seen, as an NHS user, would be a chart of PRAiS adjusted annual deaths against time for the “training” data. That chart should then have natural process limits (“NPLs”) added, calculated from the PRAiS adjusted deaths. This must show stable and predictable PRAiS adjusted deaths. Otherwise, the usefulness of the model and the whole approach is compromised. The NPLs could then be extended forwards in time and subsequent PRAiS adjusted mortalities charted on an annual basis. There would be individual hospital charts and a global chart. New points would be added annually.

I know that there is a complexity with the varying number of patients each year but if plotted in the aggregate and by hospital there is not enough variation, I think, to cause a problem.

The chart I suggest has some advantages. It would show performance over time in a manner transparent to NHS users. Every time the data comes in issue we could look and see that we have the same chart as last time we looked with new data added. We could see the new data in the context of the experience base. That helps build trust in data. There would be no need for an ad hoc analysis every time a question was raised. Further, the “training” data would give us the residual process variation empirically. We would not have to rely on simplifying assumptions such as the Poisson distribution when we are looking for a surprise.

There is a further point. The authors of the report recognise a signal against two criteria, an “Alert area” and an “Alarm area”. I’m not sure how clinicians and managers respond to a signal in these respective areas. It is suggestive of the old-fashioned “warning limits” that used to be found on some control charts. However, the authors of the report compound matters by then stating that hospitals “approaching the alert threshold may deserve additional scrutiny and monitoring of current performance”. The simple truth is that, as Terry Weight used to tell me, a signal is a signal is a signal. As soon as we see a signal we protect the customer and investigate its cause. That’s all there is to it. There is enough to do in applying that tactic diligently. Over complicating the urgency of response does not, I think, help people to act effectively on data. If we act when there is no signal then we have a strategy that will make outcomes worse.

Of course, I may have misunderstood the report and I’m happy for the authors to post here and correct me.

If we wish to make data the basis for action then we have to move from reactive ad hoc analysis to continual and transparent measurement along with a disciplined pattern of response. Medical safety strikes me as exactly the sort of system that demands such an approach.