Risks of Paediatric heart surgery in the NHS

I thought, before posting, I would let the controversy die down around this topic and in particular the anxieties and policy changes around Leeds General Infirmary. However, I had a look at this report and found there were some interesting things in it worth blogging about.

Readers will remember that there was anxiety in the UK about mortality rates from paediatric surgery and whether differential mortality rates from the limited number of hospitals was evidence of relative competence and, moreover, patient safety. For a time Leeds General Infirmary suspended all such surgery. The report I’ve been looking at was a re-analysis of the data after some early data quality problems had been resolved. Leeds was exonerated and recommenced surgery.

The data analysed is from 2009 to 2012. The headline graphic in the report is this. The three letter codes indicate individual hospitals.

Heart Summary

I like this chart as it makes an important point. There is nothing, in itself, significant about having the highest mortality rate. There will always be exactly two hospitals at the extremes of any league table. The task of data analysis is to tell us whether that is simply a manifestation of the noise in the system or whether it is a signal of an underlying special cause. Nate Silver makes these points very well in his book The Signal and the Noise. Leeds General Infirmary had the greatest number of deaths, relative to expectations, but then somebody had to. It may feel emotionally uncomfortable being at the top but it is no guide to executive action.

Statisticians like the word “significant” though I detest it. It is a “word worn smooth by a million tongues”. The important idea is that of a sign or signal that stands out in unambiguous contrast to the noise. As Don Wheeler observed, all data has noise, some data also has signals. Even the authors of the report seem to have lost confidence in the word as they enclose it in quotes in their introduction. However, what this report is all about is trying to separate signal from noise. Against all the variation in outcomes in paediatric heart surgery, is there a signal? If so, what does the signal tell us and what ought we to do?

The authors go about their analysis using p-values. I agree with Stephen Ziliak and Deirdre McCloskey in their criticism of p-values. They are a deeply unreliable as a guide to action. I do not think they do much harm they way they are used in this report but I would have preferred to see the argument made in a different way.

The methodology of the report starts out by recognising that the procedural risks will not be constant for all hospitals. Factors such as differential distributions of age, procedural complexity and the patient’s comorbidities will affect the risk. The report’s analysis is predicated on a model (PRAiS) that predicts the number of deaths to be expected in a given year as a function of these sorts of variables. The model is based on historical data, I presume from before 2009. I shall call this the “training” data. The PRAiS model endeavours to create a “level playing field”. If the PRAiS adjusted mortality figures are stable and predictable then we are simply observing noise. The noise is the variation that the PRAiS model cannot explain. It is caused by factors as yet unknown and possibly unknowable. What we are really interested in is whether any individual hospital in an individual year shows a signal, a mortality rate that is surprising given the PRAiS prediction.

The authors break down the PRAiS adjusted data by year and hospital. They then take a rather odd approach to the analysis. In each year, they make a further adjustment to the observed deaths based on the overall mortality rate for all hospitals in that year. I fear that there is no clear explanation as to why this was adopted.

I suppose that this enables them to make an annual comparison between hospitals. However, it does have some drawbacks. Any year-on-year variation not explained by the PRAiS model is part of the common cause variation, the noise, in the system. It ought to have been stable and predictable over the data with which the model was “trained”. It seems odd to adjust data on the basis of noise. If there were a deterioration common to all hospitals, it would not be picked up in the authors’ p-values. Further, a potential signal of deterioration in one hospital might be masked by a moderately bad, but unsurprising, year in general.

What the analysis does mask is that there is a likely signal here suggesting a general improvement in mortality rates common across the hospitals. Look at 2009-10 for example. Most hospitals reported fewer deaths than the PRAiS model predicted. The few that didn’t, barely exceeded the prediction.

Hear0910

In total, over the three years and 9930 procedures studied, the PRAiS model predicted 291 deaths. There were 243. For what it’s worth, I get a p-value of 0.002. Taking that at face value, there is a signal that mortality has dropped. Not a fact that I would want to disguise.

The plot that I would like to have seen, as an NHS user, would be a chart of PRAiS adjusted annual deaths against time for the “training” data. That chart should then have natural process limits (“NPLs”) added, calculated from the PRAiS adjusted deaths. This must show stable and predictable PRAiS adjusted deaths. Otherwise, the usefulness of the model and the whole approach is compromised. The NPLs could then be extended forwards in time and subsequent PRAiS adjusted mortalities charted on an annual basis. There would be individual hospital charts and a global chart. New points would be added annually.

I know that there is a complexity with the varying number of patients each year but if plotted in the aggregate and by hospital there is not enough variation, I think, to cause a problem.

The chart I suggest has some advantages. It would show performance over time in a manner transparent to NHS users. Every time the data comes in issue we could look and see that we have the same chart as last time we looked with new data added. We could see the new data in the context of the experience base. That helps build trust in data. There would be no need for an ad hoc analysis every time a question was raised. Further, the “training” data would give us the residual process variation empirically. We would not have to rely on simplifying assumptions such as the Poisson distribution when we are looking for a surprise.

There is a further point. The authors of the report recognise a signal against two criteria, an “Alert area” and an “Alarm area”. I’m not sure how clinicians and managers respond to a signal in these respective areas. It is suggestive of the old-fashioned “warning limits” that used to be found on some control charts. However, the authors of the report compound matters by then stating that hospitals “approaching the alert threshold may deserve additional scrutiny and monitoring of current performance”. The simple truth is that, as Terry Weight used to tell me, a signal is a signal is a signal. As soon as we see a signal we protect the customer and investigate its cause. That’s all there is to it. There is enough to do in applying that tactic diligently. Over complicating the urgency of response does not, I think, help people to act effectively on data. If we act when there is no signal then we have a strategy that will make outcomes worse.

Of course, I may have misunderstood the report and I’m happy for the authors to post here and correct me.

If we wish to make data the basis for action then we have to move from reactive ad hoc analysis to continual and transparent measurement along with a disciplined pattern of response. Medical safety strikes me as exactly the sort of system that demands such an approach.

George Box and Response Surface Methods

News of George Box’s death escaped me while I was on vacation earlier this year and I thought it about time I commented on a huge statistical career. There are plenty of thorough obituaries on the web and I’m sure that the RSS will do a splendid job in due course. It is sad that there was no obituary in the Fleet Street press for somebody who has made such an eminent contribution to science and technology. Box’s particular talents were formed through his English training and learning on the job. Perhaps his neglect on the national stage is a measure of the extent to which the biggest ideas work gradually and organically, away from the grandstanding of the celebrity culture.

The word statistician feels inadequate to describe Box’s work. He was a man actively engaged in seeking novel methodologies for solving practical problems. Many of his solutions embraced what we conventionally think of as statistics. However, his work always seems that of somebody who looked for methodological solutions and sometimes found them in statistics, rather than a statistician looking to sell his product. Box described himself as “an accidental statistician” and I have a soft spot for people who arrive at their destinations by unconventional routes.

I was spurred on to reflect on Box’s work by Tim Davis’ worthwhile advocacy of dimensional analysis in experimental design;. Box himself lamented that engineers are often hypnotised when adopting statistical tools and discard their engineering knowledge in the process with little regret. There is a gap between the engineer incubating a, possibly ill-formed, problem and a statistically inspired structured investigation. Sometimes it’s a hazardous leap between the two. From the far side it’s sometimes difficult to look back and see what motivated the investigation. The more bridges we can find across that gap the better. I think few have approached the effectiveness with which Box pontificated (in the exact sense of the word).

One of Box’s greatest contributions was his advocacy of Response Surface Methods (“RSMs”). I think some of my most enjoyable statistical experiences were back in my automotive industry days when we were using RSMs with computer models to optimise design details on mechanical components. We were looking to improve durability and reduce warranty costs. I recall one situation where we exploited an elastic-plastic model of a feature that took 16 hours to run on the company’s CRAY supercomputer, a situation where even computer experiments needed a structured investigation.

As I said, Tim had got me thinking and I returned to a frustrating book that I have put down years ago: Walter G Vinenti’s What Engineers Know and How They Know It (1990, Johns Hopkins UP). Vincenti was an eminent aerospace engineer and the book is a fascinating history of a number of notable events in aerospace design. I do have a problem with this book. Vincenti seems rather dismissive of statistics. There are no statisticians in the index! There is however a compelling chapter on W F Durand and E P Lesley’s First World War propeller experiments. These were executed through quite a nice little factorial design. Durand’s trials and tribulations in managing the experimentation show that really the statistics is the easy bit. You can find the full report here. It is well worth reading.

Vincenti is rather dismissive of Durand’s statistical skill and relegates it to a footnote. He doesn’t really acknowledge Durand’s methodological sophistication. The truly frustrating thing about the book is the difficulty in drawing generalised conclusions that answer the question in the title. However, Vincenti does come up with the suggestion that “parameter variation” in the broadest sense is a key part of the engineering learning process. I think it’s a disappointing takeaway as his descriptive part of the book is much richer than the conclusion suggests. Perhaps I will come back to this.

One of Box’s key insights was engineers’ need for immediacy and sequentiality in the parameter variation process (“Statistics as a Catalyst to Learning by Scientific Method Part II – A Discussion”, Journal of Quality Technology, 31(1), 1999, pp16-29).

Psychologist Daniel Kahneman has described two ways of thinking that typify human decision making. System 1 is instinctive, fluent, heuristic and integrated with the experience base. System 1 is over confident and often leads us astray. System 2 employs reflective considered analysis. It can, when properly guided by statistical theory, guard against the hazards of System 1. Problems such as “What factors determine this process output?” are difficult. Kahneman observes that often, when confronted with difficult problems, System 1 substitutes a simpler problem such as “What factors are we currently relying on to control this process?”. Experts think they are answering the first question when they are in fact answering the second. Box’s requirement for immediacy allows engineers to exploit their, sometimes misleading, experience base while subjecting it to a rigorous experimental test in a rapid and efficient manner.

Experimental results feed into System 2 thinking. However, the human mind is still much too confident in adopting explanations that are in reality merely plausible rather than probable. The requirement of sequentiality allows analysis of those explanations in a rich and diverse context that puts them to a rigorous test.

One of the fascinations of engineering research is exploring the partially known. Jon Schmidt made the following remark about structural engineering but I think it applies to engineering in general. I t certainly applied to mechanical engineering in my automotive days.

Structural engineering is the art of modelling materials we do not wholly understand into shapes we cannot precisely analyse so as to withstand forces we cannot properly assess in such a way that the public at large has no reason to suspect the extent of our ignorance.

The application of statistics, and in particular RSMs, to engineering is one of the great tools we have for decision making under uncertainty. Modern psychology has tended to confirm Box’s instincts, learned on the job, about the tools that best support human decision making and guard against its inadequacies. Box remains a role model in developing strategies for operational excellence.

Another driving metaphor

Welcome – I think this is going to be a shadow blog for a while until everything is in place.

I note that in today’s (London) Times Stephan Shakespeare purports to quote Sir Terry Leahy as saying:

An enterprise without data is like a car with no headlights at night.

Actually, I couldn’t find it on the web, not even with Google. I wonder if Mr Shakespeare has misremembered this famous quote.

In any event, I never encountered an enterprise without data. The problem faced by most enterprises is turning that data into information and exploiting it in making business decisions. I fear that all those decisions are made under besetting uncertainties that have little in common with a halogen beam into the future.

Data, no matter how Big, is not enough. It will not help you steer the car unless you know how to add value to it. In doing that, awareness of uncertainty and variation is a key insight.