I started my grown-up working life on a project seeking to predict extreme ocean currents off the north west coast of the UK. As a result I follow environmental disasters very closely. I fear that it’s natural that incidents in my own country have particular salience. I don’t want to minimise disasters elsewhere in the world when I talk about recent flooding in the north of England. It’s just that they are close enough to home for me to get a better understanding of the essential features.
The causes of the flooding are multi-factorial and many of the factors are well beyond my expertise. However, The Times (London) reported on 28 December 2015 that “Some scientists say that [the UK Environment Agency] has been repeatedly caught out by the recent heavy rainfall because it sets too much store by predictions based on historical records” (p7). Setting store by predictions based on historical records is very much where my hands-on experience of statistics began.
The starting point of prediction is extreme value theory, developed by Sir Ronald Fisher and L H C Tippett in the 1920s. Extreme value analysis (EVA) aims to put probabilistic bounds on events outside the existing experience base by predicating that such events follow a special form of probability distribution. Historical data can be used to fit such a distribution using the usual statistical estimation methods. Prediction is then based on a double extrapolation: firstly in the exact form of the tail of the extreme value distribution and secondly from the past data to future safety. As the old saying goes, “Interpolation is (almost) always safe. Extrapolation is always dangerous.”
EVA rests on some non-trivial assumptions about the process under scrutiny. No statistical method yields more than was input in the first place. If we are being allowed to extrapolate beyond the experience base then there are inevitably some assumptions. Where the real world process doesn’t follow those assumptions the extrapolation is compromised. To some extent there is no cure for this other than to come to a rational decision about the sensitivity of the analysis to the assumptions and to apply a substantial safety factor to the physical engineering solutions.
One of those assumptions also plays to the dimension of extrapolation from past to future. Statisticians often demand that the data be independent and identically distributed. However, that is a weird thing to demand of data. Real world data is hardly ever independent as every successive observation provides more information about the distribution and alters the probability of future observations. We need a better idea to capture process stability.
Historical data can only be projected into the future if it comes from a process that is “sufficiently regular to be predictable”. That regularity is effectively characterised by the property of exchangeability. Deciding whether data is exchangeable demands, not only statistical evidence of its past regularity, but also domain knowledge of the physical process that it measures. The exchangeability must continue into the predicable future if historical data is to provide any guide. In the matter of flooding, knowledge of hydrology, climatology, planning and engineering, law, in addition to local knowledge about economics and infrastructure changes already in development, is essential. Exchangeability is always a judgment. And a critical one.
Predicting extreme floods is a complex business and I send my good wishes to all involved. It is an example of something that is essentially a team enterprise as it demands the co-operative inputs of diverse sets of experience and skills.
In many ways this is an exemplary model of how to act on data. There is no mechanistic process of inference that stands outside a substantial knowledge of what is being measured. The secret of data analysis, which often hinges on judgments about exchangeability, is to visualize the data in a compelling and transparent way so that it can be subjected to collaborative criticism by a diverse team.