Data versus modelling

Life can only be understood backwards; but it must be lived forwards.

Søren Kierkegaard

Journalist James Forsyth was brave enough to write the following in The Spectator, 4 July 2020 in the context of reform of the UK civil service.

The new emphasis on data must properly distinguish between data and modelling. Data has real analytical value – it enables robust discussion of what has worked and what has not. Modelling is a far less exact science. In this [Covid-19] crisis, as in the 2008 financial crisis, models have been more of a hinderance than a help.

Now, this glosses a number of issues that I have gone on about a lot on this blog. It’s a good opportunity for me to challenge again what I think I have learned from a career in data, modelling and evidence.

Data basics

Pick up your undergraduate statistics text. Turn to page one. You will find this diagram.

Frame

The population, and be assured I honestly hate that term but I am stuck with it, is the collection of all things or events, individuals, that I passionately want to know about. All that I am willing to pay money to find out about. Many practical facets of life prevent me from measuring every single individual. Sometimes it’s worth the effort and that’s called a census. Then I know everything, subject to the performance of my measurement process. And if you haven’t characterised that beforehand you will be in trouble. #MSA

In many practical situations, we take a sample. Even then, not every single individual in the population will be available for sampling within my budget. Suppose I want to market soccer merchandise to all the people who support West Bromwich Albion. I have no means to identify who all those people are. I might start with season ticket holders, or individuals who have bought a ticket on line from the club in the past year, or paid for multiple West Brom games on subscription TV. I will not even have access to all those. Some may have opted to protect their details from marketing activities under GDPRUK. What is left, no matter how I chose to define it, is called the sampling frame. That is the collection of individuals that I have access to and can interrogate, in principle.  The sampling frame is all those items I can put on a list from one to whatever. I can interrogate any of them. I will probably, just because of cost, take a subset of the frame as my sample. As a matter of pure statistical theory, I can analyse and quantify the uncertainty in my conclusions that arises from the limited extent of my sampling within the frame, at least if I have adopted one of the canonical statistical sampling plans.

However, statistical theory tells me nothing about the uncertainty that arises in extrapolating (yes it is!) from frame to population. Many supporters will not show up in my frame, those who follow from the sports bar for example. Some in the frame may not even be supporters but parents who buy tickets for offspring who have rebelled against family tradition. In this illustration, I have a suspicion that the differences between frame and population are not so great. Nearly all the people in my frame will be supporters and neglecting those outside it may not be so great a matter. The overlap between frame and population is large, even though it may not be perfect. However, in general, extrapolation from frame to population is a matter for my subjective subject matter insight, market and product knowledge. Statistical theory is the trivial bit. Using domain knowledge to go from frame to population is the hard work. Not only is it hard work, it bears the greater part of the risk.

Enumerative and analytic statistics

W Edwards Deming was certainly the most famous statistician of the twentieth century. So long ago now. He made a famous distinction between two types of statistical study.

Enumerative study: A statistical study in which action will be taken on the material in the frame being studied.

Analytic study: A statistical study in which action will be taken on the process or cause-system that produced the frame being studied. The aim being to improve practice in the future.

Suppose that a company manufactures 1000 overcoats for sale on-line. An inspector checks each overcoat of the 1000 to make sure it has all three buttons. All is well. The 1000 overcoats are released for sale. No way to run a business, I know, but an example of an enumerative study. The 1000 overcoats are the frame. The inspector has sampled 100% of them. Action has been taken on the 1000 overcoats, the 1000 overcoats that were, themselves, the sampling frame. Sadly, this is what so many people think statistics is all about. There is no ambiguity here in extrapolating from frame to population as the frame is the population.

Deming’s definition of an analytic study is a bit more obscure with its reference to cause systems. But let’s take a case that is, at once, extreme and routine.

When we are governing or running a commercial enterprise or a charity, we are in the business of predicting the future. The past has happened and we are stuck with it. This is what our world looks like.

Frame

The frame available for sampling is the historical past. The data that you have is a sample from that past frame. The population you want to know about is the future. There is no area of overlap between past and future, between frame and population. All that stuff in statistics books about enumerative studies, that is most of the contents, will not help you. Issues of extrapolating from frame to sample, the tame statistical matters in the text books, are dwarfed by the audacity of projecting the frame onto an ineffable future.

And, as an aside, just think about what that means when we are drawing conclusions about future human health from past experiments on mice.

What Deming pointed towards, with his definition of analytic study, is that, in many cases, we have enough faith to believe that both the past and future are determined by a common system of factors, drivers, mechanisms, phenomena and causes, physiochemical and economic, likely interacting in a complicated but regular way. This is what Deming meant by the cause system.

Managing and governing are both about pulling levers to effect change. Dwelling on the past will only yield beneficial future change if exploited, mercilessly, to understand the cause system. To characterise what are the levers that will deliver future beneficial outcomes. That was Deming’s big challenge.

The inexact science of modelling

And to predict, we need a model of the cause system. This is unavoidable. Sometimes we are able to use the simplest model of all. That the stream of data we are bothered about is exchangeable, or if you prefer stable and predicable. As I have stressed so many times before on this blog, to do that we need:

  • Trenchant criticism of the experience base that shows an historical record of exchangeability; and
  • Enough subject matter insight into the cause system to believe that such exchangeability will be maintained, at least into an immediate future where foresight would be valuable.

Here, there is no need quantitatively to map out the cause system in detail. We are simply relying on its presumed persistence into the future. It’s still a model. Of course, the price of extrapolation is eternal vigilance. Philip Tetlock drew similar conclusions in Superforecasting.

But often we know that critical influences on the past are pray to change and variation. Climates, councils, governments … populations, tastes, technologies, creeds and resources never stand still. As visible change takes place we need to be able to map its influence onto those outcomes that bother us. We need to be able to do that in advance. Predicting sales of diesel motor vehicles based on historical data will have little prospect of success unless we know that they are being regulated out of existence, in the UK at least. And we have to account for that effect. Quantitatively. This requires more sophisticated modelling. But it remains essential to any form of prediction.

I looked at some of the critical ideas in modelling here, here and here.

Data v models

The purpose of models is not to fit the data but to sharpen the questions.

Samuel Karlin

Nothing is more useless than the endless collection of data without a will to action. Action takes place in the present with the intention of changing the future. To use historical data to inform our actions we need models. Forsyth wants to know what has worked in the past and what has not. That was then, this is now. And it is not even now we are bothered about but the future. Uncritical extrapolation is not robust analysis. We need models.

If we don’t understand these fundamental issues then models will seem more a hinderance than a help.

But … eternal vigilance.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s