I have just been reading Superforecasting: The Art and Science of Prediction by Philip Tetlock and Dan Gardner. The book has attracted much attention and enthusiasm in the press. It makes a bold claim that some people, superforecasters, though inexpert in the conventional sense, are possessed of the ability to make predictions with a striking degree of accuracy, that those individuals exploit a strategy for forecasting applicable even to the least structured evidence, and that the method can be described and learned. The book summarises results of a study sponsored by US intelligence agencies as part of the Good Judgment Project but, be warned, there is no study data in the book.
I haven’t found any really good distinction between forecasting and prediction so I might swap between the two words arbitrarily here.
What was being predicted?
The forecasts/ predictions in question were all in the field of global politics and economics. For example, a question asked in January 2011 was:
Will Italy restructure or default on its debt by 31 December 2011?
This is a question that invited a yes/ no answer. However, participants were encouraged to answer with a probability, a number between 0% and 100% inclusive. If they were certain of the outcome they could answer 100%, if certain that it would not occur, 0%. The participants were allowed, I think encouraged, to update and re-update their forecasts at any time. So, as far as I can see, a forecaster who predicted 60% for Italian debt restructuring in January 2011 could revise that to 0% in December, even up to the 30th. Each update was counted as a separate forecast.
The study looked for “Goldilocks” problems, not too difficult, not to easy but just right.
Bruno de Finetti was very sniffy about using the word “prediction” in this context and preferred the word “prevision”. It didn’t catch on.
Who was studied?
The study was conducted by means of a tournament among volunteers. I gather that the participants wanted to be identified and thereby personally associated with their scores. Contestants had to be college graduates and, as a preliminary, had to complete a battery of standard cognitive and general knowledge tests designed to characterise their given capabilities. The competitors in general fell in the upper 30 percent of the general population for intelligence and knowledge. When some book reviews comment on how the superforecasters included housewives and unemployed factory workers I think they give the wrong impression. This was a smart, self-selecting, competitive group with an interest in global politics. As far as I can tell, backgrounds in mathematics, science and computing were typical. It is true that most were amateurs in global politics.
With such a sampling frame, of what population is it representative? The authors recognise that problem though don’t come up with an answer.
Forecasters were assessed using Brier scores. I fear that Brier scores fail to be intuitive, are hard to understand and raise all sorts of conceptual difficulties. I don’t feel that they are sufficiently well explained, challenged or justified in the book. Suppose that a competitor predicts a probability p for the Italian default of 60%. Rewrite this as a probability in the range 0 to 1 for convenience, 0.6 If the competitor accepts finite additivity then the probability of “no default” is 1- 0.6 = 0.4. Now suppose that outcomes f are coded as 1 when confirmed and 0 when disconfirmed. That means that if a default occurs then f ( default ) = 1 and f (no default ) = 0. If there is no default then f ( default ) = 0 and f (no default ) = 1. It’s not easy to get. We then take the difference between the ps and the fs, calculate the square of the differences and sum them. This is illustrated below for “no default” which yields a Brier score of 0.72.
|Event||p||f||( p – f ) 2|
Suppose we were dealing with a fair coin toss. Nobody would criticise a forecasting probability of 50% for heads and 50% for tails. The long run Brier score would be 0.5 (think about it). Brier scores were averaged for each competitor and used as the basis of ranking them. If a competitor updated a prediction then every fresh update was counted as an individual prediction and each prediction was scored. More on this later. An average of 0.5 would be similar to a chimp throwing darts at a target. That is about how well expert professional forecasters had performed in a previous study. The lower the score the better. Zero would be perfect foresight.
I would have liked to have seen some alternative analyses and I think that a Hosmer-Lemeshow statistic or detailed calibration study would in some ways have been more intuitive and yielded more insight.
What the results?
The results are not given in the book, only some anecdotes. Competitor Doug Lorch, a former IBM programmer it says, answered 104 questions in the first year and achieved a Brier score of 0.22. He was fifth on the drop list. The top 58 competitors, the superforecasters, had an average Brier score of 0.25 compared with 0.37 for the balance. In the second year, Lorch joined a team of other competitors identified as superforecasters and achieved an average Brier score of 0.14. He beat a prediction market of traders dealing in futures in the outcomes, the book says by 40% though it does not explain what that means.
I don’t think that I find any of that, in itself, persuasive. However, there is a limited amount of analysis here on the (old) Good Judgment Project website. Despite the reservations I have set out above there are some impressive results, in particular this chart.
The competitors’ Brier scores were measured over the first 25 questions. The 100 with the lowest scores were identified, the blue line. The chart then shows the performance of that same group of competitors over the subsequent 175 questions. Group membership is not updated. It is the same 100 competitors as performed best at the start who are plotted across the whole 200 questions. The red line shows the performance of the worst 100 competitors from the first 25 questions, again with the same cohort plotted for all 200 questions.
Unfortunately, it is not raw Brier scores that are plotted but standardised scores. The scores have been adjusted so that the mean is zero and standard deviation one. That actually adds nothing to the chart but obscures somewhat how it is interpreted. I think that violates all Shewhart’s rules of data presentation.
That said, over the first 25 questions the blue cohort outperform the red. Then that same superiority of performance is maintained over the subsequent 175 questions. We don’t know how much is the difference in performance because of the standardisation. However, the superiority of performance is obvious. If that is a mere artefact of the data then I am unable to see how. Despite the way that data is presented and my difficulties with Brier scores, I cannot think of any interpretation other than there being a cohort of superforecasters who were, in general, better at prediction than the rest.
Tetlock comes up with some tentative explanations as to the consistent performance of the best. In particular he notes that the superforecasters updated their predictions more frequently than the remainder. Each of those updates was counted as a fresh prediction. I wonder how much of the variation in Brier scores is accounted for by variation in the time of making the forecast? If superforecasters are simply more active than the rest, making lots of forecasts once the outcome is obvious then the result is not very surprising.
That may well not be the case as the book contends that superforecasters predicting 300 days in the future did better than the balance of competitors predicting 100 days. However, I do feel that the variation arising from the time a prediction was made needs to be taken out of the data so that the variation in, shall we call it, foresight can be visualised. The book is short on actual analysis and I would like to have seen more. Even in a popular management book.
The data on the website on purported improvements from training is less persuasive, a bit of an #executivetimeseries.
Some of the recommendations for being a superforecaster are familiar ideas from behavoural psychology. Be a fox not a hedgehog, don’t neglect base rates, be astute to the distinction between signal and noise, read widely and richly, etc..
There was one unanticipated and intriguing result. The superforecasters updated their predictions not only frequently but by fine degrees, perhaps from 61% to 62%. I think that some further analysis is required to show that that is not simply an artefact of the measurement. Because Brier scores have a squared term they would be expected to punish the variation in large adjustments.
However, taking the conclusion at face value, it has some important consequences for risk assessment which often proceeds by broadly granular ranking on a rating scale of 1 to 5, say. The study suggests that the best predictions will be those where careful attention is paid to fine gradations in probability.
Of course, continual updating of predictions is essential to even the most primitive risk management though honoured more often in the breach than the observance. I shall come back to the significance of this for risk management in a future post.
There is also an interesting discussion about making predictions in teams but I shall have to come back to that another time.
The amateurs out-performed the professionals on global politics. I wonder if the same result would have been encountered against experts in structural engineering.
Professor Tetlock invites you to join the programme at http://www.goodjudgment.com.