Bryan Caplan  

Superforecasting: Supremely Awesome

Henderson on Deaton... Reply to Kevin Drum...

I'm already an unabashed Tetlock fanboy.  But his latest book, Superforecasting: The Art and Science of Prediction (co-authored with Dan Gardner but still written in the first person) takes my fandom to new levels.  Quick version: Philip Tetlock organized one of several teams competing to make accurate predictions about matters we normally leave to intelligence analysts.  Examples: "Will the president of Tunisia flee to a cushy exile in the next month?  "Will an outbreak of H5N1 in China kill more than ten in the next month?  Will the euro fall below $1.20 in the next twelve months."  Tetlock then crowdsourced, carefully measuring participants' traits as well as their forecasts.  He carefully studied top 2% performers' top traits, dubbing them "superforecasters."  And he ran lots of robustness checks and side experiments along the way.

Quick punchline:
The strongest predictor of rising into the ranks of superforecasters is perpetual beta, the degree to which one is committed to belief updating and self-improvement.  It is roughly three times as powerful a predictor as its closest rival, intelligence.
But no punchline can do justice to the richness of this book.  A few highlights...

An "obvious" claim we should all internalize:
Obviously, a forecast without a time frame is absurd.  And yet, forecasters routinely make them...  They're not being dishonest, at least not usually.  Rather, they're relying on a shared implicit understanding, however rough, of the timeline they have in mind.  That's why forecasts without timelines don't appear absurd when they are made.  But as time passes, memories fade, and tacit time frames that seemed so obvious to all become less so. 
The outrageous empirics of how humans convert qualitative claims into numerical probabilities:
In March 1951 National Intelligence Estimate (NIE) 29-51 was published.  "Although it is impossible to determine which course of action the Kremlin is likely to adopt," the report concluded, "we believe that the extent of [Eastern European] military and propaganda preparations indicate that an attack on Yugoslavia in 1951 should be considered a serious possibility." ...But a few days later, [Sherman] Kent was chatting with a senior State Department official who casually asked, "By the way, what did you people mean by the expression 'serious possibility'?  What kind of odds did you have in mind?"  Kent said he was pessimistic.  He felt the odds were about 65 to 35 in favor of an attack.  The official was started.  He and his colleagues had taken "serious possibility" to mean much lower odds.

Disturbed, Kent went back to his team.  They had all agreed to use "serious possibility" in the NIE so Kent asked each person, in turn, what he thought it meant.  One analyst said it meant odds of about 80 to 20, or four times more likely than not that there would be an invasion.  Another thought it meant odds of 20 to 80 - exactly the opposite.  Other answers were scattered between these extremes.  Kent was floored.
A deep finding that could easily reverse if widely known:
How can we be sure that when Brian Labatte makes an initial estimate of 70% but then stops himself and adjusts it to 65% the change is meaningful?  The answer lies in the tournament data.  Barbara Mellers has shown that granularity predicts accuracy: the average forecaster who sticks with tens - 20%, 30%, 40% - is less accurate than the finer-grained forecaster who uses fives - 20%, 25%, 30% - and still less accurate than the even finer-grained forecaster who uses ones - 20%, 21%, 22%.  As a further test, she rounded forecasts to make them less granular... She then recalculated Brier scores and discovered that superforecasters lost accuracy in response to even the smallest-scale rounding, to the nearest .05, whereas regular forecasters lost little even from rounding four times as large, to the nearest 0.2.
More highlights coming soon - the book is packed with them.  But if any book is worth reading cover to cover, it's Superforecasting.

COMMENTS (19 to date)
Steve Reilly writes:

Funny, I'm reading it right now and just turned to check Facebook for a second when I saw your post. But do you consider yourself a hedgehog?

Sam writes:

I haven't read Superforecasters cover to cover, only skimmed. But to counter the glowing review by Bryan, it's worth emphasizing that one major claim of the book is highly counterintuitive and perhaps empirically suspect. Namely, Tetlock argues that superforecasters working in teams, with beliefs aggregated using certain statistical techniques, can outperform the consensus of a prediction market (with "non-super" forecasters participating as traders) that incentivizes (to some degree) participants to express their true subjective beliefs. As far as I can tell, he does have some empirical evidence for this, and I believe the result might even replicate under an identical experimental design. But I'm not that confident it is a *robust* finding -- it's not clear to me that one can conclude, for example, that "non-super" forecasters participating in a state-of-the-art prediction market would collectively perform worse than superforecasters participating (as a team, and with statistical aggregation techniques) in a predictive survey. To be fair, I think Tetlock displays an admirable degree of epistemic modesty in discussing these issues. But one should be cautious in interpreting his findings nonetheless. I don't think the Good Judgment Project findings are dispositive with respect to the relative importance of mechanism design (i.e. setting up even ordinary participants with a set of good incentives) versus idiosyncratic psychological factors (i.e. the thought process of "normals" versus superforecasters), assuming that one's ultimate objective is in fact predictive accuracy.

Tim Minto writes:

Sam, I can provide you with a bit more background on your concern that comparing "superforecasters" in teams vs. "non-supers" in a prediction market is an apples-to-oranges comparison.

I was a "superforecaster" in the teams condition. In season 4 of the experiment - which happened largely after the main contents of the book was authored - the experiment did divide the supers into 2 groups: 130 working as 10 13-person teams, and 130 in a market condition. The result is that (after their aggregation algorithms were applied), the bottom-line aggregated forecast of the teams condition did beat the market condition. But I don't know levels of statistical significance on that. I expect that the research team will be producing a more detailed report on these season 4 findings in the near future.

Khodge writes:

I like your quick punchline. That is the approach I use in the stock market. I find that I have a good feel for a stock if I keep an eye on it. In my desk job, on the other hand, I usually only pay attention at month end and I quickly lose the feel of the movement when I miss a month or two.

I'm looking forward to reading the book.

Bedarz Iliaci writes:

the odds were about 65 to 35

What do numerical odds actually mean? And are the numerical odds any more significant or informative than the term "serious possibility"

Isn't there an error in quantifying what can not be quantified by its very nature?

Floccina writes:

So what would be a supper forecast for world temperatures, sea levels and ocean acidifacation damage. I ask because the forecasters are so far apart that I do not know what to think. I have lost confidence in both side of the AGW debate. I want to believe Matt Ridlley and Indur Goklany but with so many seemingly rational people on the other side, I have little confidence.

Khodge writes:

Floccina raises an interesting point as it relates to superforcasting teams. At what point do you have groupthink rather than something valuable?

John Davies writes:
Barbara Mellers has shown that granularity predicts accuracy: the average forecaster who sticks with tens - 20%, 30%, 40% - is less accurate than the finer-grained forecaster who uses fives - 20%, 25%, 30% - and still less accurate than the even finer-grained forecaster who uses ones - 20%, 21%, 22%.

All very good and interesting. But there's a danger of Goodhart's Law coming into play. If this becomes generally known, every purveyor of bad and made-up forecasts is going to start quoting numbers to three decimal places, to sound credible.

Richard writes:
What do numerical odds actually mean? And are the numerical odds any more significant or informative than the term "serious possibility"

Isn't there an error in quantifying what can not be quantified by its very nature?

I think it's quite clear. If you take all the times that someone predicted a 65% chance of something happening, then the event in question should've happened 65% of the time. The more the results deviate from that, the worse the forecaster is.

Jon Murphy writes:

Thanks for the tip! I'm going to buy this book tonight. I'm an economic forecaster and I'd love to see how I can use this for my job.

Sam writes:

@Tim, Thanks for the additional info. I'm definitely curious about the statistical significance of the findings of the season 4 experiment you describe.

My impression from the book (and from the paper by Atanasov et al on markets versus panels) was that the comparison was based upon the market condition in earlier seasons. One concern I had was that the super teams got to "skim" the top forecasters from the market condition. It sounds like the season 4 experiment you describe does not suffer from that problem.

But there is still the matter of making statistical adjustments to the team's forecasts and not the market's. Why is that a fair comparison? Prediction markets are also known to suffer from certain biases, such as favorites versus long-shots. These just happen not to be the same as for predictive surveys. (For example, applying the extremization algorithm to market forecasts would probably reduce their accuracy.)

Finally, even the season 4 experiment you described was with play money. Despite the existence of a small literature (basically 1 empirical study) claiming that play money is just as effective as real money in incentivizing PM participants, I think this is far from rigorously established.

An experiment that would be interesting: give 100 superforecasters $10K each, and give $1M to Phil Tetlock. Tetlock gets exclusive access to the statistically aggregated forecasts of another 100 superforecasters (working in teams, etc). Now let Tetlock trade in a prediction market with the 100 individual supers (who of course might spontaneously choose to self-organize into teams, and to apply statistical aggregation / adjustments to their team members' predictions). After some number of years, each individual super gets to keep their $10K, plus or minus and trading profits/losses, while each member of Tetlock's panel gets 1/100th of Tetlock's $1M (plus or minus any profits/losses). What would the final distribution of assets look like? If you think Tetlock's side would "win", how much more than $10K would you pay for the right to be a member of his team?

Tim Minto writes:

@Sam: I think your points are quite valid here - my guess is that the season 4 research will show that teams-vs.-markets may be a promising avenue for further research, but won't conclusively say much beyond that. Another thing to consider (also discussed in the book) is that even if they are "super", 130 forecasters isn't a deep and liquid enough market to capture all the characteristics of, say, the daily market in AAPL or EURUSD.

One other interesting feature of teams-vs.-markets is that (if I recall correctly, and I don't have access to all the research numbers - I was just a lab rat here), the entire numerical difference between the two conditions, over 100+ questions, could be accounted for in 3 questions that were pretty "fuzzy" from a resolution standpoint: we didn't know for sure that the event happened at the time, and only with clarifying info released after the fact was an outcome determined:
- Would an ISIS-claimed attack take place in Saudi Arabia within a specified time period? (Yes, ultimately)
- Would a post-Charlie Hebdo terrorist attack meeting certain descriptive characteristics (related to geography and reporting in open-source media) occur within the following two months? (Yes, ultimately)
- Would China conduct blue-water naval exercises beyond a defined geographic area within a specified time period? (Yes, ultimately)

More research needed to see if these were dumb luck (these are only 3 questions, and anything can happen in 3 coin flips), or if there is some characteristic of the teams-based discussion process which somehow elicited good ideas that couldn't get drawn out of a market.

Hmmm writes:

Floccina---why do you "want" to believe either side? That will bias your judgment. Right now, it appears the 'earth is not warming' crowd are about done. Still leaves open the 'are humans notably causing it' questions, the 'can we do anything about it considering costs-to-benefits' questions, and the 'is it so bad anyway' questions. I expect that time will tell the 'deniers' were materially wrong, but that's (probably) a long time and so for now you have to just look at obtained data, look for problems with such data, and see where it leads you.

Right now, the ocean salinity and heat maps say--get ready for a hot and wet winter California! But decades are small sample sizes in this game, so (1) predictions about a particular season, or (2) trend analysis are difficult.

Sam writes:

@Hmmmm, I agree with you about the difficulty of predictions. I think even NOAA has very low confidence in their seasonal forecast for CA weather this winter. There are several posts at discussing ENSO impacts on US precipitation. The story is quite complicated, with the impacts depending on the magnitude of the El Nino event. Since there are only two analogs in the modern record for the current El Nino event (1982, 1998), suffice it to say that the data don't speak very loudly about what to expect.

@Tim, I agree, liquidity is definitely an issue. In fact, I think 130 *active* participants might make the GJP experiment one of the more liquid PMs in history, which is kind of sad. Right now has relatively active markets in a number of political question; one of the most liquid is "Will Biden run?" which has about $130K of open interest. Since the site limits each participant to $850 of risk on a question, there are at least 160 or so bettors, but I'd estimate the actual number as closer to 1,500. Still very small potatoes compared to most financial markets, even for small illiquid stocks.

Also, you make a good point about the confounding effect of ambiguous questions. I did very poorly in the most recent season of GJP (on the public PM); in a last-ditch attempt to salvage my account balance, ultimately I decided to go for broke betting against a post-Charlie Hebdo terrorist attack, so I remember that particular resolution well! (I've found that my own forecasting skills are much worse in foreign affairs than in other domains like science, sports, domestic politics, etc. That's probably the most important thing I learned from the IARPA tournament!)

DK writes:

I would disagree that only three questions decided it all. I just looked at the research spreadsheet that was copied to the X-teams scores file: Teams aggregation (sLogit) topped Market on 111 questions and lost to Market on 24 questions (there was one tie). There were 20 questions where Teams algorithm won with over 0.1 margin and 15 questions where Teams lost by more than 0.1 margin. The sum of the top three highest margins of winning is remarkably similar: 1.1548 for Market wins and 1.1523 for Market losses.

Several "Supermarket" participants I talked to (like Tim, I myself was in team condition) told me that they felt like they cared more about doing well in this play money market than in the real market. Also, there were few market questions in Season 4: WTI spot price, Nikkei index, Gold spot price, Euro/USD exchange rate, VSTOXX index, etc. We don't have any official data on how our forecasts compared with others, but my personal impression that the "supers" were doing pretty well on these.

Bedarz Iliaci writes:


the event in question should've happened 65% of the time.

True in physics but it would not work for non-repeatable events, would it?
Such as forecasting a particular war, for instance.

Richard writes:
True in physics but it would not work for non-repeatable events, would it? Such as forecasting a particular war, for instance.

True enough, but if you want to judge how good a forecaster is you can just look at their results across different predictions and compare them to others.

Sandy Sillman writes:

I am also a "Superforecaster" (and was a teammate of Tim Minto). I also happen to be an atmospheric scientist in "real life" (now retired), so I want to address Fioccina's question.

About climate change in general, I think the question (inside the US, at least) is a bit personal: who do you believe and why? I can only assure you that the IPCC (the organization that issues climate prognoses) is extremely cautious and conservative in outlook. They only state conclusions when there is overwhelming evidence in support, and always based on publications that have appeared in peer-reviewed scientific literature. Scientists aren't perfect, and do occasionally succumb to their own biases. But they face a very critical audience - other scientists - and are very careful on that account. Furthermore, the more extreme or speculative results always get weeded out in the IPCC, which is a cautious and consensus document by nature. I think if you compare the care and precision of scientists who work in the field with the carelessness of skeptics, who operate more in the realm of politics and rumor, it should be clear who is more honest and credible.

There is a huge difference between "science writers" and real scientists. Writers, often with their own agenda, can get away with loads of garbage. You can't get away with much when you submit an article to a peer-reviewed science journal. So you have on one side people like Michael Mann, who have worked through the math used to represent the earth's atmosphere in climate models and the statistical difficulties involved in obtaining estimates of past climate from fossils, worked through the data, and gone through the extensive review and criticism needed to get work on these published. On the other side you have writers like Matt Ridley, who get to go through Michael Mann's work (including every private email he has ever written) and pick out pieces that can be made to look bad, often out of context, to be published in political magazines where the only thing that matters is a smooth presentation.

Sadly I think the global warming threat (and associated acidification of the oceans ) is real and is almost impossible to exaggerate.

How Superforecasters might approach this: first, it is impossible to evaluate how well people forecast climate change in this type of tournament, because any meaningful question would have a ten-year time horizon at least-a bit tricky to squeeze into a 4-year study. The tournament did ask two climate-related questions, but they were short term (what would happen over the next few months), more similar to weather forecasts than climate change. Quite typically, in these questions I think most Superforecasters would put aside the question of climate change (as in what might happen over the next 30 years) and focus on what will happen over the next 3 months. Second, most Superforecasters approach questions by seeking out forecasts from experts - not me, but true experts on 3-month forecasts. We tend to be "inside information" hounds. The climate questions in the tournament also allowed us to do our own numerical projections: how did rates of Arctic summer ice melt vary in recent years and could we use that to project a rate for this year, for example. But even here we often used results from standard weather/climate forecast models (the same type of models that forecast global warming). As Superforecasters, we use climate scientists. We don't try to become climate scientists ourselves.

The question of what source to trust became important - and murky - in many questions. For example, we were asked to forecast elections in Guinea, Sierra Leone, Chile, and other places about which none of us knew anything beforehand. We often had to go through unfamiliar newspapers and decide which ones were fact-based, which ones were well-intentioned but with blind spots, and which ones were partisan rags.

So I don't know what I can say as a superforecaster, since everyone needs to use their own brains to figure out these things (and, as Tetlock's book reported, Superforecasters fell all over the political spectrum, all with different sources of information). But I sure can say a lot as a climate scientist.

I was a bit dubious about the claim that granularity predicts accuracy. My usual assumption is that extra granularity means that the people involved don't understand significant figures.

On the other hand, one simple way to increase your accuracy is to take the average of other people's estimates. Such an average would be far more granular and more accurate than a raw estimate.

On the gripping hand, forecasters who copy each other are not exactly useful.

Comments for this entry have been closed
Return to top