Can Policymakers Trust Forecasters? | IFP

Introduction

In their 1990 book Uncertainty, Morgan and Henrion investigated experts’ ability to predict the future, concluding that “elicited expert judgments may be seriously flawed, but are often the only game in town.” But newer research argues for other ways to predict the future, besides asking experts. A major finding of the young field studying “epistemic institutions” like forecasting is that some nonexperts match or beat experts at predicting the future.

Consider Philip Tetlock’s 2005 claim that most experts performed little better than chance (and no better than “dilettantes”); or the 2013 competition in which the best amateur forecasters were said to beat CIA agents by a full 30%; or internet users outdoing an aggregate of epidemiologists during the COVID-19 pandemic. One might expect these top forecasters to provide a clear benefit to policymakers, enabling them to reduce error in planning for future events.

However, on reviewing the available evidence, our confidence in the “forecaster advantage” took a hit. Although the top judgment-based forecasters seem slightly better than domain experts, the handful of studies available don’t find that forecasters as a general class maintain the huge advantage mentioned above.

Additionally, even policymakers friendly to forecasting have not found it the top priority in a crisis. As superforecaster Michael Story observed about UK COVID-19 policy:

“During the pandemic, Dominic Cummings said some of the most useful stuff that he received and circulated in the British government was not forecasting. It was qualitative information explaining the general model of what’s going on, which enabled decision-makers to think more clearly about their options for action and the likely consequences. If you’re worried about a new disease outbreak, you don’t just want a percentage probability estimate about future case numbers, you want an explanation of how the virus is likely to spread, what you can do about it, how you can prevent it.”

Despite all this, the superforecaster phenomenon appears real: the very best forecasters are more skilled than others at predicting, in arbitrary domains, even given little prior knowledge. But where does that leave policymakers looking for insight into the future? If you’re a policymaker, how can you know when to seek expert counsel, invoke statistical models, query the forecasters, or do all of the above? Our problem is matching the right platform to the right policymaker, based on the information they need, and augmenting one method of forecasting with the others.

Of these potential sources of predictions, generalist forecasters can indeed outperform experts, but more research is needed about the size of their advantage and the domains they can be expected to work best in. But we know enough to proceed: generalist forecasting seems to have utility for policymakers, even without definitive answers to those questions. Existing research suggests that combining sources in decision-making can improve outcomes and that forecasting can be an effective tool for improving government decision-making, even if it turns out to be only somewhat more accurate than experts.

We now review the options: experts, statistical models, judgment-based forecasters, and prediction markets.

Asking the experts

You know them already: the academics, the consultants, the standard sources for policymakers. But Tetlock’s 2005 book Expert Political Judgment found that “area experts” (regional specialists in, say, Afghanistan) were in fact not much better at predicting the future of their chosen region than random guessing. An unsourced leak to the Washington Post even claimed that expert intelligence analysts (like NSA and CIA agents) were outperformed by a full 30% by the best amateurs informed by Google and common sense.

The playing field is actually a bit more level than these eye-catching stats imply. A closer look at the 30% claim led us to an unpublished study that indeed describes a 35% difference – but the paper’s comparison between experts and superforecasters was not apples-to-apples: the aggregation method used for the forecasters is known to produce better results than the one used for the expert analysts.¹ A fair comparison found a 10% advantage for the amateurs, not reaching statistical significance.

Accuracy and calibration aren’t the only reasons you might value expertise. This is perhaps especially true of academics, who often focus on producing explanations of a phenomenon rather than predictions alone. Engineers can use such explanations to control systems, and this control is often the main purpose of our seeking information. In a real-world problem – like deciding what factors to consider when engineering quake-resistant buildings – the geologist will always be a necessary input.

To recap: experts are probably not as uniformly bad at prediction as the first wave of Tetlock-inspired skepticism indicated, and in any case their predictive acumen does not exhaust their value. Nevertheless, experts are not the only reliable source for forecasts. And we see signs that experts can improve greatly if exposed to better methods of forecasting.

Asking statistical models

Policymakers regularly look to statistical models created by data analysts as another source of forecasts. In 2005, when Expert Political Judgment was published, statistical models still performed better than the few human forecasters who had been sampled. But later studies found that the best humans outperform statistical models.

Humans also have a clear edge in generality: data-driven methods are only as good as the data they’re driven by; and often there is no data. Conversely, statistical models do extremely well when predicting continuous trends (as opposed to a chaotic system like weather), when fed piles of clean and fresh data, and when modeling a stable quantity (as opposed to something “anti-inductive” like a financial market or an adversary reacting to predictions about it). Most of the world is missing at least one of these conditions, putting fundamental limits on the use of models.

Statistical models are likely to see major improvements, though. Data on everything is increasing in availability, and progress in machine learning is exploding in many fields; this progress should reach forecasting models sooner or later.² A huge consequence of better formal models comes from the domino effect: better model outputs are an input to better human forecasting, and vice versa.

Asking top forecasters

Judgment-based forecasting (as opposed to statistical forecasts) took off over the last decade after the power of talented amateurs was prominently demonstrated in a 2010 tournament run by the national security organization IARPA. Importantly, it turned out that forecasting skill was partially trainable and quite readily identifiable.

Forecasters aren’t magic, of course, and sometimes national intelligence merits its name. Consider the invasion of Ukraine: beginning in early December 2021, U.S. intelligence said the risk of invasion was “notable,” and later “very high.” It took Metaculus users until mid-February to give it more than 50% probability; on 18th February 2022, days before the Battle of Kyiv, forecasters gave only 15% to the proposition “Will Russian troops enter Kyiv?”

But at their best, forecasters provide astounding results. One of the earliest warnings about COVID-19 came from an anonymous user of Metaculus, a public forecasting platform. At the time, many public health experts and journalists were ignoring or downplaying the risk. Metaculus users overall also beat a panel of experts at predicting how many COVID-19 cases and deaths the U.S. would have had by certain dates. Overall we see good reason to suspect a 10% or more advantage for the best forecasters over experts not trained in probabilistic forecasting.

One reason for the advantage might come when experts are unable to reveal their true opinion due to professional obligations or reputational risks. For example, some departments that fund science may face institutional incentives to present a rosy picture about the bets they’ve made, whereas external forecasters can take a less biased perspective. Relatedly, forecaster performance tends to be more legible, because they need a public track record to gain credibility in the first place; publications and institutional affiliation do much of this work for an expert. Scoring their performance not only creates a useful feedback loop for the forecasters, but also allows outsiders to identify top performers.

This highlights a counterintuitive benefit to consulting forecasters: even if the best expert makes better predictions than the average forecaster in a given domain, you may want to seek out forecasters in addition to experts, if forecasters can be evaluated more easily than experts in that domain — since then your choice is between listening to the best forecaster versus an average expert.³

Recent work on COVID-19 predictions underscores the mutual benefit of pairing forecasting skill with domain knowledge. A talk from Hypermind describes a project that tracked experts and experienced forecasters (and people who were both) across 217 infectious disease forecasting problems. Experienced forecasters were 3% better than the expert average, while, as expected, experts who were also experienced at forecasting performed best of all.⁴ Giving experts training in forecasting methods appears to make them better than their non-trained expert counterparts and generalist forecasters alike.

One forecasting formula is to take some smart generalists, team them up to discuss each question, then pool their final beliefs into one strong “aggregate” forecast. When organized into teams collaborating on forecasts, superforecasters seem to have a competitive advantage over studied domain experts.

Another way to aggregate forecasts is to use a prediction market. Prediction markets, which create betting pools for particular forecasts, are thought to be valuable because of the assumption that markets aggregate information well: in particular, that they synthesize public information and are reasonably resistant to manipulation. To make a market mislead people, you need to dump money into the wrong end of the bet – which creates an automatic stimulus for other users to bet against you, correct the manipulation, and thus collect your money. Popular markets (with millions of dollars waged per market) should typically be safe from this attack.

Other industries pool judgments at enormous scale. Stock markets produce forecasts as a byproduct: people buy stocks predicting that prices will rise, or short them predicting they will fall. This can predict a surprising amount about the world – witness the crash in Russian stocks a full week before the Ukraine invasion of 2022.

However, prediction markets have been promising for a long time – which may actually be concerning. Here’s an excited Economist piece from 2005, featuring Nobel laureates; here’s one from 1995. Markets haven’t been living up to their promise: why?

The devil may be in the details here. Prediction markets need to be large, liquid, legal, and accessible in order to function reliably. Meeting all of these criteria in the current regulatory climate is difficult and expensive, and failure might mean a market performs no better than any other aggregation method. Other factors include: is the market a peer-to-peer exchange, or run by a market-maker who subsidizes the forecasts? How much money is at stake (since higher stakes lead to increased motivation, which increases accuracy)? Is it easy to create questions and find them a betting audience?

One should, however, expect all markets to have slightly wrong prices, since income-maximizing users wouldn’t take trades if they couldn’t expect to make something off it. Market forecasts should therefore be off by a decent rate of return.

For a mixture of practical, scaling, and regulatory reasons, the prediction market dream is yet to arrive and not obviously getting closer. In 2022, one of the largest prediction market platforms, PredictIt, was ordered to shutter its U.S. business (though a temporary stay was granted). The end of U.S. operations is usually a death knell, as in the case of Intrade and DAGGRE. Prediction markets are over-regulated, expensive, and hard to get right.⁵

Combinations

In practice, all the above forms of prediction are constantly blended. Markets are just one way of arranging forecasters (and often include domain experts as participants). The original judgment-based forecasting study arranged U.S. intelligence experts into a prediction market. Great forecasters often use statistical models and statements by domain experts as inputs; good experts use models too. Bayesian analysis often uses human judgment to initialize a data-driven model. And everyone uses polls where they can.

Furthermore, not only do these groups often overlap in the real world, but mixing the four sources often helps. One trend that cropped up repeatedly in reviewing the research was the advantage of teams, crowds, and aggregations of collective predictions. This fits even at the cognitive level: Tetlock’s catchiest finding was that “foxy” thinkers, who base their opinions on a broad range of theories and sources, consistently beat single-theory “hedgehog” thinkers. The same aggregation approach wins at all scales – having more sources of information helps groups as much as individuals; and a single forecaster can improve their accuracy by being skeptical of their own intuitions and intentionally integrating very different viewpoints.

Beyond mere correctness

It’s difficult to rank the different kinds of forecasting, but we’ve pulled together the available good studies here. Tentatively, we assert the following: in domains researchers have studied (like politics), top forecasters average a small advantage over both experts and markets. Current markets are also somewhat better than statistical models, as well as the experts that participated. Ensembles are likely the best of all.

Research comparing top forecasters with top experts could clarify particular domains or question types where forecasters have better predictive ability. Forecasting can help generate insights, describe risks, highlight areas of contention, and more. But to leverage these abilities properly, more research is essential on the relative advantages of different predictive methods for making decisions.

But there’d be much to like about forecasting even if it didn’t beat experts in accuracy and calibration. For example:

Judgment-based forecasting provides uncertainty estimates: not just “I think this will happen,” but “We think it’s 70% likely,” and “People disagree about how likely it is by 30%, on average.” Uncertainty estimates help a community reason better about likelihoods and expected costs.
Forecasts create common knowledge, a public good. Public forecasts (like prediction markets) can provide an informed starting point for discussions, and can prevent cases where private information all points one way, but no one wants to risk saying it first.
Judgment-based forecasting is scalable. Since forecasters are less oversubscribed than experts in any particular area, it seems a reasonable assumption that our experts have less time to make and update forecasts. As a result, generalist forecasters can massively expand the range and volume of what gets forecast. It’s fine to use superteams and expert committees if one is only predicting a few hundred important questions a year. But the full vision of what forecasting could do for society involves many thousands of questions.
Forecasting is relatively resistant to manipulation. In a world where forecasting becomes common, bettors and clever rogues spend their time making predictions about COVID-19 or policy effects, and media outlets take accuracy and calibration seriously, all kinds of powerful actors would likely want to control predictions. But prediction markets are structured to foil them when they do (with the crucial caveat that the market has a larger volume than the attacker is willing to spend).
Forecasting allows us to identify ‘epistemic peers’: policymakers can choose to listen to people who make fewer and smaller mistakes. Society’s existing method of doing this is seeing who makes it through a PhD, but track records have many advantages, as a more direct measure, and, for forecasters willing to be publicly tested, we can check their scores in seconds. For instance, Scott Adams is a popular pundit, but has a track record somewhat worse than a coin flip. Similarly, Metaculus’ Public Figure Predictions project looks at the people that dominate the media and helps us assess whether they should dominate.

How can policymakers use forecasters?

In her memoir, former National Security Council senior director Samantha Power reflected on the 2011 decision to intervene in Libya (which she endorsed): “We could hardly expect to have a crystal ball when it came to accurately predicting outcomes in places where the culture was not our own.” In hindsight, of course, the decision to intervene proved catastrophic, but critics warned against the decision even at the time. Could external forecasters have helped policymakers better understand the likelihood of Libyan collapse?

Even if forecasters don’t always outperform subject-matter experts, they can be helpful to policymakers: but it depends largely on their particular needs. Policymakers need more than accuracy and calibration – like generally applicable insights or patterns. Some are looking to manage worst-case outcomes; others, like science funders, value forecasts for “portfolio balancing,” analyzing which parts of R&D funding are relatively risky bets. Still other policymakers may find forecasting helpful because of its decomposition of issues into clean chains of events: by breaking up domains of expertise into specific and testable questions, forecasting can help experts refine what questions they actually want to ask.

In some cases, the rationale behind a forecast may be more valuable than the forecast itself, because it flags hidden assumptions a policymaker may not have considered, or provide a contrarian “red-teaming” viewpoint. It may be helpful to use outside forecasts as a metric for improving one’s own internal forecasting ability, or to build internal forecasts to aggregate employee insights. Simple secret polls of your colleagues are underrated, given their practicality. It’s fairly difficult and expensive to track down a domain expert, but you can always poll your own office, and this might be better if you’re in a hurry.

More grandly: many people want to improve society’s critical abilities — how accurate we are, how proportional our confidence is, and how robust to manipulation we are. To that end, combinations of the above groups and methods are often the winning strategy: cross-pollinating experts and forecasters; teaching experts simple rationality techniques; teaching forecasters more about the domains they’re most interested in; developing better ways to aggregate predictions; and encouraging more collaboration could all produce unprecedented information sources.

What does all this imply for policymakers? They should keep track records and quantity their confidence in memos. They should consider implementing simple, anonymous internal polling, as a quick way of getting a sanity check on a plan. They should pay attention to public predictions with great track records. And they should consider devoting part of their planning and strategy budgets to contracting top external forecasters – it could pay off enormously.

You yourself can take advantage of the fundamental insight of forecasting: that sometimes outsiders match or beat the insiders. Take your own ideas seriously, keep track of your exact positions, and vote your credences. If others were right, update towards them. Don’t just do your own research: find out if you should. Join in, and take note of what those who join in say. You might catch something everyone missed; you might be the first in the world to notice.

Gavin Leech and Misha Yagudin run Arb, a research consultancy. Misha co-leads Samotsvety, a forecasting boutique.

Gavin Leech helps lead Arb, a research consultancy in the mould of Rethink Priorities, RAND, and IIASA, where he focuses on forecasting, metascience, history, and other important niche areas.

Misha Yagudin co-runs Arb Research, a research consultancy, Samotsvety Forecasting and co-founded and advises quantifiedintuitions.org, a research non-profit focused on epistemics.

In particular, the amateur forecasts were weighted according to each forecaster’s past performance, while the expert market used weights proportional to each analyst’s current subjective confidence. Various other helpful tricks were also applied to the amateurs.
Zou et al (2022) describe the trend: when general text-producing “language models” are asked forecasting questions, they currently give answers far below human performance – but they improve fairly fast with more data and bigger models.
Obviously when a domain is extremely technical, like research mathematics, people without deep technical expertise will most likely not even understand the question and so will not make sensible predictions above baseline. Similarly, untrained intuition fails fast in fields like physics. There is much less good research on forecasting competence than expected, and studies often dont control for important factors, such as crowd size, whether both groups are answering the same questions, and whether their forecasts were updated with the same frequency. The available evidence suggests experts in general can be outperformed by simple models – and that top forecasters in turn outperform these models. However, the research says little about whether top experts are outperformed by top forecasters, a question with massive implications for policymakers. There is simply not enough data comparing experts and judgment-based, generalist forecasters to provide confidence about their respective predictive value.
Note that this might undersell the forecaster advantage, since these experts had all self-selected into forecasting.
If you’re interested in trying out judgment-based forecasting, we’ve collated a list of active platforms here.