Against "Nothing Ever Happens"
Tetlock, Rossi, Chudjak, and out-of-sample predictions
Remember Covid?
In retrospect, it’s shocking how bad experts were at predicting how the pandemic would play out. Vox put out a tweet on January 31, 2020 that asked and answered its own question: “Is this going to be a deadly pandemic? No.”
The same month, the WHO said that preliminary investigations found no evidence of human-to-human transmission of the disease and Anthony Fauci declared that Covid was “not a major threat to people in the United States.” BuzzFeed ran a piece with the headline “Don’t Worry about the Coronavirus, Worry About the Flu” and on February 1, 2020 the Washington Post similarly published something with the headline “Get a grippe, America. The flu is a much bigger threat than coronavirus, for now.”
Not that I knew much better. Even in mid-March, when I learned that my high school would shut down for two weeks, I figured we’d be back within a month.
But there were some people out there who were better-equipped to see that Covid could spiral into something devastating. Just a week after that Vox tweet, one of their own reporters, Kelsey Piper, wrote a piece in which she mostly highlighted everything that we didn’t know about the disease.
The coronavirus killed fewer people than the flu did in January. But it might kill more in February — and unlike the flu, its scope and effects are poorly understood and hard to guess at…
It is unclear whether China will be able to get the outbreak under control or whether it will cause a series of epidemics throughout the country. It’s also unclear whether other countries — especially those with weak health systems — will be able to quickly identify any cases in their country and avoid Wuhan-scale outbreaks.
The point is, it’s simply too soon to assert we’ll do well on both those fronts — and if we fail, then the coronavirus death toll could well climb up into the tens of thousands… That’s just far too much uncertainty to assure people that they have nothing to worry about.
Piper did not successfully predict that Covid would turn into a destabilizing global pandemic. In fact, even the disastrous outcome she mentions here includes tens of thousands of deaths rather than the millions that we later saw. Still, her coverage stands out as uniquely prescient in a relative sense because of the epistemic humility demonstrated here. She took great pains to explain that it was very difficult to model how things would proceed given all of the information we lacked about this novel virus, and that we should proceed with caution because of those unknowns.
Epistemic humility is especially valuable when considering something without much precedent, like Covid-19. But it’s frankly a great place to start when confronting any issue because we don’t understand nearly as much about the world as we think we do.
Consider Philip Tetlock’s study of political experts. Over the course of decades, he collected 82,000 predictions from 284 experts in their fields: economists, political scientists, journalists, intelligence analysts. Their predictions barely outperformed chance, and did worse than simple statistical models that extrapolated recent trends into the future. The years they spent thinking about their area of expertise meant little when it came to forecasting how the world would evolve.
Tetlock borrowed philosopher Isaiah Berlin’s construction of the hedgehog and the fox to understand different approaches to forecasting. His experts were the hedgehogs, knowing one thing very well and trying to apply that thing to the world. But Tetlock also worked with “superforecasters,” people who may not be experts in any one area but who are unusually good at modeling the future. These foxes know many little things rather than one big thing, and are willing to update their beliefs when they take in new information.
The idea that our understanding of the world around us is deeply flawed isn’t limited to political forecasting. Progressive sociologist Peter Rossi spent decades evaluating social programs and arrived at a heuristic that he called the “Iron Law of Evaluation”: the expected net impact of any large-scale social program is approximately zero.1 Of course, there are many programs that we have found to be promising, which is why he came up with a second “Stainless Steel” law: that the better designed the impact assessment of any social program, the more likely the resulting estimate of net impact is to be zero.
Rossi believed some programs did work. This brings us to his third metallic law, the Zinc Law, which states that only those programs that are likely to fail are evaluated. He called Social Security our most successful social program, and noted that it had never been evaluated rigorously because its impact is so obvious.
When we consider these laws as a reminder that far more programs fail than we might initially expect, they’ve aged well. As methods have advanced, programs we’d believed promising have yielded null results under rigorous evaluation. DARE had no measurable impact on drug use. Microloans, once hailed as a poverty-reduction breakthrough, produced modest-to-null effects in a wave of randomized controlled trials. The replication crisis has extended this pattern into psychology and economics, where we’ve repeatedly overestimated effect sizes.
This brings us to a fourth and final metallic law, which we can call Chudjak’s Aluminum Law: “nothing ever happens.” Aluminum isn’t as sturdy as iron or stainless steel, but that fits here because Chudjak’s Aluminum Law is a lazier heuristic than either of Rossi’s. It is right about some things: most predicted catastrophes don’t arrive, most breathless takes are wrong, and “things continue roughly as they were” is a reasonable default for most questions on most days.

But aluminum bends, especially when we’re dealing with something out-of-sample. Without a relevant historical comparison, too often experts default to “nothing ever happens” until they have incontrovertible evidence that something is in fact happening. Scott Alexander wrote the canonical piece on probabilistic thinking during Covid: people were told a global pandemic might arrive, radically different in scope from anything they had seen, and they waited for proof before believing it. As he put it, the reasoning was essentially “there’s no reason to panic, there are currently only ten cases in the US.” Alexander notes that this should sound like “there’s no reason to panic, the asteroid heading for Earth is still several weeks away.”
His alternative is a willingness to act on calibrated estimates. If we believed that the asteroid only had a 10% chance of hitting the Earth, we wouldn’t tell people to stop worrying—we would prepare. Every week spent confirming the threat is a week you could have been preparing for it.
In addition to the Aluminum Law, there’s the problem of the reflexive hedge: insisting that we just don’t know and refusing to form a view. That may look like epistemic humility, but it isn’t; it’s a way to avoid being wrong without having to commit to being right. The better approach is developing some model for how things will go and being willing to revise it when the world pushes back. Tetlock’s best forecasters weren’t agnostic. They held views and changed them as evidence came in.
Another one of Tetlock’s studies found that only one trait was correlated with forecasting among his experts: fame. And it was an inverse correlation, with more famous experts making less accurate predictions.
This work was done in the 80s and 90s, when a famous expert would engage with the public through cable news and newspapers. Experts who would make more confident assertions were favored in those environments. Our institutions were selecting against the epistemic humility that is so important for forecasting.
Today that selection pressure has intensified. News reaches most people through social media, and algorithmic feedback loops are much tighter than the editorial judgment of TV producers and newspaper editors in Tetlock’s era. The rewards to confidence over calibration are larger, faster, and more systematic than they were.2
The same dynamic operates in politics, and arguably more intensely. Political power requires attention, and attention in the current environment goes to confidence. A politician who speaks with epistemic humility is at a structural disadvantage against one who projects certainty. The structures that put people into power select against this kind of calibrated thinking.3
This matters more right now than it usually would, because we’re moving into unprecedented territory on several fronts at once. War looks different than it did a decade ago, with drone swarms, cyber operations, proxy conflicts whose dynamics don’t map onto Cold War templates. The information environment has been radically restructured by the internet and its algorithmic sorting, and is about to be restructured again by AI-generated content at scale. And then there’s AI itself, which is not like previous economic transitions. The steam engine and the internet were tools that extended human capability. What happens when cognition itself becomes a metered utility, something you buy by the token? It’s very hard to model, although others have tried.
Google executive James Manyika believes that “AI is the Industrial Revolution plus the Enlightenment.” It would have been impossible to model the changes from the Industrial Revolution while it was happening. Humanity had simply never seen such rapid changes in the standard of living. According to the best data we have today, GDP per capita had been essentially flat for thousands of years. In the 250 years since, it has grown exponentially, and as a result our lives today have little resemblance to those of people who lived before 1800.
We don’t know if this transformation will be like that one. But if we’re reasoning from first principles then the ability to make cognition practically free is the kind of input shock that should produce an immense shift. And so even if you believe there’s a modest probability that we see anything like “Enlightenment + Industrial Revolution,” then that should dominate the planning calculus, because the magnitude of that scenario swamps the others.
How do we start to evaluate the likelihood of such an event?
Prediction markets aren’t a panacea, and I have real concerns with their current implementation. Still, at their best they aggregate forecasts better than any individual expert can, and their prices track reality better than pundit consensus. Platforms like Polymarket and Kalshi have proven well-calibrated across tens of thousands of resolved questions: when they price something at 70%, it happens about 70% of the time.
Markets achieve this calibration by properly aligning incentives. When there’s money on the line, a shaky consensus held up by the naive notion that “nothing ever happens” will begin to fall apart. And what these markets are currently pricing in is remarkable.
Computer scientists Scott Aaronson (former OpenAI employee, working on AI safety) and Boaz Barak (current OpenAI employee, also working on AI safety) came up with a “Five Worlds of AI” framework for considering how the technology might evolve, and how our world will evolve with it. Forecasters on Metaculus4 currently assign roughly the following probabilities for which of those five worlds we will be in by 20505:
10% chance: AI-Fizzle. AI progress plateaus and the technology ends up comparable to nuclear power in societal impact.
30% chance: Futurama. AI progress continues and we see changes similar in scope to those of the Industrial Revolution. AI systems are better than human experts at a wide range of tasks but are still largely used as tools by humans, and AI has a large positive impact on society.
30% chance: AI Dystopia. Similar technological scale to Futurama, but the outcome is bad. Power concentrates in governments and corporations, mass surveillance maintains that power, and large-scale job losses follow.
18% chance: Singularia. AIs recursively self-improve, effectively becoming an alien species with their own goals. They happen to be benevolent, granting humanity material abundance.
12% chance: Paperclipalypse. Same technological trajectory as Singularia, but the AIs are either opposed to human existence or indifferent to it in ways that cause our extinction.
These numbers suggest that we should all be thinking a lot about how to steer our world towards the good outcomes and away from the catastrophic ones, but they aren’t necessarily representative of some ground truth. Maybe you think 269 Metaculus forecasters are too immersed in a community with a distorted view of AI. Plenty of serious thinkers land in very different places: Eliezer Yudkowsky, for instance, puts the probability of Paperclipalypse near 99%, while Zvi Mowshowitz is around 70% and Dean Ball is under 1%.
But absent strong and well-researched personal convictions, prediction markets are the best source of information we have.
The numbers matter because the relative probabilities shape what the right response is. If we're headed for AI-Fizzle, deemed the least likely at 10%, political regulation barely matters6. If we're headed for Industrial-Revolution-scale change, regulation is crucial to steer us toward Futurama and away from AI-Dystopia. And if we're headed for Singularia or Paperclipalypse, regulation may have limited purchase, but mandated safety work and slower development could still shift the outcome.
But the institutions that should be using this kind of information rarely have the incentive to do so.
Even Kelsey Piper, the Vox reporter who warned in early February of 2020 that there was a chance that the coronavirus could spiral out of our control, was affected by this misalignment.
In late March of 2020, after America had become the country with the most confirmed Covid cases in the world, she reflected on Twitter that her private intuition had been more accurate than her public writing. She’d told her family in early February to plan for quarantine, but in print she’d pulled her punches; she didn’t want to sound alarmist and didn’t want to step out ahead of public health officials who were still saying the risk was low.
This is illustrative of the quandary facing news organizations: they have many different variables to optimize for. Accurately forecasting the future is somewhere on the list, but so are institutional deference, keeping your audience happy, and search engine optimization. Any news outlet’s output is shaped by the whole bundle, which is why Piper was simultaneously privately convinced that Covid would lead to national quarantines and publicly hedging.
Similar problems face politicians and practically every other institution that matters. We’ve established that our systems select against calibrated people. But a second problem is that even when calibrated people get through, their incentives are misaligned. Piper is a well-calibrated thinker, and her private analysis was on the money, but she felt the structural pressure to hedge anyway.
The Aluminum Law has a payoff structure that makes it locally rational even when it’s globally catastrophic. Being right about a tail risk costs you credibility in the short term, as no one wants to second-guess the experts. But when our society collectively misjudges tail risks, everyone pays the price.
The local cost is concentrated and immediate, and the global cost is diffuse and delayed. Anyone responding rationally to those incentives will hedge. Distance from institutions makes it easier to acknowledge tail risks. So does having a place to look up the actual probabilities.
Prediction markets start to bridge that gap. They convert the value of being right into a local payoff that any individual forecaster can capture. This is an environment that flips the asymmetry, where being calibrated pays and being confidently wrong costs you.
Will leaders use these tools enough? Probably not, because of the many forces selecting against humble leaders who would defer to such tools. But normalizing their application moves the equilibrium. The alternative is to keep running on poor epistemics. We are entering an out-of-sample world, and seeing our environment clearly is crucial.
Don’t be like Chudjak.
It’s important to note that Rossi worked extensively on social programs during his career! He was a liberal who wanted programs to work, spent his career evaluating them rigorously and arrived at his iron law as an honest finding rather than an ideological one. As Charles Fain Lehman put it, Rossi’s “association with policy pessimism” was “in spite of his politics and his vocation.”
His view was that social programs often fail because it is very difficult to design effective programs, and that the people who end up designing these programs usually do not have the skills and knowledge needed. In essence, the competitive pressures that elevate to the position of designing programs do not select for the high degree of expertise needed to make a program effective.
A 2021 study from Serra-Garcia and Gneezy found that non-replicable papers in top journals actually get cited more often than replicable ones, even after failed reproductions of the non-replicable paper! This is a similar bias to the one Tetlock found - attention selects against calibration across domains.
There’s another Alexander piece that talks about why our leaders are so often “legibly mediocre,” in which he highlights that leaders are optimizing not merely for being right but instead for being right and also holding onto power. This connects to Rossi’s Iron Law - for a social program to make a difference we need highly skilled people to design it, and the “legibly mediocre” experts that we often have designing those programs just aren’t good enough.
See also Zvi Mowshowitz’s response, which argues that the people with the best world models are filtered out long before they would have access to power.
Unlike Polymarket and Kalshi, Metaculus uses play-money. The academic record is mixed on whether play-money markets are as well-calibrated as real-money ones. It’s believed that play-money markets still have properly aligned incentives because within the community of forecasters using the platform social status is tied to well-calibrated predictions.
One additional source of concern: the probabilities can get a bit wonky when some of the outcomes involve human extinction. In that scenario, there would be no payout for Metaculus users who bet accurately. They’re still incentivized to bet in the direction that they think the odds will move between now and then, but it is flawed. This particular Metaculus market flags this and requests that forecasters predict their true beliefs.
A note on timing: the resolution date for this market is 2050, but we'll likely know which world we're in much sooner. Metaculus forecasters put the median for a “general-level AI system” in 2032, and the handful of years after AGI should settle most of the ambiguity.
Unless, of course, AI fizzling is endogenous to regulation. Aaronson & Barak note this possibility, asking the question: “Could AI also fizzle by political fiat?” If we believe that there’s even a very low probability of AI Dystopia or Paperclipalypse occurring, then we should be trying to make AI fizzle by political fiat, although that may be effectively impossible.





