Reviewing my GPT-5 predictions

Better this than sports betting!

Aug 08, 2025

Earlier today (on August 7th), I realized that I was about to miss a golden opportunity to quickly test how calibrated I was about AI progress by predicting some GPT-5 benchmark scores. Upon realizing this, I dashed off some quick predictions based on a combination of then-current top benchmark scores and my intuition about how big a jump GPT-5 would be over o3, Grok 4, Claude Opus 4, and Gemini 2.5 Pro. You can see the predictions and current resolutions in the next section. My one sentence summary of their underlying sentiment is “progress continues apace along the trendline but without a discontinuous jump.”

My predictions

If I had more time, I would’ve predicted on even more things. I especially regret not predicting METR’s time horizon benchmark results. But, as they say, done is better than perfect, so here’s what we ended up with.

Analysis & takeaways

How did I do?

Even without all my predictions having resolved, I feel comfortable concluding that I did decently well. As you can see below, I predicted meaningful, but incremental progress, on SWE-Bench Verified and FrontierMath (tiers 1-3). In the first case, my maximum probability range turned out to contain GPT-5’s actual performance right in the middle. For FrontierMath, GPT-5 just made it into my second most likely, slightly higher range when using “pro” mode and tools. Finally, I predicted that GPT-5 wouldn’t make a massive jump on ARC-AGI 2 from Grok 4 and was correct. For the unresolved predictions, my current bet is that GPT-5 will top the creative writing leaderboard, is a toss up for the connections one, and will not place top 10 in the world on CodeForces. So my biggest whiff was potentially over-estimating progress on NYT Connections.

As mentioned above, I generated my predictions via hand-wavey trend line extrapolation. For each, I basically looked at how the most recent top models did relative to their predecessors and used that to make a rough guess of how much better GPT-5 would be assuming a similar trend. With the caveat that this is a very small n, my methodology worked reasonably well for predicting GPT-5’s performance.

In terms of what this implies about AI progress, my main takeaway is kind of boring - GPT-5 is another impressive but on trend step on a fast but predictable climb across a range of coding, math, and agentic benchmarks. It’s not a 0 to 1, game-changing inflection point, but it’s also not a sign of the brick wall we’ve been hearing about for years.

If we look beyond the benchmarks I made predictions for, the story is a bit more mixed, but overall holds up. On certain benchmarks not captured by my predictions, like Health Bench and economically important tasks, it made bigger jumps.

On others, like OpenAI PRs, it didn’t significantly improve over o3 and ChatGPT agent.

But, on the whole, especially factoring in the METR time horizon evaluation, “on trend” feels right as a conclusion.

Models' 50% Time Horizon chart — Source: METR’s GPT-5 post.

Going meta

As I’ve written about previously (paywalled), I have become a lot more AGI-pilled over the past two years. I wish I could easily translate my “on trend” conclusion to an update about what to expect over the next few years, but it’s surprisingly hard. Ultimately, the big question I want to know the answer to is whether predictable gains on these benchmarks result in general agents that can act as drop-in remote workers, catalyze recursive self-improvement, and act as geniuses in data centers.

I increasingly feel like the popular static benchmarks aren’t sufficient for guessing that in advance. On one hand, if a model came out tomorrow that magically saturated all these benchmarks, then that would update me towards rapid progress. On the other, if jumps on these benchmarks continue to be big but predictable by trends, then I think the additional information gained from exactly predicting the pace of their progress from here through to saturation is small.

Instead, going forward, I plan to pay attention to a mix of harder static benchmarks like SciCode and RE-Bench, complimented by dynamic benchmarks like Claude (and now other models) Plays Pokemon, VideoGameBench, and the BALROG benchmark. The former have been more resistant to progress to date and capture skills that are more relevant to my own and AI researcher’s work. The latter complement the former by measuring currently missing capabilities like long-horizon control, memory over hours, continual adaptation, and tool use within non-stationary environments. Similar to Dwarkesh, I suspect for the purpose of monitoring general AI progress, these are the capabilities we should be paying even closer attention to.

What about the value of this exercise itself? Were the 30 minutes I spent nailing down predictions worth it? The 90 minutes I spent writing this post?

Philip Tetlock and others have written about how making prediction helps forecasters filter out the noise and really focus on what they concretely expect, and that resonates here. It’s very easy for me to get caught up in the fervor of how back we are or how over it is with every launch, but stepping back and focusing on predictions can help me escape the all-or-nothing mindset. For that reason, while I suspect this specific instance was not that valuable in isolation, the habit of making these sorts of predictions and reflecting on them feels valuable.

On the other hand, if I’m going to do such exercises in the future, I’ll spend more time picking prediction targets whose resolution can meaningfully move my views. As is probably clear from the above, my “on trend” conclusion didn’t provide zero new information, but it certainly didn’t maximize my information gain either.

Finally, when it comes down to it, if nothing else, making these predictions lets me feel superior to all the people who talk about how right they were in hindsight without showing the epistemic virtue of predicting beforehand! That’s priceless. (Kidding, mostly…)

Stephen's Exhaust Pipe

Discussion about this post