2026 AI Predictions

“People keep asking if I’m back, and I haven’t really had an answer, but yeah, I’m thinking I’m back.”

Jan 02, 2026

Happy New Year!

The past few years, I haven’t done annual predictions because I felt like they were sort of missing the point. I wasn’t gaining clarity on what to expect or how I could improve my thinking. I was mostly just extrapolating trends (i.e. look at progress in recent years and assume the trend will continue). But as 2026 approaches, I’ve felt an urge to try and snapshot my current thinking about AI progress. Lo and behold, once I started writing down my thoughts, I found that predictions focused on validating/invalidating my views felt useful again.

This post contains my overall view and the resulting predictions from a longer thinking exercise. Enjoy!

Note: For all predictions, default resolution date is end of 2026 unless specified otherwise.

Overall View

I am feeling the AGI even more than I was when I first fessed up a year and a half ago. From where I sit, the predictions coming from the “there is no wall” camp have remained much more accurate than those coming from the “hitting a wall” camp. I feel like we are solidly in a slow takeoff world barring some exogenous shock that cuts off the compute faucet. That said, I also find myself violently agreeing with Seb Krier that optimists continue to overestimate how fast diffusion will be and underestimate path dependency.

Putting these two together, I expect what I think of as “jagged AGI” to emerge over the next few years: systems that are superhuman at some tasks, mediocre at others, and weirdly bad at things that seem like they should be easy. Maybe memory gets solved next year; maybe continual learning remains a bottleneck; maybe something I’m not tracking turns out to matter most. But I expect jaggedness to persist for at least 2-3 years. Beyond that, the picture gets blurry for me.

Predictions

In spreadsheet form. Thanks, Claude!

My predictions focus on areas I am both interested in and on which I think I have a differentiated perspective. That means notably omitted areas include whether AI is a bubble and what will happen to the job market. In both cases, I have some opinions¹ but don’t think they’re particularly well informed or novel.

Scaling

Best estimates for SoTA models at the end of 2026 suggest training FLOPs >=10x than the current SoTA models’ (Claude Opus 4.5, Gemini 3 Pro, GPT 5.2 (although the latter is likely smaller than the other two), Grok 4.1): 65%
- My methodology here admittedly may be a bit messy. For example, with GPT 4.5, I’m pretty confident it was at least a 2-3x scale-up from GPT 4 due to a mix of rumors, articles, and other hints. Worst case, if I can’t resolve this, I’ll just mark it as ambiguous.
Lab leaders still saying scaling continues to pay off for at least one of pre-training or RL at end of 2026: 70%
Epoch Capabilities Index trend continues to look linear at the end of next year: 80%

Coding

Dario’s 90% of code being written by LLMs will be obviously true for a large fraction (50+%) of working programmers in the tech industry: 80%
SWE-Bench Verified will be clearly saturated (>90+% performance): 90%
SciCode (source) and SWE-Bench Pro (source) will both have top scores above 70%: 70%
I will reliably be able to delegate coding tasks that would have taken me half a day to do manually with only an initial prompt (or back-and-forth a la Plan Mode): 85%
My default workflow for writing code will involve managing >1 agent in parallel: 90%
- >=5 agents in parallel: 45%
Best METR 50% time horizon score (source):
- >=16 hours (trend continues): 65%
- >=24 hours (trend accelerates): 25%

Biology and the sciences

SoTA models score above human level on all 6 LAB-Bench tasks (source: system cards/other leaderboards): 90%
FrontierScience-Olympiad best score >90%: 65%
FrontierScience-Research best score:
- >50%: 75%
- >70%: 60%
FrontierMath Tiers 1-3 best score (source):
- >60%: 80%
- >80%: 50%
FrontierMath Tier 4 best score (source):
- >40%: 70%
- >60%: 40%
AI makes meaningful mathematical breakthrough (e.g. solves a Millenium Problem or similarly impactful open conjecture) where the AI did the majority of the work: 35%
More (ex 1, 2) minor biological discoveries (e.g. ROCK inhibitor example) made primarily by an LLM (i.e. AF3 predictions or literature search contribution don’t count): 85%
Major biological discovery (e.g. discovery of CRISPR) made with major contribution from an LLM (i.e. AF3 predictions or literature search contribution don’t count): 25%
I view at least one SoTA model as equally or more able to reason about hard scientific questions than the best scientists I know: 25%
I assume SoTA models are right and I am wrong when I disagree with them on biological Qs the majority of the time (and find that to be true when I dig in deeper): 70%
- This one may be tough to resolve, but I think it’s still useful to write down even if I end up not being able to resolve it definitively.

Continual learning

SoTA general model able to speed run Pokemon (competitive with good speed runners in terms of number of actions): 30%
Best overall BALROG score (source) >70%: 55%
Best BALROG NetHack score (source) >50%: 45%
Context windows of at least some publicly (including through subscription) available, SoTA frontier models are:
- >1M tokens: 90%
- >5M tokens: 40%
- >50M tokens: 20%
I use an assistant product with an understanding of my preferences that compares to what a human assistant could develop from a quarter of working together: 30%
At least one popular coding agent has a form of memory/long-term learning that works well (as judged by me) beyond what could be achieved via updating a static AGENTS.md/CLAUDE.md file: 60%
Frontier lab rolls out a user-facing product (i.e. not a fine-tuning API) that I believe involves weight updates on the day/week timescale: 25%
A model consistently writes blog posts from a single prompt that I think are as good or better than my favorite internet writers: 30%

Real-world messiness and economic value

Best GDPVal score:
- >80%: 80%
- >90%: 65%
Best measured RLI score:
- >10%: 85%
- >50%: 40%
- >80%: 20%
I am consistently using and have stuck with an AI personal assistant product on a daily basis for 2+ months: 65%
- I trust said product to manage my calendar and respond to emails without it choosing when to escalate for permission: 30%
I assess that the majority of my close, non-programmer friends/family with white collar jobs have “felt the AGI” with respect to their work: 55%

Acknowledgements

Andrew Lindner, Matt Ritter, Will Baird, Willy, and Nathan Frey for thorough and helpful feedback.

Peter Mernyei

Jan 11

For the bio predictions saying "by an LLM (i.e. AF3 predictions or literature search contribution don’t count)", would you count an LLM agent that uses AF3 or other specialized models via tool calls?

1 reply by Stephen Malina

Paul B.

Jan 2Edited

Cool predictions

> My default workflow for writing code will involve managing >=5 agents in parallel

This feels like more of a prediction about your relationship with AI tools than something about AI progress? I think their independent time horizon would need to very drastically increase before managing five becomes even feasible (eg by default you’re giving them tasks that require each 1+ hour to complete, and also the instructions don’t take more than 5 minutes each to give). However, that’s just the precondition. You’d need to also get used to it before you can juggle this much

2 replies by Stephen Malina and others

3 more comments...

Stephen's Exhaust Pipe

Discussion about this post

Ready for more?