AI already runs for hours and solves real problems on its own. The limit isn't the model. It's how ready we are to let it work.
Models now run eight-plus hours and finish the job. What task do you have that runs longer than a few minutes?

The moment a benchmark is published, it's beaten. The real question isn't whether AI can solve it. It's which benchmark you want AI to solve.

Humans steer at every layer; models build at every layer. Agents generate data, the data trains the model, and a better model runs a better company. A recursive loop.

The model gets ~2× better every year, and everyone rides that same wave. Your durable edge is what you stack on top. Leverage your own data and the advantage compounds. Start now; the gap only widens.
Work that used to take two weeks now takes two days, and will soon take two hours. The clock is collapsing, and it isn't slowing down.
Their pipeline is streamlined for AI. They embrace it instead of fighting it. Things break, they fix fast, and the cost of failing is almost nothing.
Source: The Product Compass. Anthropic releases, Feb–Mar 2026.

Not the model, not the framework, not the difficulty of the task. The bottleneck is how well we've set up our systems to let AI do what it does best: solve problems.
See it all in motion. Then we get into the how.
Ask for "a snake game" and you get one. Add music, sound, a scoreboard, a jet-orange theme and it's far better. You specified what "correct" means.
No setup, no boilerplate, one plain sentence. Copy it into your agent and a working, hosted app appears.
Same game. Now you define what "good" means, and the model follows every word.
AI output feels generic only because you didn't make it yours. Add your voice and your design, and the sameness disappears.
Answer five quick questions and get a reusable you: your tone, captured once.
Point an agent at a site you love. It reads the look and saves a design.md any agent can reuse.
A brand-new artifact: three slides on the game you just built, in your voice and your look.
A handful of markdown files give an agent the context a new hire would need: who's who, what matters, how you win. Drop in people.md, okr.md, strategy.md, and every task lands in context.
Turn your voice and design into agent-readable files and they carry task to task: the same identity behind your decks, emails, and apps. Define who you are, not each artifact.
This time an email, not an app. Same identity, a brand-new surface.
Capture how you want things done once, as a skill. Then you stop repeating yourself: the agent applies your voice and standards on every task. A few percent better, every week.
Capture it once and never prompt it again. Every build already sounds like you.

Tell AI who's reading, their role and their background, and one idea becomes two explanations. The same concept, reframed for an exec and an engineer, each in their own vocabulary.
One paper, two readers. Here's a real one: Unlimited OCR (arXiv 2606.23050). Let AI reframe it for each, in their own words.
Same paper, Unlimited OCR, read two ways. The exec gets the outcome. The engineer gets the mechanism.
One model now parses 40+ pages in a single pass at 93% accuracy: six points past the best open baseline and ~35% faster on long files. Cost stays flat as documents grow, so high-volume back-office paperwork gets dramatically cheaper to automate.
Every output token attends to all image tokens but only the last 128 output tokens, so the KV cache is bounded (Lm + n) instead of growing with length, giving constant memory and latency. That's what lets it decode 40+ pages in one forward pass.
Grab a coffee, stretch, reset. We pick up right after.
One word, Remotion, is the difference between a long Slack wall and a flashy 30-second video. Text is ambiguous; visuals are less. Your voice and metaphors carry in too.
Hand AI a Slack message and get back a 30-second Remotion video.
AI made us busier, not freer. Attention is now the scarce resource. So don't ship 80 pages of slop. Make it dense, lead with hierarchy, and you'll earn the feedback that's hardest to get.
Turn a long report into a single dense HTML page that earns feedback.
The unlock isn't control. It's giving AI a goal and letting it run.
Tiny nudges, big gains. The same loop that taught DeepSeek-R1 to keep thinking longer all on its own.

Point it at a real question and let it run. We'll sharpen the technique on the next slides.
Assuming the agent nails the query first try is delusional; the web's too vast. So loop it: search, evaluate the source, find the gaps, dig deeper or adjust. A depth-first beam search.
One living report, five passes. It finds its own gaps and digs deeper.
Give it a goal and let it loop: create, review, revise, repeat. A model rarely catches its own faults while generating. But force it to review, again and again, and quality climbs with every pass.
Give AI an objective, not a task, plus tight guardrails: a 5-min experiment, revert-if-worse, keep-if-better. It runs for hours; you wake to a result better than weeks of your own work.

Three files, one cycle: a human writes the strategy in program.md, the agent edits train.py, runs a 5-min experiment, measures, and keeps the commit only if it improved. 80–100 times a night, zero intervention.

Clone the repo, give it a goal, and let it ratchet on its own.
Run on the UCI Heart Disease set: 47 iterations, keep-if-better. val_auc ratchets from 0.889 to 0.9114, then plateaus at a strong local optimum.
AutoResearch · UCI Heart Disease · run-20260505-1213

Three tests: the outcome is measurable, you can iterate fast, and the goal is clear. Pass all three and AI can run the loop itself, on ads, prompts, even org design.
It runs hundreds of experiments you'd never try by hand, and the winners are often counter-intuitive. You hand it goals, taste and guardrails; it hands back strategies you didn't know existed. The exchange runs both ways.
The open question we're chasing next. The benchmark is public, so take a look.
github.com/fjfok/REST-bench ↗As generation gets nearly free, verifying what's actually right becomes the constraint. Prosus's edge: a billion customers, a verification layer at a scale almost no one else can match.
Each step raises the ceiling of what AI can do for us.
Turn processes, goals, and knowledge into structured text. Documentation becomes the fuel.
Share context, skills, and blueprints across teams and portfolio companies. No gatekeeping.
If you set goals, make sure you have the correct metrics to follow them. Think sensitive, correlated metrics.
The north star. Not today's reality, but every step pushes us closer. AutoResearch, AutoPMF, and other AI loops optimise on top of structured context, at machine speed.