The bottleneck
is us.

AI already runs for hours and solves real problems on its own. The limit isn't the model. It's how ready we are to let it work.

Tasks scale quadratically

Models now run eight-plus hours and finish the job. What task do you have that runs longer than a few minutes?

METR: time horizon of software tasks LLMs can complete 50% of the time

Benchmarks fall within hours

The moment a benchmark is published, it's beaten. The real question isn't whether AI can solve it. It's which benchmark you want AI to solve.

Epoch AI: AI benchmarks have rapidly saturated over time

Your company is a data generator

Humans steer at every layer; models build at every layer. Agents generate data, the data trains the model, and a better model runs a better company. A recursive loop.

M2* Model Iteration System: humans steer and models build at every layer, producing the next-gen model in a recursive loop

Models double, your data compounds

The model gets ~2× better every year, and everyone rides that same wave. Your durable edge is what you stack on top. Leverage your own data and the advantage compounds. Start now; the gap only widens.

×2 ×2 ×2 + your data model ×2 / yr your edge 2024 2025 2026 2027

Two weeks → two days → two hours

Work that used to take two weeks now takes two days, and will soon take two hours. The clock is collapsing, and it isn't slowing down.

2 weeks 2 days 2 hours

Claude ships every single day

Their pipeline is streamlined for AI. They embrace it instead of fighting it. Things break, they fix fast, and the cost of failing is almost nothing.

Source: The Product Compass. Anthropic releases, Feb–Mar 2026.

Everything Claude Team shipped in 52 days

It was never the model

Not the model, not the framework, not the difficulty of the task. The bottleneck is how well we've set up our systems to let AI do what it does best: solve problems.

models frameworks hard tasks US solved problems
Intermezzo

A live walkthrough
of Toqan

See it all in motion. Then we get into the how.

Quality depends on your specs, not the model

Ask for "a snake game" and you get one. Add music, sound, a scoreboard, a jet-orange theme and it's far better. You specified what "correct" means.

"a snake game" specify SCORE 1240 HI 9999 music · sound · jet theme

Try it: paste this

No setup, no boilerplate, one plain sentence. Copy it into your agent and a working, hosted app appears.

prompt
Create a snake game using simple HTML and host it as an APP

Try again with improved specifications

Same game. Now you define what "good" means, and the model follows every word.

prompt
Create a snake game in simple HTML with background music, sound effects, a live scoreboard, and saved high scores. Apply a jet-orange fighter-jet theme, and host it as an APP.

Add your voice and design

AI output feels generic only because you didn't make it yours. Add your voice and your design, and the sameness disappears.

task + voice.md + design.md unmistakably yours

Create a voice.md

Answer five quick questions and get a reusable you: your tone, captured once.

prompt
Ask me five quick questions about how I like to write, then turn my answers into a voice.md an agent can reuse to sound like me.

Create a design.md

Point an agent at a site you love. It reads the look and saves a design.md any agent can reuse.

prompt
Fetch the design from https://www.thuisbezorgd.nl/en, including colours, fonts, spacing, and the overall feel, and turn it into a design.md an agent can reuse to style anything I build.

Make a pitch deck

A brand-new artifact: three slides on the game you just built, in your voice and your look.

prompt
Using my voice.md and design.md, create a three-slide HTML presentation about my snake game. Keep it local, with no hosting.

Teach AI your business

A handful of markdown files give an agent the context a new hire would need: who's who, what matters, how you win. Drop in people.md, okr.md, strategy.md, and every task lands in context.

people.md okr.md strategy.md glossary.md your agent gets the business

One voice. Every artifact.

Turn your voice and design into agent-readable files and they carry task to task: the same identity behind your decks, emails, and apps. Define who you are, not each artifact.

agent-readable voice.md design.md deck email app web same voice & design, everywhere

Same voice, new artifact

This time an email, not an app. Same identity, a brand-new surface.

prompt
Using my voice.md and design.md, draft a short launch email for my snake game with the same voice and look.

Turn it into a skill

Capture how you want things done once, as a skill. Then you stop repeating yourself: the agent applies your voice and standards on every task. A few percent better, every week.

your skill captured once · voice · design · standards task task task applied every time → a few % better each week

Bottle it as a skill

Capture it once and never prompt it again. Every build already sounds like you.

prompt
Turn what we just did, using my voice.md and design.md to build a themed, hosted app, into a reusable skill, so that next time I say "build me an app" it already sounds and looks like me.
shared skills
Skills marketplace: browse and install shared skills across your organization and the Prosus Group

Learning, personalized

Tell AI who's reading, their role and their background, and one idea becomes two explanations. The same concept, reframed for an exec and an engineer, each in their own vocabulary.

one concept the exec the engineer impact & outcomes in business terms how it works in code & detail

Explain it two ways

One paper, two readers. Here's a real one: Unlimited OCR (arXiv 2606.23050). Let AI reframe it for each, in their own words.

prompt
Read this paper, Unlimited OCR, arXiv 2606.23050 (https://huggingface.co/papers/2606.23050), and explain it twice: once for a non-technical exec (impact, outcomes, business terms) and once for an engineer (how it actually works, in detail). Use my voice.md so both still sound like me, and keep each under 150 words.

One paper, two readers

Same paper, Unlimited OCR, read two ways. The exec gets the outcome. The engineer gets the mechanism.

As an exec

Read whole documents in one shot, cheaper at scale

One model now parses 40+ pages in a single pass at 93% accuracy: six points past the best open baseline and ~35% faster on long files. Cost stays flat as documents grow, so high-volume back-office paperwork gets dramatically cheaper to automate.

As an engineer

Reference Sliding Window Attention (R-SWA)

Every output token attends to all image tokens but only the last 128 output tokens, so the KV cache is bounded (Lm + n) instead of growing with length, giving constant memory and latency. That's what lets it decode 40+ pages in one forward pass.

Break

Take five.

Grab a coffee, stretch, reset. We pick up right after.

One word unlocks video

One word, Remotion, is the difference between a long Slack wall and a flashy 30-second video. Text is ambiguous; visuals are less. Your voice and metaphors carry in too.

a long Slack wall Remotion 30s a 30-second video

Unlock the video

Hand AI a Slack message and get back a 30-second Remotion video.

prompt
Create a video using Remotion about this Slack message: [paste it]. Make it a 30-second clip with punchy on-screen captions, one key metric animated, and my brand colors from design.md. Give me the component code and the exact command to render it to MP4.

Density beats volume

AI made us busier, not freer. Attention is now the scarce resource. So don't ship 80 pages of slop. Make it dense, lead with hierarchy, and you'll earn the feedback that's hardest to get.

80 pages of slop 1–2 dense pages · more feedback

Cut it to one page

Turn a long report into a single dense HTML page that earns feedback.

prompt
Read this doc at https://www.prosus.com/~/media/Files/P/prosus-corp-v2/results-reports-and-events-archive/latest-results/hy2026/hy2026-results-video-transcript.pdf and turn it into one DENSE HTML page: the headline, the 3 numbers that matter, then the proof. Cut the rest, keep my voice.md tone, and tell me what you dropped.
Part II

Stop steering.
Let it loop.

The unlock isn't control. It's giving AI a goal and letting it run.

Just keep nudging

Do X
Done.
Do better
Revised. Tighter, clearer.
Review yourself
Found 3 gaps. Fixing…
Improve
v4. Measurably better.

Tiny nudges, big gains. The same loop that taught DeepSeek-R1 to keep thinking longer all on its own.

DeepSeek-R1-Zero average response length climbs steadily over RL training steps as the model learns to reason longer

Your turn: research it

Point it at a real question and let it run. We'll sharpen the technique on the next slides.

prompt
Search the web for "Auto Research applications in business" and write a short report on where it is already creating value.

Don't one-shot the search

Assuming the agent nails the query first try is delusional; the web's too vast. So loop it: search, evaluate the source, find the gaps, dig deeper or adjust. A depth-first beam search.

query answer search evaluate · prune dig deeper

Loop the search

One living report, five passes. It finds its own gaps and digs deeper.

prompt
Search the web for "AutoResearch applications in business". Maintain ONE report at AR-applications.md; edit and reshape it as you learn, don't just append. After each pass, think of new search terms to fill gaps and find new directions, then iterate. Loop 5 times.

Let it loop and review

Give it a goal and let it loop: create, review, revise, repeat. A model rarely catches its own faults while generating. But force it to review, again and again, and quality climbs with every pass.

quality loop 1 loop 2 loop 3 loop 4 create → review → revise, every pass

Auto-research while you sleep

Give AI an objective, not a task, plus tight guardrails: a 5-min experiment, revert-if-worse, keep-if-better. It runs for hours; you wake to a result better than weeks of your own work.

Autoresearch progress: 83 experiments, 15 kept improvements, validation BPB ratcheting down overnight

How the loop works

Three files, one cycle: a human writes the strategy in program.md, the agent edits train.py, runs a 5-min experiment, measures, and keeps the commit only if it improved. 80–100 times a night, zero intervention.

The autoresearch loop: read program.md, edit train.py, run a 5-min experiment, measure the metric, keep the commit if improved else revert and try again

Run it yourself

Clone the repo, give it a goal, and let it ratchet on its own.

prompt
Clone github.com/fjfok/autoresearch-edu, read the README, and run the AutoResearch loop on the example dataset. Give it a goal and guardrails, let it ratchet on its own, keep only changes that improve the metric, and report what it found.

47 iterations, +0.022 val_auc

Run on the UCI Heart Disease set: 47 iterations, keep-if-better. val_auc ratchets from 0.889 to 0.9114, then plateaus at a strong local optimum.

AutoResearch · UCI Heart Disease · run-20260505-1213

AutoResearch ratchet: 47 iterations on UCI Heart Disease, val_auc ratcheting to 0.9114 (+0.022 over baseline), with kept, discarded and crashed runs and a running-best line

Can you auto-research it?

Three tests: the outcome is measurable, you can iterate fast, and the goal is clear. Pass all three and AI can run the loop itself, on ads, prompts, even org design.

measurable outcome fast to iterate clear goal → anything with a KPI: ads · prompts · org

What can your loop teach you?

It runs hundreds of experiments you'd never try by hand, and the winners are often counter-intuitive. You hand it goals, taste and guardrails; it hands back strategies you didn't know existed. The exchange runs both ways.

your loop you strategies that actually worked goals · taste · guardrails learnings flow both ways

Verification is the new bottleneck

As generation gets nearly free, verifying what's actually right becomes the constraint. Prosus's edge: a billion customers, a verification layer at a scale almost no one else can match.

generation, cheap & infinite verify ← the bottleneck 1B customers

Four moves. One direction.

Each step raises the ceiling of what AI can do for us.

01

Make the business readable

Turn processes, goals, and knowledge into structured text. Documentation becomes the fuel.

02

Turn learnings into assets

Share context, skills, and blueprints across teams and portfolio companies. No gatekeeping.

03

Align your goals and metrics

If you set goals, make sure you have the correct metrics to follow them. Think sensitive, correlated metrics.

04

Aim for full autonomy

The north star. Not today's reality, but every step pushes us closer. AutoResearch, AutoPMF, and other AI loops optimise on top of structured context, at machine speed.

The bottleneck is us
01 / 45
← → or space to navigate