People walk into AI consults with their eyes shining. They've been told their business is about to be replaced by a sentient toaster, that the singularity is queued behind their morning coffee, and that whoever ships the next chatbot wins the universe.
I love the enthusiasm. I have to gently dismantle most of it.
Here's the unromantic truth: today's frontier AI is not particularly intelligent. What it is, and this part is genuinely impressive, is the fastest, broadest, most ridiculously well-read intern humanity has ever produced. It has read everything. It has understood almost none of it. And the gap between those two things is where my consulting work lives.
The smoking gun: a benchmark a child beats
If you only click one link in this post, click this one: the ARC-AGI leaderboard.
ARC-AGI is the benchmark created by François Chollet specifically to be the thing AI can't brute-force. It's a set of small visual puzzles, grid in, grid out, that require you to spot a pattern from a couple of examples and apply it. A bright eight-year-old can solve most of them in an afternoon. The third generation, ARC-AGI 3, raises the bar further: tiny custom-built 2D games with no instructions. You have to figure out the rules by playing.
That's the entire industry's current ceiling on a task a primary-schooler shrugs off. Sixty hundredths of one percent. The models that write your sales emails, draft your contracts, and refactor your codebase cannot work out the rules of a game they haven't seen before, which is, depending on how you squint, the actual definition of intelligence.
ARC-AGI is the only formal benchmark that has resisted brute-force memorisation since 2019. The puzzles are private. There is no Stack Overflow answer to copy. There is no Reddit thread to scrape. The model has to think, and that turns out to be the part it's worst at.
So what is AI actually doing?
Stripped of the marketing, a large language model is doing one thing astonishingly well: compressing the internet into a probability function. You give it some text, and it produces the statistically most plausible continuation, conditioned on every byte of human writing it ate during training.
That sounds reductive. It is reductive. It's also why these systems work. Almost every model you've used since ChatGPT launched is the same core idea trained on more data with more compute. The current race is a scaling race, not an intelligence race, and so far nothing has come along that genuinely changes that.
What you're paying for when you call GPT-5 or Claude Opus 4.7 is:
- Breadth. The model has read more medicine, law, code, history and Tagalog poetry than any human alive. Any time your problem benefits from "what does the world generally know about X?", it's a superpower.
- Speed. It can produce a passable first draft of almost anything in seconds. First drafts are where most knowledge work bottlenecks.
- Tool use. Modern models score very high on operating computers, clicking buttons, calling APIs, running code, because they've seen millions of examples of how those tools work. They are excellent imitators of established workflows.
What you are not paying for:
- Genuine reasoning from first principles. When the problem is novel, the model degrades to confident-sounding nonsense. This is the famous "hallucination", it's not a bug, it's the architecture doing exactly what it was built to do.
- On-the-spot learning. An LLM can't really update from one conversation to the next. Every chat starts cold. There's interesting work being done on this front, but nothing in production has cracked it yet.
- Discovery without a leash. Point an agent at a vague goal with no direction and it will produce something. It will also, frequently, produce the wrong thing very efficiently.
Why this is good news for your business
If frontier models were genuinely intelligent, you wouldn't be reading this. You'd be reading a redundancy notice.
Because they aren't, the value you extract from AI is almost entirely bottlenecked by the human pointing it at the right problems. Three things that don't bottleneck you anymore:
- Knowledge. The models carry more than your team will ever need.
- Price. A tier-one API call costs cents. Most workflows are cheaper than the coffee that fuels them.
- Setup. You can install a serious agentic stack on your laptop in under an hour.
What does bottleneck you:
- Knowing which problem to give the AI. This is taste, judgement, and business context, the things models still can't fake.
- Building the scaffolding around the AI so it produces consistent output rather than artisanal prose every Tuesday.
- Evaluating the output. If you can't tell when the model is confidently wrong, you cannot deploy it safely. Period.
This is the part nobody putting up billboards about "AI transformation" wants to say out loud, so I will: the most expensive component of any working AI deployment is still the human pilot. The model is the engine. The pilot decides where the plane goes, when to pull up, and whether the runway is even pointing the right direction.
The honest test for "real" intelligence
I've adopted a personal benchmark, and you can borrow it: I'll start calling AI "intelligent" when a frontier model crosses 5% on ARC-AGI-3 without bespoke fine-tuning. Not 100%, 5%. The threshold where the system has demonstrably learned something from scratch on a problem it has never seen.
Until then, what we have is a sublime statistical mimic. Worth every cent. Just not the thing the press release said it was.
Where I come in
If your business is going to get value out of AI in 2026, it will be because someone, internal or external, knows the difference between the parts of the system that are genuinely capable and the parts that are theatre. That's the work I do. Two-hour proving ground, no retainer, no PowerPoint. We pick a real bottleneck, we build something that works, and you keep it.
If that sounds like the conversation you've been meaning to have, drop me a line.