Three Reports Dropped on the Same Day. Together, They Tell a Story None of Them Tells Alone.

Yesterday, three groups of people who spend their professional lives studying AI published three very different reports within hours of each other. One was an outside researcher making his best guesses about what AI can actually do right now. Another was Anthropic's own red team describing what their newest model found when they pointed it at the world's most secure software. The third was a researcher at METR, the organization that builds the benchmarks we use to measure AI capabilities, saying those benchmarks are breaking.

Individually, each report is worth reading. Together, they tell a story that should change how you think about AI in 2026. Not because AI is about to take your job. Not because it's all hype. Because we've entered a phase where nobody, including the people building these systems, can tell you exactly what AI can and can't do.

That's not a throwaway observation. It has real consequences for how you spend money, evaluate vendors, and plan your next twelve months.

The Outside View: Ryan Greenblatt's Honest Accounting

Ryan Greenblatt is an AI safety researcher at Redwood Research who has unusually good visibility into what's happening inside the major AI labs. On April 7, he published "My picture of the present in AI", a long and detailed forecast of where things actually stand. Not where the marketing departments say they stand. Where they actually stand.

His most important claim is about productivity. The real AI speed-up at the best AI companies in the world is about 1.6x. Not 5x. Not 10x. Not 20x. Those inflated numbers come from asking the wrong question. People ask "how long would this task take without AI?" when they should ask "what speed-up would make you indifferent to having AI tools?" The first question invites exaggeration. The second one forces honesty. When you ask it correctly, the answer at frontier AI labs, staffed by some of the best engineers on Earth, is 1.6x.

Greenblatt also notes that AI-generated work tends to be sloppier, less reliable, and less well understood than human-only work. I've seen this firsthand. During my time leading teams at Riot Games and Wargaming, I learned that the real cost of sloppy work isn't the initial output. It's the downstream debugging, the misunderstandings that compound, the technical debt that accrues quietly until it doesn't. AI work carries that same hidden tax, and most productivity estimates ignore it completely.

But here's where his report gets really interesting. Greenblatt made a specific prediction about a model called Mythos at Anthropic. He predicted it would show dramatically improved capabilities, particularly in cybersecurity. He estimated a 60% chance that within six months, a well-engineered AI agent scaffold could create a working end-to-end exploit against a top-10 consumer software target with $1M in inference compute.

He published that prediction on the same day Anthropic showed he was right.

The Inside View: What Anthropic's Red Team Found

Hours after Greenblatt's post went live, Anthropic's Frontier Red Team published their assessment of Claude Mythos Preview's cybersecurity capabilities. The results are the kind of thing that makes you read the paragraph twice.

Mythos found thousands of zero-day vulnerabilities in every major operating system and every major web browser. It found a 27-year-old bug in OpenBSD, a system whose entire reputation is built on security. It found a 16-year-old bug in FFmpeg, one of the most aggressively fuzzed codebases in the world. These are systems that generations of the best human security researchers have been poking at for decades.

The numbers that matter most involve a direct comparison. The previous best model, Opus 4.6, attempted to turn Firefox vulnerabilities into working exploits and succeeded about 2 times out of several hundred attempts. Mythos Preview succeeded 181 times out of 250.

From 2 out of hundreds to 181 out of 250. That's not incremental improvement. That's a qualitative shift in what the model can do.

And here's what makes it genuinely unsettling. These capabilities were not specifically trained. Nobody at Anthropic sat down and said "let's teach Mythos to hack browsers." The cybersecurity performance emerged as a side effect of general improvements in coding, reasoning, and autonomy. Mythos got better at finding exploits because it got better at thinking about code. That distinction matters enormously, because it means we cannot predict in advance which capabilities will emerge from the next round of general improvements.

Anthropic chose not to release Mythos publicly. Instead, they launched Project Glasswing, a partnership with Amazon, Apple, Microsoft, CrowdStrike, and the Linux Foundation to use the model defensively. When a company whose entire business model depends on selling AI access decides to withhold a product, that decision tells you something about what they found during testing. I've been in enough executive conversations about product launches to know that companies don't leave revenue on the table unless the internal data is genuinely alarming.

The Measuring Stick: METR Says the Rulers Are Broken

The third report came from LawrenceC, a researcher at METR, the leading external organization that builds the benchmarks used to evaluate AI capabilities. METR's entire purpose is to answer the question "how capable is this model?" And LawrenceC's post is essentially a public admission that their tools are failing.

Opus 4.6, the model before Mythos, already succeeds at over 80% of METR's time horizon tasks. The confidence interval for Opus 4.6's time horizon ranges from 12 hours to 60 hours. Think about what that means. Their best estimate of what this model can do autonomously spans a 5x range. That's not a measurement. That's a shrug with error bars.

When Anthropic's formal evaluations maxed out for Opus 4.6, the decision about whether to classify it as ASL-4 (their highest danger level) came down to a survey of 16 Anthropic employees. Not a benchmark. Not a rigorous evaluation framework. A poll. I spent years building people analytics systems at companies with tens of thousands of employees, and I can tell you that a survey of 16 people from a single organization, no matter how smart they are, is not a reliable instrument for decisions of this magnitude.

Creating harder benchmarks isn't a simple fix. Getting human baselines for just 50 new hard tasks would cost over $1 million in labor alone. And by the time those benchmarks are designed, validated, and deployed, the next model may have already surpassed them. LawrenceC's concluding observation lands hard. "We need to figure out what to do when we live in a world where the natural pace of AI development is faster than what we can easily measure."

Why These Three Matter Together

Read separately, each report is interesting. Read together on the same day, they form a picture that none of them draws on its own.

Greenblatt's 1.6x productivity number and Mythos's cybersecurity performance are not contradictory. They describe different parts of the same animal. The 1.6x is the average speed-up across all engineering tasks at frontier AI companies. The cyber results represent peak performance on tasks that perfectly match AI's strengths: verifiable outcomes, massive parallelization, and deep benefit from exhaustive knowledge of code patterns. Both numbers are real. The gap between them is the actual story. AI capabilities in 2026 are extraordinarily uneven. Brilliant in some domains. Mediocre in others. And the boundary between brilliant and mediocre shifts with every new model in ways that nobody, including the builders, can fully predict.

METR's benchmarking crisis explains why both of those findings are simultaneously more and less alarming than they appear. More alarming because if we can't measure capabilities, we can't bound risk. We literally don't know what Mythos can do that Anthropic hasn't tested yet. Less alarming because the tasks where benchmarks are saturating (well-defined coding problems, security vulnerabilities, structured technical work) are not the same tasks that most businesses care about. Judgment calls, relationship management, navigating ambiguity, working with incomplete information: these are the core of most small business operations, and AI remains genuinely mediocre at them.

The timing of all three publications matters. An outside researcher speculated about Mythos by name. The evaluation community said the next model would break their tools. And then Anthropic dropped the model and its red team assessment the same day. This was a coordinated information moment. The AI safety community saw this coming and was building the intellectual framework for understanding it in real time. That level of anticipation and coordination is new, and it's worth paying attention to, because it means the people who study AI risk are not being blindsided. They're keeping pace, even as their measurement tools fall behind.

What This Means for Your Business

If the organizations whose entire purpose is measuring AI capabilities are publicly saying "our tools aren't keeping up," then anyone selling you a specific ROI number for AI adoption is selling you something they literally cannot verify. That includes consultants. That includes me. If I told you "AI will make your operations 40% more efficient," I'd be making that number up. Anyone who gives you a precise figure right now is either guessing or lying, and you should ask them which one.

The honest answer to "what will AI do for my business?" is uncomfortable but more useful than fake certainty. AI will probably help more than you expect in some areas and less than you expect in others, and the only way to find out which is which is to run actual experiments with your actual workflows. Pick a specific task. Time it without AI. Time it with AI. Compare the quality. Your own data from your own operations is more reliable than any vendor benchmark right now, because even the best external benchmarks are falling apart.

The cybersecurity implications are real but probably not your immediate problem. Mythos-level capabilities are aimed at Chrome, Linux, Safari, and major infrastructure. If you're running a business in Las Cruces or El Paso, you're downstream of those systems, not the direct target. The organizations that maintain that infrastructure (Apple, Google, Microsoft, the Linux Foundation) are the ones who need to act, and Project Glasswing suggests they already are. Your job remains what it's always been. Keep your software updated. Use strong, unique passwords. Enable multi-factor authentication. That advice hasn't changed, even if the reason it matters has gotten more urgent.

The most useful takeaway from all three reports is a question you can carry into every AI conversation for the rest of the year. When someone tells you AI is "5x faster" or "will replace X by Y date," ask them how they measured it. Ask what baseline they used. Ask whether they accounted for the quality difference between AI output and human output. Right now, even the best measurement organizations in the world are struggling to answer those questions about the systems they built the tests for. If METR can't reliably benchmark Opus 4.6, the person on LinkedIn claiming AI will replace accountants by 2027 almost certainly can't back up their claim either.

Sitting with the Contradiction

We're in a strange moment. AI is simultaneously more capable than most people realize and less transformative than the hype suggests. Mythos found bugs that human security researchers missed for 27 years. And the real productivity speed-up at the best AI companies in the world is 1.6x. Both of those things are true at the same time, about the same generation of technology.

Sitting with that contradiction, rather than collapsing it into a simple story of either doom or magic, is the most honest and useful thing you can do right now. The people selling you certainty in either direction don't have the data to support it. The people who do have the data are telling you, publicly and on the record, that their data isn't good enough.

If you're in the Borderland and want to talk through what any of this means for your specific situation, you know where to find me.

Frequently Asked Questions

Should I be worried about AI cyber threats to my business?

Not from Mythos specifically, which Anthropic is withholding from public release and channeling through defensive partnerships. The more practical concern is that the underlying techniques will eventually diffuse into the broader threat landscape, raising the baseline sophistication of attacks over the next few years. The defensive playbook for a small business hasn't changed yet, but it will get more important to follow it rigorously. If you've been putting off that password audit or dragging your feet on MFA, now's a reasonable time to stop procrastinating.

If nobody can measure AI properly, how do I evaluate AI tools for my business?

Ignore the vendor's published benchmarks entirely. They're measuring performance on standardized tasks that probably don't resemble your workflow. Instead, design your own test. Take a task you do every week, do it both ways (with and without the AI tool), and compare the results across three dimensions: speed, quality, and how much cleanup the AI output needs before it's actually usable. That cleanup time is where most vendor claims fall apart.

Is AI really only 1.6x faster?

That number is the best-case scenario at companies like Anthropic and OpenAI, where the engineers are highly skilled at prompting, the workflows are already optimized for AI integration, and the tasks involve code and technical writing. For most small businesses working on less structured tasks, the real number for any given workflow is probably lower on average but could spike much higher for narrow, well-defined tasks like formatting data, drafting template-based content, or generating first-pass reports.

What does "emerged without being trained for" mean? Should that worry me?

Most AI capabilities are intentional. Engineers choose training data and objectives to make models good at specific things. Emergent capabilities are different. They show up as unplanned side effects of making the model generally smarter. It's roughly analogous to how a child learning to read well might unexpectedly become good at crossword puzzles. Nobody taught that skill explicitly. The concern is that emergent capabilities are, by definition, the ones nobody planned for and nobody tested in advance. On the positive side, this also means the next generation of AI tools may develop unexpected strengths that genuinely help your industry in ways nobody is currently predicting.

These articles are from the AI safety community. Is this fear-mongering?

It's closer to the opposite. Greenblatt's productivity numbers are dramatically lower than what the AI hype machine claims. The METR researcher is publicly admitting that their own tools have limitations, which is not what a fear-monger does (fear-mongers claim their measurements are precise and scary). And Anthropic voluntarily left money on the table by withholding a model from commercial release. Collectively, this reads like a group of technically rigorous people trying to be honest about a genuinely complicated situation, not like people with an agenda trying to frighten you into action.