AI Agents Promised to Do Your Work. They Can't Yet.

Here's what's wild: Wall Street wiped $285 billion off the SaaS sector because of an AI agent that literally stops working when you close your laptop.

I'm talking about Anthropic's Claude Computer Use (they call it "co-work"), which launched in January and immediately triggered an existential panic among enterprise software companies. The promise was intoxicating—an AI that could navigate your actual computer, open your actual files, and just... do things. No coding required. Point it at a spreadsheet, tell it what you need, walk away.

Microsoft saw the threat immediately. Within weeks, they'd built their own version (confusingly also called "co-work") on top of Claude's infrastructure, despite having invested $3 billion in OpenAI. That alone should tell you how serious this got.

But here's the tension that AI strategist Nate B Jones surfaces in his recent breakdown: the agent that caused all this chaos is still in research preview. It's being rolled out cautiously. And it has limitations that would be laughable in any production software.

"If I shut this laptop, co-work just goes to sleep. It doesn't do anything. You can't engage with it. It's done," Jones explains. "That is not something that would be remotely doable for any SaaS application."

So what's actually happening in the gap between the promise and the reality?

The Three Questions That Separate Real Agents From Hype

Jones proposes a framework that cuts through the noise. Before you commit to any AI agent, ask:

Does it have persistent memory, or does every session start from zero?
Does it produce artifacts you can inspect, edit, and build upon?
Does the architecture let context compound over time?

The interesting part is that even Claude Computer Use—the tool that triggered a quarter-trillion-dollar market correction—only scores about 1.5 out of 3 on this rubric. Yes, it produces great artifacts (especially in Excel). But persistent memory is iffy at best, and context doesn't really compound across sessions.

"That's one of the things that's ironic," Jones notes. "This agent capability is so powerful, so addictive, so demand-driven that even if the answer to these three hard questions is like one and a half out of three, you still jump on it."

That demand is real. Claude has had to adjust usage limits multiple times because people can't stop using it, limitations and all.

Why Coding Agents Worked First (And What That Reveals)

There's a reason AI agents conquered code before they conquered anything else: verifiability.

Code either runs or it doesn't. The feedback is immediate and binary. You don't need to interpret whether it's "good enough" or "mostly right." This is why tools like Cursor and GitHub Copilot established product-market fit so quickly—the domain itself provided the quality signal.

Knowledge work is messier. How do you verify that an agent correctly summarized a meeting? Prepped you for a client call? Synthesized market research? The success criteria are fuzzy, which makes the engineering problem exponentially harder.

This verifiability gap is why every agent company is scrambling to solve the same three problems Jones identified. Memory, artifacts, and compounding context aren't just nice-to-haves—they're the minimum viable infrastructure for outcome-focused work.

Four Contenders Rushing Into the Gap

Jones walks through several alternatives to Claude Computer Use, each taking different bets on what matters:

Lindy targets busy executives with a pitch that sounds perfect: describe what you want in plain language, walk away, and Lindy builds and operates the workflow. Founder Flo Crello has high-profile endorsers calling it life-changing.

But Jones's experience is more mixed. The tool scores a 2.4 out of 5 on Trustpilot, with users complaining that credits burn mysteriously and complex tasks fail without clear explanations. "The interface for debugging, the interface for editing is not great," Jones observes. Lindy has persistent memory (qualified yes), but artifacts are opaque and context compounding is inconsistent.

The verdict? Lindy has found a niche between Zapier and Claude Computer Use—easier than Zapier, more focused than Claude—but it's not a deep outcomes agent yet.

Sauna (formerly Wordware) is the most conceptually interesting bet. After raising $30 million to build an IDE for AI agents, they pivoted hard when they realized something fundamental: "People don't wake up thinking I want to build an automation today. Instead, they wake up thinking, I have too much to do."

So they rebuilt as "Cursor for knowledge work," positioning memory not as a feature but as foundational infrastructure. "Memory as a substrate, not as a toggle," Jones emphasizes. The insight driving this: knowledge workers won't become programmers in the AI future—they'll need to write better specs.

The problem? Sauna is very early and very demo-heavy. Jones can't verify whether it actually delivers on any of the three key questions yet. But he's watching it because "it outlines where the industry is going."

Google Opal is the free option nobody's talking about. It's Google Labs' prompt-to-workflow builder, recently upgraded with Gemini 2.0 Flash powering agentic capabilities. The biggest advantage is obvious: zero barrier to entry.

But the disadvantages are also obvious—it's Google, which means (a) data lock-in concerns, and (b) a high probability this becomes another abandoned experiment. The memory feature "looks a lot like a spreadsheet," Jones notes, "and a spreadsheet is not going to be durable enough for the kind of memory you need for long-running agentic outcomes."

Still, for lightweight workflow automation? Free is compelling.

What the Market Is Actually Telling Us

The $285 billion SaaS sell-off wasn't irrational—it was premature.

Investors saw Claude Computer Use producing tangible work artifacts and extrapolated to a future where traditional SaaS tools become obsolete. But the tech that triggered that panic can't survive you closing your laptop. The gap between vision and execution is massive.

What's genuinely interesting is that even with a 1.5/3 score on fundamental capabilities, demand is overwhelming. Product-market fit exists even when the product barely works. That suggests we're looking at a category that will eventually deliver on its promise—just not yet, and not with the tools that exist today.

The builders who succeed will solve the unsexy infrastructure problems: durable memory systems, artifact generation that's truly editable, and architectures that let context compound across sessions. Those aren't demo-friendly features, but they're the difference between hype and utility.

For now, the agents we have are powerful enough to spook markets and change behavior, but fragile enough that they sleep when you close your laptop. The question isn't whether outcome-focused agents will arrive—it's who'll build the boring infrastructure that makes them actually work.

Yuki Okonkwo is Buzzrag's AI & Machine Learning Correspondent.