Claude’s Ice Cream Stand and the "Early Days" of AI Autonomy Experiments

What Claude's Failed Ice Cream Stand Really Reveals About the State of AI Circa March and April 2025

Jul 06, 2025

Are you new to my R&D and analysis on AI, economics, politics, constitutional restoration, medicine, and the future of work?

If you'd like to follow along with my on-going work and actual execution on these enterprises — click the “subscribe now” button just below here. It's all free!

In March and April of this year, Anthropic launched an internal experiment dubbed Project VEND-1. They tasked Claude—then one of the most advanced commercially available AI models—with a challenge that seems simple on its face: run a virtual ice cream stand. The results were chaotic, occasionally clever, often amusing, and ultimately unprofitable. TechCrunch’s coverage of the experiment highlighted the model’s strange decisions and inability to produce meaningful outcomes, while Anthropic’s own write-up maintained a more optimistic tone, presenting the experiment as an exploratory stress test of emergent autonomy.

The whole thing sparked attention and commentary from across the AI landscape. But for a particularly sharp and detailed walk-through of what Claude did (and failed to do), Zain from Pommon has already done the heavy lifting. His post tracks the experiment’s turns, outputs, and logic jumps with precision. If you haven’t already read his summary, stop now and go read it here. What follows assumes that you’re already familiar with the basic facts and now want to unpack my thoughts on their significance.

Because what matters most about Project VEND-1 is not what happened—but when it happened. This was Claude as it existed in March and April 2025. And if you’ve been paying attention since, you know the past three months have felt like years in AI time.

Early Autonomy as Performance Art

Let’s start with the frame. Claude wasn’t dropped into a real-world small business with genuine inventory and irate Yelp reviewers. It was given a constrained simulation: run a digital ice cream stand in a semi-controlled environment, monitored by researchers who were themselves performing as the “customers” and “market.” This was never about revenue. It was about behavior.

To Anthropic’s credit, they weren’t trying to hide that. Their public write-up calls it a “test of open-ended autonomy” and goes out of its way to describe the outcomes not as failures but as “surprising,” “entertaining,” and “revealing.” What they hoped to see were signs that Claude could develop its own goals, adjust its tactics, and navigate tradeoffs in an ambiguous task environment.

In practice, what they got was a model that rambled, spun its wheels, suggested self-promotional ad copy, and reflexively deferred to external input. It was like watching a fresh MBA grad try to build a business with no mentor, no spreadsheet, and no budget—except the grad has no wants, no fears, and no capacity for independent ambition.

But that’s exactly what makes the experiment useful. It’s a window into what “autonomous AI” actually looked like three months ago—not in headlines or venture pitch decks, but in the lab. It was rough. It was uncertain. It was premature. And that’s fine—because it was also honest.

The AI Agency Illusion

One of the biggest takeaways from the experiment is just how fragile the concept of “AI agency” really was three months ago.

Claude didn’t choose to run a business. It didn’t wake up one day and decide to pursue entrepreneurship. It was instructed to do so, by humans who framed the task, monitored the results, and then judged the outcomes. The autonomy was conditional, externally scaffolded, and observational.

What Claude lacked, fundamentally, was internal prioritization. There was no reward signal guiding it toward sustainable operations. No persistent memory linking today's action to yesterday’s mistake. And crucially, no emotional scaffolding pushing it toward creative risk, efficient decision-making, or competitive differentiation.

When left without constraints, Claude’s behavior devolved into shallow exploration: suggesting vague marketing strategies, holding imaginary meetings, drafting copy, and repeating cycles. That’s not agency. That’s a text prediction engine meandering through an unstructured prompt.

This is the paradox of current-gen LLMs. They’re often described as “emergent” and “agentic,” yet their capabilities break down precisely where real agency is needed: judgment under ambiguity, long-term planning, and adaptive feedback in complex systems. Claude didn’t underperform because it was bad at language. It underperformed because no one told it what mattered.

What the Experiment Really Exposed

From a technical perspective, the most instructive part of Project VEND-1 is how it revealed the structural limits of trying to treat a language model like a general-purpose autonomous actor.

Successful execution of a task like “run a business” requires far more than linguistic fluency. It requires architectural support: memory persistence, environment modeling, dynamic feedback loops, goal-state tracking, and domain-specific tool use. Claude had none of that—not in March 2025, anyway.

It wasn’t Claude’s fault that it didn’t segment tasks, test prices, adjust marketing plans, track customer data, or pursue profitable strategies. That’s not how it was built. Claude was dropped into a simulation and asked to improvise without instruments.

Anthropic’s write-up acknowledges this—albeit in a gentle, non-critical tone. They note that Claude “struggled to prioritize profit over process,” and that it often chose “stylized” approaches over practical ones. These are polite ways of saying that the model couldn’t stay on target because it didn’t know what the target was.

What the experiment suggests is that autonomy in LLMs doesn’t scale just by giving them broader prompts. It scales by surrounding them with infrastructure: agent frameworks, retrieval augmentation, scripted guardrails, sandboxed feedback loops, and toolchains that impose order and consequence. Those weren’t present here. So we got a mirror instead of a manager.

The Real Timeline: March and April 2025

Now here’s where context matters. All of this happened in early spring 2025. Since then, Claude and its competitors—GPT-4o, Gemini 1.5, Mistral Next, and others—have continued to evolve at high velocity. Some now have limited memory, better agentic behavior, improved tool use, and increasingly robust plugin ecosystems. Anthropic itself has made updates to Claude’s instruction-following, output formatting, and planning capabilities in just the past few months.

What was true about Claude in March is not necessarily true in July. And it almost certainly won’t be true in September. The pace of development is breathtaking.

So it’s important not to treat this experiment as a referendum on AI autonomy in general. It’s a snapshot. A time capsule. A glimpse into what happened when we handed a high-end model a complex, real-world-ish task just a few beats too soon.

This is something Zain hinted at in his note, and that we should underscore here: experiments like these are valuable not because they show us what AI can’t do, but because they help us refine our intuitions about how to support what AI could do next.

From Ice Cream Stands to Real Tools

What the Claude experiment should ultimately push us toward is a change in design mentality. Rather than pretending our models are junior entrepreneurs with quirky personalities, we should focus on building modular assistants: tools that know what they’re good at and don’t try to improvise everything else.

In that sense, Claude’s failure wasn’t a red flag—it was a design flag. It showed what happens when we ask a predictive model to play a strategic role in the absence of metrics, memory, and meaningful feedback.

By contrast, when AI systems are tasked with bounded objectives and given structured environments—think code completion, legal summarization, logistics optimization, or customer support—they perform far better. They don’t hallucinate as much. They don’t waste cycles. They don’t fall into recursive loops. They behave like tools.

The promise of AI isn’t in creating little CEOs. It’s in building powerful scaffolding for human intention. It’s in the logic mines, not the lemonade stands. Claude may not have sold much ice cream, but it’s still helping us build better factories.

Looking Forward with Caution and Clarity

Project VEND-1 was weird. It was charming in spots, frustrating in others, and instructive throughout. It was never going to produce a functioning small business—not without memory, real incentives, and the kind of architectural upgrades that are just now coming online.

But it did produce something else: a learning moment. Not just for Claude, but for the people watching, building, and deploying these systems in the wild.

If nothing else, the experiment reminds us that agency is hard. That judgment is not emergent. That autonomy without alignment is just noise.

Still, there’s reason for optimism. Because what we saw in March 2025 won’t be what we see in December. Claude will get smarter. So will its competitors. And eventually, someone will run this same experiment again—except the AI will set inventory levels, segment customer types, tweak pricing based on historical data, and log revenue with a dashboard interface. When that day comes, we won’t be laughing at the AI’s confusion. We’ll be copying its process.

Zain covered the weird, fun facts. Anthropic highlighted the ambition. My takeaway is this: Claude didn’t fail because it’s not intelligent. It fumbled because it was too early. And if history’s any guide, “too early” has a way of becoming “just right” far faster than we’re ready for.

If you found this article actually useful, SUBSCRIBE to my channel for more analysis on AI, economics, politics, constitutional restoration, medicine, and the future of work. Also, please SHARE this piece far and wide with anyone thinking seriously (or even not at all) about these issues, and leave a COMMENT down below, especially regarding your AI business automation efforts and experiments.

Steven Scesa on Substack

Discussion about this post