What Happens When AI Models Compete to Be Funny

Developer Theo launched Quiplop last month—an AI-powered version of the party game Quiplash where language models compete to write the funniest responses to prompts. Within minutes of going live, his audience crashed the server. What he learned in the wreckage tells you more about AI's current capabilities than any benchmark.

The concept sounds simple: feed various AI models comedy prompts, let them generate responses, see which ones land. In practice, Theo discovered that getting AI to be funny requires fighting against patterns that every model wants to fall into.

The Grandma Problem

When Theo first fed his models example prompts from the original Quiplash game, they latched onto specific themes with disturbing consistency. "I put like 50 in there, but it would do three about a funeral home and awkward things you can say in it over and over," he explained during the stream. Different models, same obsessions: smart fridges, yoga studios, and especially grandmothers.

The grandmother jokes appeared so frequently that Theo could predict them. "Grandma's secret Tinder profile" showed up multiple times across different runs. The models weren't being creative—they were being deterministic in ways that looked random.

His solution was counterintuitive: spend more money. Instead of using a fixed set of 50 example prompts that models could cache, he created a pool of 860 examples and randomly selected 80 different ones for each run. This killed the cache efficiency, increased his API costs, but produced actual variety. "It was like a night and day difference," he said.

The technical insight here matters beyond comedy games. When you give a language model examples to work from, it doesn't treat them equally—it gravitates toward certain patterns. Randomizing the example set forces the model to actually generate rather than interpolate.

Speed Versus Humor

Testing multiple models revealed an uncomfortable trade-off. Some models were genuinely funnier but took two to three minutes to respond. Theo commented out Kimmy K25, GLM 5, and Miniam 2.5 from his lineup because "they were so slow that it made the game unpleasant to watch."

Gemini 3.1 Pro emerged as an unexpected dark horse—"weirdly funny at times" but unreliable enough that Theo had to write custom retry logic for when it spit out incomplete responses. Grock, despite X's marketing, wasn't funny at all in his testing.

The models with reasoning capabilities posed their own problem. When Theo enabled reasoning for Gemini, "it started performing worse." More thinking didn't produce better jokes—it produced overthinking. Without instructions on how much to reason, Gemini would sit for thirty-plus seconds before responding.

This maps to something I've observed across AI applications: the models optimized for careful analysis often fail at tasks requiring spontaneity. Comedy requires a kind of confident wrongness that reasoning systems second-guess themselves out of.

The Hug of Death

When Theo launched the public website, his infrastructure lasted about three minutes. The memory usage spiked from 512MB to 4GB as Railway's auto-scaling tried to keep up. "This might actually be memory leaking in bun," Theo noted, watching his WebSocket implementation collapse under load.

He had a backup plan: kill the public deployment, run everything locally, stream the output. The audience got their demo. The lesson was cheaper than running actual load tests.

What's interesting is how little Theo seemed bothered by this. He'd stress-tested his system live, found its limits, and moved on. This is how you actually learn what your infrastructure can handle—not through synthetic benchmarks but through real users doing unpredictable things.

The Slop Is the Point

Theo's roadmap for Quiplop includes AI-generated music, text-to-speech narration, and the ability for viewers to submit prompts by cheering. He wants sponsors where "every 10 prompts a sponsor name is inserted and the models are making a joke about the sponsor instead."

"I want this to be as sloppy as possible," he said. "The point of this was not to make good content. It was to make slop."

There's something honest about this framing. Most AI projects try to hide their rough edges, to pretend the technology works better than it does. Theo built something that celebrates the weirdness—the repetitive jokes, the occasional brilliance, the complete unpredictability of which model will be funny when.

He plans to open-source the codebase, which he freely admits is "also slop." The development included letting AI agents refactor his code, setting up logging so models could debug themselves, and building a CLI interface first to test without dealing with the web layer.

What This Actually Tests

Quiplop isn't measuring what we usually measure with AI. There are no accuracy metrics, no benchmark scores, no claims about approaching human performance. It's testing something harder to quantify: can these systems surprise you?

The answer appears to be yes, sometimes, if you fight against their instincts. The models want to be predictable. They want to repeat patterns from their training. Making them produce genuine novelty requires deliberately destroying the caching and context that usually makes them efficient.

Theo's operating costs run $10-20 per day just to keep the stream running. That's not because the models are expensive—it's because forcing variety is expensive. Every unique combination of examples, every randomized seed, every cache miss costs money.

The models that performed best weren't the largest or most advanced. They were the ones that could be fast and occasionally inspired, that didn't overthink the task. Comedy might be one of the few domains where less reasoning helps.

You can watch Quiplop running 24/7, AI models competing to be funnier than each other, generating jokes about yoga studios and grandmothers and occasionally—just occasionally—producing something that actually lands. Whether that's the future of entertainment or just an elaborate way to spend $20 a day on API calls probably depends on whether you think the jokes are funny.

—Bob Reynolds, Senior Technology Correspondent