When AI Needs to Invent Problems Before Solving

Robert Lange has spent enough time watching AI systems spin their wheels to recognize a [pattern. The founding researcher at Sakana AI describes what happens when you turn a large language model loose without guardrails: "When we run LLMs autonomously, they tend to just kind of like nothing interesting happens."

This observation sits at the center of his work on Shinka Evolve, a system that combines language models with evolutionary algorithms to generate computer programs. The name—"shinka" means "evolve" in Japanese—hints at the meta quality Lange is after. Not just evolution, but evolution that evolves.

The work builds on recent progress in using AI to write code. Systems like AlphaEvolve from DeepMind have shown that language models can iteratively improve programs when given clear objectives. Hand them a problem like circle packing—fitting as many circles as possible into a square without overlap—and they'll generate solutions, test them, and refine their approach.

But Lange sees a limitation in this paradigm. "With all of these systems so far, maybe except for the AI scientist, the problem is given," he explains. "You have an evaluator, you have a correctness checker, and you sample programs only on that single problem. But often times innovation for a specific problem might require first inventing a different problem."

This isn't idle philosophizing. It reflects a deeper tension in how we're building AI systems. We optimize for what we can measure. We converge on solutions to problems we've already identified. What we don't do—what current architectures struggle with—is the messy, indirect path that characterizes actual discovery.

The Stepping Stone Problem

Kenneth Stanley wrote a book called Why Greatness Cannot Be Planned that makes this case from evolutionary biology. Natural selection doesn't aim for specific outcomes. It accumulates useful capabilities without knowing they're useful. Wings didn't evolve for flight—they evolved for temperature regulation in small dinosaurs. Flight came later, an accidental benefit of structures built for something else entirely.

Lange draws the parallel to AI development. "You need to collect a bunch of stepping stones first and then build on top of them to really find innovations," he says. The problem is that stepping stones don't look valuable until you're standing on them looking at what comes next.

Shinka Evolve tries to address this through what Lange calls "sample efficiency"—getting more useful output from fewer evaluations. Where AlphaEvolve might generate a thousand programs to solve a problem, Shinka Evolve achieves comparable or better results with dramatically fewer attempts. On the circle packing benchmark, it matched state-of-the-art performance using a fraction of the computational budget.

The technical approach involves maintaining an archive of programs organized as "islands"—populations that evolve semi-independently before exchanging information. Language models act as mutation operators, suggesting modifications to existing programs. A bandit algorithm adaptively selects which frontier model to use for each mutation: GPT-4, Claude Sonnet, or Gemini, depending on what's worked recently.

This adaptive selection is itself a form of meta-evolution. The system doesn't just evolve programs; it evolves its strategy for evolving programs. Hence: Shinka Evolve.

The Unknown Unknown

The more interesting question is whether any of this approaches genuine discovery. Lange is candid about the limitations. When asked if the system truly thinks outside the box, he frames it carefully: "I don't know all problems on the internet that try doing circle packing. But what I can see in the tree that we also depict is there's for example like a crossover operation between two programs happening where sort of different concepts are combined."

That hedge matters. The system recombines existing ideas in novel ways. It doesn't extrapolate beyond its training data in the way a human researcher might make an intuitive leap. The question is whether recombination is enough.

Lange argues it might be, at least as a starting point. "Even though these things might be in the end designed by humans, there are many unknown unknowns that we humans didn't think of while designing them," he points out. The space of possible combinations is vast enough that exhaustively exploring it yields surprises.

But there's a catch-22 here that Lange acknowledges. To find stepping stones useful for problem A, you might need to solve problem B first. But how does the system know to work on problem B when it's been given problem A? This is the "problem problem" he keeps returning to.

Earlier systems like POET—Paired Open-Ended Trailblazer—attempted this by co-evolving environments and agents simultaneously. As agents got better, environments became more complex, forcing agents to improve further. The curriculum emerged from the interaction rather than being specified upfront.

Lange wants to extend this principle beyond reinforcement learning to scientific research more broadly. "Going forward it's going to be really important to not only sort of do open-ended optimization of solutions but sort of do the co-evolution of problem and solution together," he argues.

Where the Rubber Meets the Road

Shinka Evolve has produced concrete results beyond circle packing. It achieved second place in an AtCoder competitive programming challenge. It evolved load-balancing loss functions for mixture-of-experts models. It generated agent scaffolds for AIME mathematics benchmarks.

These are real accomplishments, but they're also bounded ones. The system works within well-defined domains where evaluation is clear. You know if circles overlap. You know if code passes test cases. The messy, ambiguous parts of research—deciding what questions matter, recognizing when an unexpected result is meaningful—remain human territory.

Lange doesn't oversell this. When discussing The AI Scientist, another project aimed at automating research, he's frank: "The current version is more co-pilot than autonomous researcher." The system can execute a research pipeline, but it can't yet steer one.

The interesting timeline question is how long this limitation persists. Lange believes scientific research will be "fundamentally transformed" in five to twenty years. That's a wide range, and deliberately so. The mechanics of iterative code improvement are working now. The capacity for genuine problem invention—for asking questions humans wouldn't think to ask—remains speculative.

What's clear is that making language models better at answering questions we pose won't get us there. The architecture needs to include the capacity for surprise, for pursuing directions that don't initially appear relevant. Whether Shinka Evolve's approach of co-evolving problems and solutions achieves this, or whether it's still too parasitic on human-defined starting conditions, remains an open question.

Lange has been working on this long enough to know the difference between genuine open-endedness and systems that just appear open-ended within constrained domains. His caution is worth noting. So is his conviction that the problem is solvable.

Bob Reynolds is Senior Technology Correspondent for Buzzrag