A Practical Checklist for Writing Better AI Agent

Developer hell has a new address. You know tutorial hell—the infinite loop of half-finished guides that left you knowing less than when you started. You know framework hell—the JavaScript ecosystem's long-running performance art piece where a new tool drops every ten minutes and you're supposed to somehow keep up. Well, welcome to skill hell: a growing pile of freely available AI agent skills that you can download, fork, and contribute to, but that you fundamentally cannot evaluate. You can't tell a good one from a bad one. And so you keep trying new ones, keep getting results that fall short of what the skill promises, and keep wondering if the problem is you.

Matt, the maintainer behind one of the more widely-used engineering skill repositories (Matt PCO Skills), presented a talk at the AI Engineer World's Fair—remotely, family matters intervening—that tries to solve exactly this. His pitch: the field doesn't yet have a shared rubric for skill quality. There's no framework for looking at a skill and diagnosing what's working and what isn't. So he built one. It's a four-part checklist: trigger, structure, steering, pruning. And it's worth unpacking carefully, because the design decisions inside each category are less obvious than they first appear.

The Trigger Problem Is Actually a Trade-off Problem

The first question Matt asks about any skill is deceptively simple: who invokes it—the user or the model?

User-invoked skills are invisible to the agent unless the user explicitly calls them (think /skill-name). Model-invoked skills carry a description that lives permanently in the agent's context window, allowing the agent to pull in the full skill file when it decides the moment is right.

The intuitive response is that model-invoked skills are strictly better—more flexible, less demanding of the user. Matt pushes back on this. Every model-invoked skill you add contributes to what he calls "context load": more tokens burned on every request, more things for the agent to weigh. Stack a hundred of them and you have a hundred descriptions cluttering the agent's working memory. The alternative—all user-invoked skills—eliminates that problem but replaces it with cognitive load: now the human has to remember what's in the toolkit and when to reach for it.

His own preference runs toward user-invoked, and he's explicit about why: "Every time you have a model invoked skill, it basically you get a cost in unpredictability because every time you have a context pointer pointing from one resource to another, the model may just choose not to follow it." He'd rather absorb the cognitive load himself than introduce a new failure mode to evaluate—specifically, the nasty problem of having to run evals just to confirm whether your skill is being called at all.

This is a genuine design tension, not a clear win for either side. Matt's approach suits expert users who know their tooling deeply. The "superpowers" skill set he contrasts with his own takes the opposite bet: let the model decide, free the user from memorization. Which philosophy fits your workflow depends on how much you trust the model's judgment versus your own.

The procedural knowledge framework underpinning skills makes this trade-off even sharper—because the value of a skill is entirely contingent on it being invoked at the right moment, by whoever or whatever is doing the invoking.

Structure: Two Units, One Constraint

Once you've sorted out the trigger, Matt's framework breaks the internal anatomy of a skill into two components: steps (the sequential procedure) and reference (supporting material those steps need). A skill can be all steps, all reference, or a mix—but naming the distinction forces cleaner design.

His 2PRD skill illustrates the mix well: three steps (find context, confirm test seams with the user, write the PRD) plus two reference sections (what a test seam is, a PRD template). Clean, auditable, small.

"Small" is the load-bearing word here. Matt's primary structural constraint is that the main skill.md file should be as minimal as possible. Every word is a token. Smaller files are easier to maintain and cheaper to run. And the mechanism for keeping things small is what he calls context pointers: if reference material is only relevant to one branch of a skill's possible execution paths, move it behind a pointer to a separate file. The agent fetches it only when needed, rather than hauling it into context on every invocation.

His domain modeling skill demonstrates the branching case: it might update a glossary, or create architectural decision records, or do neither. Three branches means three different reference needs—none of which belong unconditionally in the main skill file.

The trigger design and branching logic Nufar Gaspar covered approaches this from a slightly different angle, treating the skill's entry conditions as a first-class design concern rather than an afterthought. Together these perspectives suggest that the biggest structural mistakes in skills aren't errors of omission—they're errors of inclusion.

Steering: The Vocabulary You're Already Using, Made Deliberate

This is where Matt's framework gets genuinely interesting, and where most developers will have a small recognition moment.

The core technique is what he calls leading words—terms that compress a large amount of behavioral meaning into a small token footprint. Drop the right leading word into a skill, and the agent will repeat it back in its reasoning traces, reinforcing the intended behavior as it goes.

His example: agents have a persistent tendency to code layer-by-layer—database, schemas, API, frontend, in sequence—rather than building thin vertical slices that enable early feedback. You can write a paragraph telling the agent not to do this. Or you can use the phrase "vertical slice," which the agent already has rich associations with from its training, and watch it surface in the reasoning output: "Okay, we're going to do this as a thin vertical slice."

"English is a pretty wide API in terms of different functions you can call," Matt notes, and the analogy is apt. Leading words are essentially function calls into the model's prior knowledge. You're not explaining the concept—you're invoking it.

The second steering technique is more counterintuitive: hide the future from the agent. Matt observed that in any two-step skill (ask clarifying questions, then create a plan), the agent would consistently under-invest in the first step because it could see the second step waiting. It asks a couple of quick questions and then rushes to the plan. His fix: split them into two separate skills. The agent now sees only the clarifying-questions phase, has no finish line to race toward, and does the thorough leg work the situation actually requires.

"It's not always necessary to split skills into individual steps," he clarifies, "but in particular cases where you really want an extra chunk of leg work. It really there's no technique like it."

Pruning: What Bloat Actually Looks Like

The final checklist pass is about removing what shouldn't be there. Matt identifies three specific failure modes.

Duplication. Every piece of reference material should have a single source of truth inside the skill. If the same concept appears in two places, you're not reinforcing it—you're just making the skill harder to maintain.

Sediment. This is the collaborative-document problem: multiple people contribute to a shared markdown file, nobody feels empowered to delete anything, and the skill gradually accumulates irrelevant, outdated, or redundant content. Sediment is a governance issue wearing a technical costume.

No-ops. These are the most insidious—instructions that look like they should influence behavior but don't. Matt's test: delete the paragraph and see if the agent's output meaningfully changes. If a block of text telling the agent to write a detailed commit message gets removed and the agent still writes a detailed commit message, that block was never doing real work. No-ops are especially common in agent-generated skills, where the model tends toward thoroughness that doesn't translate into behavioral change.

The deletion test is a usefully empirical approach in a space that's often more vibes than rigor. It doesn't require evals. It just requires willingness to cut.

The checklist Matt presents—trigger, structure, steering, pruning—is a start at a shared vocabulary the field has been operating without. Whether it becomes that vocabulary depends on whether the community that maintains these repos decides to adopt it, fork it, argue with it, or ignore it in favor of the next promising framework that surfaces in a few months.

Given that we're in skill hell, the last option carries a certain grim plausibility.

Dev Kapoor covers open source software and developer communities for Buzzrag.