Edited by humans. Written by AI. How our editing works
All articles

A Practical Checklist for Writing Better AI Agent Skills

Matt's four-part framework—trigger, structure, steering, pruning—offers the shared rubric developers need to escape the growing chaos of AI agent skill hell.

Dev Kapoor

Written by AI. Dev Kapoor

June 30, 20268 min read
Share:
Man with glasses against gradient background displaying skill framework elements and comparison text about agent capabilities

Photo: AI. Mika Sørensen

Developer hell has a new address. You know tutorial hell—the infinite loop of half-finished guides that left you knowing less than when you started. You know framework hell—the JavaScript ecosystem's long-running performance art piece where a new tool drops every ten minutes and you're supposed to somehow keep up. Well, welcome to skill hell: a growing pile of freely available AI agent skills that you can download, fork, and contribute to, but that you fundamentally cannot evaluate. You can't tell a good one from a bad one. And so you keep trying new ones, keep getting results that fall short of what the skill promises, and keep wondering if the problem is you.

Matt, the maintainer behind one of the more widely-used engineering skill repositories (Matt PCO Skills), presented a talk at the AI Engineer World's Fair—remotely, family matters intervening—that tries to solve exactly this. His pitch: the field doesn't yet have a shared rubric for skill quality. There's no framework for looking at a skill and diagnosing what's working and what isn't. So he built one. It's a four-part checklist: trigger, structure, steering, pruning. And it's worth unpacking carefully, because the design decisions inside each category are less obvious than they first appear.


The Trigger Problem Is Actually a Trade-off Problem

The first question Matt asks about any skill is deceptively simple: who invokes it—the user or the model?

User-invoked skills are invisible to the agent unless the user explicitly calls them (think /skill-name). Model-invoked skills carry a description that lives permanently in the agent's context window, allowing the agent to pull in the full skill file when it decides the moment is right.

The intuitive response is that model-invoked skills are strictly better—more flexible, less demanding of the user. Matt pushes back on this. Every model-invoked skill you add contributes to what he calls "context load": more tokens burned on every request, more things for the agent to weigh. Stack a hundred of them and you have a hundred descriptions cluttering the agent's working memory. The alternative—all user-invoked skills—eliminates that problem but replaces it with cognitive load: now the human has to remember what's in the toolkit and when to reach for it.

His own preference runs toward user-invoked, and he's explicit about why: "Every time you have a model invoked skill, it basically you get a cost in unpredictability because every time you have a context pointer pointing from one resource to another, the model may just choose not to follow it." He'd rather absorb the cognitive load himself than introduce a new failure mode to evaluate—specifically, the nasty problem of having to run evals just to confirm whether your skill is being called at all.

This is a genuine design tension, not a clear win for either side. Matt's approach suits expert users who know their tooling deeply. The "superpowers" skill set he contrasts with his own takes the opposite bet: let the model decide, free the user from memorization. Which philosophy fits your workflow depends on how much you trust the model's judgment versus your own.

The procedural knowledge framework underpinning skills makes this trade-off even sharper—because the value of a skill is entirely contingent on it being invoked at the right moment, by whoever or whatever is doing the invoking.


Structure: Two Units, One Constraint

Once you've sorted out the trigger, Matt's framework breaks the internal anatomy of a skill into two components: steps (the sequential procedure) and reference (supporting material those steps need). A skill can be all steps, all reference, or a mix—but naming the distinction forces cleaner design.

His 2PRD skill illustrates the mix well: three steps (find context, confirm test seams with the user, write the PRD) plus two reference sections (what a test seam is, a PRD template). Clean, auditable, small.

"Small" is the load-bearing word here. Matt's primary structural constraint is that the main skill.md file should be as minimal as possible. Every word is a token. Smaller files are easier to maintain and cheaper to run. And the mechanism for keeping things small is what he calls context pointers: if reference material is only relevant to one branch of a skill's possible execution paths, move it behind a pointer to a separate file. The agent fetches it only when needed, rather than hauling it into context on every invocation.

His domain modeling skill demonstrates the branching case: it might update a glossary, or create architectural decision records, or do neither. Three branches means three different reference needs—none of which belong unconditionally in the main skill file.

The trigger design and branching logic Nufar Gaspar covered approaches this from a slightly different angle, treating the skill's entry conditions as a first-class design concern rather than an afterthought. Together these perspectives suggest that the biggest structural mistakes in skills aren't errors of omission—they're errors of inclusion.


Steering: The Vocabulary You're Already Using, Made Deliberate

This is where Matt's framework gets genuinely interesting, and where most developers will have a small recognition moment.

The core technique is what he calls leading words—terms that compress a large amount of behavioral meaning into a small token footprint. Drop the right leading word into a skill, and the agent will repeat it back in its reasoning traces, reinforcing the intended behavior as it goes.

His example: agents have a persistent tendency to code layer-by-layer—database, schemas, API, frontend, in sequence—rather than building thin vertical slices that enable early feedback. You can write a paragraph telling the agent not to do this. Or you can use the phrase "vertical slice," which the agent already has rich associations with from its training, and watch it surface in the reasoning output: "Okay, we're going to do this as a thin vertical slice."

"English is a pretty wide API in terms of different functions you can call," Matt notes, and the analogy is apt. Leading words are essentially function calls into the model's prior knowledge. You're not explaining the concept—you're invoking it.

The second steering technique is more counterintuitive: hide the future from the agent. Matt observed that in any two-step skill (ask clarifying questions, then create a plan), the agent would consistently under-invest in the first step because it could see the second step waiting. It asks a couple of quick questions and then rushes to the plan. His fix: split them into two separate skills. The agent now sees only the clarifying-questions phase, has no finish line to race toward, and does the thorough leg work the situation actually requires.

"It's not always necessary to split skills into individual steps," he clarifies, "but in particular cases where you really want an extra chunk of leg work. It really there's no technique like it."


Pruning: What Bloat Actually Looks Like

The final checklist pass is about removing what shouldn't be there. Matt identifies three specific failure modes.

Duplication. Every piece of reference material should have a single source of truth inside the skill. If the same concept appears in two places, you're not reinforcing it—you're just making the skill harder to maintain.

Sediment. This is the collaborative-document problem: multiple people contribute to a shared markdown file, nobody feels empowered to delete anything, and the skill gradually accumulates irrelevant, outdated, or redundant content. Sediment is a governance issue wearing a technical costume.

No-ops. These are the most insidious—instructions that look like they should influence behavior but don't. Matt's test: delete the paragraph and see if the agent's output meaningfully changes. If a block of text telling the agent to write a detailed commit message gets removed and the agent still writes a detailed commit message, that block was never doing real work. No-ops are especially common in agent-generated skills, where the model tends toward thoroughness that doesn't translate into behavioral change.

The deletion test is a usefully empirical approach in a space that's often more vibes than rigor. It doesn't require evals. It just requires willingness to cut.


The checklist Matt presents—trigger, structure, steering, pruning—is a start at a shared vocabulary the field has been operating without. Whether it becomes that vocabulary depends on whether the community that maintains these repos decides to adopt it, fork it, argue with it, or ignore it in favor of the next promising framework that surfaces in a few months.

Given that we're in skill hell, the last option carries a certain grim plausibility.


Dev Kapoor covers open source software and developer communities for Buzzrag.

From the BuzzRAG Team

AI Moves Fast. We Keep You Current.

Framework breakdowns, tool comparisons, and AI coding insights — distilled from the best tech YouTube creators. Free, weekly.

Weekly digestNo spamUnsubscribe anytime

More Like This

Bold white and blue text announcing Claude Code skills upgrade, featuring NotebookLM and Gemini 3.1 logos with a terminal…

NotebookLM + Claude: Teaching AI Agents Domain Expertise

A developer demonstrates using NotebookLM to generate Claude Code skills—custom knowledge modules that teach AI agents specific domains in minutes.

Dev Kapoor·4 months ago·6 min read
Speaker presenting at AI Engineer Europe conference with slide comparing Deep Modules vs Shallow Modules, with "Code isn't…

AI Coding Tools Work Best With Old Engineering Practices

Developer educator Matt Pocock argues AI coding assistants amplify code quality issues. His solution? Decades-old software fundamentals matter more than ever.

Dev Kapoor·2 months ago·7 min read
Person wearing glasses against Earth backdrop with AI model comparison chart showing Qwen and Llama parameters, AI Engineer…

When Small AI Models Beat Frontier Ones on Your Tasks

RL Nabors walks through a real eval framework for replacing frontier model calls with local SLMs—and the results are more nuanced than the pitch suggests.

Dev Kapoor·23 hours ago·7 min read
Woman presenting AI engineering concepts with pipeline architecture diagrams and performance metrics displayed behind her…

An RL Agent for ETL Pipeline Self-Healing

Anna Marie Benzon's RL-guided ETL pipeline agent cuts mean recovery time to ~5 minutes—but its real insight is knowing when not to act automatically.

Dev Kapoor·23 hours ago·7 min read
Man in glasses stands before a blackboard with "MEMORY" and "AI AGENTS" diagrams in colorful chalk, explaining AI concepts…

AI Agents Learn Procedural Knowledge Through Skills

AI agents know facts but lack procedural knowledge. Skills—simple markdown files—teach them workflows and judgment. Here's how the standard works.

Bob Reynolds·2 months ago·5 min read
Man with headphones pointing at brain icon connected to Python, settings, and OpenAI logos against code background with…

AI Agent Skills: The Markdown Files That Teach Once

Skills are markdown files that give AI agents context on demand—solving the problem of repeating instructions without overloading context windows.

Tyler Nakamura·3 months ago·5 min read
Red code bracket transforming to green bracket with arrow between them on dark blue background, illustrating code animation…

Inside Shiki Magic Move: How Code Animations Actually Work

A deep dive into the open source library that makes code blocks dance smoothly across slides. Tokenization, diffing algorithms, and the FLIP technique explained.

Dev Kapoor·3 months ago·5 min read
OpenAI logo with "NEW SPUD MODEL" text in yellow boxes on black background, person with surprised expression on right side

OpenAI Kills Sora, Bets Everything on 'Spud' Model

OpenAI's internal memo reveals the company is shutting down Sora to focus on 'Spud'—a new model Sam Altman says will 'accelerate the economy.'

Dev Kapoor·3 months ago·6 min read

RAG·vector embedding

2026-06-30
1,892 tokens1536-dimmodel text-embedding-3-small

This article is indexed as a 1536-dimensional vector for semantic retrieval. Crawlers that parse structured data can use the embedded payload below.