All posts

You Can't Prompt Your Way Out of a Missing Ground Truth

This week I thought I was making progress. The pipeline was running, tests were being generated, the dashboard looked green.

Then I actually looked at the output.

The pipeline was producing 40, 50 acceptance criteria per ticket. Most were unnecessary - duplicated across viewports, checking things that didn't need checking, and some were testing content that wasn't even on the page.

I spent a few days trying to fix it with better prompts. Clearer instructions. More explicit rules. It helped a little. It didn't solve it.

No prompt is good enough when the agent has no idea what's actually on the page.

It Started with My Own Code

Before I touched the AgenticQA pipeline output, I needed to clean up the pipeline itself. The codebase was getting messy - duplicated logic, agents doing too many things, the usual entropy that builds up when you're moving fast.

So I created a code-review skill in Claude Code - a plain text file that encodes the engineering principles I actually care about:

Every time Claude reviews my code, it loads that skill and checks against those principles. It's just a markdown file in .claude/skills/code-review/SKILL.md, but it changed how Claude works with my codebase. Consistent standards, every time, without me having to repeat myself.

And that's when the idea hit me.

What If My Agents Had Skills Too?

My pipeline is built on the Claude Agent SDK. Each step - reading the ticket, exploring the page, generating criteria, writing tests - is its own agent. And those agents were the ones producing all the noisy, hallucinated output.

If a skill file could make Claude Code follow my coding standards, could the same thing work for my pipeline agents? Could I write a qa-best-practices skill that encodes our QA rules - observe don't speculate, no duplicates, every test tied to real evidence - and have each agent follow it?

The SDK Actually Supports This

Turns out, the Claude Agent SDK has first-class support for skills. You put a SKILL.md file in .claude/skills/, configure two options, and the model discovers and loads the skill automatically:

for await (const message of query({ prompt: "Run QA pipeline for ticket QA-57", options: { settingSources: ["project"], // discover skills from .claude/skills/ allowedTools: ["Skill", "Read", "Bash"] // enable the Skill tool } }))

Skills are on-demand: the model reads the description, decides when the skill is relevant, and loads it. Clean, elegant, and built right into the SDK.

But I didn't use it.

Why I Chose Not To

Here's the thing about on-demand loading: the model decides when to use the skill. That's great for optional capabilities - "use the PDF skill when someone uploads a PDF." But my QA standards aren't optional. They need to apply to every single run, unconditionally.

I also considered using a project-level rules file (CLAUDE.md), which gets loaded automatically for every agent. But my pipeline has eight agents, and only three of them generate criteria and tests. The executor and bug reporter don't need criteria-writing rules cluttering their context.

So I went with the simplest thing that actually fit: I load the SKILL.md file at startup and append its content to the system prompt of the three agents that need it. Per-agent targeting. Unconditional application. No hoping the model decides to load it.

"First-class" doesn't always mean "right for your case." Understand the mechanism before adopting it.

Before and After

Same ticket. Same page. Same pipeline. Here's the output before and after adding the QA skill:

Before
Footer does not cause horizontal scrolling on desktop (>=1024px)
Footer does not cause horizontal scrolling on tablet (768-1023px)
Footer does not cause horizontal scrolling on mobile (<768px)
Footer branding uses dark background consistent with design system
No footer elements overlap adjacent page content sections
Footer link with no hover state is flagged as missing interactivity
Footer layout is consistent in Chrome when page loads
After
Verify all 10 Solutions column footer links resolve with HTTP 200
Verify all 4 Integrations column links: Source Code, ChatOps, Issue Management, AI
Verify all 3 legal bar links: Privacy Policy, Terms of Use, Status
Verify all 4 social media icon links resolve: LinkedIn, X, Facebook, Instagram

Fewer tests. But when they fail now, something actually broke.

The Real Lesson

I only landed on the right design because I stopped and asked Claude: "is this the best way to load a skill in the Agent SDK?" That one question surfaced three mechanisms - skills, rule files, system prompt injection - and the conversation that followed is what made the choice deliberate instead of accidental.

Using Claude to interrogate your own code and design decisions - not just generate them - is underrated. The shift from "write me this" to "why would I do this" is where the real leverage is.

#agenticQA  #llmtesting  #claudeagentsdk  #buildinpublic  #qaautomation  #ai