June 13, 2026 · The Constraint

How to find what is actually broken in a prompt that looks complete

You have written the guardrails. You confirmed the output format. You added a schema. The prompt still produces wrong answers, and the wrong answers are confident. Here is what breaks a prompt that appears structurally sound.

The prompt is not missing a guardrail. It is not missing an output format. The schema is correct. The problem is that the prompt uses terms whose meanings it never defined: terms like “feat,” “fix,” “breaking,” “critical,” “high priority.” The model fills those definitions from its training distribution. The training distribution does not know your codebase’s conventions. It knows the aggregate of millions of changelogs, bug trackers, and release notes from teams whose conventions differ from yours in ways that are invisible until the wrong classification ships to production.

The wrong answers are confident because the model is not guessing. It is applying a definition. The definition is just not the one you meant. This failure has a name: Semantic Drift Vector. And the process for diagnosing and rebuilding it has a name too: BYOP.

Why a structurally complete prompt can still produce wrong answers

A language model generates by sampling from a probability distribution over possible next tokens. Your prompt narrows that distribution. Add a clear target and the distribution tightens. Add context and it tightens further. But if you never specify what the output categories mean, the model fills those definitions from its training prior.

For widely-used labels, the model’s prior is strong and consistent with itself, but inconsistent with your specific convention. Conventional Commits labels are particularly vulnerable: the model has seen enough changelogs to be confidently, systematically wrong about edge cases in your codebase. A JSON schema validates that “type”: “feat” is present as a string. It says nothing about which diffs count as feat by your team’s definition. The schema ratifies the structure. It cannot catch semantic drift inside that structure.

This is why the failure is hard to detect. The outputs are syntactically correct. They pass validation. The error is in the semantic layer: the model applied its own definition of the label rather than yours. The closer a term is to a widely-shared standard, the stronger the prior, and the harder the drift is to see until you compare a set of outputs against ground truth.

The four diagnostic questions

BYOP (Bring Your Own Prompt) is a diagnostic rebuild protocol. You bring a failing production prompt. The protocol walks it through four questions, each targeting a different layer where a structurally complete prompt can still fail. The questions are ordered by the layer they probe: intent, then context, then structural guardrails, then semantic guardrails.

Q1: Is the intent stated, specific, and unambiguous? Not just present. Specific enough that a domain expert reading the prompt in isolation would produce the same output as another domain expert. If two engineers would classify the same diff differently given only the intent statement, the intent is underspecified even if it is grammatically complete.

Q2: Is the context complete? Does the prompt supply every input the model needs to match the domain’s definition of correct: not just the shape of the input, but the conventions that govern the output? Missing context is often invisible because the model fills the gap with a plausible default. The default is not wrong in the abstract. It is wrong for your codebase.

Q3: Are the guardrails structural or semantic? Structural guardrails constrain format. A JSON schema tells the model what shape the output must take. It does not say what the values mean. Structural guardrails do not catch semantic error; they ratify it.

Q4: Are the output categories defined in the prompt, or are they inherited from training? Any output category that is not defined in the prompt will be defined by the model’s prior. For widely-used labels, the model’s prior is strong and consistent with itself but inconsistent with your specific convention. If Q3 or Q4 fails, the prompt has a Semantic Drift Vector: a gap between what the term means in your domain and what the model infers it means from training.

The changelog classifier: a concrete case study

An engineer builds a changelog generation assistant for a TypeScript monorepo. The role is clear. The intent states “generate a structured changelog entry from the supplied git diff.” The output format specifies a JSON object with four fields: type (one of feat | fix | refactor | chore), scope, description, and breaking (boolean). This passes visual inspection. It passes schema validation on every run.

Three weeks in, the QA pass catches three recurring failures. Refactors with no user-facing effect are labelled feat. A diff that removed a public method was marked “breaking”: false. Descriptions match the schema but copy commit message text verbatim rather than characterizing the actual change.

The BYOP diagnostic:

Q1 (Intent): Weak. “Generate a structured changelog entry” does not define what a correct entry looks like. The verb “generate” does no work.
Q2 (Context): Missing. The model knows nothing about this monorepo’s conventions. Without an explicit default for uncertain cases, the model resolves ambiguity toward the less disruptive classification.
Q3 (Guardrails): Structural only. The JSON schema passes. Semantic guardrail: absent.
Q4 (Output categories): Inherited from training. feat, fix, refactor, chore, breaking: none defined. The model uses its Conventional Commits prior.

Failure layer identified: Semantic Drift Vector on all four output categories.

The rebuilt prompt:

ROLE: Changelog generation assistant for a TypeScript monorepo.

INTENT
Classify the supplied git diff into exactly one of the defined change
types and produce a structured changelog entry using the definitions below.
Do not use the commit message text in the description field; characterize
the change from the diff.

DEFINITIONS
feat:     adds a capability a consumer of the public API could not
          previously invoke. Internal refactors, performance improvements,
          and test changes are not feat.
fix:      corrects incorrect behavior without altering the public API
          surface. A fix does not change what callers can call or how.
refactor: restructures implementation without altering observable behavior
          or the public API. No behavior change, no new capability.
chore:    dependency updates, config changes, build tooling. No code
          behavior changes.
breaking: true if a consumer of the public API must change their code to
          accommodate this diff. When uncertain, mark true.

GUARDRAILS
- If the diff modifies the public API surface and you are not certain
  whether the change is breaking, default to true.
- Do not copy commit message text into the description. Write from the diff.

OUTPUT FORMAT
{
  "type": "feat" | "fix" | "refactor" | "chore",
  "scope": string,
  "description": string,
  "breaking": boolean
}

Same model. Same diff. The misclassification rate drops because the model is no longer using its training prior for the output categories. The definitions replaced the prior.

Where the same failure pattern shows up

The changelog classifier is one instance of a pattern that recurs across any system that uses enumerated output categories without defining them.

Support ticket routing. A routing agent classifies inbound tickets by priority: Critical, High, Medium, Low. The schema is present. The labels are not defined. The model routes using its training prior for what “Critical” means in a support context. Your team’s definition of Critical (anything blocking a paying customer from completing a transaction) is narrower than the training prior (anything severe). The result: too many tickets land in Critical, the queue is overloaded, and the tier loses meaning.

Document review pipelines. A review agent flags clauses in contracts as Compliant, Flag, or Reject. “Flag” is not defined. The model fills it with its own threshold for what warrants flagging. Your threshold is stricter. Clauses that should go to legal review are classified Compliant and released. The schema never triggers a validation error. The semantic drift is invisible until the compliance audit.

In each case, the same Q4 failure: output categories exist in the prompt as labels but are defined nowhere. The model applies its prior. The prior is not wrong in the abstract. It is wrong for the specific domain convention in use.

What is BYOP

BYOP (Bring Your Own Prompt) is a diagnostic rebuild protocol: take a failing production prompt, walk it through a structured four-question diagnostic (intent, context, structural guardrails, semantic guardrails), identify the failure layer, and produce a rebuilt version that targets the specific cause.

In use: “The changelog classifier was misclassifying for three weeks before anyone ran a BYOP pass on it. The diagnostic found the Semantic Drift Vector in under ten minutes.”

Where it does not apply: prompts that are failing because of model capability limits rather than prompt design. BYOP diagnoses structural and semantic failures in the prompt itself. If the task requires reasoning the model cannot perform at any specification level, the diagnostic will surface that limit. Adding definitions will not fix a capability problem. BYOP identifies whether the failure is in the prompt; model selection is a separate decision that follows the diagnosis.

What BYOP does not solve

The BYOP rebuild replaces the model’s training prior with explicit definitions. That is the fix for a Semantic Drift Vector. It is not a guarantee that the rebuilt prompt is correct.

The rebuild tells the model what “breaking” means in your domain. It does not verify that the model applied that definition correctly on any given diff. The output is still a classification produced by a probabilistic system. The rebuilt prompt raises the baseline; it does not establish a floor.

Establishing a floor requires a different move: defining, before you test the prompt, what a correct output looks like. Not structurally, not by schema, but by the domain’s criteria for truth. If you cannot write those criteria down before you run the prompt, you cannot evaluate whether the prompt is performing correctly. You can only observe outputs and guess. That prior criterion has a name: Ground Truth Contract. It is not a test suite, though a test suite is one implementation of it. It is a contract that specifies what “correct” means for this prompt before you look at any output.

One action this week

Take any production prompt with output categories: a classifier, a triage tool, a routing agent, a severity labeller. Find every output category your prompt specifies, whether schema fields, enumerated values, or named instruction labels. For each one, ask whether the definition is written in the prompt or inherited from training. If the answer is “inherited,” write the definition now, in one sentence, using your domain’s criteria. Add it to the prompt under a DEFINITIONS block. Run the same inputs that produced the outputs you already have. Compare.

The gap between the old outputs and the new ones is a Semantic Drift Vector made visible.

The full vocabulary for diagnosing and rebuilding production AI prompts, all 25 precision terms, is on the Method page, free and public. For engineers who want the complete diagnostic toolkit, 12 deployable system prompts including SP-06 (Three-Constraint Rule Diagnostic) and SP-11 (Ground Truth Contract Builder), the Agent Control Architecture Pack includes five fully-worked rebuilds and three AGENTS.md templates. The Constraint newsletter ships weekly: one issue, one technique, one precise term. Subscribe free.

The framework behind this post

All 25 precision terms, the Prompt Maturity Model, and the vocabulary that makes AI agent failures diagnosable.

Read the Method → Subscribe to The Constraint