Structured Outputs vs Tool Calling for LLM Data Extraction: Pick by Intent, Not by Habit

Both endpoints take a JSON schema. Both return validated structured data. They are not the same primitive and picking the wrong one is the most common shape of 'this LLM extraction is unreliable in production' I see in code review. The rule is: structured outputs for extraction and classification, tool calling for triggering actions, and structural validity is not semantic correctness. Here's the framework I use.

Two LLM API features look almost identical to a beginner. Both take a JSON schema. Both return validated structured data. Both have 99-100% schema-validity rates as of early 2026 thanks to constrained decoding. They are not the same feature, and code review across the agent products I've reviewed in the last six months has the same pattern: someone reached for tool calling for what's really a data-extraction problem, or someone reached for structured outputs for what's really an action-triggering problem, and the resulting reliability is worse than it should be. The rule is small and I keep having to write it down: structured outputs for extraction, tool calling for actions, never the other way around.

What each primitive is actually for

▸Structured outputs (response_format with type: json_schema in OpenAI / output_config in Anthropic): the model returns a single JSON document that conforms to a supplied schema. The model is making one decision: 'what's the right structured representation of the answer?' There's no side effect, no external tool, no callback. The output is the answer.
▸Tool calling (tools / functions): the model returns a (tool_name, tool_args) pair signaling 'please call this tool with these args, then send me the result.' The model is making two decisions in sequence: 'which tool fits this turn' (selection) and 'what arguments fit the schema for that tool' (parameterization). The agent loop runs the tool, returns the result, and the model continues.
▸JSON mode (json_object): obsolete in 2026 except as a compatibility fallback. It guarantees the output is valid JSON; it does not enforce a schema. The model can return any-shape JSON. The schema-adherence rate is much lower than constrained decoding modes. Don't ship new code on it; migrate existing code off it when convenient.

Why the choice matters for reliability

Tool calling is selection plus parameterization. The selection step has a real failure mode: the model picks the wrong tool given the user's request. As I covered in the tool-search post, accuracy on tool selection drops materially as the catalog grows past a handful of tools. If your problem is 'extract a structured representation of this contract,' you don't need the model to pick anything. You're forcing it to make a selection decision that doesn't have a meaningful choice. Structured outputs skips the selection step entirely.

Going the other direction: structured outputs for an action-triggering problem ('classify whether to refund this customer') still gives you a valid JSON object with a 'should_refund: true' field, but it's an answer, not an instruction. The agent loop has to interpret the JSON and dispatch to a refund tool itself. That dispatch is bespoke code; the agent loop primitives that handle tool_use blocks (logging, audit, human-in-the-loop hooks) don't fire because the model didn't emit a tool_use block. You lose the platform's affordances for the easy action-routing the tool-call path was designed for.

Concrete decision examples

▸Extract structured fields from a contract PDF (parties, effective date, term length, governing law): structured outputs. Single decision, no action.
▸Classify a support ticket into a fixed taxonomy of issue types: structured outputs. The class label is the answer.
▸Summarize a meeting transcript into a typed JSON of action items: structured outputs. Each action item is a typed object; the response is the array.
▸Look up a customer's order history before drafting a reply: tool calling. The model needs to invoke a database query, then reason about the result.
▸Send a Slack message to the on-call channel about an alert: tool calling. The model is taking an action with a side effect.
▸Hybrid (most common production case): tool call to fetch data, then structured-output a summary of it. Two model turns; first uses tool calling to trigger the fetch, second uses structured outputs to format the answer.

Structural validity is not semantic correctness

The biggest production trap with structured outputs in 2026 is the false confidence the constrained decoding guarantee gives you. The model will return a JSON object that matches your schema 99-100% of the time. It will also occasionally fill the fields with wrong values. Constrained decoding prevents schema violation; it does not prevent semantic error. A schema that says {governing_law: string} will get a string back; the string can still say 'California' when the actual contract specifies New York.

The defense is a validation layer on top of the constrained decoding. Pydantic (or Zod, or any cross-field validator) catches the cases the schema can't express: 'effective_date must be before expiration_date,' 'amount must be positive,' 'governing_law must be one of the 50 US state names.' Schema-level constraints (regex, enums, ranges) catch the obvious; cross-field validators catch the subtle. Both layers are cheap; running them is the difference between 'we trust the LLM's output' and 'we test the LLM's output.'

The library landscape, briefly

▸Pydantic: pure validation. Doesn't know about LLMs. Right when you control the LLM call yourself and just need to validate the response.
▸Instructor: thin wrapper that takes a Pydantic model, makes the LLM call, validates the response, retries on failure. Right when you want one-line LLM-extraction calls in Python.
▸BAML: a domain-specific language for declaring LLM-function interfaces, with cross-language code generation and 'Schema Aligned Parsing' that recovers structured data even from partially malformed model output. Right when multiple services in different languages consume the same LLM contracts, or when you want strong static guarantees over runtime validation.
▸Outlines: constrained-decoding library at the model-inference level. Forces the model's tokens to conform to a regex or grammar. Right for self-hosted models where you control inference; less relevant when calling hosted APIs that already do this.
▸Native provider features (OpenAI structured outputs, Anthropic output_config): right when you're calling the API directly and want to skip the wrapper layer. The libraries above wrap these underneath; the choice is wrapper-or-not, not which provider.

What I ship in production

# Pattern A: pure extraction
# Use case: extract structured fields from unstructured input
# (no side effect, single model call)
#
#   ┌──────────────────┐    response_format=json_schema
#   │  Anthropic API   │ ◄──────── Pydantic schema
#   │  + Claude Sonnet │
#   └────────┬─────────┘
#            │
#            ▼
#   ┌──────────────────┐    schema valid (99%+ guaranteed)
#   │  Pydantic        │    semantic checks (cross-field,
#   │  validator       │    enum subsets, business rules)
#   └────────┬─────────┘
#            │
#            ▼
#   ┌──────────────────┐
#   │  typed result    │  → application code
#   └──────────────────┘


# Pattern B: action triggering
# Use case: model decides to invoke an external tool
#
#   ┌──────────────────┐    tools=[refund, lookup, ...]
#   │  Anthropic API   │
#   │  + Claude Sonnet │
#   └────────┬─────────┘
#            │
#            ▼ tool_use block
#   ┌──────────────────┐
#   │  agent loop      │ → invoke tool with audit + HITL gate
#   │  (orchestrator)  │
#   └──────────────────┘


# Pattern C: hybrid (most common)
# Tool call to fetch, structured-output to summarize
#   tool_call(get_orders) → tool_result → structured_output(summary)

What I'd skip

▸JSON mode without schema enforcement. If the provider supports structured outputs (all the major ones do as of early 2026), use it. JSON mode just guarantees parseability, not adherence; the gap matters in production.
▸Hand-rolled JSON-Schema-from-Pydantic-via-string-templates. Use the libraries' built-in conversion (Pydantic's model_json_schema(), the Instructor / Anthropic / OpenAI helpers). Hand-rolling is a class of subtle bug that the library has already fixed.
▸Schema-and-pray. The 'we'll use structured outputs and trust the result' approach is the production version of the false-confidence trap. The validation layer is not optional; it's where semantic errors get caught.
▸Mixing the two endpoints in the same call. Anthropic's API can do tool calling and force a JSON-schema'd final response in the same conversation; the temptation to combine them in one model call is real and the failure mode (the model produces a tool call when you wanted an extraction, or vice versa) is real. Use them on separate turns.

What I would change in my own pipeline

▸Centralize the Pydantic models in a single shared module. Today some are duplicated across files because they appeared in different services first. A single source-of-truth schema directory would reduce the per-service drift and make BAML migration (if I ever go that route) easier.
▸Add property-based testing on the Pydantic schemas. Hypothesis (Python) or fast-check (TS) can generate adversarial inputs that should still validate. Useful for catching the 'this schema is too permissive' class of bug.
▸Treat structured-output schemas as a versioned wire format. When I tweak a field, downstream consumers shouldn't break silently. A schema-version field plus a contract test in CI is a small investment with a measurable failure-mode reduction.

The bigger lesson

The two LLM extraction primitives look interchangeable in the docs and behave very differently in production. Picking by intent (extract vs act) instead of by habit (tools-for-everything or json-mode-for-everything) is the architectural choice that determines whether your extraction pipeline is reliable. The constrained-decoding guarantee is real and helpful and not sufficient; the validation layer above it is what separates 'the model usually returns what we want' from 'the model returns what we want and we know it does.'

If a hiring manager asks me how I think about LLM data extraction in 2026, this is the answer. Not 'we use Pydantic,' but 'here's the primitive that fits the use case, here's the validation that catches the semantic errors the schema can't, here's the test surface that gives us confidence the pipeline does what we say it does.' Reliability in production LLM systems is mostly the work of separating 'looks right' from 'is right' at every layer.

References

▸OpenAI: 'Introducing Structured Outputs in the API' + response_format json_schema docs
▸Anthropic: structured outputs / output_config docs (platform.claude.com/docs/build-with-claude/structured-outputs)
▸Vellum: 'When should I use function calling, structured outputs or JSON mode' (2026 guide)
▸BAML: Schema Aligned Parsing (boundaryml.com/blog/schema-aligned-parsing)
▸Instructor (567-labs/instructor): Pydantic-based LLM wrapper
▸Outlines: constrained decoding library for self-hosted models
▸TechSy / BuildMVPFast: 2026 production guides on JSON mode vs function calling vs structured outputs

← MORE NOTES OPEN COMMS →