Anonymizing PII Client-Side Before It Reaches the LLM (Why I Don't Trust the Gateway)

The 2026 default for LLM PII protection is a server-side gateway: Presidio plus LiteLLM, Lakera Guard, Skyflow. The gateway sees the PII, redacts it, forwards the scrubbed prompt. That's a meaningful improvement over plaintext, and the gateway still has the data. The fat-client pattern moves redaction to the device: PII never leaves the trust boundary, not even to the gateway. Here's the design, the threat model that justifies it, and why I'd defend it for healthcare and legal AI products.

Open the 2026 architecture diagram for any enterprise LLM product handling sensitive data and you'll see the same shape: thin client → server-side gateway (Presidio behind LiteLLM, or Lakera Guard, or Skyflow) → LLM provider. The gateway redacts PII, forwards a scrubbed prompt to Claude or GPT, and re-hydrates the response on the way back. It's a meaningful improvement over plaintext, it satisfies most compliance audits, and it's the answer most teams ship. It also has a property that bothered me enough to design around it: the gateway sees the PII. If the gateway is compromised, the PII is in the gateway's memory at the moment of redaction. That window matters for the highest-tier enterprise customers, and it's the reason I shipped a fat-client client-side anonymization pattern instead.

Where the trust boundary actually sits

The server-side gateway pattern is correct in spirit and limited in scope. It moves the trust boundary from 'the LLM provider' to 'our gateway plus the LLM provider's redacted-input pathway.' That's better, not best. For workloads where the customer's contract reads 'PII never leaves our infrastructure' or 'PII never touches a third-party server in any form,' the gateway pattern technically passes audit while leaving a real attack surface: the gateway server is a network-attached process holding plaintext PII for the duration of the redaction step.

Healthcare and legal customers in particular ask for the stricter answer. They're not satisfied that the gateway redacts before forwarding; they want assurance that the gateway never holds the data in the first place. That's a constraint the gateway pattern can't satisfy because the gateway is where redaction happens by design.

The fat-client pattern, in pictures

┌─────────────────────────────────────────────┐
│  CLIENT (browser, native app, or desktop)   │
│                                             │
│  ┌─────────────────────────────────────┐    │
│  │  on-device NER model (GLiNER, etc)  │    │
│  │   detects PII spans                 │    │
│  └──────────────┬──────────────────────┘    │
│                 │                            │
│                 ▼                            │
│  ┌─────────────────────────────────────┐    │
│  │  local mapping store (in-memory)    │    │
│  │   PERSON_0 = "Jane Smith"           │    │
│  │   ADDR_0   = "123 Main St"          │    │
│  │   DOB_0    = "1972-03-14"           │    │
│  └──────────────┬──────────────────────┘    │
│                 │                            │
│                 ▼ scrubbed prompt only       │
└─────────────────┼────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────┐
│  SERVER (thin orchestrator)                 │
│   never sees PII at any point               │
│   forwards scrubbed prompt to LLM           │
└──────────────────┬──────────────────────────┘
                   │
                   ▼
            ┌──────────────┐
            │  LLM (Claude │  receives prompt with placeholders,
            │  / GPT, etc) │  returns response with same placeholders
            └──────┬───────┘
                   │
                   ▼ scrubbed response
┌─────────────────────────────────────────────┐
│  CLIENT re-hydrates from local mapping      │
│  user sees the response with real PII       │
└─────────────────────────────────────────────┘

Three things make this work as a real pattern, not a research demo:

▸On-device NER. The detection model runs on the user's machine. Not a wrapper around a server-side detector. GLiNER (a recent generalist NER model) is small enough (under 200 MB quantized) to run in-browser via ONNX or in a native iOS/macOS app via CoreML. Detection is sub-100ms for typical prompt sizes.
▸Local mapping store. The PERSON_0 → 'Jane Smith' mapping lives in client memory, never on a server. It's per-session, ephemeral, cleared on app close. There is no central store the gateway can be breached to leak.
▸Schema-stable placeholder format. The placeholders are typed and indexed (PERSON_0, ADDR_1, DOB_2) so the LLM can reason about them as semantic objects rather than opaque tokens. The LLM understands 'PERSON_0 met PERSON_1 at ADDR_0 on DATE_0' just fine; it doesn't need to know the actual names.

The threat model that justifies it

The fat-client pattern is overkill for most workloads. It's correct when the threat model includes any of these:

▸Server-side compromise. A successful attack on the orchestrator should leak nothing about the user's PII because the orchestrator never had any. This is the core 'zero data retention at the application layer' guarantee.
▸Insider risk on the platform team. The platform engineers who SSH into the gateway should not be able to read user PII even with their elevated access, because the gateway memory never holds it.
▸Regulatory contracts that say 'PII does not leave customer infrastructure.' Healthcare AI products selling into HIPAA-bound providers, legal AI selling to firms with attorney-client privilege concerns, EU products under GDPR Article 32 obligations.
▸Third-party LLM provider trust. Even with ZDR contracts, the LLM provider sees the model input. If the model input contains PII (even briefly during inference), the customer's contract may forbid it. Sending placeholders only avoids the question.

If your threat model doesn't include those, the gateway pattern is fine and simpler. The fat client is engineering investment that pays off only when the audit demands it.

The trade-offs I accept

▸Bigger client. Shipping a 200 MB NER model adds to the app's download size. For browser apps this is a real cost; the model loads on first use, is cached in IndexedDB after that. For native apps the model ships in the bundle.
▸Detection quality varies by language and entity type. The on-device model is necessarily smaller than a server-side equivalent. Coverage is good for English personal names, addresses, phones, emails, SSNs, DOBs; coverage is weaker for niche entity types (medical record numbers, custom identifiers) and lower-resource languages. The mitigation: the regex layer for structured PII (high-precision, low-recall) plus the NER layer for unstructured (lower-precision, higher-recall) gives a defense in depth that a single layer doesn't.
▸Mapping consistency across turns. If the user mentions 'Jane Smith' in turn 1 and again in turn 7, the placeholder should be PERSON_0 both times so the LLM can reason about them as the same entity. The mapping store is per-session, persists across turns, and is queried before each scrub to reuse existing mappings. This is more work than a stateless gateway scrub.
▸Re-hydration in the UI is the developer's responsibility. The server returns text with placeholders; the client has to substitute back. For streaming responses this means the substitution runs on each chunk before render. About 30 lines of code on the client side, but it's code that doesn't exist in the gateway pattern.

How it compares to the named alternatives

▸Microsoft Presidio + LiteLLM gateway: server-side, mature, broad entity coverage, free open source. Right when the gateway is acceptable in your trust boundary.
▸Lakera Guard: server-side AI gateway, claims sub-50ms latency, 98%+ detection accuracy, includes prompt-injection guards. Right when you want commercial support and broader threat coverage.
▸Skyflow: privacy vault with format-preserving tokenization, polymorphic encryption. Right when you have a structured PII corpus that needs to remain queryable while encrypted; their position is explicitly that proxy-based gateways are limited.
▸On-prem local LLM: not really a 'PII redaction' pattern, but a 'send PII to a model that's also on-prem' pattern. Solves the same problem differently. Right when you can host a capable enough model locally; expensive for frontier-quality output.
▸Fat-client (this post): client-side detection + scrubbing + re-hydration. PII never reaches the server. Right when the contract says PII never leaves the user's machine, or when the threat model includes server compromise.

What the LLM sees, and why it still works

A common pushback on this pattern: 'won't the model give worse answers when it sees PERSON_0 instead of Jane Smith?' Mostly no, in my testing. The model treats PERSON_0 as a typed referent and reasons about it correctly: 'summarize the meeting between PERSON_0 and PERSON_1 at ADDR_0' produces a coherent summary that says 'PERSON_0 met with PERSON_1 at ADDR_0,' which the client re-hydrates to 'Jane Smith met with John Doe at 123 Main St.' The model doesn't need to know the actual names to reason about who-met-with-whom.

Where it does break: tasks that depend on the actual identity (look up Jane Smith's email) or on cultural-context inference about names (assess gender / nationality from name patterns). Those tasks legitimately need the PII, and the right answer is to gate them differently (look up the email locally, then send only the result; never send the inference task to a third-party model). The fat-client architecture supports this naturally because the local mapping store is the source of truth the client can query.

What I would change

▸Move the mapping store to a sealed-keychain-backed local cache that survives across sessions while remaining encrypted at rest. Today the store is in-memory only; for long-running session continuity this is too aggressive an eviction policy.
▸Add a confidence-tier output from the NER step: high-confidence detections are scrubbed silently; low-confidence detections trigger a UI confirmation ('Did you mean for this to be redacted? [yes / no]'). This catches the cases where the model misses an unusual entity without forcing the user to review every detection.
▸Treat the placeholder schema as a versioned wire format. Today it's documented in code. Versioning would let new placeholder types ship to the client and server independently without coordinated deploys.

The bigger lesson

Privacy architecture for AI is mostly about where the trust boundary sits. The gateway pattern is great for products where the boundary can be 'our server.' The fat-client pattern is the answer when the boundary has to be 'the user's device.' Picking the right one is a customer-contract question dressed up as an engineering question; the engineering question is 'what does the architecture have to look like to honor that contract.' Most teams skip the customer-contract step and end up with the wrong default.

If a hiring manager asks me how I think about privacy in AI infrastructure, this is the framing. Not 'we use Presidio because it's the standard,' but 'here's the contract, here's where the trust boundary sits, here's the architecture that places redaction inside that boundary.' That's the thinking that turns a compliance checkbox into a defensible product.

References

▸Microsoft Presidio: open-source PII redaction framework (microsoft.github.io/presidio)
▸GLiNER: generalist NER model suitable for on-device deployment
▸LiteLLM proxy + Presidio integration tutorial (docs.litellm.ai)
▸Lakera Guard: API-first AI gateway (docs.litellm.ai/docs/proxy/guardrails/lakera_ai)
▸Skyflow: privacy vault and tokenization architecture (skyflow.com)
▸Building a Zero-Data-Retention Layer for Production LLM Agents (tanayshah.dev/blog/zero-data-retention-agents/)
▸Jarvis case study (tanayshah.dev/projects/jarvis/)

← MORE NOTES OPEN COMMS →