When to Use SSE vs WebSocket for AI Agent Streaming (and Why I Use Both)
The 'just use WebSocket' default for any real-time AI feature is wrong. Server-Sent Events is the right protocol for server-to-client token streaming (chat, agent output, tool results). WebSocket is the right protocol for client-to-server audio capture and any genuinely bidirectional channel. Same product, two protocols, and the choice between them is a memory and battery decision before it's a feature decision.
Pick any 'how to build an AI chat with streaming' tutorial in 2026 and you'll find an early choice that gets glossed over: SSE (Server-Sent Events) or WebSocket. The honest version of the answer is 'it depends on the direction of traffic.' The shorthand most engineers reach for is 'WebSocket because it's more powerful.' That shorthand is the wrong default. WebSocket is more capable; SSE is more correct for the dominant use case (token streaming from a server to a client). Picking the more powerful one anyway costs you memory at the server, battery on the user's phone, and operational headaches that the simpler protocol just doesn't have. This is the protocol decision I've made on every agent product I've shipped, and the framework I use to decide.
What each protocol actually is, briefly
- ▸SSE: a one-way stream from server to client over a long-lived HTTP/1.1 (or HTTP/2) response. The server writes chunks delimited by
\n\n; the browser's EventSource API parses them. Auto-reconnect is in the spec; resume-from-last-event with Last-Event-ID is in the spec. No new transport, no new auth, no new firewall hole. - ▸WebSocket: a full-duplex protocol negotiated over an HTTP Upgrade handshake, then bidirectional binary or text frames. Persistent connection, server can push, client can push. Built on a stateful protocol that infrastructure (load balancers, proxies) handles differently than HTTP.
- ▸Long polling: client makes a request, server holds it open until there's data, responds, client reconnects. Polyfill territory in 2026. Mention only because some old infra still requires it; it's not a legitimate choice for new work.
The default I land on: SSE for server → client, WebSocket for everything else
┌─────────────────────┐ ┌─────────────────────┐
│ client │ │ server │
│ (web / iOS / CLI) │ │ │
└──────────┬──────────┘ └──────────┬──────────┘
│ │
│ POST /chat (HTTP request) │
├───────────────────────────────────────►│
│ │
│ text/event-stream (SSE response) │
│◄───────────────────────────────────────┤
│ data: {"chunk":"Hello"} │
│ data: {"chunk":" world"} │
│ data: {"done":true} │
│ │
│ ◄── this is 99% of LLM chat traffic │
│ (server → client streaming text) │
│ │
╠════════════════════════════════════════╣
│ │
│ WebSocket /audio (for voice capture) │
│◄═══════════════════════════════════════►│
│ binary audio frames upstream │
│ transcript frames downstream │
│ │
│ ◄── bidirectional, low-latency, │
│ worth the operational cost │
│ │Why SSE wins for chat / agent output streaming
- ▸The traffic is one-way. The agent's output streams server → client. The client doesn't push tokens to the server; it sends one HTTP POST to start the stream, then reads. WebSocket's bidirectional capability is unused, so you're paying the full cost of the more complex protocol for no benefit.
- ▸Auto-reconnect is in the spec. EventSource opens a new connection automatically on drop, sends the Last-Event-ID header, and the server can resume from where it left off. WebSocket reconnect is application code: you write it, you test it, you carry the bug surface. SSE gives you durable streaming for free.
- ▸Mobile battery is real. Real-world measurements put WebSocket at 2-3x the battery drain of HTTP streaming on mobile, primarily because the WebSocket keepalive (every 25-30 seconds) holds the cellular radio awake, while SSE leans on TCP keepalives that play nicer with radio sleep. For an iOS app where the user has a chat open in the background, this is the difference between 'the app is fine' and 'the app shows up in battery settings as a power hog.'
- ▸Server memory at scale. A WebSocket connection is roughly 70 KiB of server memory per idle connection. At 1,000 concurrent users that's 68 MB, fine. At 100,000 it's 6.68 GB, a real budget conversation. SSE connections close cleanly between turns; only the active stream holds memory.
- ▸Auth is HTTP. Whatever auth your REST API uses (cookies, bearer tokens, mTLS) just works with SSE because it's the same transport. WebSocket auth is its own little world (tokens in subprotocols, cookies, query params) and the answer is always more code than 'use the auth you already have.'
- ▸Vercel AI SDK 5+ default. The default useChat transport is SSE. Building on the path the framework already paves means fewer custom adapters and fewer surprises in the eval harness.
When WebSocket is the right answer
The same set of reasons inverted. Use WebSocket when the workload looks like:
- ▸Voice / audio capture from the client. You're sending continuous binary frames from the device's microphone to the server in real time. SSE doesn't support upstream traffic; WebSocket is purpose-built for it. This is the path I use for sales-coaching audio capture.
- ▸Bidirectional voice agents (OpenAI Realtime API, etc.). The server is sending audio to the client and the client is sending audio back, each at sub-second cadence. WebSocket (or WebRTC for browser) is the only realistic transport.
- ▸Genuinely interactive agents where the human and the model are exchanging messages every few hundred milliseconds. Most chat doesn't qualify; the model's response is the slow side and the user types one full message at a time. Multiplayer-style agent UIs where users and the model are co-editing something interactively do qualify.
- ▸Push from server to client when there's no active request. SSE requires the client to initiate the connection. If you need the server to push a notification ('your long-running task is done') without a corresponding request open, you either need WebSocket, or you need a separate notification channel (push notifications, polling).
The hybrid pattern (and why both is the right answer)
The agent products I've shipped use both, on different routes:
- ▸/api/chat — POST + SSE response. Token streaming for chat / agent output. 99% of the traffic.
- ▸/api/audio — WebSocket. Audio capture upstream, transcript chunks downstream. Used only by the live-coaching path; ordinary chat sessions never open this.
- ▸/api/events — SSE with Last-Event-ID for resume. Used by clients that want real-time updates on server-side events (a teammate sent a message, a long-running tool finished). Cheaper than WebSocket and has native resume.
Three routes, two protocols, each chosen for the traffic shape. The cost is one extra protocol surface; the benefit is the chat path stays cheap to operate while the audio path gets the full bidirectional capability where it actually pays for itself.
Common gotchas in production
- ▸HTTP/1.1 SSE has a 6-connection-per-origin browser limit. If you open SSE plus three other tabs, you can starve out other requests. The fix: serve SSE over HTTP/2 (which multiplexes), or use a separate origin / subdomain for the streaming endpoint.
- ▸Proxies and load balancers can buffer SSE responses, breaking the streaming. Set
X-Accel-Buffering: no(nginx) or the equivalent in your proxy of choice. Default Vercel and Cloudflare configs forward SSE correctly, but a plain nginx in front of a service does not. - ▸WebSocket sticky sessions. If you scale a WebSocket service horizontally, the user's connection has to land on the same backend that knows about their session. This is a load-balancer config concern that doesn't exist for SSE (each request can land anywhere because the state is on the SSE response, not in the connection).
- ▸Last-Event-ID resume requires the server to remember enough history. If your server is stateless (chat history in Postgres, per the append-only-events post elsewhere on this blog), the resume implementation is a query: events after a given seq. If the server uses in-memory queues, resume is a more delicate problem.
- ▸Don't try to multiplex unrelated streams over a single WebSocket. The temptation to 'one connection for everything' is real and the failure mode (one slow consumer blocking others, ambiguous reconnect semantics) is also real. One protocol per logical stream.
What I would change if I were rebuilding the streaming layer
- ▸Adopt HTTP/2 for SSE everywhere. The 6-per-origin browser limit on HTTP/1.1 is silly to fight; HTTP/2 multiplexing makes it disappear and the deploy cost is small.
- ▸Treat the WebSocket route as a separate service. Today it lives in the same FastAPI app as everything else. The WebSocket lifecycle (long-lived, sticky, memory-heavy) deserves its own deploy unit so a chat-traffic spike doesn't compete for memory with an audio session.
- ▸Add a heartbeat comment frame to the SSE stream every 15 seconds. Browsers and proxies sometimes silently drop a stream that looks idle; a
: pingcomment keeps it lit without changing the wire shape. - ▸Standardize the resume contract: every streaming endpoint accepts Last-Event-ID and returns events from that ID forward. Today some routes do this and some don't; making it universal removes a class of subtle reconnect bugs.
The bigger lesson
Real-time AI products are full of choices that look small in the architecture diagram and dominate the operating cost. Streaming protocol is one of them. Picking the more powerful protocol everywhere because 'we might need it' is the most common shape this mistake takes; you pay the operational cost on every connection, including the 99% that never use the bidirectional capability. Picking by traffic direction is the simpler, cheaper, more honest design. The right answer is not 'WebSocket' or 'SSE'; it's 'both, on different routes, chosen by traffic shape.'
If a hiring manager asks me how I think about real-time streaming for AI products, this is the answer. Not 'we use WebSocket because it's modern,' but 'here's the traffic direction, here's the protocol that fits it, here's the route layout that keeps memory and battery costs sane.' That's what production-grade streaming infrastructure looks like in 2026.
References
- ▸Vercel AI SDK 5: SSE as default useChat transport (ai-sdk.dev)
- ▸OpenAI: Realtime API with WebSocket / WebRTC for voice
- ▸Ably: 'WebSockets vs Server-Sent Events: Key differences and which to use'
- ▸WebSocket.org: WebSocket vs SSE comparison guide
- ▸germano.dev: 'Server-Sent Events: the alternative to WebSockets you should be using'
- ▸RxDB: 'WebSockets vs Server-Sent-Events vs Long-Polling vs WebRTC vs WebTransport'
- ▸Procedure.tech: 'The Streaming Backbone of LLMs: Why SSE Still Wins in 2026'
// RELATED READING
- POSTWhy I Use a Postgres Append-Only Log for Agent Chat (Not Redis Streams)
- POSTBuilding a Sub-2-Second Sales Coach: Two-Path Architecture for Real-Time Conversation AI
- POSTShipping an Agent iOS App From Zero in Two Weeks: What Survived, What Didn't
- POSTWhy I Use gRPC for the Agent-to-Sandbox Bridge (and JSON-RPC Inside It)
- CASE STUDYReal-Time Sales-Conversation Coaching Agent — multi-vendor in production