MercuryStream · Real-time data infrastructure

// 01 — WHY I BUILT IT

THE PROBLEM

Three problems kept showing up in financial data engineering that nobody talks about until they bite you. First: feed formats change without notice. Cboe's PITCH spec literally reserves the right to add message types and grow message length without warning. Your decoder breaks, your pipeline goes silent, and you find out from angry traders, not from monitoring.

Second: "replay yesterday" is a real production requirement. KX's kdb+tick architecture uses tickerplant logs specifically for replay — recovering after failures, onboarding new subscribers, reproducing bugs. It is not optional in production finance, and most homegrown pipelines skip it.

Third: consumer lag is a silent killer. Confluent calls it out as one of the most common production Kafka issues — slow consumers back up queues, memory explodes, and you start processing stale prices. Most pipelines don't even surface this as a metric until things are already on fire.

// 02 — THE APPROACH

THE WORK

MercuryStream is built around three answers, one per pain point. Schema drift gets detected at the hot-path level via key/type validation; malformed events get sampled to a side channel so they don't block ingest. Replay is solved with a Flight Recorder: a 10-minute rolling in-memory window that dumps 5K events before + 3K events after any anomaly to disk — like an airplane black box, but for your data pipeline.

Consumer lag is handled with explicit drop-oldest backpressure. When a consumer falls behind, old events get dropped to keep data fresh. Drops are counted and exposed as metrics, not silently swallowed. You always know how much you've lost.

The hot path runs at 57K events/sec sustained on a single connection with sub-2ms p99 latency end-to-end. Memory stays under 100MB per container because all queues are bounded.

// 03 — KEY DECISIONS

WHAT I CHOSE & WHY

DECISION · 01

Black-box recording, not continuous logging

Most pipelines record everything to disk 'just in case.' At 50K events/sec that's ~4GB/hour of JSON, plus disk I/O contention with the hot path, plus needle-in-haystack debugging. MercuryStream records nothing during normal operation — the ring buffer lives in memory (~5MB). A dump fires only when an anomaly trips. You get exactly the context to reproduce; no more, no less.

DECISION · 02

Drop-oldest, not block-the-producer

When a consumer falls behind, the choice is: block ingest (data goes stale upstream), buffer indefinitely (OOM in 30 seconds), or drop. Drop-oldest keeps latency bounded and exposes the loss as a metric. Hidden drops are worse than measured ones.

DECISION · 03

Bounded everything

Every queue, every buffer, every cache has a hard upper bound enforced at construction time. Production financial data doesn't have a tail event in queue size — it has a log-normal distribution that spikes during volatility. Bounding is the only way to keep the pipeline alive when the market does something interesting.

// 04 — ARCHITECTURE

HOW IT FITS TOGETHER

Three-pipeline staged design: P1 ingester (one WebSocket connection, JSON decode, schema check) → P2 anomaly detector (sequence, dedup, latency, drift checks) → P3 flight recorder (rolling buffer, dump on trip). Each stage is bounded; backpressure is drop-oldest at every queue boundary.

// FIG. SYSTEM DIAGRAMSCALE 1:N

// 05 — STATE OF THE ART

2026 BLEEDING-EDGE TECH

▸ Python 3.12 asyncio (TaskGroup + ExceptionGroup)

Modern structured-concurrency primitives — no more orphaned tasks, no swallowed exceptions across pipeline stages.

▸ Drop-oldest backpressure (LMAX-disruptor lineage)

Bounded ring buffers with lock-free single-producer / multi-consumer semantics. Same pattern HFT shops have used for a decade, applied to retail-scale pipelines.

▸ Black-box flight recorder

Conditional persistence — record nothing during normal operation, dump 8K-event window only on anomaly. Cuts storage by 99%+ vs. continuous logging.

// 06 — MEASURED

NUMBERS THAT MATTER

Throughput

57K events/sec

Sustained, single connection

p99 latency

<2 ms

End-to-end through pipeline

Memory

<100 MB

Per container, bounded queues

Incident bundle

8K events

5K pre + 3K post

// 07 — IF I DID IT AGAIN

LESSONS · WHAT I'D CHANGE

→The Flight Recorder pattern generalizes: any time you have an event stream where 99% of events are boring and 1% are debug-worthy, in-memory ring buffer + on-anomaly dump is dramatically cheaper than continuous logging.
→If I rewrote this, the dedup LRU would be tunable per asset class — crypto has different retransmit characteristics than equities.
→Schema drift detection should publish a versioned schema artifact, not just emit a warning. Right now drift gets caught but the diff is human-only.

// 08 — STACK

THE TOOLS

LANGUAGE

Python 3.12 (asyncio)

TRANSPORT

WebSockets

STORE

MongoDB

RUNTIME

Docker (multi-stage)

DATA

Coinbase WS feed

DESIGN

Bounded ring buffersLRU dedup

// 09 — RELATED WORK

OTHER CASE STUDIES BY TANAY

Architecture sketch · weekend project

Real-Time Sales-Conversation Coaching Agent

Architecture sketch for an agent that listens to in-person field sales conversations and delivers grounded coaching — live during the call and deep after it.

READ →

Reference / open-source library

AI Agent Error-Handling Patterns

Stop your AI agents from failing silently. Four production reliability patterns, with tests and upgrade paths.

READ →

MCP server

Travel MCP Server

Real-time access to flights, hotels, and weather — for any AI assistant that speaks MCP.

READ →

// RELATED NOTES