TANAY.SHAH
← FIELD REPORT/PROJECTS/MERCURY-STREAM
● PRODUCTION · 2025 · REAL-TIME DATA INFRASTRUCTURE · UPDATED 2026-05-10

MercuryStream.

A real-time market data pipeline that solves the problems exchanges warn you about.

// 01 — WHY I BUILT IT
THE PROBLEM

Three problems kept showing up in financial data engineering that nobody talks about until they bite you. First: feed formats change without notice. Cboe's PITCH spec literally reserves the right to add message types and grow message length without warning. Your decoder breaks, your pipeline goes silent, and you find out from angry traders, not from monitoring.

Second: "replay yesterday" is a real production requirement. KX's kdb+tick architecture uses tickerplant logs specifically for replay — recovering after failures, onboarding new subscribers, reproducing bugs. It is not optional in production finance, and most homegrown pipelines skip it.

Third: consumer lag is a silent killer. Confluent calls it out as one of the most common production Kafka issues — slow consumers back up queues, memory explodes, and you start processing stale prices. Most pipelines don't even surface this as a metric until things are already on fire.

// 02 — THE APPROACH
THE WORK

MercuryStream is built around three answers, one per pain point. Schema drift gets detected at the hot-path level via key/type validation; malformed events get sampled to a side channel so they don't block ingest. Replay is solved with a Flight Recorder: a 10-minute rolling in-memory window that dumps 5K events before + 3K events after any anomaly to disk — like an airplane black box, but for your data pipeline.

Consumer lag is handled with explicit drop-oldest backpressure. When a consumer falls behind, old events get dropped to keep data fresh. Drops are counted and exposed as metrics, not silently swallowed. You always know how much you've lost.

The hot path runs at 57K events/sec sustained on a single connection with sub-2ms p99 latency end-to-end. Memory stays under 100MB per container because all queues are bounded.

// 03 — KEY DECISIONS
WHAT I CHOSE & WHY
DECISION · 01

Black-box recording, not continuous logging

Most pipelines record everything to disk 'just in case.' At 50K events/sec that's ~4GB/hour of JSON, plus disk I/O contention with the hot path, plus needle-in-haystack debugging. MercuryStream records nothing during normal operation — the ring buffer lives in memory (~5MB). A dump fires only when an anomaly trips. You get exactly the context to reproduce; no more, no less.

DECISION · 02

Drop-oldest, not block-the-producer

When a consumer falls behind, the choice is: block ingest (data goes stale upstream), buffer indefinitely (OOM in 30 seconds), or drop. Drop-oldest keeps latency bounded and exposes the loss as a metric. Hidden drops are worse than measured ones.

DECISION · 03

Bounded everything

Every queue, every buffer, every cache has a hard upper bound enforced at construction time. Production financial data doesn't have a tail event in queue size — it has a log-normal distribution that spikes during volatility. Bounding is the only way to keep the pipeline alive when the market does something interesting.

// 04 — ARCHITECTURE
HOW IT FITS TOGETHER

Three-pipeline staged design: P1 ingester (one WebSocket connection, JSON decode, schema check) → P2 anomaly detector (sequence, dedup, latency, drift checks) → P3 flight recorder (rolling buffer, dump on trip). Each stage is bounded; backpressure is drop-oldest at every queue boundary.

// FIG. SYSTEM DIAGRAMSCALE 1:N
Mercury Stream — three-stage cryptocurrency anomaly detection pipelineArchitecture diagram for Mercury Stream. The Coinbase Pro WebSocket feed flows into Pipeline 1 (Ingester) which normalizes and writes to Kafka. Pipeline 2 (Anomaly Detector) consumes the stream, applies statistical outlier detection on price and volume windows, and emits flagged events. Pipeline 3 (Flight Recorder) persists the full anomaly record with surrounding context for later replay and analysis.EXTERNALPIPELINE · BOUNDEDON-ANOMALY OUTPUTSCOINBASE WSmarket data feed57K events/secP1 · INGESTERWS connect · JSON decodeschema check (key/type)async / boundedP2 · DETECTORsequence · dedup (LRU)latency p99 · driftemit anomaliesP3 · FLIGHT RECORDER10-min ring buffer (in-mem)~5 MB · zero disk I/Odump on anomaly↳ drop-oldest backpressure between every stageINCIDENT BUNDLE5K pre + 3K post · self-contained · replayable
// 05 — STATE OF THE ART
2026 BLEEDING-EDGE TECH
Python 3.12 asyncio (TaskGroup + ExceptionGroup)

Modern structured-concurrency primitives — no more orphaned tasks, no swallowed exceptions across pipeline stages.

Drop-oldest backpressure (LMAX-disruptor lineage)

Bounded ring buffers with lock-free single-producer / multi-consumer semantics. Same pattern HFT shops have used for a decade, applied to retail-scale pipelines.

Black-box flight recorder

Conditional persistence — record nothing during normal operation, dump 8K-event window only on anomaly. Cuts storage by 99%+ vs. continuous logging.

// 06 — MEASURED
NUMBERS THAT MATTER
Throughput
57K events/sec
Sustained, single connection
p99 latency
<2 ms
End-to-end through pipeline
Memory
<100 MB
Per container, bounded queues
Incident bundle
8K events
5K pre + 3K post
// 07 — IF I DID IT AGAIN
LESSONS · WHAT I'D CHANGE
  • The Flight Recorder pattern generalizes: any time you have an event stream where 99% of events are boring and 1% are debug-worthy, in-memory ring buffer + on-anomaly dump is dramatically cheaper than continuous logging.
  • If I rewrote this, the dedup LRU would be tunable per asset class — crypto has different retransmit characteristics than equities.
  • Schema drift detection should publish a versioned schema artifact, not just emit a warning. Right now drift gets caught but the diff is human-only.
// 08 — STACK
THE TOOLS
LANGUAGE
Python 3.12 (asyncio)
TRANSPORT
WebSockets
STORE
MongoDB
RUNTIME
Docker (multi-stage)
DATA
Coinbase WS feed
DESIGN
Bounded ring buffersLRU dedup