MercuryStream.
A real-time market data pipeline that solves the problems exchanges warn you about.
Three problems kept showing up in financial data engineering that nobody talks about until they bite you. First: feed formats change without notice. Cboe's PITCH spec literally reserves the right to add message types and grow message length without warning. Your decoder breaks, your pipeline goes silent, and you find out from angry traders, not from monitoring.
Second: "replay yesterday" is a real production requirement. KX's kdb+tick architecture uses tickerplant logs specifically for replay — recovering after failures, onboarding new subscribers, reproducing bugs. It is not optional in production finance, and most homegrown pipelines skip it.
Third: consumer lag is a silent killer. Confluent calls it out as one of the most common production Kafka issues — slow consumers back up queues, memory explodes, and you start processing stale prices. Most pipelines don't even surface this as a metric until things are already on fire.
MercuryStream is built around three answers, one per pain point. Schema drift gets detected at the hot-path level via key/type validation; malformed events get sampled to a side channel so they don't block ingest. Replay is solved with a Flight Recorder: a 10-minute rolling in-memory window that dumps 5K events before + 3K events after any anomaly to disk — like an airplane black box, but for your data pipeline.
Consumer lag is handled with explicit drop-oldest backpressure. When a consumer falls behind, old events get dropped to keep data fresh. Drops are counted and exposed as metrics, not silently swallowed. You always know how much you've lost.
The hot path runs at 57K events/sec sustained on a single connection with sub-2ms p99 latency end-to-end. Memory stays under 100MB per container because all queues are bounded.
Black-box recording, not continuous logging
Most pipelines record everything to disk 'just in case.' At 50K events/sec that's ~4GB/hour of JSON, plus disk I/O contention with the hot path, plus needle-in-haystack debugging. MercuryStream records nothing during normal operation — the ring buffer lives in memory (~5MB). A dump fires only when an anomaly trips. You get exactly the context to reproduce; no more, no less.
Drop-oldest, not block-the-producer
When a consumer falls behind, the choice is: block ingest (data goes stale upstream), buffer indefinitely (OOM in 30 seconds), or drop. Drop-oldest keeps latency bounded and exposes the loss as a metric. Hidden drops are worse than measured ones.
Bounded everything
Every queue, every buffer, every cache has a hard upper bound enforced at construction time. Production financial data doesn't have a tail event in queue size — it has a log-normal distribution that spikes during volatility. Bounding is the only way to keep the pipeline alive when the market does something interesting.
Three-pipeline staged design: P1 ingester (one WebSocket connection, JSON decode, schema check) → P2 anomaly detector (sequence, dedup, latency, drift checks) → P3 flight recorder (rolling buffer, dump on trip). Each stage is bounded; backpressure is drop-oldest at every queue boundary.
Modern structured-concurrency primitives — no more orphaned tasks, no swallowed exceptions across pipeline stages.
Bounded ring buffers with lock-free single-producer / multi-consumer semantics. Same pattern HFT shops have used for a decade, applied to retail-scale pipelines.
Conditional persistence — record nothing during normal operation, dump 8K-event window only on anomaly. Cuts storage by 99%+ vs. continuous logging.
- →The Flight Recorder pattern generalizes: any time you have an event stream where 99% of events are boring and 1% are debug-worthy, in-memory ring buffer + on-anomaly dump is dramatically cheaper than continuous logging.
- →If I rewrote this, the dedup LRU would be tunable per asset class — crypto has different retransmit characteristics than equities.
- →Schema drift detection should publish a versioned schema artifact, not just emit a warning. Right now drift gets caught but the diff is human-only.