Building a Black-Box Flight Recorder for Streaming Anomalies
Logging every event in a high-volume streaming pipeline drowns the storage system and obscures the rare event you actually need. The flight-recorder pattern (rolling ring buffer, snapshot on trigger) trades retention for context: you keep the last N seconds in memory and dump them on anomaly. Here's the three-stage Kafka design I shipped, and the math behind 5K-events-before / 3K-events-after.
If you write a real-time pipeline that ingests a high-rate event stream (a market feed, a telemetry firehose, a SaaS event bus), the day comes when something weird happens at 3:14 PM and you spend the next two days trying to reconstruct what was happening at 3:14 PM. The naive answer is 'log everything, then grep.' At any reasonable streaming volume, this either drowns your storage or makes the rare event impossible to find inside the noise. The right answer is borrowed from a different industry: an aircraft black box. Roll a buffer of recent events, snapshot it on trigger, throw the rest away. This post is about what that pattern looks like for a Kafka-based streaming pipeline in 2026, and the design decisions that made the version I shipped actually useful.
Prior art: JFR, Go's flight recorder, bitdrift
The pattern is older than streaming pipelines. Java Flight Recorder (in the JDK since Java 11) has shipped a circular in-memory buffer for events with on-demand snapshot for years. Go 1.25 added an equivalent for execution traces in the standard library. Bitdrift, the mobile observability company, built their entire product around the 'capture on anomaly' shape: a ring buffer at the edge, a control plane that decides when to upload it. Whenever a category of problem keeps being solved the same way across runtimes, languages, and product categories, that's prior art worth borrowing.
What none of these prior implementations cover is what changes when the buffer has to live across a distributed pipeline rather than inside a single process. That's the bit Mercury Stream had to solve.
The three-stage pipeline
┌────────────────────┐
│ Coinbase WS feed │ high-rate, can spike, drops messages,
│ (raw JSON frames) │ sends size-0 markers, drifts timestamps
└─────────┬──────────┘
│
▼
┌────────────────────┐ Kafka topic: raw_events
│ Stage 1: Ingester │ bounded queue, drop-oldest on overflow
│ (normalize + │ schema-drift detector samples violators
│ schema check) │
└─────────┬──────────┘
│
▼
┌────────────────────┐ Kafka topic: events_clean
│ Stage 2: Detector │ sliding-window outlier detection
│ (statistical │ over price + volume; emit ANOMALY events
│ outlier scan) │ to anomaly_triggers topic
└─────────┬──────────┘
│
▼
┌────────────────────┐ on each ANOMALY trigger:
│ Stage 3: Recorder │ fetch [t-5K events, t+3K events]
│ (ring buffer + │ write incident bundle to S3
│ incident dump) │ indexed by anomaly_id + timestamp
└────────────────────┘Three stages instead of one for the same reason any pipeline gets staged: each stage has its own bounded queue, its own backpressure policy, and its own failure mode. A bug in the detector doesn't take down ingestion. A surge from the upstream feed doesn't compromise the recorder's snapshot integrity. The cost is two extra Kafka hops, which on a localhost broker is sub-millisecond and on a real cluster is a few milliseconds. Cheap.
Drop-oldest, not block
Each stage's input queue uses a drop-oldest backpressure policy, not block. This is the choice most production teams get wrong on the first attempt. The intuition pulls toward 'block': we don't want to lose data. But for time-series streaming where the value of an event decays with age (a 30-second-old crypto price is useless for trading), block-on-backpressure produces the worst possible behavior under load: the consumer falls behind, the queue fills, the producer slows, the producer starts buffering upstream of itself, and now the entire pipeline is processing data that's already stale by the time it reaches the consumer.
Drop-oldest inverts that. When the queue is full, the oldest events are dropped. Drops are counted in a metric, not hidden. The freshness of what reaches the consumer is preserved. The stale events you would have processed under block-mode are explicitly the ones you don't want.
The 5K-before / 3K-after replay window
The replay window is the design decision that does the most product work for the least architectural effort. The numbers (5,000 events before the anomaly, 3,000 events after) are not arbitrary. They came from three constraints I had to balance:
- ▸Ring buffer memory budget. The recorder holds the last N events in memory at all times. Each event is a normalized JSON struct of about 400 bytes. 5K events is ~2 MB. 50K events is ~20 MB per recorder instance. Once you start carrying multiple gigabytes of in-memory state per pod, your operational cost goes up faster than your debugging value does.
- ▸Time horizon. At a sustained 200 events per second on a busy market pair, 5K events is about 25 seconds of pre-anomaly context. 25 seconds is enough to see the upstream pattern that led to the anomaly without burying the analyst in 'normal' background data.
- ▸Post-anomaly confirmation window. 3K events of post-anomaly tail is about 15 seconds at the same rate. 15 seconds is enough to see whether the anomaly was a one-off spike, a sustained shift, or the leading edge of a longer event. Less than that and you can't tell; more than that and you're paying memory for diminishing context value.
The exact 5K / 3K split is specific to the rate and economics of this particular pipeline. The general rule is: pick the pre-window long enough to see the cause, pick the post-window long enough to confirm the effect, then pick numbers small enough to fit in the ring buffer without crowding out throughput.
Schema-drift detection: sample, don't block
The first version of the ingester rejected any event that failed the schema check. This is the wrong default for two reasons. First, the upstream is allowed to evolve the schema. Coinbase's docs explicitly state they may add new message types or grow message lengths without notice. Rejecting on schema mismatch means a feed update breaks the pipeline silently until someone notices the gap.
Second, you want to know what changed. The right design is: validate against the expected schema, accept on success, sample on failure. Sampling means: log the first N violators per minute (with the full raw event), pass the rest of the violators downstream as 'unknown' events with a flag, never block the pipeline on a schema check. The sampled violators become a queue an engineer can investigate without sitting at a terminal at 3 AM.
Storage and retention for incident bundles
Each anomaly snapshot lands in S3 as a single newline-delimited JSON file (events) plus a small JSON manifest (anomaly metadata, detector reasoning, time range, event count). S3 is overkill for the data volume but right for the access pattern: the bundles are written once, read at most a handful of times during incident analysis, and then archived. Glacier retention after 30 days; deletion after 90, in line with our compliance window.
The other thing the manifest carries is the detector's reasoning: which sliding-window statistic exceeded which threshold, computed against which baseline. Without that, the bundle is a pile of events; with it, the bundle is a hypothesis you can test against the events. This is the smallest engineering change with the highest debugging payoff: you get to show up to the incident already knowing what the system thought happened.
What I would change
- ▸Push the ring buffer down into the detector stage rather than keeping it in a separate recorder service. Crossing a Kafka topic to fetch the pre-anomaly context window adds latency that doesn't pay for itself, and the ordering guarantees you need (events around the trigger, in seq order) are easier to maintain in a single process. The original split was a vestige of an early design where the detector and recorder were on different teams.
- ▸Add a 'soft trigger' tier alongside the hard ANOMALY trigger. A 'soft trigger' (say, a statistic exceeded but didn't cross the dump threshold) records a metadata-only summary, not a full event bundle. This gives you a much denser signal across normal-but-interesting market conditions without paying full S3 cost.
- ▸Treat the manifest as a first-class schema with versioning. Today it's a free-form JSON object, and the detector evolves over time. Future-me will thank present-me for adding a
manifest_versionfield and validating downstream consumers against it.
The bigger lesson
The temptation to over-engineer streaming observability is real. Most teams either capture nothing or capture everything; both are wrong for different reasons. Capture-on-anomaly says: the right unit of recording is the rare event, not the firehose, and the right context for the rare event is a bounded window on either side of it. That framing turns the storage problem from O(events ingested) into O(anomalies times window size), which is two or three orders of magnitude smaller, and it makes the bundles you keep actually findable.
If a hiring manager asks me how I think about observability for streaming systems, this is the design I describe. The pattern generalizes: anywhere you have a high-rate stream, a sparse class of interesting events, and a finite engineering team that has to debug them, the flight recorder pattern is the cheapest design that doesn't lie to you under load.
References
- ▸Oracle: Java Flight Recorder (JFR) runtime guide and Streaming API
- ▸Go 1.25: Flight Recorder addition to the standard library (go.dev/blog)
- ▸bitdrift: 'time travel, science fiction no more' on ring-buffer mobile telemetry (bitdrift.io)
- ▸Streamkap / Conduktor / RisingWave: backpressure strategies in streaming systems
- ▸Coinbase Developer Docs: Advanced Trade WebSocket overview, sequence-gap handling
- ▸CoinAPI.io: 'Why Real-Time Crypto Data Is Harder Than It Looks' (2026)
- ▸Mercury Stream — case study (tanayshah.dev/projects/mercury-stream/)
// RELATED READING
- POSTFrom Days to Hours: Migrating a 20M-Record Wikipedia ML Pipeline From Sync to Async
- POSTWhy I Built My Own Agent Eval Harness Instead of Reaching for LangSmith
- POSTWhy I Run Postgres Migrations on Container Startup, Not From CI
- CASE STUDYMercury Stream — real-time market data with on-anomaly flight recorder
- CASE STUDYAI Agent Error-Handling Patterns — production reliability on Trigger.dev v4