From Days to Hours: Migrating a 20M-Record Wikipedia ML Pipeline From Sync to Async
The naive asyncio.gather rewrite gets you a 3x speedup and a 503 from the upstream API. The right pattern is semaphore + httpx + bounded concurrency + a smart retry policy that respects the rate-limit headers the upstream actually returns. Here's the version that took ingestion from days to hours, and the four bottlenecks that turned out not to be I/O at all.
When I joined the lab at UMD in 2024 the data collection script for the project I inherited was a single Python file that iterated over Wikipedia's deletion-discussion archive and pulled comments one at a time with requests.get. Twenty-million-plus records eventually, full-corpus refresh on a quarterly cadence, total wall-clock time per refresh: about three days. Three days where the script could die at any moment, get rate-limited, hit a TLS hiccup, or just get killed by the lab's overnight reboot. I rewrote the collection layer to async over the next month. The refresh that took three days started taking under five hours. Here's what worked, what didn't, and which bottlenecks turned out to have nothing to do with I/O.
Why sync persists in ML data engineering
Most ML pipelines you'll inherit in 2026 are still synchronous, even ones that pull from rate-limited APIs and process millions of records. The reasons are reasonable in isolation: requests is mature and stable, the failure modes are well-understood, and most ML graduate students learn pandas and HTTP libraries before they learn asyncio. The cost is silent. You don't notice the pipeline is slow until you have to do a full refresh on a deadline.
The trap is that the team often jumps from 'this is too slow' to 'we need Spark / Kafka / Beam' without considering the middle answer. For 20M API calls to a single upstream, you don't need a distributed cluster. You need bounded-concurrency async over a single machine with a thoughtful retry policy. That's a Python-process-sized problem, and the asyncio standard library is the right tool for it.
The naive rewrite, and why it fails
The first instinct of anyone reading the asyncio docs is asyncio.gather over the full URL list. Don't do this. With 20M URLs, three things break:
- ▸Memory: gather() materializes the entire iterable of coroutines before it starts. At 20M coroutines you'll OOM the box before a single request fires.
- ▸Upstream rate limits: gather() with no concurrency cap fires every coroutine the moment the loop services it. You'll see a few hundred milliseconds of fast responses, then a flood of 429s as the upstream's rate limiter kicks in.
- ▸Connection saturation: an httpx default client has a 100-connection pool. 20M concurrent requests through a 100-connection pool isn't 'concurrent' in the sense you wanted; it's 100 requests at a time interleaved with mass coroutine pending state.
I wasted a week on this version. The benchmark looked great on a 1,000-URL test set. It melted on the full corpus.
The version that worked
┌──────────────────┐
│ URL generator │ ◄── lazy, never materialized
│ (yields URL str)│ (Python generator over the corpus index)
└────────┬─────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ bounded worker pool (asyncio.Semaphore(N)) │
│ ──── coroutines acquire token, fetch, release │
│ N tuned to upstream rate-limit budget │
└────────┬────────────────────────────────────────┘
│
▼
┌──────────────────┐ on 429: exponential backoff +
│ httpx.AsyncClient│ honor Retry-After header
│ (HTTP/2, pool=N)│ on 5xx: retry with jitter
└────────┬─────────┘ on 4xx: log + skip (not retried)
│
▼
┌──────────────────┐
│ asyncpg COPY │ ◄── batched 1000 rows per write
│ (Postgres bulk) │
└──────────────────┘Five concrete details made this fast enough:
- ▸Generator-based URL feed, not a list. The pipeline never materializes 20M URL strings in memory. The generator yields lazily; the worker pool consumes lazily; the database writer flushes in batches. Memory stays flat at a few hundred megabytes regardless of corpus size.
- ▸asyncio.Semaphore(N) where N is tuned to the upstream's rate-limit budget, not the local CPU count. Wikipedia's API at the time accepted around 50 RPS sustained from a single user agent. I picked N=40 to leave headroom and never saw a 429 in production. Picking N from CPU count is the bug; picking N from upstream budget is the design.
- ▸httpx with HTTP/2 enabled, single AsyncClient instance reused across the entire run. HTTP/1.1 head-of-line blocking serializes requests behind the slowest one on each connection. HTTP/2 multiplexes them. Same N concurrency, ~30% latency reduction in steady state.
- ▸Retry policy that respects Retry-After. The single most common bug in async scrapers I read in code review: catching the 429, sleeping a fixed amount, retrying. The right behavior is parsing Retry-After (or the X-RateLimit-Reset header on APIs that emit it), backing off the entire pool for that duration, then resuming. Pool-wide backoff prevents a thundering-herd retry that triggers another 429 cascade.
- ▸asyncpg COPY into Postgres in batches of 1000. The naive version inserts row-by-row with INSERT, which makes the database the bottleneck instead of the network. COPY (the bulk-load form) is roughly 100x faster for batch ingest and was, in my measurements, the difference between 'database keeps up with the network' and 'we wait for the database every batch.'
The four bottlenecks that weren't I/O
I expected this to be a story about I/O concurrency. Some of it was. But the bottlenecks I actually had to chase were not network ones:
- ▸JSON parsing in the hot path. The Wikipedia revision payloads are large, and Python's stdlib json is slow. Switching to orjson cut parse time by 5x, and the pipeline's wall clock by 18% on the full run. This was not a network optimization. It was the same number of bytes moving over the same connections, just decoded faster.
- ▸DNS resolution under load. With 40 concurrent connections each opening fresh sockets, the system DNS resolver started rate-limiting. I added a single shared aiohttp-style DNS cache (or in httpx, configured the underlying httpcore to reuse connections aggressively). DNS lookups dropped from O(requests) to O(unique hosts).
- ▸Logging overhead. The first version logged every request to a file. At 1000 RPS that's 1000 syslogd writes per second, which serializes through a single mutex in Python's logging module. I moved to a queued log handler with a separate flusher coroutine, and the pipeline got 12% faster.
- ▸GC pressure from response bodies. httpx returns bytes objects. At sustained throughput, the allocator was thrashing. Calling response.aclose() promptly inside the worker (not relying on garbage collection) and processing each response into the database write before yielding control reduced retained memory and the GC pause profile by half.
None of those would show up in a 100-URL benchmark. All of them dominated at production scale. This is a thing about async data pipelines worth internalizing: the first 10x is concurrency. The next 2-3x is the non-network details that get exposed once concurrency is fixed.
What I would do differently if I rebuilt it for 100M+
- ▸Producer-consumer with asyncio.Queue, not gather over a generator. At 20M records the generator approach is fine. At 100M, you want backpressure-aware queue consumers so a slow database writer can pause the URL fetchers without dropping work. The shape is: producer task fills a bounded asyncio.Queue, N consumer tasks drain it, queue size caps memory.
- ▸Process-level horizontal scaling with a coordination layer. Once one Python process is saturated on CPU (orjson + asyncpg + httpx is heavy), add a second process and shard the URL space deterministically (e.g., hash(url) % N). Coordinate via a small Redis key set so re-runs don't re-fetch already-completed records. This is the production answer; for 20M records on a deadline I never needed it.
- ▸Treat the retry policy as a first-class component, not a wrapper around requests. Mine grew organically into a 200-line module that handles 429s, 503s, network drops, partial reads, and idempotency. If I rebuilt, I'd start with a structured retry library (tenacity, backoff) and only customize what genuinely needs custom behavior.
What this enabled downstream
Faster ingestion was the obvious win. The less obvious win was iteration speed on the model. With sync ingestion taking three days, I'd iterate on the classifier (a fine-tuned Llama-3.2 with LoRA) on stale data, ship a checkpoint that looked fine in eval, then watch the next refresh discover a class shift the model hadn't seen and lose accuracy. With async ingestion taking five hours, I could refresh, retrain, and re-evaluate inside a single afternoon. The model went from 75% accuracy to 81% over six iteration cycles, none of which would have happened on the three-day cadence.
If a hiring manager asks me what 'data engineering for ML' looks like in practice, this is the project I describe. Not because async is special, but because the bottleneck shifted four times during the project (network → JSON → DNS → GC), and recognizing each shift was the work. The end result: 'days to hours' is the headline, but the actual lesson is that the first profile is wrong and you have to re-profile every time you fix something.
References
- ▸Capital One Tech: Async processing in Python for faster data pipelines
- ▸Towards Data Science: Blazing Hot Python AsyncIO Pipelines (2026)
- ▸death.andgravity: "Limiting concurrency in Python asyncio: the story of async imap_unordered"
- ▸rednafi.com: "Limit concurrency with semaphore in Python asyncio"
- ▸DEV Community: practitioner case studies on asyncio scrapers shipped to production
- ▸orjson: faster JSON parsing for Python (github.com/ijl/orjson)
- ▸asyncpg: high-performance Postgres driver for Python (magicstack/asyncpg)