Building a Sub-2-Second Sales Coach: Two-Path Architecture for Real-Time Conversation AI

Live conversation coaching has a sub-2-second latency budget. Post-call analysis benefits from 30 seconds of frontier-model reasoning. The same agent can't satisfy both. The pattern that worked is two parallel paths feeding off the same diarized stream: a fast path on Llama 3.3 + Groq for in-the-moment nudges, a slow path on Claude Opus / Gemini 2.5 for the manager's ride-along artifact.

Conversation intelligence in 2026 splits cleanly into two products that look like one. There's the live-coaching product (a rep on a sales call gets a nudge in their earbud before they miss an objection) and there's the post-call product (a manager opens an artifact the next morning that summarizes the call, scores it against a rubric, and identifies coaching moments). Most published architectures treat these as variations of the same pipeline. They're not. The latency budgets are off by an order of magnitude, and a single agent loop optimized for one is the wrong shape for the other. The pattern that worked when I shipped this kind of system is two parallel paths off a shared diarized stream. Here's the design and the math.

Why one path can't satisfy both

Voice AI research consistently finds the same human latency floor: under 500-700ms feels like a conversation, over 1 second feels like talking to a computer, over 2 seconds the human starts talking over the agent and abandons the interaction. For live coaching that nudges a rep in real time, the budget is roughly: 'the moment passed' on one side, 'the rep had time to recover before saying the wrong thing' on the other. Sub-2 seconds is the practical ceiling; sub-1 second is where it actually feels useful.

Frontier models (Claude Opus 4.7, Gemini 2.5 Pro) are not sub-2-seconds-to-first-token under any realistic load. They're 5-15 seconds for substantive reasoning over a meaningful chunk of context. Their answer quality, however, is what makes the post-call artifact worth opening; a smaller model's summary is technically a summary and not the artifact a sales manager will actually read.

Forcing one model to do both is the bug. Either the live path is slow (rep ignores the nudges because they arrived after the moment) or the post-call artifact is shallow (manager ignores the report because it reads like an Outlook email summary). The two-path design admits this and uses the right model for each budget.

The architecture

┌──────────────────────────────────────────────────┐
│  rep's phone (mobile capture)                    │
│  WebSocket audio stream → cloud                  │
└────────────────────┬─────────────────────────────┘
                     │
                     ▼
┌──────────────────────────────────────────────────┐
│  Deepgram Nova-3 streaming diarization           │
│   - sub-300ms partial transcripts                │
│   - speaker labels per word                      │
│   - timestamped utterances                       │
└────────────┬──────────────────────────┬──────────┘
             │                          │
             ▼                          ▼
┌─────────────────────┐    ┌──────────────────────┐
│  PATH A: live coach │    │  PATH B: post-call   │
│  budget: < 2s       │    │  budget: < 30s       │
│                     │    │                      │
│  Llama 3.3 70B on   │    │  Claude Opus 4.7 OR  │
│  Groq (~0.26s avg)  │    │  Gemini 2.5 Pro      │
│  ~10-moment-type    │    │  per-tenant rubric   │
│  taxonomy classifier│    │  multi-step reasoning│
│                     │    │                      │
│  → earbud nudge     │    │  → ride-along        │
│    (text or haptic) │    │    artifact for mgr  │
└──────────┬──────────┘    └──────────┬───────────┘
           │                          │
           └──────────┬───────────────┘
                      │
                      ▼
       ┌────────────────────────────────┐
       │  Postgres + pgvector           │
       │   - diarized utterances (raw)  │
       │   - moment-type tags (Path A)  │
       │   - artifact records (Path B)  │
       │   - embeddings (both)          │
       │   - every claim grounded to    │
       │     {start_ts, end_ts,         │
       │      speaker_id}               │
       └────────────────────────────────┘

Path A: the live coach

Path A is a small, constrained, fast classifier. The model (Llama 3.3 70B on Groq) is given the rolling diarized transcript window (last 30-60 seconds) and a fixed taxonomy of about 10 moment types: 'objection raised,' 'price asked,' 'competitor mentioned,' 'commitment offered,' 'silence too long,' 'monologue too long,' and so on. The output is a single classification (or 'nothing right now') plus a one-sentence nudge. End-to-end latency: 200-700ms after the diarization partial arrives, which puts the budget for the round trip from 'rep finished saying the trigger phrase' to 'haptic on the rep's watch' in the 800ms-1.5s range.

The constrained taxonomy is the design choice that makes this work. Asking 'what should the rep do right now' open-ended would force a frontier model and blow the latency budget. Asking 'is this a moment of type X / Y / Z, and which one' is a classification task a 70B model on Groq can answer reliably in under a second. The delivery surface (earbud, watch haptic) further constrains the output: the rep doesn't have time to read a paragraph; the nudge has to fit in 6-12 words.

Path B: the post-call deep dive

Path B is the slow, careful, per-tenant analysis that produces the artifact a manager actually reads. The model (Claude Opus 4.7 or Gemini 2.5 Pro, per tenant config) gets the full diarized transcript, the per-tenant rubric (what does this customer's coaching philosophy define as good behavior, what counts as risky, what's the framework for scoring), and a longer reasoning budget. The output is a structured artifact with per-section scores, citations, and recommended coaching moments.

The per-tenant rubric is the multiplier here. A generic 'how was the call' summary has ChatGPT-template energy. A summary written against the customer's actual sales playbook ('we coach toward MEDDIC, did the rep cover Decision Criteria? Did they identify the Economic Buyer?') is the kind of artifact a manager will reference in the next 1:1. The same model output, framed by the same prompt minus the rubric, is dramatically less useful. Spend the engineering on the rubric, not on the model.

Grounding: every claim has a timestamp and a speaker

The non-negotiable design constraint, in both paths, is that every claim the system makes has to be grounded in the transcript. 'The rep did not address the customer's objection at 14:32:08' is useful. 'The rep did not handle objections well' is not. The diarized transcript provides the start_ts, end_ts, and speaker_id for every utterance; the schema for both Path A's moment classifications and Path B's artifact references requires those fields. The downstream UI uses them to make every claim clickable: 'see the moment' jumps the audio playback to the right second.

Without grounding, the artifact is a vibes-based hot take that managers learn to ignore. With grounding, the artifact is a tool that holds up under scrutiny because every claim is reproducible from the source. This is the one piece of the design I would not negotiate on if I rebuilt it.

How this differs from Gong / Chorus / Cresta / Balto

The incumbent conversation-intelligence platforms (Gong, Chorus, Outreach, Cresta) historically anchored on post-call analysis with real-time as a newer feature. Balto is the explicit real-time-first competitor. Most of them now have both paths; the difference is in execution:

▸Most platforms run a single transcription pipeline and tap off it for both products. Same diarization layer feeding both. This is a sound choice.
▸Most platforms use a single LLM tier for the live nudges. The choice of tier (small fast vs. medium accurate) determines whether the live experience feels useful. Cresta and Balto have publicly described their tradeoffs; the small-fast path with a constrained taxonomy is the consensus.
▸Most platforms do not ground claims to timestamps as an enforced schema constraint. They ground when convenient. The product difference is large: a coaching artifact where every claim is clickable feels real; one where claims float is rapidly distrusted.
▸The two-vendor model strategy (Llama on Groq for live, Anthropic / Google for post-call) is less common than 'one vendor for everything' and is the place a small team can outperform an incumbent that's locked into a single vendor stack.

What I would change if I rebuilt the pipeline

▸Move the moment-type taxonomy to a per-tenant config rather than a global one. The 10 default types cover most workflows; a customer with an unusual sales motion (consultative deep-tech, transactional retail, regulated healthcare) will have a different list. Per-tenant lists need a fine-tune of the live classifier, which is non-trivial but not exotic.
▸Add a 'soft trigger' tier on Path A that records moments without dispatching a nudge to the rep. Today either we nudge or we drop it. A 'noted, not delivered' tier gives the post-call artifact richer signal without paying the rep's attention budget.
▸Move the artifact rendering into a template DSL the customer can edit. Today the schema is fixed; a manager who wants 'show me only the objection-handling moments' has to file a feature request. A small templating layer would push this control down to the customer.
▸Add per-claim confidence scores to the post-call artifact. The frontier model knows when it's reaching versus when it's confident; surfacing that to the manager prevents over-trusting borderline claims.

The bigger lesson

Real-time + post-call is a multi-vendor, multi-budget design problem disguised as a single-product question. The temptation to use one model and one path produces a product that's mediocre at both. Splitting the paths and matching the model tier to the latency budget is a small architectural choice that produces a much better felt experience on both surfaces. The work is in the diarization layer, the moment-type taxonomy, the per-tenant rubric, and the grounding schema. The model choice is the easy part once you've done the work; before that, no model does the job.

If a hiring manager asks me how I think about real-time conversation AI in 2026, this is the framing. Not 'we use Deepgram and Claude,' but 'here's the latency split, here's why two paths is the cheapest design that produces a usable result on both surfaces, here's where the engineering discipline (grounding, taxonomy, rubric) actually lives.' That's what production-grade conversation AI looks like.

References

▸Deepgram: Nova-3 streaming diarization, sub-300ms latency, 441x real-time (developers.deepgram.com)
▸Groq: Llama 3.3 70B at 0.26s avg latency, 276 tokens/sec (groq.com)
▸Anthropic: Claude Opus 4.7 (platform.claude.com)
▸Google: Gemini 2.5 Pro long-context reasoning
▸Cresta blog: 'Engineering for Real-Time Voice Agent Latency'
▸Sierra: 'Engineering low-latency voice agents'
▸Twilio: Core Latency in AI Voice Agents
▸Real-Time Sales Coaching Agent — case study (tanayshah.dev/projects/real-time-sales-coaching-agent/)

← MORE NOTES OPEN COMMS →