Real-Time Sales-Conversation Coaching Agent.
Architecture sketch for an agent that listens to in-person field sales conversations and delivers grounded coaching — live during the call and deep after it.
Speech analytics for field sales is one of the most under-tooled corners of the AI agent market. Phone-based Gong / Chorus / Wingman dominate the SDR space because the audio path is clean (single channel, lossless, server-side). In-person sales — the HVAC quote in someone's basement, the roofing pitch on a contractor's tablet, the solar consult in a kitchen — has none of those luxuries. The audio is recorded on a phone, the speakers physically interrupt each other, the room reverbs, and the manager isn't there to ride along.
The product opportunity is real. A field-sales rep doing 8 in-home consults a day burns through ~6 hours of conversation that nobody reviews. Their manager can ride along on maybe one a week. The performance variance between top and median reps is enormous and almost entirely conversational — what discovery questions they asked, when they introduced price, how they handled the price-objection moment. Closing this loop with an AI coach is a 30-40% close-rate lift for the median rep, comparable in business impact to a direct headcount hire.
The interesting architectural question isn't whether to do it — it's how. A coaching agent that just transcribes is a transcription product, not a coaching product. A coaching agent that runs the entire post-call analysis pipeline live blows the latency budget. The right design splits the loop into two paths with different latency / cost / capability tradeoffs.
Two-path architecture. Path A is the live coach — sub-2-second feedback during the conversation, surfaced to the rep via an unobtrusive earbud or watch haptic. Its job is narrow: detect a small set of high-leverage moments (rep talked too long without asking a question; customer raised a price objection that wasn't addressed; rep skipped the warranty section) and nudge. The model has to be fast and lossy is acceptable.
Path B is the post-call deep dive — runs after the conversation ends, takes 30-60 seconds, produces the manager-facing artifact: full transcript with speaker labels, scored against a per-tenant rubric, with every coaching point grounded to a specific timestamped utterance. This is where the long-context model earns its money. It's also where the manager's virtual-ridealong workflow lives — they read the artifact, drop comments, surface the call to other reps as a teaching example.
Both paths share the same upstream: a streaming multi-speaker diarization layer (Deepgram Nova-3 or AssemblyAI Universal-2 are the 2026 gold standard for in-person audio) feeding a rolling conversation context window. The diarizer publishes utterances tagged with speaker ID and timestamp. Path A subscribes to the stream and runs a small, latency-tuned reasoning model (think Llama 3.3 70B on Groq for sub-100ms first-token, or Claude Haiku for slightly higher cost / better reasoning). Path B waits for the call-ended event and runs the full conversation through Claude Opus or Gemini 2.5 Pro with the per-tenant rubric in context.
Multi-speaker diarization is the load-bearing layer
Every coaching point — live or post-call — has to anchor to a specific utterance from a specific speaker at a specific timestamp. If the diarizer says 'rep said the customer's price objection' when actually the customer said it, the coaching is wrong and the rep loses trust in the system. In-person audio with reverb, interruptions, and phone-mic compression makes this 10x harder than phone audio. The right move is to over-invest here: pay for the best speech-to-text vendor, run a second-pass re-alignment with a smaller acoustic model, and surface confidence scores so downstream models can bail out when diarization quality drops below threshold.
Per-tenant rubric, not a global one
Every contractor's sales playbook is different. A roofing company cares about the inspection-walk step. A solar company cares about the financing-options step. A pest-control company cares about the contract-length step. A global rubric averaged across customers becomes useless within months as it drifts toward the lowest common denominator. The right design is per-tenant rubric configuration — the tenant uploads their playbook, the system extracts a structured rubric (steps, expected utterances, scoring weights), and the post-call model scores against THAT rubric. Rubric versioning is a first-class concern; you'll change them.
Live coach uses a constrained taxonomy; post-call coach is open-ended
This is the hardest tradeoff in the design. You can let the LLM freely pick coaching moments — high recall, high cost, high false-positive rate, occasional brilliant insights. Or you can constrain to a fixed taxonomy of ~10 high-leverage moment types — high precision, lower recall, very fast, predictable. Live coaching wants the constrained version (the rep's working memory budget is small; you must not interrupt them with a hallucinated suggestion). Post-call coaching wants the open-ended version (the manager has time to filter false positives; the value is in catching the unusual moment a fixed taxonomy would miss). Two models, two prompts, two evaluation harnesses.
Citation grounding is the trust layer
Every single coaching point in the manager-facing artifact is a clickable citation back to the timestamped utterance that triggered it. No exceptions. This is non-negotiable for two reasons: managers won't trust the system without it, and rubric drift only becomes detectable when humans can audit individual coaching points against ground truth. Architecturally this means the post-call model's structured output must include `start_ts`, `end_ts`, `speaker_id` for every claim — and the front-end must render those as audio-scrub anchors.
Streaming diarization layer feeds a rolling context window. Path A (live coach) is a hot subscriber: small reasoning model + constrained-taxonomy detector running every ~200ms over the last N utterances, output gated by a confidence threshold. Path B (post-call deep dive) is a cold consumer: triggered on the call-ended event, runs Claude Opus or Gemini 2.5 Pro over the full transcript + per-tenant rubric, emits structured JSON with timestamp-grounded claims. Both write into a shared call-record store (Postgres + pgvector for utterance embeddings).
2026 SOTA for streaming in-person audio. Reverb-tolerant, speaker-diarization built in, sub-300ms latency. The load-bearing layer for any in-person sales coaching agent.
Live coaching must beat the rep's working memory cycle (~2s). Post-call analysis can run a long-context model for 30-60s and still feel instant to the manager. Treating these as the same problem is the most common architecture mistake in this space.
Contractor-by-contractor sales playbooks vary radically. Per-tenant rubric beats a global rubric on accuracy and on customer perception of 'this AI actually gets my business.' Versioning is required because rubrics get edited.
Different precision/recall targets at different points in the loop. Forcing both into one model collapses both objectives. Two models, two prompts, two evaluation harnesses.
Trust layer. Every coaching point must be auditable back to the source utterance — start_ts, end_ts, speaker_id. Without this, the system is not adoptable by managers.
Sub-100ms first-token at the price tier that makes a per-call cost model work. Claude Haiku is the alternative if the live coach needs better reasoning at the cost of ~300ms latency.
Long-context reasoning over the full conversation + per-tenant rubric. Prompt-cache the rubric (it's stable across calls); pay only for the conversation tokens.
- →The hardest unsolved tradeoff in this space: do you let the LLM pick which moments to coach on (high recall, hallucination risk), or constrain it to a fixed taxonomy of high-leverage moments (high precision, miss the unusual)? My intuition is constrained for the live path, open-ended for the post-call path, with a feedback loop that promotes recurring open-ended moments into the constrained taxonomy over time. But it's not obvious — the answer depends on whether the system is graded on rep adoption (live precision matters most) or manager efficiency (post-call recall matters most). Probably worth instrumenting both and A/B testing.
- →If I were starting from scratch, I'd build the post-call path first. It's slower-clock-speed (30s instead of 2s), the failure mode is gentler (a manager filters bad coaching, a rep gets confused by it), and it generates the labeled data the live coach needs to fine-tune against. Live coaching without good post-call data is unsolvable — there's nothing to evaluate against.
- →Diarization quality is the silent killer. Most teams underinvest here because Deepgram / AssemblyAI 'just work' at the demo level. They start failing at the long-tail: thick accents, three-speaker conversations, basement reverb, cell-phone audio compression. A small 2nd-pass re-alignment model running over the streaming output catches ~70% of the misalignment errors at meaningfully lower cost than upgrading the primary STT vendor.