> ## Documentation Index
> Fetch the complete documentation index at: https://docs.greenflash.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Voice Agents

> Voice-agent analytics that work on any transcript. No audio storage required. Latency, interruptions, silences, call-end, and prosody analyzers, plus a synced audio timeline on every call.

Voice analytics in Greenflash work on the transcript. We never see, store, or process the audio file. Most voice analytics tools want you to ship them recordings; we don't. If a conversation has voice-shaped data like speakers and per-turn timing, voice analytics runs on it automatically.

<img src="https://mintcdn.com/greenflash/tGxprFDzRCoEawRJ/public/content-images/voice_call_detail.png?fit=max&auto=format&n=tGxprFDzRCoEawRJ&q=85&s=0f924c6ad3307294a735229b7ab81b73" alt="Voice call detail with synced audio + interactive timeline" width="2518" height="570" data-path="public/content-images/voice_call_detail.png" />

<Tip>
  Audio never touches Greenflash. When your provider supplies a recording URL, we render an audio player that streams directly from your storage. We analyze the transcript and the timing; the audio file stays yours.
</Tip>

## Auto-detection

You don't have to flag a conversation as "voice" or change anything about how you log. Greenflash looks at the transcript shape and per-turn timestamps, and if it looks like a voice call, the voice analyzers run. Already-instrumented conversations pick up voice analytics retroactively, no backfill.

What this means in practice:

* If you already POST to `/v1/messages` with per-turn timing, voice analytics is already running on those calls.
* If you use Vapi, Retell, ElevenLabs, Bland, Synthflow, or Simple.ai, point the provider's webhook at us and the rest is automatic.
* If you're on something custom (LiveKit, Pipecat, OpenAI Realtime, in-house infra), send the transcript and we'll figure it out from there.

<Tip>
  Voice detection needs per-turn timing. If every message in a conversation shares the same timestamp, the voice analyzers won't fire, even if the call really did happen over voice.
</Tip>

## Canonical transcript shape

If you're building an adapter, this is the shape to target. Per-turn timing is the only hard requirement; everything else upgrades the analytics you get back.

**Minimum** (auto-detection fires, latency/interruption/silence analyzers run):

```json theme={"theme":{"light":"github-light","dark":"vesper"}}
{
  "role": "assistant",
  "content": "Hi, this is Aria — how can I help you today?",
  "voice": {
    "startedAt": 1700000000000,
    "durationMs": 2400
  }
}
```

`startedAt` is Unix epoch milliseconds. `durationMs` is how long that turn was spoken. Speaker comes from the standard message `role` (`"user"` / `"assistant"`) — you don't need a separate `voice.speaker` for chat-shaped transcripts.

<Tip>
  `voice.speaker` only exists for cases where `role` isn't enough: diarization-only data with no agent/user mapping (e.g. `"Speaker 0"`, `"Speaker 1"`), or transcripts logged via `messageType` instead of `role` where you still want to flag the speaker. If you're already sending `role`, leave `speaker` out.
</Tip>

**Recommended** (unlocks the full set of voice analyzers and timeline annotations):

| Field                                    | Why it helps                                                                                              |
| ---------------------------------------- | --------------------------------------------------------------------------------------------------------- |
| `voice.endedAt`                          | Lets us draw exact turn boundaries when `durationMs` isn't available                                      |
| `voice.responseLatencyMs`                | Drives the TTFA/agent-response latency view; we'll derive it from gaps if missing                         |
| `voice.asrConfidence`                    | Surfaces low-confidence turns and feeds the ASR latency analyzer                                          |
| `voice.wasInterrupted` / `voice.bargeIn` | Direct signal into the interruption analyzer (we also detect from overlap, but explicit is more accurate) |
| `voice.silenceBeforeMs`                  | Direct signal into the silence analyzer                                                                   |
| `voice.prosody`                          | Enables the prosody analyzer (sentiment, arousal, emotion); skipped if absent                             |

At the conversation level, send a `voiceCall` object with `durationMs`, `endedReason`, `recordingUrl` (see below), and `latency` aggregates if your platform exposes them. Full field reference is in the [Public API schema](/features/public-api).

## Five voice analyzers

Five analyzers run on every voice conversation:

| Analyzer          | What it surfaces                                                                                          |
| ----------------- | --------------------------------------------------------------------------------------------------------- |
| **Latency**       | Per-turn Time-To-First-Audio (TTFA), ASR, LLM, and TTS outliers; flags regressions when you change models |
| **Interruptions** | Barge-in clusters and overlapping speech that signal user frustration                                     |
| **Silences**      | Long gaps where the agent stalled or the user dropped off mid-call                                        |
| **Call-end**      | Classifies how the call ended (completed, abandoned, escalated, failed)                                   |
| **Prosody**       | Sentiment shifts, vocal arousal, and emotion swings across the call                                       |

These are deterministic by design: same transcript in, same result out. We picked it that way so you can compare week-over-week numbers without worrying about LLM drift muddying the comparison. The only analyzer that calls an LLM is call-end, and only as a tiebreaker on cases the rules can't decide.

## Conversation pathologies (voice + text)

Three pathology detectors fire on every conversation, voice or text:

* **Clarification loops:** the agent re-asks the same clarifying question across turns.
* **Repeated information:** the agent re-states context the user already gave.
* **Agent monologue:** long stretches of agent speech with no user input.

These exist because they're real failure modes we kept seeing in conversation reviews. They show up everywhere, but voice transcripts make them painfully obvious because the conversational shape is so visible.

## Voice Call section on conversation detail

Open any voice conversation and the page renders a Voice Call section that's auto-hidden for text. Two pieces share the surface, and it's worth keeping them straight:

* **Call flow timeline (synthesized from the transcript).** Per-turn bars colored by speaker, sized by duration, and marked with interruption and silence flags. Latency overlays (TTFA, ASR, LLM, TTS) sit on each turn. Greenflash draws this view from the per-turn timing on your messages — there's no audio waveform involved, and it renders whether or not a recording exists.
* **Audio player (the real recording).** When `voiceCall.recordingUrl` is set, an HTML `<audio>` element streams the file directly from your URL. When audio is loaded, the synthesized timeline above doubles as a scrubber and click-to-seek lights up; the playback head moves through the bars in lockstep with the audio.
* **Call-end classification.** The analyzer's label shown alongside the platform's raw `endedReason` so you see both views side by side.

<img src="https://mintcdn.com/greenflash/tGxprFDzRCoEawRJ/public/content-images/voice_timeline.png?fit=max&auto=format&n=tGxprFDzRCoEawRJ&q=85&s=3dd382fc735e320392a02166b4311c74" alt="Synced audio player + interactive timeline" width="2530" height="906" data-path="public/content-images/voice_timeline.png" />

## The `voiceCall` object

`voiceCall` is the conversation-level metadata block for voice calls — call duration, ended reason, latency aggregates, platform-supplied success, structured outputs, and the recording URL. It lives on the conversation alongside `properties` and is what the voice-aware UI keys off of.

The call flow timeline, latency annotations, interruptions, silences, and call-end classification are all reconstructed from the transcript and per-message `voice` timing — they render whether or not you ship a recording. The audio player is the only piece that depends on `voiceCall.recordingUrl`.

### `recordingUrl` and audio playback

The audio player plays the **actual recording from your provider or storage**. It is not synthesized — there is no TTS step, and we don't reconstruct audio from the transcript. Greenflash renders an `<audio>` element pointed at the URL you give us; the listener's browser streams the bytes directly from your origin. We never download, proxy, transcode, or cache the file.

A few implications follow from that:

* **The URL has to be publicly reachable from the listener's browser.** Signed URLs are fine, but they need to stay valid for as long as you want playback to work. If you expire or rotate them, the player will break for older calls.
* **No URL means no audio and no scrubbing — but the synthesized flow still renders.** The call flow timeline, per-turn latency overlays, interruption/silence flags, and call-end classification are all transcript-derived, so a call with no `recordingUrl` still shows the full Voice Call section. Click-to-seek and the playback head are the only things that go away.
* **Audio stays in your storage.** Greenflash never copies the file. If the recording is deleted or the bucket goes away, playback goes with it; the transcript and analytics remain.
* **Webhook integrations map this automatically.** When you point Vapi, Retell, ElevenLabs, Bland, Synthflow, or Simple.ai at our webhook, the provider's recording URL is forwarded into `voiceCall.recordingUrl` for you, so playback works as long as the provider's hosted recording is reachable.

## Webhook integrations

For the major voice platforms, paste a Greenflash URL into your provider's webhook field and skip the SDK. Under the hood each one is just an adapter over `/v1/messages`, so you get the same auth, sampling, and analyses you'd get from the direct API.

| Provider              | Webhook URL                                                                 |
| --------------------- | --------------------------------------------------------------------------- |
| **Vapi**              | `https://www.greenflash.ai/api/v1/integrations/vapi?productId=<uuid>`       |
| **Retell**            | `https://www.greenflash.ai/api/v1/integrations/retell?productId=<uuid>`     |
| **ElevenLabs Agents** | `https://www.greenflash.ai/api/v1/integrations/elevenlabs?productId=<uuid>` |
| **Bland AI**          | `https://www.greenflash.ai/api/v1/integrations/bland?productId=<uuid>`      |
| **Synthflow**         | `https://www.greenflash.ai/api/v1/integrations/synthflow?productId=<uuid>`  |
| **Simple.ai**         | `https://www.greenflash.ai/api/v1/integrations/simpleai?productId=<uuid>`   |

Find your product UUID under **Settings → Products** in the Greenflash app, and add an `Authorization: Bearer gf_<your-api-key>` header in the provider's webhook config. That's the whole setup.

## Direct API option

For voice stacks not listed above (LiveKit, Pipecat, OpenAI Realtime, custom infra), POST directly to `/v1/messages` with `voiceCall` and per-message `voice` objects:

```json theme={"theme":{"light":"github-light","dark":"vesper"}}
{
  "productId": "your-product-uuid",
  "externalConversationId": "call_abc123",
  "externalUserId": "+15555550100",
  "voiceCall": {
    "platform": "other",
    "platformCallId": "call_abc123",
    "durationMs": 47500,
    "recordingUrl": "https://recordings.example.com/call_abc123.mp3",
    "endedReason": "user_hangup",
    "latency": { "e2eMs": 950, "asrMs": 220, "llmMs": 480, "ttsMs": 250 },
    "callSuccessful": true
  },
  "messages": [
    {
      "externalMessageId": "call_abc123:0",
      "role": "assistant",
      "content": "Hi, this is Aria — how can I help you today?",
      "voice": { "startedAt": 1700000000000, "durationMs": 2400 }
    },
    {
      "externalMessageId": "call_abc123:1",
      "role": "user",
      "content": "I'd like to book a demo.",
      "voice": { "startedAt": 1700000003100, "durationMs": 1800, "asrConfidence": 0.97 }
    }
  ]
}
```

Every field on `voiceCall` and per-message `voice` is optional. Auto-detection still fires from per-turn timing alone.

## Use cases

### Diagnosing a latency regression

You switched LLMs last Tuesday and CSAT slipped this week. Open the product page and the Recent Regressions card surfaces a jump in voice latency p95. Click into the affected calls. The per-turn latency annotations show TTS quietly adding 400ms after the model swap. Roll back, and the regression resolves.

### Catching abandoned calls

Open the Call-end analyzer's *abandoned* bucket and read three calls. Two of them show the same agent monologue at minute three: the prompt has the agent over-explaining before the user can confirm what they actually wanted. Tighten the prompt, abandoned-call rate drops.

### Comparing voice providers

Run the same prompt on Vapi and Retell for a week. Build a [User Segment](/features/user-segments) for each provider using conversation-property filters, then compare CQI, friction pillar scores, and voice latency p95 side by side. Now you can decide which one ships, with numbers.

## Next steps

<CardGroup cols={2}>
  <Card title="Custom Analyses" icon="flask-conical" href="/features/custom-analyses">
    Define guardrails that fire on voice signals: long silences, interruption clusters, unusual ended\_reason patterns.
  </Card>

  <Card title="User Segments" icon="users-round" href="/features/user-segments">
    Slice voice users by call success rate, sentiment trajectory, and call-end classification.
  </Card>

  <Card title="Linear" icon="linear" href="/integrations/linear">
    Push flagged voice conversations into Linear. The issue includes a deep link to the synced audio and timeline view.
  </Card>

  <Card title="Public API" icon="code" href="/features/public-api">
    Full schema reference for the canonical voice payload.
  </Card>
</CardGroup>
