Conversation
|
@coderabbitai review |
WalkthroughThis pull request integrates ElevenLabs as a transcription provider throughout the system, replacing Mistral as Option 2. Changes include a new backend ElevenLabs provider implementation, API wrapper with speaker identification enhancement, frontend service and UI component, configuration setup, and supporting utilities. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant WebUI
participant SpeechRecAPI
participant ElevenLabsAPI
participant SpeakerDB
participant AudioBackend
User->>WebUI: Upload audio + select ElevenLabs mode
WebUI->>SpeechRecAPI: POST /elevenlabs/v1/transcribe (audio, diarize=true)
SpeechRecAPI->>ElevenLabsAPI: Forward audio + params
ElevenLabsAPI->>ElevenLabsAPI: Transcribe with speaker diarization
ElevenLabsAPI-->>SpeechRecAPI: Return transcript + word-level data + speakers
alt enhanceSpeakers requested
SpeechRecAPI->>SpeechRecAPI: Extract segments & build embeddings
SpeechRecAPI->>SpeakerDB: Query speaker embeddings
SpeakerDB-->>SpeechRecAPI: Return stored speaker embeddings
SpeechRecAPI->>SpeechRecAPI: Match speaker segments → identified speakers
SpeechRecAPI->>SpeechRecAPI: Annotate response with speaker metadata
end
SpeechRecAPI-->>WebUI: Enhanced response (transcript + speakers + confidence)
WebUI-->>User: Display transcript with identified speakers
sequenceDiagram
participant BackendApp
participant TranscriptionFactory
participant ElevenLabsProvider
participant ElevenLabsAPI
BackendApp->>TranscriptionFactory: get_transcription_provider("elevenlabs")
TranscriptionFactory->>TranscriptionFactory: Read ELEVENLABS_API_KEY from env
TranscriptionFactory->>ElevenLabsProvider: __init__(api_key)
TranscriptionFactory-->>BackendApp: Return ElevenLabsProvider instance
BackendApp->>ElevenLabsProvider: transcribe(audio_bytes, sample_rate, diarize=true)
ElevenLabsProvider->>ElevenLabsProvider: Convert PCM to WAV
ElevenLabsProvider->>ElevenLabsAPI: POST multipart/form-data (WAV + config)
ElevenLabsAPI-->>ElevenLabsProvider: Return JSON (transcript, words, speakers)
ElevenLabsProvider->>ElevenLabsProvider: Parse response → extract segments + speaker data
ElevenLabsProvider-->>BackendApp: Return structured dict (text, segments, diarization)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes The changes are substantial and span multiple layers (backend services, API routes, frontend services, UI) with heterogeneous logic patterns. Key factors: new ElevenLabs provider implementation with PCM-to-WAV conversion and API integration, speaker identification enhancement logic, parser utility for response formatting, and frontend service/state management integration. While individual files follow consistent patterns, the overall scope and cross-layer coordination require careful verification of end-to-end flow, error handling, and API contract consistency between layers. Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (17)
extras/speaker-recognition/src/simple_speaker_recognition/utils/elevenlabs_parser.py (3)
70-81: Sort by time and keep unknown-speaker words instead of dropping them.
- Ensure chronological grouping even if input is unsorted.
- Don’t discard words lacking
speaker_id; label them asspeaker_unknownto preserve transcript.def _group_words_by_speaker(self, words: List[Dict[str, Any]]) -> List[Dict[str, Any]]: @@ - if not words: + if not words: return [] - segments = [] + # Defensive: ensure time order + words = sorted(words, key=lambda w: w.get('start', 0.0)) + + segments = [] current_segment = None @@ - for word in words: - speaker_id = word.get('speaker_id') - if speaker_id is None: - continue - speaker_label = f"speaker_{speaker_id}" + for word in words: + speaker_id = word.get('speaker_id') + speaker_label = f"speaker_{speaker_id}" if speaker_id is not None else "speaker_unknown"Also applies to: 85-92
53-56: Compute total_duration robustly (don’t rely on last word).Handles unsorted inputs safely.
- total_duration = 0.0 - if filtered_words: - total_duration = filtered_words[-1].get('end', 0.0) + total_duration = max((w.get('end', 0.0) for w in filtered_words), default=0.0)
23-34: Add IO/JSON error handling and Path support; set encoding.Improves resilience and type flexibility.
-from typing import Any, Dict, List, Optional, Tuple +from typing import Any, Dict, List, Optional, Tuple, Union @@ - def parse_elevenlabs_json(self, json_path: str) -> Dict[str, Any]: + def parse_elevenlabs_json(self, json_path: Union[str, Path]) -> Dict[str, Any]: @@ - with open(json_path, 'r') as f: - data = json.load(f) + json_path = Path(json_path) + try: + with json_path.open('r', encoding='utf-8') as f: + data = json.load(f) + except (FileNotFoundError, json.JSONDecodeError) as e: + logger.error("Failed to read/parse ElevenLabs JSON at %s: %s", json_path, e) + raise ValueError(f"Invalid ElevenLabs JSON at {json_path}") from ebackends/advanced/src/advanced_omi_backend/models/transcription.py (1)
39-39: Verify no references to the removed MISTRAL provider remain: COMPLETEThe breaking change successfully removed MISTRAL from the active provider enum (transcription.py). No code references
TranscriptionProvider.MISTRALor the old enum variant.However, two orphaned code fragments should be cleaned up:
- backends/advanced/src/advanced_omi_backend/models/conversation.py:23 – Remove the unused
MISTRAL = "mistral"enum member (this enum is not referenced anywhere)- backends/advanced/src/advanced_omi_backend/app_config.py:56 – Remove the unused
self.mistral_api_key = os.getenv("MISTRAL_API_KEY")(this attribute is never referenced)These are not breaking issues but should be removed to avoid confusion and maintain consistency with the provider change.
extras/speaker-recognition/init.py (1)
251-257: Add summary output for ElevenLabs keyGood CLI hook. Also show its status in the final summary (like Deepgram) to reduce confusion.
@@ def show_summary(self): """Show configuration summary""" @@ - self.console.print(f"✅ HTTPS Enabled: {self.config.get('REACT_UI_HTTPS', 'false')}") - if self.config.get('DEEPGRAM_API_KEY'): - self.console.print(f"✅ Deepgram API Key: Configured") + self.console.print(f"✅ HTTPS Enabled: {self.config.get('REACT_UI_HTTPS', 'false')}") + if self.config.get('DEEPGRAM_API_KEY') or self.read_existing_env_value("DEEPGRAM_API_KEY"): + self.console.print("✅ Deepgram API Key: Configured") + if self.config.get('ELEVENLABS_API_KEY') or self.read_existing_env_value("ELEVENLABS_API_KEY"): + self.console.print("✅ ElevenLabs API Key: Configured")backends/advanced/init.py (1)
169-189: Hide API key input (avoid echoing secrets)
Prompt.askechoes the ElevenLabs key. Prefer hidden input (getpass) like a password. This is a UX/security improvement; optional but recommended.Example change:
# Reuse existing if present... # When asking for a new key, prefer hidden input: api_key = self.prompt_password("ElevenLabs API key") if not existing_key else (api_key_input or existing_key)If you need “leave empty to skip” semantics, introduce a helper that uses getpass but allows empty return.
backends/advanced/src/advanced_omi_backend/services/transcription/elevenlabs.py (1)
74-83: Harden HTTP error handling and loggingUse
response.raise_for_status()andlogger.exception(...); avoid broadexcept Exception.@@ - async with httpx.AsyncClient(timeout=timeout_config) as client: - response = await client.post( + async with httpx.AsyncClient(timeout=timeout_config) as client: + response = await client.post( self.url, headers=headers, data=data, files=files ) - - if response.status_code == 200: - result = response.json() + response.raise_for_status() + result = response.json() @@ - else: - logger.error(f"ElevenLabs API error: {response.status_code} - {response.text}") - return {"text": "", "words": [], "segments": []} + except httpx.HTTPStatusError as e: + logger.exception(f"HTTP error calling ElevenLabs API: {e.response.status_code}") + return {"text": "", "words": [], "segments": []} @@ - except httpx.TimeoutException as e: - logger.error(f"Timeout during ElevenLabs API call: {e}") + except httpx.TimeoutException as e: + logger.exception("Timeout during ElevenLabs API call") return {"text": "", "words": [], "segments": []} - except Exception as e: - logger.error(f"Error calling ElevenLabs API: {e}") + except httpx.HTTPError as e: + logger.exception("Network error during ElevenLabs API call") return {"text": "", "words": [], "segments": []}Also applies to: 120-129
extras/speaker-recognition/webui/src/services/elevenlabs.ts (2)
65-71: Align diarization params with backend/API.Use
diarize(and optionallytimestamps_granularity) to match wrapper/ElevenLabs, notenable_speaker_diarization.- // Enable speaker diarization - formData.append('enable_speaker_diarization', 'true') + // Enable speaker diarization + formData.append('diarize', 'true') + formData.append('timestamps_granularity', 'word')
144-149: Text concatenation misses spaces; running average is biased.
- Add a space when appending words.
- Compute a length-weighted average instead of repeatedly halving.
- currentSegment.end = word.end - currentSegment.text += word.text - // Update confidence as running average - currentSegment.confidence = (currentSegment.confidence + confidence) / 2 + currentSegment.end = word.end + currentSegment.text += ` ${word.text}` + // Track count and update running average + // @ts-expect-error internal counter not in interface + currentSegment._count = (currentSegment._count ?? 1) + 1 + // @ts-expect-error internal counter not in interface + currentSegment.confidence = ((currentSegment.confidence * (currentSegment._count - 1)) + confidence) / currentSegment._countbackends/advanced/Docs/elevenlabs-integration.md (3)
562-579: Add languages to all fenced code blocks.At least one fence lacks a language (markdownlint MD040). Use
textfor ASCII diagrams.-``` +```text User uploads audio file ↓ [Inference Page UI] ...--- `375-379`: **Prevent false-positive secret leaks in examples.** The `Authorization: Bearer YOUR_JWT_TOKEN` header can trigger scanners. Use a neutral placeholder. ```diff - -H "Authorization: Bearer YOUR_JWT_TOKEN" \ + -H "Authorization: Bearer <AUTH_TOKEN_PLACEHOLDER>" \
393-399: Confidence mapping: document the saturation used in code.Code clamps
abs(logprob)to 1.0; docs should reflect this for consistency.-'confidence': 1.0 - abs(word_obj.get('logprob', 0)) +'confidence': 1.0 - min(abs(word_obj.get('logprob', 0)), 1.0)extras/speaker-recognition/src/simple_speaker_recognition/api/routers/elevenlabs_wrapper.py (5)
136-169: Avoid O(n²) indexing on words; track original indices.Use enumerate to keep indices and build segments without
list.index.- # Filter only actual words (skip spacing and audio events) - filtered_words = [w for w in words if w.get('type') == 'word'] + # Keep original indices for later updates + filtered_pairs = [(i, w) for i, w in enumerate(words) if w.get('type') == 'word'] @@ - speaker_segments = [] - if filtered_words: + speaker_segments = [] + if filtered_pairs: current_segment = None @@ - for word in filtered_words: - speaker_id = word.get('speaker_id') + for original_idx, word in filtered_pairs: + speaker_id = word.get('speaker_id') if speaker_id is None: continue @@ - current_segment = { + current_segment = { 'speaker_id': speaker_id, - 'start_time': word.get('start', 0.0), - 'end_time': word.get('end', 0.0), - 'word_indices': [filtered_words.index(word)] + 'start_time': word.get('start', 0.0), + 'end_time': word.get('end', 0.0), + 'word_indices': [original_idx] } else: # Extend current segment - current_segment['end_time'] = word.get('end', 0.0) - current_segment['word_indices'].append(filtered_words.index(word)) + current_segment['end_time'] = word.get('end', 0.0) + current_segment['word_indices'].append(original_idx)
225-239: Replace nested rescans with direct index updates.Now that
word_indicesare original indices, update in O(1).- for word_idx in segment_info["word_indices"]: - if word_idx < len(filtered_words): - # Find the original index in enhanced_words - original_word = filtered_words[word_idx] - for i, w in enumerate(enhanced_words): - if w is original_word or (w.get('start') == original_word.get('start') and w.get('text') == original_word.get('text')): - enhanced_words[i].update({ - "identified_speaker_id": segment_result["speaker_id"], - "identified_speaker_name": segment_result["speaker_name"], - "speaker_identification_confidence": segment_result["confidence"], - "speaker_status": segment_result["status"] - }) - break + for original_idx in segment_info["word_indices"]: + if 0 <= original_idx < len(enhanced_words): + enhanced_words[original_idx].update({ + "identified_speaker_id": segment_result["speaker_id"], + "identified_speaker_name": segment_result["speaker_name"], + "speaker_identification_confidence": segment_result["confidence"], + "speaker_status": segment_result["status"] + })
243-257: Exception logging and fallback updates.Use
log.exceptionand same O(1) index update in error path.- except Exception as e: - log.warning(f"Error identifying segment {segment_idx}: {e}") + except Exception as e: + log.exception(f"Error identifying segment {segment_idx}: {e}") @@ - for word_idx in segment_info["word_indices"]: - if word_idx < len(filtered_words): - original_word = filtered_words[word_idx] - for i, w in enumerate(enhanced_words): - if w is original_word or (w.get('start') == original_word.get('start') and w.get('text') == original_word.get('text')): - enhanced_words[i].update({ - "identified_speaker_id": None, - "identified_speaker_name": None, - "speaker_identification_confidence": 0.0, - "speaker_status": SpeakerStatus.ERROR.value - }) - break + for original_idx in segment_info["word_indices"]: + if 0 <= original_idx < len(enhanced_words): + enhanced_words[original_idx].update({ + "identified_speaker_id": None, + "identified_speaker_name": None, + "speaker_identification_confidence": 0.0, + "speaker_status": SpeakerStatus.ERROR.value + })
291-300: Preferlog.exceptionon broad except; keep error chain.Improve observability and exception chaining.
- except Exception as e: - log.error(f"Error during speaker identification: {e}") + except Exception as e: + log.exception(f"Error during speaker identification: {e}")
387-391: Error handling: uselog.exceptionand chain the raise.Align with Ruff TRY400/B904 and keep the original traceback.
- except Exception as e: - log.error(f"Error processing request: {e}") - raise HTTPException(status_code=500, detail=f"Error processing request: {str(e)}") + except Exception as e: + log.exception(f"Error processing request: {e}") + raise HTTPException(status_code=500, detail=f"Error processing request: {e!s}") from e
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
CLAUDE.md(1 hunks)backends/advanced/.env.template(1 hunks)backends/advanced/Docs/elevenlabs-integration.md(1 hunks)backends/advanced/init.py(2 hunks)backends/advanced/src/advanced_omi_backend/models/transcription.py(1 hunks)backends/advanced/src/advanced_omi_backend/services/transcription/__init__.py(4 hunks)backends/advanced/src/advanced_omi_backend/services/transcription/elevenlabs.py(1 hunks)extras/speaker-recognition/.env.template(1 hunks)extras/speaker-recognition/init.py(3 hunks)extras/speaker-recognition/src/simple_speaker_recognition/api/routers/__init__.py(1 hunks)extras/speaker-recognition/src/simple_speaker_recognition/api/routers/elevenlabs_wrapper.py(1 hunks)extras/speaker-recognition/src/simple_speaker_recognition/api/service.py(3 hunks)extras/speaker-recognition/src/simple_speaker_recognition/utils/elevenlabs_parser.py(1 hunks)extras/speaker-recognition/webui/src/components/ProcessingModeSelector.tsx(4 hunks)extras/speaker-recognition/webui/src/services/elevenlabs.ts(1 hunks)extras/speaker-recognition/webui/src/services/speakerIdentification.ts(5 hunks)wizard.py(1 hunks)
🧰 Additional context used
🪛 Gitleaks (8.28.0)
backends/advanced/Docs/elevenlabs-integration.md
[high] 375-376: Discovered a potential authorization token provided in a curl command header, which could compromise the curl accessed resource.
(curl-auth-header)
🪛 markdownlint-cli2 (0.18.1)
backends/advanced/Docs/elevenlabs-integration.md
562-562: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.14.1)
backends/advanced/src/advanced_omi_backend/services/transcription/elevenlabs.py
124-124: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
126-126: Do not catch blind exception: Exception
(BLE001)
127-127: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
extras/speaker-recognition/src/simple_speaker_recognition/api/routers/elevenlabs_wrapper.py
243-243: Do not catch blind exception: Exception
(BLE001)
291-291: Do not catch blind exception: Exception
(BLE001)
292-292: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
306-306: Do not perform function call File in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
389-389: Do not catch blind exception: Exception
(BLE001)
390-390: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
391-391: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
391-391: Use explicit conversion flag
Replace with conversion flag
(RUF010)
backends/advanced/src/advanced_omi_backend/services/transcription/__init__.py
96-98: Avoid specifying long messages outside the exception class
(TRY003)
101-101: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: integration-tests
🔇 Additional comments (24)
extras/speaker-recognition/src/simple_speaker_recognition/utils/elevenlabs_parser.py (3)
12-22: Overall structure looks solid.Clear API, sensible segment model, and helpful metadata.
57-68: The review comment's suggested refactoring would break external callers.The proposed diff removes
'total_duration'from the return value. However, external code likelaptop_client.py:443usessummary.get('total_duration', 0)expecting it at the top level. Removing it would cause those calls to silently return0instead of the actual duration, introducing a bug.If deduplication is desired, the options are:
- Remove from
metadata, keep at top-level (preserves API)- Remove from top-level, keep in metadata (requires updating all callers across multiple files—beyond this PR's scope)
Likely an incorrect or invalid review comment.
41-45: Parser correctly implements ElevenLabs API specification—no issues found.Verification confirms the parser correctly accesses all documented fields from the ElevenLabs Speech-to-Text API response:
- Top-level: language_code, language_probability, text, words array
- Word-level fields: type ("word", "spacing", "audio_event"), start, end, text, speaker_id, logprob
The implementation (lines 31–84) safely retrieves each field using
.get()with defaults, correctly filters words by type, and properly converts logprob to confidence. No field name mismatches or missing accesses detected.extras/speaker-recognition/src/simple_speaker_recognition/api/service.py (3)
34-34: LGTM! Consistent configuration pattern.The ElevenLabs API key field follows the same pattern as the Deepgram configuration, maintaining consistency in the codebase.
56-59: LGTM! Environment override follows established pattern.The ElevenLabs API key environment override is consistent with the Deepgram implementation above it.
140-140: LGTM! Router integration follows existing patterns.The ElevenLabs router import and registration are consistent with other routers in the service.
Also applies to: 150-150
backends/advanced/.env.template (2)
53-56: LGTM! Clear documentation for ElevenLabs integration.The ElevenLabs configuration is well-documented with a helpful URL for obtaining API keys. The option numbering and structure are consistent with other transcription providers.
60-62: LGTM! Provider documentation updated correctly.The transcription provider comment now accurately includes ElevenLabs as an available option alongside Deepgram and Parakeet.
wizard.py (1)
186-191: LGTM! Consistent API key propagation pattern.The ElevenLabs API key handling follows the same pattern as the Deepgram key propagation above it, ensuring consistency in the wizard's configuration flow.
extras/speaker-recognition/.env.template (1)
42-42: LGTM! Environment variable follows naming conventions.The ElevenLabs API key variable is correctly placed with other external service credentials and uses a consistent placeholder format.
CLAUDE.md (1)
289-291: LGTM! Documentation updated consistently.The CLAUDE.md transcription provider documentation now reflects ElevenLabs as Option 2, consistent with the changes in
.env.template. The feature descriptions (99 languages, speaker diarization) align with ElevenLabs' known capabilities.extras/speaker-recognition/webui/src/components/ProcessingModeSelector.tsx (3)
453-454: LGTM! Grid layout updated for three modes.The grid columns and slice adjustment correctly accommodate the addition of the ElevenLabs mode alongside the existing two modes.
471-477: LGTM! Good UX addition for requirements visibility.The requirements badge provides clear visual feedback about API key dependencies, improving user awareness before selecting a processing mode.
62-70: Speaker limit claim verified—code is accurate.ElevenLabs' speaker diarization supports up to 32 speakers, confirming the configuration in the code is correct. The ElevenLabs mode setup follows the existing pattern and requires no changes.
extras/speaker-recognition/src/simple_speaker_recognition/api/routers/__init__.py (1)
8-8: LGTM! Router export follows established patterns.The ElevenLabs router import and public export are consistent with the existing router structure, maintaining clean module organization.
Also applies to: 17-17
extras/speaker-recognition/init.py (2)
423-423: LGTM: invoked ElevenLabs setup in run()The call order fits alongside other provider setups.
457-458: CLI flag addedNice. Please ensure README/Docs mention
--elevenlabs-api-key.backends/advanced/src/advanced_omi_backend/services/transcription/__init__.py (1)
23-25: Wiring/export looks correctImport and all exposure for ElevenLabsProvider are consistent.
Also applies to: 141-142
backends/advanced/init.py (1)
139-146: Provider list updateElevenLabs option added and described—looks good.
extras/speaker-recognition/webui/src/services/speakerIdentification.ts (4)
99-101: Mode switch wiringNew mode case is correctly routed to ElevenLabs processor.
228-280: ElevenLabs processing path looks consistent
- Mapping aligns with other modes; summary computed via ElevenLabs helpers.
- Error rethrow integrates with outer friendly messages.
Confirm that
options.maxSpeakersis respected by your backend path; otherwise consider removing it from the call to avoid confusion.
577-581: UX: Mode descriptionClear and concise; matches functionality.
9-9: All ElevenLabs service exports verified successfully. The import statement inspeakerIdentification.tsline 9 correctly references all four items:
transcribeWithElevenLabs(async function, line 52)processElevenLabsResponse(function, line 109)calculateConfidenceSummary(function, line 173, aliased ascalculateElevenLabsConfidenceSummary)ElevenLabsResponse(interface, line 18)extras/speaker-recognition/src/simple_speaker_recognition/api/routers/elevenlabs_wrapper.py (1)
200-208: ****The
validate_confidencefunction always returns afloatand never returnsNone. The implementation explicitly handles theNonecase by assigning0.0before the return statement, and the function guarantees a clamped float between 0.0 and 1.0. The comparisonconfidence >= confidence_thresholdat line 206 is safe and requires no guard.Likely an incorrect or invalid review comment.
| **Comparison**: Deepgram Nova-3 costs ~$0.36/hour (pay-as-you-go) | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
Pricing claim may drift. Please verify.
Confirm latest Deepgram/ElevenLabs pricing before merging.
🌐 Web query:
Verify current prices for ElevenLabs Scribe v1 (hourly effective rates across tiers) and Deepgram Nova-3 pay‑as‑you‑go.
💡 Result:
I checked current public pricing (as of Oct 21, 2025):
-
ElevenLabs — Scribe v1 (product/developer docs): hourly effective rates by subscription tier are listed on ElevenLabs’ pricing table (example: Starter = $0.40/hour included; Creator = $0.35/hour included; Pro = $0.33/hour included; Scale = $0.30/hour included; Business = $0.22/hour included). The table also shows per‑additional‑hour top‑up rates for some tiers. [1]
-
Deepgram — Nova‑3 pay‑as‑you‑go (Deepgram pricing page):
- Pre‑recorded (batch) Nova‑3 (monolingual): $0.0043 per minute (≈ $0.258/hr); Nova‑3 multilingual batch: $0.0052/min.
- Streaming Nova‑3 (real‑time) Nova‑3 (monolingual): $0.0077/min (≈ $0.462/hr); streaming multilingual: $0.0092/min. (Deepgram lists separate Pay‑As‑You‑Go and Growth rates on their pricing page.) [2]
Sources:
[1] ElevenLabs Scribe v1 docs/pricing.
[2] Deepgram pricing page (Nova‑3 batch and streaming rates).
---
🏁 Script executed:
```shell
#!/bin/bash
# Read the file to get full context around lines 53-54
cat "backends/advanced/Docs/elevenlabs-integration.md"
Length of output: 29346
Deepgram Nova-3 pricing claims are inaccurate; requires correction in two locations.
Based on current Deepgram pricing (as of Oct 21, 2025), the documented figure of ~$0.36/hour does not match actual rates:
- Batch (pay-as-you-go, pre-recorded): $0.0043/min = ~$0.26/hour
- Streaming (real-time): $0.0077/min = ~$0.46/hour
Locations to update:
- Line 53-54: Change "
$0.36/hour (pay-as-you-go)" to "$0.26/hour (batch)" or specify streaming if applicable - Comparison table (later in file, "Cost" row): Update Deepgram cost from "~$0.36/hr" to match the correct rate for the intended use case
Clarify whether the comparison targets batch or streaming, as pricing differs significantly.
🤖 Prompt for AI Agents
In backends/advanced/Docs/elevenlabs-integration.md around lines 53-54, the
Deepgram Nova-3 hourly cost is incorrect (~$0.36/hr); update the text to specify
whether it refers to batch or streaming and use the correct rate (batch:
~$0.26/hr, streaming: ~$0.46/hr). Also update the later Comparison table "Cost"
row to match the same chosen use case (replace "~$0.36/hr" with "~$0.26/hr
(batch)" or "~$0.46/hr (streaming)"), and add a brief parenthetical note
clarifying that Deepgram pricing differs between batch (pre-recorded) and
streaming (real-time).
| elif provider_name == "elevenlabs": | ||
| if not elevenlabs_key: | ||
| raise RuntimeError( | ||
| "ElevenLabs transcription provider requested but ELEVENLABS_API_KEY not configured" | ||
| ) | ||
| logger.info(f"Using ElevenLabs transcription provider in {mode} mode") | ||
| if mode == "streaming": | ||
| raise RuntimeError("ElevenLabs does not support streaming mode - use batch mode") | ||
| return ElevenLabsProvider(elevenlabs_key) | ||
|
|
There was a problem hiding this comment.
Doc/message updates and auto-select warning
- Update docstring to list 'elevenlabs' (and 'offline') to reflect supported providers.
- Current auto-select warning omits ElevenLabs; users with ELEVENLABS_API_KEY and no explicit provider will see a misleading message. Clarify guidance to set TRANSCRIPTION_PROVIDER=elevenlabs.
@@
- Args:
- provider_name: Name of the provider ('deepgram', 'parakeet').
+ Args:
+ provider_name: Name of the provider ('deepgram', 'parakeet', 'elevenlabs', 'offline').
@@
- logger.warning(
- "No transcription provider configured (DEEPGRAM_API_KEY or PARAKEET_ASR_URL required)"
- )
+ logger.warning(
+ "No transcription provider configured. Set TRANSCRIPTION_PROVIDER to "
+ "'deepgram', 'parakeet', 'offline', or 'elevenlabs' (and provide the respective credentials)."
+ )Also applies to: 126-129, 38-41
🧰 Tools
🪛 Ruff (0.14.1)
96-98: Avoid specifying long messages outside the exception class
(TRY003)
101-101: Avoid specifying long messages outside the exception class
(TRY003)
| words.append({ | ||
| 'word': word_obj.get('text', ''), | ||
| 'start': word_obj.get('start', 0), | ||
| 'end': word_obj.get('end', 0), | ||
| 'confidence': 1.0 - abs(word_obj.get('logprob', 0)), # Convert logprob to confidence | ||
| 'speaker': word_obj.get('speaker_id'), | ||
| }) | ||
|
|
||
| # Extract speaker segments if diarization is enabled |
There was a problem hiding this comment.
Confidence mapping from logprob is incorrect
1 - abs(logprob) can go negative or >1. Prefer probability if provided, or convert logprob via exp() and clamp to [0,1].
@@
- words.append({
+ import math
+ prob = word_obj.get('prob')
+ if prob is None:
+ lp = word_obj.get('logprob')
+ prob = math.exp(lp) if lp is not None else None
+ confidence = None
+ if prob is not None:
+ try:
+ confidence = max(0.0, min(1.0, float(prob)))
+ except Exception:
+ confidence = None
+ words.append({
'word': word_obj.get('text', ''),
'start': word_obj.get('start', 0),
'end': word_obj.get('end', 0),
- 'confidence': 1.0 - abs(word_obj.get('logprob', 0)), # Convert logprob to confidence
+ 'confidence': confidence,
'speaker': word_obj.get('speaker_id'),
})🤖 Prompt for AI Agents
In
backends/advanced/src/advanced_omi_backend/services/transcription/elevenlabs.py
around lines 97 to 105, the current confidence mapping uses 1 - abs(logprob)
which can produce values <0 or >1; change it to prefer an explicit probability
field if present (e.g., word_obj.get('probability')), otherwise convert logprob
to probability using math.exp(logprob) (or 0.0 if missing), then clamp the
result to [0.0, 1.0] and assign that as the confidence; ensure you import math
and handle missing/None values safely so confidence is always within [0,1].
| async with aiohttp.ClientSession() as session: | ||
| async with session.post( | ||
| url, | ||
| headers=headers, | ||
| data=form_data | ||
| ) as response: | ||
| if response.status != 200: | ||
| error_text = await response.text() | ||
| log.error(f"ElevenLabs API error: {response.status} - {error_text}") | ||
| raise HTTPException( | ||
| status_code=response.status, | ||
| detail=f"ElevenLabs API error: {error_text}" | ||
| ) | ||
|
|
||
| result = await response.json() | ||
| log.info("Successfully received ElevenLabs response") | ||
| return result | ||
|
|
There was a problem hiding this comment.
Add client timeouts for external ElevenLabs call.
Unbounded waits can hang workers. Set explicit ClientTimeout.
- async with aiohttp.ClientSession() as session:
- async with session.post(
+ timeout = aiohttp.ClientTimeout(total=600, connect=30, sock_read=540, sock_connect=30)
+ async with aiohttp.ClientSession(timeout=timeout) as session:
+ async with session.post(
url,
headers=headers,
data=form_data
) as response:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async with aiohttp.ClientSession() as session: | |
| async with session.post( | |
| url, | |
| headers=headers, | |
| data=form_data | |
| ) as response: | |
| if response.status != 200: | |
| error_text = await response.text() | |
| log.error(f"ElevenLabs API error: {response.status} - {error_text}") | |
| raise HTTPException( | |
| status_code=response.status, | |
| detail=f"ElevenLabs API error: {error_text}" | |
| ) | |
| result = await response.json() | |
| log.info("Successfully received ElevenLabs response") | |
| return result | |
| timeout = aiohttp.ClientTimeout(total=600, connect=30, sock_read=540, sock_connect=30) | |
| async with aiohttp.ClientSession(timeout=timeout) as session: | |
| async with session.post( | |
| url, | |
| headers=headers, | |
| data=form_data | |
| ) as response: | |
| if response.status != 200: | |
| error_text = await response.text() | |
| log.error(f"ElevenLabs API error: {response.status} - {error_text}") | |
| raise HTTPException( | |
| status_code=response.status, | |
| detail=f"ElevenLabs API error: {error_text}" | |
| ) | |
| result = await response.json() | |
| log.info("Successfully received ElevenLabs response") | |
| return result |
🤖 Prompt for AI Agents
In
extras/speaker-recognition/src/simple_speaker_recognition/api/routers/elevenlabs_wrapper.py
around lines 75-92, the aiohttp.ClientSession is created without any timeout
which can lead to unbounded waits; create an aiohttp.ClientTimeout (e.g.,
ClientTimeout(total=30, connect=5, sock_read=20) or use a configurable value
from settings) and pass it into ClientSession via the timeout parameter so the
external ElevenLabs call will fail fast on network issues; keep other logic
intact and ensure proper error handling still awaits response.text() when status
!= 200.
| def _logprob_to_confidence(self, logprob: float) -> float: | ||
| """Convert ElevenLabs logprob to confidence score (0-1). | ||
|
|
||
| Args: | ||
| logprob: Log probability from ElevenLabs | ||
|
|
||
| Returns: | ||
| Confidence score between 0 and 1 | ||
| """ | ||
| # ElevenLabs returns log probability (negative values closer to 0 are more confident) | ||
| # Convert to confidence: closer to 0 = higher confidence | ||
| return 1.0 - min(abs(logprob), 1.0) | ||
|
|
There was a problem hiding this comment.
Fix confidence mapping; current formula collapses to 0 for most words.
For typical logprobs (e.g., ≤ -1), 1 - min(abs(logprob), 1) yields 0, flattening confidence and degrading downstream scoring. Map logprob back to probability instead.
Apply:
+import math
@@
def _logprob_to_confidence(self, logprob: float) -> float:
@@
- # ElevenLabs returns log probability (negative values closer to 0 are more confident)
- # Convert to confidence: closer to 0 = higher confidence
- return 1.0 - min(abs(logprob), 1.0)
+ # ElevenLabs returns log probability (log p). Convert to probability in [0,1].
+ if logprob is None:
+ return 0.0
+ try:
+ # Clamp to [0,1] and handle extreme/invalid values.
+ return max(0.0, min(1.0, math.exp(float(logprob))))
+ except (TypeError, ValueError, OverflowError):
+ return 0.0🤖 Prompt for AI Agents
In
extras/speaker-recognition/src/simple_speaker_recognition/utils/elevenlabs_parser.py
around lines 165-177, the current mapping 1.0 - min(abs(logprob), 1.0) collapses
most logprobs to 0; instead convert the log probability back to a probability
and clamp to [0,1]. Replace the formula by computing prob = math.exp(logprob)
(import math if missing), then return max(0.0, min(1.0, prob)) so very negative
logprobs yield low confidence and values near 0 yield confidence near 1.0.
| // Determine endpoint based on enhancement option | ||
| const endpoint = options.enhanceSpeakers | ||
| ? '/elevenlabs/v1/transcribe' | ||
| : 'https://api.elevenlabs.io/v1/speech-to-text' | ||
|
|
There was a problem hiding this comment.
Avoid direct browser calls to ElevenLabs; route via backend wrapper only.
Direct calls risk CORS issues and expose provider credentials. Use only the backend /elevenlabs/v1/transcribe endpoint.
Apply:
- const endpoint = options.enhanceSpeakers
- ? '/elevenlabs/v1/transcribe'
- : 'https://api.elevenlabs.io/v1/speech-to-text'
+ const endpoint = '/elevenlabs/v1/transcribe'📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Determine endpoint based on enhancement option | |
| const endpoint = options.enhanceSpeakers | |
| ? '/elevenlabs/v1/transcribe' | |
| : 'https://api.elevenlabs.io/v1/speech-to-text' | |
| // Determine endpoint based on enhancement option | |
| const endpoint = '/elevenlabs/v1/transcribe' |
🤖 Prompt for AI Agents
In extras/speaker-recognition/webui/src/services/elevenlabs.ts around lines 72
to 76, the code chooses the external ElevenLabs API URL for non-enhanced
transcribe calls which exposes provider endpoints/credentials and can cause
CORS; always route requests through the backend wrapper. Change the logic so
endpoint is always the backend path '/elevenlabs/v1/transcribe' (remove or
ignore direct 'https://api.elevenlabs.io/v1/speech-to-text' usage), and ensure
any client-side code uses only that backend endpoint; if needed, update callers
to pass the enhancement option to the backend rather than switching URLs in the
browser.
| if (options.speakerConfidenceThreshold !== undefined) { | ||
| params.similarity_threshold = options.speakerConfidenceThreshold.toString() | ||
| } |
There was a problem hiding this comment.
Query param mismatch with backend.
Wrapper expects speaker_confidence_threshold; current code sends similarity_threshold.
- params.similarity_threshold = options.speakerConfidenceThreshold.toString()
+ params.speaker_confidence_threshold = options.speakerConfidenceThreshold.toString()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (options.speakerConfidenceThreshold !== undefined) { | |
| params.similarity_threshold = options.speakerConfidenceThreshold.toString() | |
| } | |
| if (options.speakerConfidenceThreshold !== undefined) { | |
| params.speaker_confidence_threshold = options.speakerConfidenceThreshold.toString() | |
| } |
🤖 Prompt for AI Agents
In extras/speaker-recognition/webui/src/services/elevenlabs.ts around lines 84
to 86, the code sets params.similarity_threshold but the backend wrapper expects
speaker_confidence_threshold; change the param key to
params.speaker_confidence_threshold and keep converting the numeric option to
string (e.g., options.speakerConfidenceThreshold.toString()) so the request
sends the correct query parameter name and value.
| headers: { | ||
| 'Content-Type': 'multipart/form-data', | ||
| ...(endpoint.includes('elevenlabs.io') && { | ||
| 'xi-api-key': import.meta.env.VITE_ELEVENLABS_API_KEY || '' | ||
| }) | ||
| }, |
There was a problem hiding this comment.
Multipart boundary + secret exposure.
- Do not set
Content-TypeforFormData(breaks boundary). - Do not send
xi-api-keyfrom the browser.
- headers: {
- 'Content-Type': 'multipart/form-data',
- ...(endpoint.includes('elevenlabs.io') && {
- 'xi-api-key': import.meta.env.VITE_ELEVENLABS_API_KEY || ''
- })
- },
+ headers: {
+ // Let the browser set multipart boundary automatically.
+ },🤖 Prompt for AI Agents
In extras/speaker-recognition/webui/src/services/elevenlabs.ts around lines 90
to 95, the code forces a Content-Type which breaks multipart boundaries and also
exposes the ElevenLabs secret from the browser; remove the manual 'Content-Type'
header when sending FormData (let the browser set the boundary) by conditionally
omitting that header if the body is a FormData, and stop sending the xi-api-key
from client-side code—move API key usage to a server-side proxy or backend
endpoint that your web client calls instead so the secret is never included in
browser requests.
| export function calculateConfidenceSummary(segments: ElevenLabsSegment[]): { | ||
| total_segments: number | ||
| high_confidence: number | ||
| medium_confidence: number | ||
| low_confidence: number | ||
| } { | ||
| return { | ||
| total_segments: segments.length, | ||
| high_confidence: segments.filter(s => s.confidence >= 0.8).length, | ||
| medium_confidence: segments.filter(s => s.confidence >= 0.6 && s.confidence < 0.8).length, | ||
| low_confidence: segments.filter(s => s.confidence >= 0.4 && s.confidence < 0.6).length | ||
| } | ||
| } |
There was a problem hiding this comment.
Confidence buckets skip segments < 0.4.
Ensure categories are exhaustive.
return {
total_segments: segments.length,
high_confidence: segments.filter(s => s.confidence >= 0.8).length,
medium_confidence: segments.filter(s => s.confidence >= 0.6 && s.confidence < 0.8).length,
- low_confidence: segments.filter(s => s.confidence >= 0.4 && s.confidence < 0.6).length
+ low_confidence: segments.filter(s => s.confidence < 0.6).length
}🤖 Prompt for AI Agents
In extras/speaker-recognition/webui/src/services/elevenlabs.ts around lines
173–185, the current confidence buckets skip segments with confidence < 0.4;
make the categories exhaustive by adding a very_low_confidence bucket (counting
segments with s.confidence < 0.4) or by adjusting the existing low_confidence
range to include values below 0.4; update the returned object shape accordingly
and ensure tests/consumers are updated to use the new/changed field.
Summary by CodeRabbit
New Features
Documentation
Configuration