Skip to content

This PR introduces ElevenLabs support#136

Closed
AnkushMalaker wants to merge 1 commit intomainfrom
feat/eleven-labs
Closed

This PR introduces ElevenLabs support#136
AnkushMalaker wants to merge 1 commit intomainfrom
feat/eleven-labs

Conversation

@AnkushMalaker
Copy link
Collaborator

@AnkushMalaker AnkushMalaker commented Oct 21, 2025

Summary by CodeRabbit

  • New Features

    • Added ElevenLabs as a new transcription provider, supporting 99 languages with speaker diarization capabilities.
    • Integrated speaker identification enhancement that works with ElevenLabs transcription results.
    • Added new processing mode in the UI for ElevenLabs-enhanced transcription with optional speaker enhancement.
  • Documentation

    • Added comprehensive ElevenLabs integration guide.
  • Configuration

    • Updated transcription provider configuration: replaced Mistral with ElevenLabs as the alternative provider option.

@AnkushMalaker
Copy link
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 21, 2025

Walkthrough

This pull request integrates ElevenLabs as a transcription provider throughout the system, replacing Mistral as Option 2. Changes include a new backend ElevenLabs provider implementation, API wrapper with speaker identification enhancement, frontend service and UI component, configuration setup, and supporting utilities.

Changes

Cohort / File(s) Summary
Backend Configuration
CLAUDE.md, backends/advanced/.env.template, backends/advanced/Docs/elevenlabs-integration.md
Updated documentation and environment template to support ElevenLabs as transcription provider; adds ELEVENLABS_API_KEY configuration and replaces Mistral provider description.
Backend Initialization
backends/advanced/init.py
Updated provider selection flow to prompt for and configure ElevenLabs API key instead of Mistral; sets TRANSCRIPTION_PROVIDER to "elevenlabs" when option is selected.
Backend Service Layer
backends/advanced/src/advanced_omi_backend/models/transcription.py, backends/advanced/src/advanced_omi_backend/services/transcription/__init__.py, backends/advanced/src/advanced_omi_backend/services/transcription/elevenlabs.py
Added ElevenLabsProvider class implementing batch transcription via ElevenLabs Scribe v1 API; integrated into provider factory with environment-based selection; replaced MISTRAL enum member with ELEVENLABS.
Speaker Recognition Backend Configuration
extras/speaker-recognition/.env.template, extras/speaker-recognition/init.py
Added ELEVENLABS_API_KEY environment variable and setup method to speaker-recognition CLI.
Speaker Recognition API Layer
extras/speaker-recognition/src/simple_speaker_recognition/api/routers/__init__.py, extras/speaker-recognition/src/simple_speaker_recognition/api/routers/elevenlabs_wrapper.py, extras/speaker-recognition/src/simple_speaker_recognition/api/service.py
Added ElevenLabs router with endpoint for transcription and optional speaker identification enhancement; wired router into API service; added elevenlabs_api_key to Settings.
Speaker Recognition Utilities
extras/speaker-recognition/src/simple_speaker_recognition/utils/elevenlabs_parser.py
New ElevenLabsParser class to parse ElevenLabs JSON output, segment by speaker, extract speaker statistics, and convert to speaker identification format.
Speaker Recognition Frontend Services
extras/speaker-recognition/webui/src/services/elevenlabs.ts, extras/speaker-recognition/webui/src/services/speakerIdentification.ts
Added ElevenLabs transcription service with speaker response processing; integrated 'elevenlabs-enhanced' mode into speaker identification workflow.
Speaker Recognition Frontend UI
extras/speaker-recognition/webui/src/components/ProcessingModeSelector.tsx
Added ElevenLabs Transcribe mode to processing mode selector; updated grid layout to display three modes per row.
Integration Script
wizard.py
Extended setup wizard to pass ElevenLabs API key to speaker-recognition service if available.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant WebUI
    participant SpeechRecAPI
    participant ElevenLabsAPI
    participant SpeakerDB
    participant AudioBackend

    User->>WebUI: Upload audio + select ElevenLabs mode
    WebUI->>SpeechRecAPI: POST /elevenlabs/v1/transcribe (audio, diarize=true)
    SpeechRecAPI->>ElevenLabsAPI: Forward audio + params
    ElevenLabsAPI->>ElevenLabsAPI: Transcribe with speaker diarization
    ElevenLabsAPI-->>SpeechRecAPI: Return transcript + word-level data + speakers
    
    alt enhanceSpeakers requested
        SpeechRecAPI->>SpeechRecAPI: Extract segments & build embeddings
        SpeechRecAPI->>SpeakerDB: Query speaker embeddings
        SpeakerDB-->>SpeechRecAPI: Return stored speaker embeddings
        SpeechRecAPI->>SpeechRecAPI: Match speaker segments → identified speakers
        SpeechRecAPI->>SpeechRecAPI: Annotate response with speaker metadata
    end
    
    SpeechRecAPI-->>WebUI: Enhanced response (transcript + speakers + confidence)
    WebUI-->>User: Display transcript with identified speakers
Loading
sequenceDiagram
    participant BackendApp
    participant TranscriptionFactory
    participant ElevenLabsProvider
    participant ElevenLabsAPI

    BackendApp->>TranscriptionFactory: get_transcription_provider("elevenlabs")
    TranscriptionFactory->>TranscriptionFactory: Read ELEVENLABS_API_KEY from env
    TranscriptionFactory->>ElevenLabsProvider: __init__(api_key)
    TranscriptionFactory-->>BackendApp: Return ElevenLabsProvider instance
    
    BackendApp->>ElevenLabsProvider: transcribe(audio_bytes, sample_rate, diarize=true)
    ElevenLabsProvider->>ElevenLabsProvider: Convert PCM to WAV
    ElevenLabsProvider->>ElevenLabsAPI: POST multipart/form-data (WAV + config)
    ElevenLabsAPI-->>ElevenLabsProvider: Return JSON (transcript, words, speakers)
    ElevenLabsProvider->>ElevenLabsProvider: Parse response → extract segments + speaker data
    ElevenLabsProvider-->>BackendApp: Return structured dict (text, segments, diarization)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The changes are substantial and span multiple layers (backend services, API routes, frontend services, UI) with heterogeneous logic patterns. Key factors: new ElevenLabs provider implementation with PCM-to-WAV conversion and API integration, speaker identification enhancement logic, parser utility for response formatting, and frontend service/state management integration. While individual files follow consistent patterns, the overall scope and cross-layer coordination require careful verification of end-to-end flow, error handling, and API contract consistency between layers.

Possibly related PRs

Suggested reviewers

  • thestumonkey

Poem

🐰 Hops with glee through ElevenLabs door,
Transcription flows where Mistral was before,
Speaker voices dance in segments bright,
From WAV to words, each one polished right,
The rabbit grins—full stack delight!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "This PR introduces ElevenLabs support" directly and accurately reflects the primary objective of the changeset. The raw_summary shows extensive additions of ElevenLabs integration across multiple components: a new ElevenLabsProvider class for batch transcription in the backends/advanced service, integration into the transcription provider factory, configuration updates, ElevenLabs router and wrapper in the speaker-recognition service, UI components for the new mode, and supporting utilities. The title is specific and clearly communicates the main change—adding ElevenLabs as a transcription provider option across the system—without vague terminology or misleading claims. While the wording includes "This PR" (slightly redundant for a PR title), it remains concise and understandable enough for a teammate reviewing commit history.
Docstring Coverage ✅ Passed Docstring coverage is 93.55% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/eleven-labs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 21, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (17)
extras/speaker-recognition/src/simple_speaker_recognition/utils/elevenlabs_parser.py (3)

70-81: Sort by time and keep unknown-speaker words instead of dropping them.

  • Ensure chronological grouping even if input is unsorted.
  • Don’t discard words lacking speaker_id; label them as speaker_unknown to preserve transcript.
 def _group_words_by_speaker(self, words: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
@@
-        if not words:
+        if not words:
             return []
 
-        segments = []
+        # Defensive: ensure time order
+        words = sorted(words, key=lambda w: w.get('start', 0.0))
+
+        segments = []
         current_segment = None
@@
-        for word in words:
-            speaker_id = word.get('speaker_id')
-            if speaker_id is None:
-                continue
-            speaker_label = f"speaker_{speaker_id}"
+        for word in words:
+            speaker_id = word.get('speaker_id')
+            speaker_label = f"speaker_{speaker_id}" if speaker_id is not None else "speaker_unknown"

Also applies to: 85-92


53-56: Compute total_duration robustly (don’t rely on last word).

Handles unsorted inputs safely.

-        total_duration = 0.0
-        if filtered_words:
-            total_duration = filtered_words[-1].get('end', 0.0)
+        total_duration = max((w.get('end', 0.0) for w in filtered_words), default=0.0)

23-34: Add IO/JSON error handling and Path support; set encoding.

Improves resilience and type flexibility.

-from typing import Any, Dict, List, Optional, Tuple
+from typing import Any, Dict, List, Optional, Tuple, Union
@@
-    def parse_elevenlabs_json(self, json_path: str) -> Dict[str, Any]:
+    def parse_elevenlabs_json(self, json_path: Union[str, Path]) -> Dict[str, Any]:
@@
-        with open(json_path, 'r') as f:
-            data = json.load(f)
+        json_path = Path(json_path)
+        try:
+            with json_path.open('r', encoding='utf-8') as f:
+                data = json.load(f)
+        except (FileNotFoundError, json.JSONDecodeError) as e:
+            logger.error("Failed to read/parse ElevenLabs JSON at %s: %s", json_path, e)
+            raise ValueError(f"Invalid ElevenLabs JSON at {json_path}") from e
backends/advanced/src/advanced_omi_backend/models/transcription.py (1)

39-39: Verify no references to the removed MISTRAL provider remain: COMPLETE

The breaking change successfully removed MISTRAL from the active provider enum (transcription.py). No code references TranscriptionProvider.MISTRAL or the old enum variant.

However, two orphaned code fragments should be cleaned up:

  • backends/advanced/src/advanced_omi_backend/models/conversation.py:23 – Remove the unused MISTRAL = "mistral" enum member (this enum is not referenced anywhere)
  • backends/advanced/src/advanced_omi_backend/app_config.py:56 – Remove the unused self.mistral_api_key = os.getenv("MISTRAL_API_KEY") (this attribute is never referenced)

These are not breaking issues but should be removed to avoid confusion and maintain consistency with the provider change.

extras/speaker-recognition/init.py (1)

251-257: Add summary output for ElevenLabs key

Good CLI hook. Also show its status in the final summary (like Deepgram) to reduce confusion.

@@
     def show_summary(self):
         """Show configuration summary"""
@@
-        self.console.print(f"✅ HTTPS Enabled: {self.config.get('REACT_UI_HTTPS', 'false')}")
-        if self.config.get('DEEPGRAM_API_KEY'):
-            self.console.print(f"✅ Deepgram API Key: Configured")
+        self.console.print(f"✅ HTTPS Enabled: {self.config.get('REACT_UI_HTTPS', 'false')}")
+        if self.config.get('DEEPGRAM_API_KEY') or self.read_existing_env_value("DEEPGRAM_API_KEY"):
+            self.console.print("✅ Deepgram API Key: Configured")
+        if self.config.get('ELEVENLABS_API_KEY') or self.read_existing_env_value("ELEVENLABS_API_KEY"):
+            self.console.print("✅ ElevenLabs API Key: Configured")
backends/advanced/init.py (1)

169-189: Hide API key input (avoid echoing secrets)

Prompt.ask echoes the ElevenLabs key. Prefer hidden input (getpass) like a password. This is a UX/security improvement; optional but recommended.

Example change:

# Reuse existing if present...
# When asking for a new key, prefer hidden input:
api_key = self.prompt_password("ElevenLabs API key") if not existing_key else (api_key_input or existing_key)

If you need “leave empty to skip” semantics, introduce a helper that uses getpass but allows empty return.

backends/advanced/src/advanced_omi_backend/services/transcription/elevenlabs.py (1)

74-83: Harden HTTP error handling and logging

Use response.raise_for_status() and logger.exception(...); avoid broad except Exception.

@@
-            async with httpx.AsyncClient(timeout=timeout_config) as client:
-                response = await client.post(
+            async with httpx.AsyncClient(timeout=timeout_config) as client:
+                response = await client.post(
                     self.url,
                     headers=headers,
                     data=data,
                     files=files
                 )
-
-                if response.status_code == 200:
-                    result = response.json()
+                response.raise_for_status()
+                result = response.json()
@@
-                else:
-                    logger.error(f"ElevenLabs API error: {response.status_code} - {response.text}")
-                    return {"text": "", "words": [], "segments": []}
+        except httpx.HTTPStatusError as e:
+            logger.exception(f"HTTP error calling ElevenLabs API: {e.response.status_code}")
+            return {"text": "", "words": [], "segments": []}
@@
-        except httpx.TimeoutException as e:
-            logger.error(f"Timeout during ElevenLabs API call: {e}")
+        except httpx.TimeoutException as e:
+            logger.exception("Timeout during ElevenLabs API call")
             return {"text": "", "words": [], "segments": []}
-        except Exception as e:
-            logger.error(f"Error calling ElevenLabs API: {e}")
+        except httpx.HTTPError as e:
+            logger.exception("Network error during ElevenLabs API call")
             return {"text": "", "words": [], "segments": []}

Also applies to: 120-129

extras/speaker-recognition/webui/src/services/elevenlabs.ts (2)

65-71: Align diarization params with backend/API.

Use diarize (and optionally timestamps_granularity) to match wrapper/ElevenLabs, not enable_speaker_diarization.

-    // Enable speaker diarization
-    formData.append('enable_speaker_diarization', 'true')
+    // Enable speaker diarization
+    formData.append('diarize', 'true')
+    formData.append('timestamps_granularity', 'word')

144-149: Text concatenation misses spaces; running average is biased.

  • Add a space when appending words.
  • Compute a length-weighted average instead of repeatedly halving.
-      currentSegment.end = word.end
-      currentSegment.text += word.text
-      // Update confidence as running average
-      currentSegment.confidence = (currentSegment.confidence + confidence) / 2
+      currentSegment.end = word.end
+      currentSegment.text += ` ${word.text}`
+      // Track count and update running average
+      // @ts-expect-error internal counter not in interface
+      currentSegment._count = (currentSegment._count ?? 1) + 1
+      // @ts-expect-error internal counter not in interface
+      currentSegment.confidence = ((currentSegment.confidence * (currentSegment._count - 1)) + confidence) / currentSegment._count
backends/advanced/Docs/elevenlabs-integration.md (3)

562-579: Add languages to all fenced code blocks.

At least one fence lacks a language (markdownlint MD040). Use text for ASCII diagrams.

-```
+```text
 User uploads audio file
         ↓
 [Inference Page UI]
 ...

---

`375-379`: **Prevent false-positive secret leaks in examples.**

The `Authorization: Bearer YOUR_JWT_TOKEN` header can trigger scanners. Use a neutral placeholder.

```diff
-  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
+  -H "Authorization: Bearer <AUTH_TOKEN_PLACEHOLDER>" \

393-399: Confidence mapping: document the saturation used in code.

Code clamps abs(logprob) to 1.0; docs should reflect this for consistency.

-'confidence': 1.0 - abs(word_obj.get('logprob', 0))
+'confidence': 1.0 - min(abs(word_obj.get('logprob', 0)), 1.0)
extras/speaker-recognition/src/simple_speaker_recognition/api/routers/elevenlabs_wrapper.py (5)

136-169: Avoid O(n²) indexing on words; track original indices.

Use enumerate to keep indices and build segments without list.index.

-            # Filter only actual words (skip spacing and audio events)
-            filtered_words = [w for w in words if w.get('type') == 'word']
+            # Keep original indices for later updates
+            filtered_pairs = [(i, w) for i, w in enumerate(words) if w.get('type') == 'word']
@@
-            speaker_segments = []
-            if filtered_words:
+            speaker_segments = []
+            if filtered_pairs:
                 current_segment = None
@@
-                for word in filtered_words:
-                    speaker_id = word.get('speaker_id')
+                for original_idx, word in filtered_pairs:
+                    speaker_id = word.get('speaker_id')
                     if speaker_id is None:
                         continue
@@
-                        current_segment = {
+                        current_segment = {
                             'speaker_id': speaker_id,
-                            'start_time': word.get('start', 0.0),
-                            'end_time': word.get('end', 0.0),
-                            'word_indices': [filtered_words.index(word)]
+                            'start_time': word.get('start', 0.0),
+                            'end_time': word.get('end', 0.0),
+                            'word_indices': [original_idx]
                         }
                     else:
                         # Extend current segment
-                        current_segment['end_time'] = word.get('end', 0.0)
-                        current_segment['word_indices'].append(filtered_words.index(word))
+                        current_segment['end_time'] = word.get('end', 0.0)
+                        current_segment['word_indices'].append(original_idx)

225-239: Replace nested rescans with direct index updates.

Now that word_indices are original indices, update in O(1).

-                    for word_idx in segment_info["word_indices"]:
-                        if word_idx < len(filtered_words):
-                            # Find the original index in enhanced_words
-                            original_word = filtered_words[word_idx]
-                            for i, w in enumerate(enhanced_words):
-                                if w is original_word or (w.get('start') == original_word.get('start') and w.get('text') == original_word.get('text')):
-                                    enhanced_words[i].update({
-                                        "identified_speaker_id": segment_result["speaker_id"],
-                                        "identified_speaker_name": segment_result["speaker_name"],
-                                        "speaker_identification_confidence": segment_result["confidence"],
-                                        "speaker_status": segment_result["status"]
-                                    })
-                                    break
+                    for original_idx in segment_info["word_indices"]:
+                        if 0 <= original_idx < len(enhanced_words):
+                            enhanced_words[original_idx].update({
+                                "identified_speaker_id": segment_result["speaker_id"],
+                                "identified_speaker_name": segment_result["speaker_name"],
+                                "speaker_identification_confidence": segment_result["confidence"],
+                                "speaker_status": segment_result["status"]
+                            })

243-257: Exception logging and fallback updates.

Use log.exception and same O(1) index update in error path.

-                except Exception as e:
-                    log.warning(f"Error identifying segment {segment_idx}: {e}")
+                except Exception as e:
+                    log.exception(f"Error identifying segment {segment_idx}: {e}")
@@
-                    for word_idx in segment_info["word_indices"]:
-                        if word_idx < len(filtered_words):
-                            original_word = filtered_words[word_idx]
-                            for i, w in enumerate(enhanced_words):
-                                if w is original_word or (w.get('start') == original_word.get('start') and w.get('text') == original_word.get('text')):
-                                    enhanced_words[i].update({
-                                        "identified_speaker_id": None,
-                                        "identified_speaker_name": None,
-                                        "speaker_identification_confidence": 0.0,
-                                        "speaker_status": SpeakerStatus.ERROR.value
-                                    })
-                                    break
+                    for original_idx in segment_info["word_indices"]:
+                        if 0 <= original_idx < len(enhanced_words):
+                            enhanced_words[original_idx].update({
+                                "identified_speaker_id": None,
+                                "identified_speaker_name": None,
+                                "speaker_identification_confidence": 0.0,
+                                "speaker_status": SpeakerStatus.ERROR.value
+                            })

291-300: Prefer log.exception on broad except; keep error chain.

Improve observability and exception chaining.

-    except Exception as e:
-        log.error(f"Error during speaker identification: {e}")
+    except Exception as e:
+        log.exception(f"Error during speaker identification: {e}")

387-391: Error handling: use log.exception and chain the raise.

Align with Ruff TRY400/B904 and keep the original traceback.

-    except Exception as e:
-        log.error(f"Error processing request: {e}")
-        raise HTTPException(status_code=500, detail=f"Error processing request: {str(e)}")
+    except Exception as e:
+        log.exception(f"Error processing request: {e}")
+        raise HTTPException(status_code=500, detail=f"Error processing request: {e!s}") from e
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d407036 and 316ac2b.

📒 Files selected for processing (17)
  • CLAUDE.md (1 hunks)
  • backends/advanced/.env.template (1 hunks)
  • backends/advanced/Docs/elevenlabs-integration.md (1 hunks)
  • backends/advanced/init.py (2 hunks)
  • backends/advanced/src/advanced_omi_backend/models/transcription.py (1 hunks)
  • backends/advanced/src/advanced_omi_backend/services/transcription/__init__.py (4 hunks)
  • backends/advanced/src/advanced_omi_backend/services/transcription/elevenlabs.py (1 hunks)
  • extras/speaker-recognition/.env.template (1 hunks)
  • extras/speaker-recognition/init.py (3 hunks)
  • extras/speaker-recognition/src/simple_speaker_recognition/api/routers/__init__.py (1 hunks)
  • extras/speaker-recognition/src/simple_speaker_recognition/api/routers/elevenlabs_wrapper.py (1 hunks)
  • extras/speaker-recognition/src/simple_speaker_recognition/api/service.py (3 hunks)
  • extras/speaker-recognition/src/simple_speaker_recognition/utils/elevenlabs_parser.py (1 hunks)
  • extras/speaker-recognition/webui/src/components/ProcessingModeSelector.tsx (4 hunks)
  • extras/speaker-recognition/webui/src/services/elevenlabs.ts (1 hunks)
  • extras/speaker-recognition/webui/src/services/speakerIdentification.ts (5 hunks)
  • wizard.py (1 hunks)
🧰 Additional context used
🪛 Gitleaks (8.28.0)
backends/advanced/Docs/elevenlabs-integration.md

[high] 375-376: Discovered a potential authorization token provided in a curl command header, which could compromise the curl accessed resource.

(curl-auth-header)

🪛 markdownlint-cli2 (0.18.1)
backends/advanced/Docs/elevenlabs-integration.md

562-562: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.1)
backends/advanced/src/advanced_omi_backend/services/transcription/elevenlabs.py

124-124: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


126-126: Do not catch blind exception: Exception

(BLE001)


127-127: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

extras/speaker-recognition/src/simple_speaker_recognition/api/routers/elevenlabs_wrapper.py

243-243: Do not catch blind exception: Exception

(BLE001)


291-291: Do not catch blind exception: Exception

(BLE001)


292-292: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


306-306: Do not perform function call File in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


389-389: Do not catch blind exception: Exception

(BLE001)


390-390: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


391-391: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


391-391: Use explicit conversion flag

Replace with conversion flag

(RUF010)

backends/advanced/src/advanced_omi_backend/services/transcription/__init__.py

96-98: Avoid specifying long messages outside the exception class

(TRY003)


101-101: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: integration-tests
🔇 Additional comments (24)
extras/speaker-recognition/src/simple_speaker_recognition/utils/elevenlabs_parser.py (3)

12-22: Overall structure looks solid.

Clear API, sensible segment model, and helpful metadata.


57-68: The review comment's suggested refactoring would break external callers.

The proposed diff removes 'total_duration' from the return value. However, external code like laptop_client.py:443 uses summary.get('total_duration', 0) expecting it at the top level. Removing it would cause those calls to silently return 0 instead of the actual duration, introducing a bug.

If deduplication is desired, the options are:

  1. Remove from metadata, keep at top-level (preserves API)
  2. Remove from top-level, keep in metadata (requires updating all callers across multiple files—beyond this PR's scope)

Likely an incorrect or invalid review comment.


41-45: Parser correctly implements ElevenLabs API specification—no issues found.

Verification confirms the parser correctly accesses all documented fields from the ElevenLabs Speech-to-Text API response:

  • Top-level: language_code, language_probability, text, words array
  • Word-level fields: type ("word", "spacing", "audio_event"), start, end, text, speaker_id, logprob

The implementation (lines 31–84) safely retrieves each field using .get() with defaults, correctly filters words by type, and properly converts logprob to confidence. No field name mismatches or missing accesses detected.

extras/speaker-recognition/src/simple_speaker_recognition/api/service.py (3)

34-34: LGTM! Consistent configuration pattern.

The ElevenLabs API key field follows the same pattern as the Deepgram configuration, maintaining consistency in the codebase.


56-59: LGTM! Environment override follows established pattern.

The ElevenLabs API key environment override is consistent with the Deepgram implementation above it.


140-140: LGTM! Router integration follows existing patterns.

The ElevenLabs router import and registration are consistent with other routers in the service.

Also applies to: 150-150

backends/advanced/.env.template (2)

53-56: LGTM! Clear documentation for ElevenLabs integration.

The ElevenLabs configuration is well-documented with a helpful URL for obtaining API keys. The option numbering and structure are consistent with other transcription providers.


60-62: LGTM! Provider documentation updated correctly.

The transcription provider comment now accurately includes ElevenLabs as an available option alongside Deepgram and Parakeet.

wizard.py (1)

186-191: LGTM! Consistent API key propagation pattern.

The ElevenLabs API key handling follows the same pattern as the Deepgram key propagation above it, ensuring consistency in the wizard's configuration flow.

extras/speaker-recognition/.env.template (1)

42-42: LGTM! Environment variable follows naming conventions.

The ElevenLabs API key variable is correctly placed with other external service credentials and uses a consistent placeholder format.

CLAUDE.md (1)

289-291: LGTM! Documentation updated consistently.

The CLAUDE.md transcription provider documentation now reflects ElevenLabs as Option 2, consistent with the changes in .env.template. The feature descriptions (99 languages, speaker diarization) align with ElevenLabs' known capabilities.

extras/speaker-recognition/webui/src/components/ProcessingModeSelector.tsx (3)

453-454: LGTM! Grid layout updated for three modes.

The grid columns and slice adjustment correctly accommodate the addition of the ElevenLabs mode alongside the existing two modes.


471-477: LGTM! Good UX addition for requirements visibility.

The requirements badge provides clear visual feedback about API key dependencies, improving user awareness before selecting a processing mode.


62-70: Speaker limit claim verified—code is accurate.

ElevenLabs' speaker diarization supports up to 32 speakers, confirming the configuration in the code is correct. The ElevenLabs mode setup follows the existing pattern and requires no changes.

extras/speaker-recognition/src/simple_speaker_recognition/api/routers/__init__.py (1)

8-8: LGTM! Router export follows established patterns.

The ElevenLabs router import and public export are consistent with the existing router structure, maintaining clean module organization.

Also applies to: 17-17

extras/speaker-recognition/init.py (2)

423-423: LGTM: invoked ElevenLabs setup in run()

The call order fits alongside other provider setups.


457-458: CLI flag added

Nice. Please ensure README/Docs mention --elevenlabs-api-key.

backends/advanced/src/advanced_omi_backend/services/transcription/__init__.py (1)

23-25: Wiring/export looks correct

Import and all exposure for ElevenLabsProvider are consistent.

Also applies to: 141-142

backends/advanced/init.py (1)

139-146: Provider list update

ElevenLabs option added and described—looks good.

extras/speaker-recognition/webui/src/services/speakerIdentification.ts (4)

99-101: Mode switch wiring

New mode case is correctly routed to ElevenLabs processor.


228-280: ElevenLabs processing path looks consistent

  • Mapping aligns with other modes; summary computed via ElevenLabs helpers.
  • Error rethrow integrates with outer friendly messages.

Confirm that options.maxSpeakers is respected by your backend path; otherwise consider removing it from the call to avoid confusion.


577-581: UX: Mode description

Clear and concise; matches functionality.


9-9: All ElevenLabs service exports verified successfully. The import statement in speakerIdentification.ts line 9 correctly references all four items:

  • transcribeWithElevenLabs (async function, line 52)
  • processElevenLabsResponse (function, line 109)
  • calculateConfidenceSummary (function, line 173, aliased as calculateElevenLabsConfidenceSummary)
  • ElevenLabsResponse (interface, line 18)
extras/speaker-recognition/src/simple_speaker_recognition/api/routers/elevenlabs_wrapper.py (1)

200-208: ****

The validate_confidence function always returns a float and never returns None. The implementation explicitly handles the None case by assigning 0.0 before the return statement, and the function guarantees a clamped float between 0.0 and 1.0. The comparison confidence >= confidence_threshold at line 206 is safe and requires no guard.

Likely an incorrect or invalid review comment.

Comment on lines +53 to +54
**Comparison**: Deepgram Nova-3 costs ~$0.36/hour (pay-as-you-go)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Pricing claim may drift. Please verify.

Confirm latest Deepgram/ElevenLabs pricing before merging.


🌐 Web query:

Verify current prices for ElevenLabs Scribe v1 (hourly effective rates across tiers) and Deepgram Nova-3 pay‑as‑you‑go.

💡 Result:

I checked current public pricing (as of Oct 21, 2025):

  • ElevenLabs — Scribe v1 (product/developer docs): hourly effective rates by subscription tier are listed on ElevenLabs’ pricing table (example: Starter = $0.40/hour included; Creator = $0.35/hour included; Pro = $0.33/hour included; Scale = $0.30/hour included; Business = $0.22/hour included). The table also shows per‑additional‑hour top‑up rates for some tiers. [1]

  • Deepgram — Nova‑3 pay‑as‑you‑go (Deepgram pricing page):

    • Pre‑recorded (batch) Nova‑3 (monolingual): $0.0043 per minute (≈ $0.258/hr); Nova‑3 multilingual batch: $0.0052/min.
    • Streaming Nova‑3 (real‑time) Nova‑3 (monolingual): $0.0077/min (≈ $0.462/hr); streaming multilingual: $0.0092/min. (Deepgram lists separate Pay‑As‑You‑Go and Growth rates on their pricing page.) [2]

Sources:
[1] ElevenLabs Scribe v1 docs/pricing.
[2] Deepgram pricing page (Nova‑3 batch and streaming rates).



---

🏁 Script executed:

```shell
#!/bin/bash
# Read the file to get full context around lines 53-54
cat "backends/advanced/Docs/elevenlabs-integration.md"

Length of output: 29346


Deepgram Nova-3 pricing claims are inaccurate; requires correction in two locations.

Based on current Deepgram pricing (as of Oct 21, 2025), the documented figure of ~$0.36/hour does not match actual rates:

  • Batch (pay-as-you-go, pre-recorded): $0.0043/min = ~$0.26/hour
  • Streaming (real-time): $0.0077/min = ~$0.46/hour

Locations to update:

  1. Line 53-54: Change "$0.36/hour (pay-as-you-go)" to "$0.26/hour (batch)" or specify streaming if applicable
  2. Comparison table (later in file, "Cost" row): Update Deepgram cost from "~$0.36/hr" to match the correct rate for the intended use case

Clarify whether the comparison targets batch or streaming, as pricing differs significantly.

🤖 Prompt for AI Agents
In backends/advanced/Docs/elevenlabs-integration.md around lines 53-54, the
Deepgram Nova-3 hourly cost is incorrect (~$0.36/hr); update the text to specify
whether it refers to batch or streaming and use the correct rate (batch:
~$0.26/hr, streaming: ~$0.46/hr). Also update the later Comparison table "Cost"
row to match the same chosen use case (replace "~$0.36/hr" with "~$0.26/hr
(batch)" or "~$0.46/hr (streaming)"), and add a brief parenthetical note
clarifying that Deepgram pricing differs between batch (pre-recorded) and
streaming (real-time).

Comment on lines +94 to +103
elif provider_name == "elevenlabs":
if not elevenlabs_key:
raise RuntimeError(
"ElevenLabs transcription provider requested but ELEVENLABS_API_KEY not configured"
)
logger.info(f"Using ElevenLabs transcription provider in {mode} mode")
if mode == "streaming":
raise RuntimeError("ElevenLabs does not support streaming mode - use batch mode")
return ElevenLabsProvider(elevenlabs_key)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Doc/message updates and auto-select warning

  • Update docstring to list 'elevenlabs' (and 'offline') to reflect supported providers.
  • Current auto-select warning omits ElevenLabs; users with ELEVENLABS_API_KEY and no explicit provider will see a misleading message. Clarify guidance to set TRANSCRIPTION_PROVIDER=elevenlabs.
@@
-    Args:
-        provider_name: Name of the provider ('deepgram', 'parakeet').
+    Args:
+        provider_name: Name of the provider ('deepgram', 'parakeet', 'elevenlabs', 'offline').
@@
-            logger.warning(
-                "No transcription provider configured (DEEPGRAM_API_KEY or PARAKEET_ASR_URL required)"
-            )
+            logger.warning(
+                "No transcription provider configured. Set TRANSCRIPTION_PROVIDER to "
+                "'deepgram', 'parakeet', 'offline', or 'elevenlabs' (and provide the respective credentials)."
+            )

Also applies to: 126-129, 38-41

🧰 Tools
🪛 Ruff (0.14.1)

96-98: Avoid specifying long messages outside the exception class

(TRY003)


101-101: Avoid specifying long messages outside the exception class

(TRY003)

Comment on lines +97 to +105
words.append({
'word': word_obj.get('text', ''),
'start': word_obj.get('start', 0),
'end': word_obj.get('end', 0),
'confidence': 1.0 - abs(word_obj.get('logprob', 0)), # Convert logprob to confidence
'speaker': word_obj.get('speaker_id'),
})

# Extract speaker segments if diarization is enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Confidence mapping from logprob is incorrect

1 - abs(logprob) can go negative or >1. Prefer probability if provided, or convert logprob via exp() and clamp to [0,1].

@@
-                                words.append({
+                                import math
+                                prob = word_obj.get('prob')
+                                if prob is None:
+                                    lp = word_obj.get('logprob')
+                                    prob = math.exp(lp) if lp is not None else None
+                                confidence = None
+                                if prob is not None:
+                                    try:
+                                        confidence = max(0.0, min(1.0, float(prob)))
+                                    except Exception:
+                                        confidence = None
+                                words.append({
                                     'word': word_obj.get('text', ''),
                                     'start': word_obj.get('start', 0),
                                     'end': word_obj.get('end', 0),
-                                    'confidence': 1.0 - abs(word_obj.get('logprob', 0)),  # Convert logprob to confidence
+                                    'confidence': confidence,
                                     'speaker': word_obj.get('speaker_id'),
                                 })
🤖 Prompt for AI Agents
In
backends/advanced/src/advanced_omi_backend/services/transcription/elevenlabs.py
around lines 97 to 105, the current confidence mapping uses 1 - abs(logprob)
which can produce values <0 or >1; change it to prefer an explicit probability
field if present (e.g., word_obj.get('probability')), otherwise convert logprob
to probability using math.exp(logprob) (or 0.0 if missing), then clamp the
result to [0.0, 1.0] and assign that as the confidence; ensure you import math
and handle missing/None values safely so confidence is always within [0,1].

Comment on lines +75 to +92
async with aiohttp.ClientSession() as session:
async with session.post(
url,
headers=headers,
data=form_data
) as response:
if response.status != 200:
error_text = await response.text()
log.error(f"ElevenLabs API error: {response.status} - {error_text}")
raise HTTPException(
status_code=response.status,
detail=f"ElevenLabs API error: {error_text}"
)

result = await response.json()
log.info("Successfully received ElevenLabs response")
return result

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Add client timeouts for external ElevenLabs call.

Unbounded waits can hang workers. Set explicit ClientTimeout.

-    async with aiohttp.ClientSession() as session:
-        async with session.post(
+    timeout = aiohttp.ClientTimeout(total=600, connect=30, sock_read=540, sock_connect=30)
+    async with aiohttp.ClientSession(timeout=timeout) as session:
+        async with session.post(
             url,
             headers=headers,
             data=form_data
         ) as response:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async with aiohttp.ClientSession() as session:
async with session.post(
url,
headers=headers,
data=form_data
) as response:
if response.status != 200:
error_text = await response.text()
log.error(f"ElevenLabs API error: {response.status} - {error_text}")
raise HTTPException(
status_code=response.status,
detail=f"ElevenLabs API error: {error_text}"
)
result = await response.json()
log.info("Successfully received ElevenLabs response")
return result
timeout = aiohttp.ClientTimeout(total=600, connect=30, sock_read=540, sock_connect=30)
async with aiohttp.ClientSession(timeout=timeout) as session:
async with session.post(
url,
headers=headers,
data=form_data
) as response:
if response.status != 200:
error_text = await response.text()
log.error(f"ElevenLabs API error: {response.status} - {error_text}")
raise HTTPException(
status_code=response.status,
detail=f"ElevenLabs API error: {error_text}"
)
result = await response.json()
log.info("Successfully received ElevenLabs response")
return result
🤖 Prompt for AI Agents
In
extras/speaker-recognition/src/simple_speaker_recognition/api/routers/elevenlabs_wrapper.py
around lines 75-92, the aiohttp.ClientSession is created without any timeout
which can lead to unbounded waits; create an aiohttp.ClientTimeout (e.g.,
ClientTimeout(total=30, connect=5, sock_read=20) or use a configurable value
from settings) and pass it into ClientSession via the timeout parameter so the
external ElevenLabs call will fail fast on network issues; keep other logic
intact and ensure proper error handling still awaits response.text() when status
!= 200.

Comment on lines +165 to +177
def _logprob_to_confidence(self, logprob: float) -> float:
"""Convert ElevenLabs logprob to confidence score (0-1).

Args:
logprob: Log probability from ElevenLabs

Returns:
Confidence score between 0 and 1
"""
# ElevenLabs returns log probability (negative values closer to 0 are more confident)
# Convert to confidence: closer to 0 = higher confidence
return 1.0 - min(abs(logprob), 1.0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix confidence mapping; current formula collapses to 0 for most words.

For typical logprobs (e.g., ≤ -1), 1 - min(abs(logprob), 1) yields 0, flattening confidence and degrading downstream scoring. Map logprob back to probability instead.

Apply:

+import math
@@
 def _logprob_to_confidence(self, logprob: float) -> float:
@@
-        # ElevenLabs returns log probability (negative values closer to 0 are more confident)
-        # Convert to confidence: closer to 0 = higher confidence
-        return 1.0 - min(abs(logprob), 1.0)
+        # ElevenLabs returns log probability (log p). Convert to probability in [0,1].
+        if logprob is None:
+            return 0.0
+        try:
+            # Clamp to [0,1] and handle extreme/invalid values.
+            return max(0.0, min(1.0, math.exp(float(logprob))))
+        except (TypeError, ValueError, OverflowError):
+            return 0.0
🤖 Prompt for AI Agents
In
extras/speaker-recognition/src/simple_speaker_recognition/utils/elevenlabs_parser.py
around lines 165-177, the current mapping 1.0 - min(abs(logprob), 1.0) collapses
most logprobs to 0; instead convert the log probability back to a probability
and clamp to [0,1]. Replace the formula by computing prob = math.exp(logprob)
(import math if missing), then return max(0.0, min(1.0, prob)) so very negative
logprobs yield low confidence and values near 0 yield confidence near 1.0.

Comment on lines +72 to +76
// Determine endpoint based on enhancement option
const endpoint = options.enhanceSpeakers
? '/elevenlabs/v1/transcribe'
: 'https://api.elevenlabs.io/v1/speech-to-text'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Avoid direct browser calls to ElevenLabs; route via backend wrapper only.

Direct calls risk CORS issues and expose provider credentials. Use only the backend /elevenlabs/v1/transcribe endpoint.

Apply:

-    const endpoint = options.enhanceSpeakers
-      ? '/elevenlabs/v1/transcribe'
-      : 'https://api.elevenlabs.io/v1/speech-to-text'
+    const endpoint = '/elevenlabs/v1/transcribe'
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Determine endpoint based on enhancement option
const endpoint = options.enhanceSpeakers
? '/elevenlabs/v1/transcribe'
: 'https://api.elevenlabs.io/v1/speech-to-text'
// Determine endpoint based on enhancement option
const endpoint = '/elevenlabs/v1/transcribe'
🤖 Prompt for AI Agents
In extras/speaker-recognition/webui/src/services/elevenlabs.ts around lines 72
to 76, the code chooses the external ElevenLabs API URL for non-enhanced
transcribe calls which exposes provider endpoints/credentials and can cause
CORS; always route requests through the backend wrapper. Change the logic so
endpoint is always the backend path '/elevenlabs/v1/transcribe' (remove or
ignore direct 'https://api.elevenlabs.io/v1/speech-to-text' usage), and ensure
any client-side code uses only that backend endpoint; if needed, update callers
to pass the enhancement option to the backend rather than switching URLs in the
browser.

Comment on lines +84 to +86
if (options.speakerConfidenceThreshold !== undefined) {
params.similarity_threshold = options.speakerConfidenceThreshold.toString()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Query param mismatch with backend.

Wrapper expects speaker_confidence_threshold; current code sends similarity_threshold.

-        params.similarity_threshold = options.speakerConfidenceThreshold.toString()
+        params.speaker_confidence_threshold = options.speakerConfidenceThreshold.toString()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (options.speakerConfidenceThreshold !== undefined) {
params.similarity_threshold = options.speakerConfidenceThreshold.toString()
}
if (options.speakerConfidenceThreshold !== undefined) {
params.speaker_confidence_threshold = options.speakerConfidenceThreshold.toString()
}
🤖 Prompt for AI Agents
In extras/speaker-recognition/webui/src/services/elevenlabs.ts around lines 84
to 86, the code sets params.similarity_threshold but the backend wrapper expects
speaker_confidence_threshold; change the param key to
params.speaker_confidence_threshold and keep converting the numeric option to
string (e.g., options.speakerConfidenceThreshold.toString()) so the request
sends the correct query parameter name and value.

Comment on lines +90 to +95
headers: {
'Content-Type': 'multipart/form-data',
...(endpoint.includes('elevenlabs.io') && {
'xi-api-key': import.meta.env.VITE_ELEVENLABS_API_KEY || ''
})
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Multipart boundary + secret exposure.

  • Do not set Content-Type for FormData (breaks boundary).
  • Do not send xi-api-key from the browser.
-      headers: {
-        'Content-Type': 'multipart/form-data',
-        ...(endpoint.includes('elevenlabs.io') && {
-          'xi-api-key': import.meta.env.VITE_ELEVENLABS_API_KEY || ''
-        })
-      },
+      headers: {
+        // Let the browser set multipart boundary automatically.
+      },
🤖 Prompt for AI Agents
In extras/speaker-recognition/webui/src/services/elevenlabs.ts around lines 90
to 95, the code forces a Content-Type which breaks multipart boundaries and also
exposes the ElevenLabs secret from the browser; remove the manual 'Content-Type'
header when sending FormData (let the browser set the boundary) by conditionally
omitting that header if the body is a FormData, and stop sending the xi-api-key
from client-side code—move API key usage to a server-side proxy or backend
endpoint that your web client calls instead so the secret is never included in
browser requests.

Comment on lines +173 to +185
export function calculateConfidenceSummary(segments: ElevenLabsSegment[]): {
total_segments: number
high_confidence: number
medium_confidence: number
low_confidence: number
} {
return {
total_segments: segments.length,
high_confidence: segments.filter(s => s.confidence >= 0.8).length,
medium_confidence: segments.filter(s => s.confidence >= 0.6 && s.confidence < 0.8).length,
low_confidence: segments.filter(s => s.confidence >= 0.4 && s.confidence < 0.6).length
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Confidence buckets skip segments < 0.4.

Ensure categories are exhaustive.

   return {
     total_segments: segments.length,
     high_confidence: segments.filter(s => s.confidence >= 0.8).length,
     medium_confidence: segments.filter(s => s.confidence >= 0.6 && s.confidence < 0.8).length,
-    low_confidence: segments.filter(s => s.confidence >= 0.4 && s.confidence < 0.6).length
+    low_confidence: segments.filter(s => s.confidence < 0.6).length
   }
🤖 Prompt for AI Agents
In extras/speaker-recognition/webui/src/services/elevenlabs.ts around lines
173–185, the current confidence buckets skip segments with confidence < 0.4;
make the categories exhaustive by adding a very_low_confidence bucket (counting
segments with s.confidence < 0.4) or by adjusting the existing low_confidence
range to include values below 0.4; update the returned object shape accordingly
and ensure tests/consumers are updated to use the new/changed field.

@AnkushMalaker AnkushMalaker deleted the feat/eleven-labs branch December 18, 2025 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant