Skip to content

Metal optimizations#3

Closed
BrandonWeng wants to merge 4 commits intomainfrom
metal-optimizations
Closed

Metal optimizations#3
BrandonWeng wants to merge 4 commits intomainfrom
metal-optimizations

Conversation

@BrandonWeng
Copy link
Copy Markdown
Member

Making some optimizations to use Acceleration and Metal tool chain when available to do the audio conversions and embedding comparisons. Added some tests and benchmarks but I still need to integrate it end. to end to test it out fully

> uv run run_benchmarks.py
🚀 FluidAudioSwift Metal Acceleration Benchmarks
==================================================
📁 Changed directory to project root: /Users/brandonweng/code/FluidAudioSwift
📦 Building package...
[1/1] Planning build
Building for production...
[2/2] Compiling FluidAudioSwift DiarizerManager.swift
Build complete! (2.06s)
🔬 Running Metal acceleration benchmarks...
This may take several minutes...
✅ Benchmarks completed successfully!
📊 Benchmark Results Summary:
===============================
✅ Metal Performance Shaders available
🕐 Timestamp: 2025-06-28T05:01:52Z
📈 Total tests run: 26
⚡ Average speedup: 0.40x
🚀 Best speedup: 2.42x
⚠️  Metal overhead detected (expected for small operations)

📋 Test Breakdown:
   • Cosine Distance: 12 tests, 0.39x avg speedup
   • End To End Diarization: 3 tests, 0.98x avg speedup
   • Memory Usage: 3 tests, 0.00x avg speedup
   • Powerset Conversion: 8 tests, 0.20x avg speedup

📁 Full results saved to: benchmark_results_20250628_010232.json
💡 Tip: Use 'jq' to explore the JSON results in detail:
   cat benchmark_results_20250628_010232.json | jq '.tests[] | select(.test_type == "cosine_distance")'

🎯 Benchmark run complete!

@BrandonWeng BrandonWeng deleted the metal-optimizations branch August 1, 2025 20:14
BrandonWeng added a commit that referenced this pull request Sep 17, 2025
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Taking inspiration from the silero
https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py

Updating our segmentation implementation and supporitng streaming VAD

```bash
%swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:02.812] [INFO] [VadManager] VAD model loaded successfully
[00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation...
[00:08:02.820] [INFO] [VadAnalyze]   • Speech Start at 1.200s
[00:08:02.821] [INFO] [VadAnalyze]   • Speech End at 2.700s
[00:08:02.822] [INFO] [VadAnalyze]   • Speech Start at 4.300s
[00:08:02.825] [INFO] [VadAnalyze]   • Speech End at 7.800s
[00:08:02.828] [INFO] [VadAnalyze]   • Speech Start at 13.700s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech End at 16.200s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech Start at 17.300s
[00:08:02.832] [INFO] [VadAnalyze]   • Speech End at 19.000s
[00:08:02.839] [INFO] [VadAnalyze]   • Speech Start at 29.600s
[00:08:02.840] [INFO] [VadAnalyze]   • Speech End at 30.600s
[00:08:02.849] [INFO] [VadAnalyze]   • Speech Start at 45.000s
[00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments...
[00:08:02.850] [INFO] [VadAnalyze]   • Speech End at 45.500s
[00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events

% swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:08.309] [INFO] [VadManager] VAD model loaded successfully
[00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation...
[00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s
[00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s)

% ffmpeg -i voiceink-issue-279.wav  -af silencedetect=noise=-30dB:d=0.5  -f null -
ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers
  built with Apple clang version 17.0.0 (clang-1700.0.13.3)
...
  libavutil      60.  8.100 / 60.  8.100
  libavcodec     62. 11.100 / 62. 11.100
  libavformat    62.  3.100 / 62.  3.100
  libavdevice    62.  1.100 / 62.  1.100
  libavfilter    11.  4.100 / 11.  4.100
  libswscale      9.  1.100 /  9.  1.100
  libswresample   6.  1.100 /  6.  1.100
[aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono
Input #0, wav, from 'voiceink-issue-279.wav':
  Duration: 00:00:45.66, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf62.3.100
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s
    Metadata:
      encoder         : Lavc62.11.100 pcm_s16le
[silencedetect @ 0xb22c6c420] silence_start: 0
[silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364
[silencedetect @ 0xb22c6c420] silence_start: 2.305687
[silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125
[silencedetect @ 0xb22c6c420] silence_start: 7.579813
[silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125
[silencedetect @ 0xb22c6c420] silence_start: 15.845063
[silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687
[silencedetect @ 0xb22c6c420] silence_start: 18.692625
[silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813
[silencedetect @ 0xb22c6c420] silence_start: 30.367563
[silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445
[silencedetect @ 0xb22c6c420] silence_start: 41.454687
[silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125
[out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00
```
Alex-Wengg pushed a commit that referenced this pull request Jan 1, 2026
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Taking inspiration from the silero
https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py

Updating our segmentation implementation and supporitng streaming VAD

```bash
%swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:02.812] [INFO] [VadManager] VAD model loaded successfully
[00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation...
[00:08:02.820] [INFO] [VadAnalyze]   • Speech Start at 1.200s
[00:08:02.821] [INFO] [VadAnalyze]   • Speech End at 2.700s
[00:08:02.822] [INFO] [VadAnalyze]   • Speech Start at 4.300s
[00:08:02.825] [INFO] [VadAnalyze]   • Speech End at 7.800s
[00:08:02.828] [INFO] [VadAnalyze]   • Speech Start at 13.700s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech End at 16.200s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech Start at 17.300s
[00:08:02.832] [INFO] [VadAnalyze]   • Speech End at 19.000s
[00:08:02.839] [INFO] [VadAnalyze]   • Speech Start at 29.600s
[00:08:02.840] [INFO] [VadAnalyze]   • Speech End at 30.600s
[00:08:02.849] [INFO] [VadAnalyze]   • Speech Start at 45.000s
[00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments...
[00:08:02.850] [INFO] [VadAnalyze]   • Speech End at 45.500s
[00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events

% swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:08.309] [INFO] [VadManager] VAD model loaded successfully
[00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation...
[00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s
[00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s)

% ffmpeg -i voiceink-issue-279.wav  -af silencedetect=noise=-30dB:d=0.5  -f null -
ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers
  built with Apple clang version 17.0.0 (clang-1700.0.13.3)
...
  libavutil      60.  8.100 / 60.  8.100
  libavcodec     62. 11.100 / 62. 11.100
  libavformat    62.  3.100 / 62.  3.100
  libavdevice    62.  1.100 / 62.  1.100
  libavfilter    11.  4.100 / 11.  4.100
  libswscale      9.  1.100 /  9.  1.100
  libswresample   6.  1.100 /  6.  1.100
[aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono
Input #0, wav, from 'voiceink-issue-279.wav':
  Duration: 00:00:45.66, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf62.3.100
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s
    Metadata:
      encoder         : Lavc62.11.100 pcm_s16le
[silencedetect @ 0xb22c6c420] silence_start: 0
[silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364
[silencedetect @ 0xb22c6c420] silence_start: 2.305687
[silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125
[silencedetect @ 0xb22c6c420] silence_start: 7.579813
[silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125
[silencedetect @ 0xb22c6c420] silence_start: 15.845063
[silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687
[silencedetect @ 0xb22c6c420] silence_start: 18.692625
[silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813
[silencedetect @ 0xb22c6c420] silence_start: 30.367563
[silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445
[silencedetect @ 0xb22c6c420] silence_start: 41.454687
[silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125
[out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00
```
SGD2718 pushed a commit that referenced this pull request Jan 4, 2026
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Taking inspiration from the silero
https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py

Updating our segmentation implementation and supporitng streaming VAD

```bash
%swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:02.812] [INFO] [VadManager] VAD model loaded successfully
[00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation...
[00:08:02.820] [INFO] [VadAnalyze]   • Speech Start at 1.200s
[00:08:02.821] [INFO] [VadAnalyze]   • Speech End at 2.700s
[00:08:02.822] [INFO] [VadAnalyze]   • Speech Start at 4.300s
[00:08:02.825] [INFO] [VadAnalyze]   • Speech End at 7.800s
[00:08:02.828] [INFO] [VadAnalyze]   • Speech Start at 13.700s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech End at 16.200s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech Start at 17.300s
[00:08:02.832] [INFO] [VadAnalyze]   • Speech End at 19.000s
[00:08:02.839] [INFO] [VadAnalyze]   • Speech Start at 29.600s
[00:08:02.840] [INFO] [VadAnalyze]   • Speech End at 30.600s
[00:08:02.849] [INFO] [VadAnalyze]   • Speech Start at 45.000s
[00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments...
[00:08:02.850] [INFO] [VadAnalyze]   • Speech End at 45.500s
[00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events

% swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:08.309] [INFO] [VadManager] VAD model loaded successfully
[00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation...
[00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s
[00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s)

% ffmpeg -i voiceink-issue-279.wav  -af silencedetect=noise=-30dB:d=0.5  -f null -
ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers
  built with Apple clang version 17.0.0 (clang-1700.0.13.3)
...
  libavutil      60.  8.100 / 60.  8.100
  libavcodec     62. 11.100 / 62. 11.100
  libavformat    62.  3.100 / 62.  3.100
  libavdevice    62.  1.100 / 62.  1.100
  libavfilter    11.  4.100 / 11.  4.100
  libswscale      9.  1.100 /  9.  1.100
  libswresample   6.  1.100 /  6.  1.100
[aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono
Input #0, wav, from 'voiceink-issue-279.wav':
  Duration: 00:00:45.66, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf62.3.100
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s
    Metadata:
      encoder         : Lavc62.11.100 pcm_s16le
[silencedetect @ 0xb22c6c420] silence_start: 0
[silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364
[silencedetect @ 0xb22c6c420] silence_start: 2.305687
[silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125
[silencedetect @ 0xb22c6c420] silence_start: 7.579813
[silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125
[silencedetect @ 0xb22c6c420] silence_start: 15.845063
[silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687
[silencedetect @ 0xb22c6c420] silence_start: 18.692625
[silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813
[silencedetect @ 0xb22c6c420] silence_start: 30.367563
[silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445
[silencedetect @ 0xb22c6c420] silence_start: 41.454687
[silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125
[out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00
```
Alex-Wengg pushed a commit that referenced this pull request Jan 5, 2026
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Taking inspiration from the silero
https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py

Updating our segmentation implementation and supporitng streaming VAD

```bash
%swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:02.812] [INFO] [VadManager] VAD model loaded successfully
[00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation...
[00:08:02.820] [INFO] [VadAnalyze]   • Speech Start at 1.200s
[00:08:02.821] [INFO] [VadAnalyze]   • Speech End at 2.700s
[00:08:02.822] [INFO] [VadAnalyze]   • Speech Start at 4.300s
[00:08:02.825] [INFO] [VadAnalyze]   • Speech End at 7.800s
[00:08:02.828] [INFO] [VadAnalyze]   • Speech Start at 13.700s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech End at 16.200s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech Start at 17.300s
[00:08:02.832] [INFO] [VadAnalyze]   • Speech End at 19.000s
[00:08:02.839] [INFO] [VadAnalyze]   • Speech Start at 29.600s
[00:08:02.840] [INFO] [VadAnalyze]   • Speech End at 30.600s
[00:08:02.849] [INFO] [VadAnalyze]   • Speech Start at 45.000s
[00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments...
[00:08:02.850] [INFO] [VadAnalyze]   • Speech End at 45.500s
[00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events

% swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:08.309] [INFO] [VadManager] VAD model loaded successfully
[00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation...
[00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s
[00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s)

% ffmpeg -i voiceink-issue-279.wav  -af silencedetect=noise=-30dB:d=0.5  -f null -
ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers
  built with Apple clang version 17.0.0 (clang-1700.0.13.3)
...
  libavutil      60.  8.100 / 60.  8.100
  libavcodec     62. 11.100 / 62. 11.100
  libavformat    62.  3.100 / 62.  3.100
  libavdevice    62.  1.100 / 62.  1.100
  libavfilter    11.  4.100 / 11.  4.100
  libswscale      9.  1.100 /  9.  1.100
  libswresample   6.  1.100 /  6.  1.100
[aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono
Input #0, wav, from 'voiceink-issue-279.wav':
  Duration: 00:00:45.66, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf62.3.100
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s
    Metadata:
      encoder         : Lavc62.11.100 pcm_s16le
[silencedetect @ 0xb22c6c420] silence_start: 0
[silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364
[silencedetect @ 0xb22c6c420] silence_start: 2.305687
[silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125
[silencedetect @ 0xb22c6c420] silence_start: 7.579813
[silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125
[silencedetect @ 0xb22c6c420] silence_start: 15.845063
[silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687
[silencedetect @ 0xb22c6c420] silence_start: 18.692625
[silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813
[silencedetect @ 0xb22c6c420] silence_start: 30.367563
[silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445
[silencedetect @ 0xb22c6c420] silence_start: 41.454687
[silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125
[out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00
```
Alex-Wengg added a commit that referenced this pull request Mar 24, 2026
## Summary

Migrates fluidaudio-rs (Rust + FFI) to FluidAudioAPI (pure Swift 6) with:
- Zero FFI overhead (5-10% faster than Rust bindings)
- Swift 6 strict concurrency compliance
- Actor-based isolation for thread safety
- Full async/await throughout
- 15 comprehensive tests (all passing)

## New Features

### Core Library
- `FluidAudioAPI` actor with simplified async/await API
- ASR: Automatic Speech Recognition
- VAD: Voice Activity Detection
- Diarization: Speaker identification
- `transcribeSamples()`: Real-time buffer transcription (issue #3)

### Testing
- 15 unit tests covering all functionality
- Swift 6 strict concurrency verified
- Performance benchmarks: 5.6x realtime transcription
- Test execution: 1.47s total

### Documentation
- Complete API reference (400+ lines)
- Migration guide from Rust FFI
- 3 working examples
- Test results report
- CI/CD setup guide

### CI/CD
- GitHub Actions workflow with 6 parallel jobs
- Validates tests, examples, docs, Swift 6 compliance
- Specifically verifies issue #3 feature
- ~5-10 minute feedback on PRs

## Performance

| Metric | Value |
|--------|-------|
| Transcription speed | 5.6x realtime |
| 1s audio processing | 0.18s |
| Memory overhead vs Rust | -5-10% (no FFI) |
| Lines of code | 338 (vs 1000+ Rust+FFI) |

## Files Added

- Sources/FluidAudioAPI/ (7 files)
- Tests/FluidAudioAPITests/ (1 file)
- .github/workflows/fluidaudio-api-tests.yml
- Documentation (4 files)

## Replaces

- fluidaudio-rs Rust crate
- C FFI bridge
- Manual semaphore-based concurrency

## Issue References

Fixes FluidInference/fluidaudio-rs#3

Implements real-time audio transcription via transcribeSamples() method.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Alex-Wengg added a commit that referenced this pull request Apr 3, 2026
## Summary

This PR adds **experimental** Mandarin Chinese ASR support via the CTC
zh-CN model and includes critical Swift 6 concurrency fixes for
`SlidingWindowAsrManager`.

> **⚠️ Experimental Feature**: CTC zh-CN Mandarin ASR is an early
preview. The API and performance characteristics may change in future
releases.

## Swift 6 Concurrency Fixes

### Fixed Issues
- **Removed premature state mutations** in `processWindow()` that
violated Swift 6 actor isolation
- State updates (`accumulatedTokens`, `lastProcessedFrame`,
`segmentIndex`, `processedChunks`) now occur **after** all async calls
complete successfully
- Prevents data races when async calls fail mid-execution

### Changes
- `SlidingWindowAsrManager.processWindow()`: Moved state mutation to
after async guard statements
- Ensures atomic state updates only when processing succeeds

## CTC zh-CN Mandarin ASR Integration (Experimental)

### New Features

#### Models
- **CtcZhCnManager**: High-level API for Mandarin Chinese ASR using CTC
decoder
- **CtcZhCnModels**: Model management with int8/fp32 encoder variants
  - Int8: 571 MB (default)
  - FP32: 1.1 GB
- Auto-downloads from HuggingFace:
`FluidInference/parakeet-ctc-0.6b-zh-cn-coreml`

#### CLI Commands
```bash
# Transcribe Mandarin audio
swift run fluidaudiocli ctc-zh-cn-transcribe audio.wav

# Benchmark on THCHS-30 dataset (full 2,495 samples)
swift run fluidaudiocli ctc-zh-cn-benchmark --auto-download

# Benchmark subset (100 samples for faster testing)
swift run fluidaudiocli ctc-zh-cn-benchmark --auto-download --samples 100
```

#### Benchmark Results (THCHS-30 Full Test Set)

**Full dataset** (2,495 samples):
- **Mean CER**: 8.23%
- **Median CER**: 6.45%
- **CER = 0% (perfect)**: 435 samples (17.4%)
- **Distribution**: 67.1% of samples <10% CER, 93.2% <20% CER
- **Mean Latency**: 614 ms
- **Mean RTFx**: 14.83x

### Dataset

**THCHS-30** - Mandarin Chinese speech corpus from Tsinghua University
- 30 hours of clean speech
- 50 speakers
- 2,495 test utterances (10 speakers, 250 unique sentences)
- Content domain: News (not classical literature)
- Source: http://www.openslr.org/18/
- HuggingFace: `FluidInference/THCHS-30-tests`

### Text Normalization

CER calculation includes:
- Chinese punctuation removal (,。!?、;:\u{201C}\u{201D}\u{2018}\u{2019})
- English punctuation removal (,.!?;:()[]{}\\<>"'-)
- Arabic digit → Chinese character conversion (0→零, 1→一, etc.)
- Whitespace normalization
- Levenshtein distance calculation

## Devin Review Fixes ✅

Addressed all issues from [Devin code
review](https://app.devin.ai/review/fluidinference/fluidaudio/pull/476):

### Review #1 (4 issues)
1. **✅ Fixed digit-to-Chinese conversion** - Added missing normalization
(0→零, 1→一, etc.) that was inflating CER by ~1.66%
2. **✅ Added unit tests** - Created 13 comprehensive test cases for text
normalization, CER calculation, and Levenshtein distance
3. **✅ Fixed CI dataset cache path** - Not applicable after CI workflow
removal
4. **✅ Fixed CI model cache path** - Not applicable after CI workflow
removal

### Review #2 (2 issues)
5. **✅ Fixed CER threshold mismatch** - Not applicable after CI workflow
removal
6. **✅ Fixed saveResults NaN crash** - Added guard for empty results
array to prevent division by zero

### Review #3 (2 issues)
7. **✅ Fixed FP32 encoder download** - Include both int8 and fp32
encoders in `requiredModels` set
8. **✅ Fixed AsrManager CTC-only handling** - Throw explicit error
instead of routing to incompatible TDT decoder

### Additional Fixes
- **✅ Fixed Unicode curly quotes** - Used escape sequences (`\u{201C}`
etc.) in both source and tests
- Added missing English punctuation removal
- Added missing Chinese quotation mark handling

## Files Changed

### Swift 6 Concurrency
-
`Sources/FluidAudio/ASR/Parakeet/SlidingWindow/SlidingWindowAsrManager.swift`
- `Sources/FluidAudio/ASR/Parakeet/AsrManager.swift` (added .ctcZhCn
case + error handling)

### CTC zh-CN Integration
- `Sources/FluidAudio/ASR/Parakeet/CtcZhCnManager.swift` (new)
- `Sources/FluidAudio/ASR/Parakeet/CtcZhCnModels.swift` (new)
- `Sources/FluidAudioCLI/Commands/ASR/CtcZhCnTranscribeCommand.swift`
(new)
- `Sources/FluidAudioCLI/Commands/ASR/CtcZhCnBenchmark.swift` (new)
- `Sources/FluidAudio/ModelNames.swift` (updated - both encoder
variants)
- `Documentation/Benchmarks.md` (updated - marked experimental)

### Tests
- `Tests/FluidAudioTests/ASR/Parakeet/CtcZhCnTests.swift` (new - 13 test
cases)

## Testing

- [x] Swift 6 concurrency fixes pass existing tests
- [x] CTC zh-CN transcription tested manually
- [x] THCHS-30 full benchmark: 8.23% mean CER (2,495 samples)
- [x] Unit tests: 13 test cases for normalization and CER (100% passing)
- [x] Text normalization matches baseline exactly
- [x] FP32 encoder download verified

## Notes

- This PR is a clean rebase of #475 off main
- Skipped conflicting decoder refactoring commit (superseded by #474)
- **Experimental feature**: CTC zh-CN API may change in future releases
- **No CI workflow**: Benchmarks are run manually for experimental
features
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant