Migrate speaker diarization to OSS SDK by Alex-Wengg · Pull Request #1 · FluidInference/FluidAudio

Alex-Wengg · 2025-06-22T04:19:30Z

Migrating speaker diarization functionality from Slipbox to open source SDK. This creates a standalone, reusable component that other developers can integrate.

Extract SpeakerDiarizationManager as independent Swift package
Add SherpaOnnx wrapper integration
Include model auto-download functionality
Slipbox branch using this library is SeamlessAudioSwift
main Sources/SeamlessAudioSwift/SeamlessAudioSwift.swift

usage
let manager = SpeakerDiarizationManager()
await manager.initialize()
let segments = try await manager.performSegmentation(audioSamples)

… binary files - Migrate diarizer functionality from slipbox repo - Set up proper Swift Package structure with SherpaOnnx integration - Configure Git LFS for all .a library files to avoid GitHub size limits - Add comprehensive test suite - Fix module map and linker settings for proper C/Swift interop

BrandonWeng · 2025-06-22T20:56:49Z

@@ -0,0 +1,59 @@
+// swift-tools-version: 5.9


Lets use 6.1

BrandonWeng

Great starting point! We can merge this first and make improvements as needed

Properly support different Apple architectures, currently we only support MacOS
Import and support our CoreML diarization models
Improve our benchmarks too

- Add .DS_Store and .swiftpm to .gitignore to exclude system files - Remove existing .DS_Store files from tracking - Remove .swiftpm directory from tracking (auto-generated by Xcode) - Add comprehensive README with proper attribution to SherpaOnnx - Include installation instructions, usage examples, and model attribution - Credit K2-FSA team and sherpa-onnx project for underlying libraries

- Update swift-tools-version from 5.9 to 6.0 - Remove Sendable conformance from SpeakerDiarizationManager to fix concurrency errors - Exclude lib/ directory from SherpaOnnxWrapper target to avoid unhandled file warnings - Update README requirements to reflect Swift 6.0+ and Xcode 16.0+ requirements - All tests passing with Swift 6.0 strict concurrency checking

- Updated swift-tools-version from 6.0 to 6.1 in Package.swift - Added .swift-version file specifying Swift 6.1.2 - Updated README.md to require Swift 6.1+ - Verified all tests pass with Swift 6.1.2

### Why is this change needed?  Taking inspiration from the silero https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py Updating our segmentation implementation and supporitng streaming VAD ```bash %swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming Building for debugging... [1/1] Write swift-version--58304C5D6DBC2206.txt Build of product 'fluidaudio' complete! (0.07s) [00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed [00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc [00:08:02.812] [INFO] [VadManager] VAD model loaded successfully [00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s [00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation... [00:08:02.820] [INFO] [VadAnalyze] • Speech Start at 1.200s [00:08:02.821] [INFO] [VadAnalyze] • Speech End at 2.700s [00:08:02.822] [INFO] [VadAnalyze] • Speech Start at 4.300s [00:08:02.825] [INFO] [VadAnalyze] • Speech End at 7.800s [00:08:02.828] [INFO] [VadAnalyze] • Speech Start at 13.700s [00:08:02.830] [INFO] [VadAnalyze] • Speech End at 16.200s [00:08:02.830] [INFO] [VadAnalyze] • Speech Start at 17.300s [00:08:02.832] [INFO] [VadAnalyze] • Speech End at 19.000s [00:08:02.839] [INFO] [VadAnalyze] • Speech Start at 29.600s [00:08:02.840] [INFO] [VadAnalyze] • Speech End at 30.600s [00:08:02.849] [INFO] [VadAnalyze] • Speech Start at 45.000s [00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments... [00:08:02.850] [INFO] [VadAnalyze] • Speech End at 45.500s [00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events % swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds Building for debugging... [1/1] Write swift-version--58304C5D6DBC2206.txt Build of product 'fluidaudio' complete! (0.07s) [00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed [00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc [00:08:08.309] [INFO] [VadManager] VAD model loaded successfully [00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s [00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation... [00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s [00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s) [00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s) [00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s) [00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s) [00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s) [00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s) [00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s) % ffmpeg -i voiceink-issue-279.wav -af silencedetect=noise=-30dB:d=0.5 -f null - ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers built with Apple clang version 17.0.0 (clang-1700.0.13.3) ... libavutil 60. 8.100 / 60. 8.100 libavcodec 62. 11.100 / 62. 11.100 libavformat 62. 3.100 / 62. 3.100 libavdevice 62. 1.100 / 62. 1.100 libavfilter 11. 4.100 / 11. 4.100 libswscale 9. 1.100 / 9. 1.100 libswresample 6. 1.100 / 6. 1.100 [aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono Input #0, wav, from 'voiceink-issue-279.wav': Duration: 00:00:45.66, bitrate: 256 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s Stream mapping: Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native)) Press [q] to stop, [?] for help Output #0, null, to 'pipe:': Metadata: encoder : Lavf62.3.100 Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s Metadata: encoder : Lavc62.11.100 pcm_s16le [silencedetect @ 0xb22c6c420] silence_start: 0 [silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364 [silencedetect @ 0xb22c6c420] silence_start: 2.305687 [silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125 [silencedetect @ 0xb22c6c420] silence_start: 7.579813 [silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125 [silencedetect @ 0xb22c6c420] silence_start: 15.845063 [silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687 [silencedetect @ 0xb22c6c420] silence_start: 18.692625 [silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813 [silencedetect @ 0xb22c6c420] silence_start: 30.367563 [silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445 [silencedetect @ 0xb22c6c420] silence_start: 41.454687 [silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125 [out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00 ```

Migrate speaker diarization to OSS SDK

### Why is this change needed?  Taking inspiration from the silero https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py Updating our segmentation implementation and supporitng streaming VAD ```bash %swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming Building for debugging... [1/1] Write swift-version--58304C5D6DBC2206.txt Build of product 'fluidaudio' complete! (0.07s) [00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed [00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc [00:08:02.812] [INFO] [VadManager] VAD model loaded successfully [00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s [00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation... [00:08:02.820] [INFO] [VadAnalyze] • Speech Start at 1.200s [00:08:02.821] [INFO] [VadAnalyze] • Speech End at 2.700s [00:08:02.822] [INFO] [VadAnalyze] • Speech Start at 4.300s [00:08:02.825] [INFO] [VadAnalyze] • Speech End at 7.800s [00:08:02.828] [INFO] [VadAnalyze] • Speech Start at 13.700s [00:08:02.830] [INFO] [VadAnalyze] • Speech End at 16.200s [00:08:02.830] [INFO] [VadAnalyze] • Speech Start at 17.300s [00:08:02.832] [INFO] [VadAnalyze] • Speech End at 19.000s [00:08:02.839] [INFO] [VadAnalyze] • Speech Start at 29.600s [00:08:02.840] [INFO] [VadAnalyze] • Speech End at 30.600s [00:08:02.849] [INFO] [VadAnalyze] • Speech Start at 45.000s [00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments... [00:08:02.850] [INFO] [VadAnalyze] • Speech End at 45.500s [00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events % swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds Building for debugging... [1/1] Write swift-version--58304C5D6DBC2206.txt Build of product 'fluidaudio' complete! (0.07s) [00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed [00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc [00:08:08.309] [INFO] [VadManager] VAD model loaded successfully [00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s [00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation... [00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s [00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s) [00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s) [00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s) [00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s) [00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s) [00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s) [00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s) % ffmpeg -i voiceink-issue-279.wav -af silencedetect=noise=-30dB:d=0.5 -f null - ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers built with Apple clang version 17.0.0 (clang-1700.0.13.3) ... libavutil 60. 8.100 / 60. 8.100 libavcodec 62. 11.100 / 62. 11.100 libavformat 62. 3.100 / 62. 3.100 libavdevice 62. 1.100 / 62. 1.100 libavfilter 11. 4.100 / 11. 4.100 libswscale 9. 1.100 / 9. 1.100 libswresample 6. 1.100 / 6. 1.100 [aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono Input #0, wav, from 'voiceink-issue-279.wav': Duration: 00:00:45.66, bitrate: 256 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s Stream mapping: Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native)) Press [q] to stop, [?] for help Output #0, null, to 'pipe:': Metadata: encoder : Lavf62.3.100 Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s Metadata: encoder : Lavc62.11.100 pcm_s16le [silencedetect @ 0xb22c6c420] silence_start: 0 [silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364 [silencedetect @ 0xb22c6c420] silence_start: 2.305687 [silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125 [silencedetect @ 0xb22c6c420] silence_start: 7.579813 [silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125 [silencedetect @ 0xb22c6c420] silence_start: 15.845063 [silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687 [silencedetect @ 0xb22c6c420] silence_start: 18.692625 [silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813 [silencedetect @ 0xb22c6c420] silence_start: 30.367563 [silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445 [silencedetect @ 0xb22c6c420] silence_start: 41.454687 [silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125 [out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00 ```

## Summary This PR adds **experimental** Mandarin Chinese ASR support via the CTC zh-CN model and includes critical Swift 6 concurrency fixes for `SlidingWindowAsrManager`. > **⚠️ Experimental Feature**: CTC zh-CN Mandarin ASR is an early preview. The API and performance characteristics may change in future releases. ## Swift 6 Concurrency Fixes ### Fixed Issues - **Removed premature state mutations** in `processWindow()` that violated Swift 6 actor isolation - State updates (`accumulatedTokens`, `lastProcessedFrame`, `segmentIndex`, `processedChunks`) now occur **after** all async calls complete successfully - Prevents data races when async calls fail mid-execution ### Changes - `SlidingWindowAsrManager.processWindow()`: Moved state mutation to after async guard statements - Ensures atomic state updates only when processing succeeds ## CTC zh-CN Mandarin ASR Integration (Experimental) ### New Features #### Models - **CtcZhCnManager**: High-level API for Mandarin Chinese ASR using CTC decoder - **CtcZhCnModels**: Model management with int8/fp32 encoder variants - Int8: 571 MB (default) - FP32: 1.1 GB - Auto-downloads from HuggingFace: `FluidInference/parakeet-ctc-0.6b-zh-cn-coreml` #### CLI Commands ```bash # Transcribe Mandarin audio swift run fluidaudiocli ctc-zh-cn-transcribe audio.wav # Benchmark on THCHS-30 dataset (full 2,495 samples) swift run fluidaudiocli ctc-zh-cn-benchmark --auto-download # Benchmark subset (100 samples for faster testing) swift run fluidaudiocli ctc-zh-cn-benchmark --auto-download --samples 100 ``` #### Benchmark Results (THCHS-30 Full Test Set) **Full dataset** (2,495 samples): - **Mean CER**: 8.23% - **Median CER**: 6.45% - **CER = 0% (perfect)**: 435 samples (17.4%) - **Distribution**: 67.1% of samples <10% CER, 93.2% <20% CER - **Mean Latency**: 614 ms - **Mean RTFx**: 14.83x ### Dataset **THCHS-30** - Mandarin Chinese speech corpus from Tsinghua University - 30 hours of clean speech - 50 speakers - 2,495 test utterances (10 speakers, 250 unique sentences) - Content domain: News (not classical literature) - Source: http://www.openslr.org/18/ - HuggingFace: `FluidInference/THCHS-30-tests` ### Text Normalization CER calculation includes: - Chinese punctuation removal (，。！？、；：\u{201C}\u{201D}\u{2018}\u{2019}) - English punctuation removal (,.!?;:()[]{}\\<>"'-) - Arabic digit → Chinese character conversion (0→零, 1→一, etc.) - Whitespace normalization - Levenshtein distance calculation ## Devin Review Fixes ✅ Addressed all issues from [Devin code review](https://app.devin.ai/review/fluidinference/fluidaudio/pull/476): ### Review #1 (4 issues) 1. **✅ Fixed digit-to-Chinese conversion** - Added missing normalization (0→零, 1→一, etc.) that was inflating CER by ~1.66% 2. **✅ Added unit tests** - Created 13 comprehensive test cases for text normalization, CER calculation, and Levenshtein distance 3. **✅ Fixed CI dataset cache path** - Not applicable after CI workflow removal 4. **✅ Fixed CI model cache path** - Not applicable after CI workflow removal ### Review #2 (2 issues) 5. **✅ Fixed CER threshold mismatch** - Not applicable after CI workflow removal 6. **✅ Fixed saveResults NaN crash** - Added guard for empty results array to prevent division by zero ### Review #3 (2 issues) 7. **✅ Fixed FP32 encoder download** - Include both int8 and fp32 encoders in `requiredModels` set 8. **✅ Fixed AsrManager CTC-only handling** - Throw explicit error instead of routing to incompatible TDT decoder ### Additional Fixes - **✅ Fixed Unicode curly quotes** - Used escape sequences (`\u{201C}` etc.) in both source and tests - Added missing English punctuation removal - Added missing Chinese quotation mark handling ## Files Changed ### Swift 6 Concurrency - `Sources/FluidAudio/ASR/Parakeet/SlidingWindow/SlidingWindowAsrManager.swift` - `Sources/FluidAudio/ASR/Parakeet/AsrManager.swift` (added .ctcZhCn case + error handling) ### CTC zh-CN Integration - `Sources/FluidAudio/ASR/Parakeet/CtcZhCnManager.swift` (new) - `Sources/FluidAudio/ASR/Parakeet/CtcZhCnModels.swift` (new) - `Sources/FluidAudioCLI/Commands/ASR/CtcZhCnTranscribeCommand.swift` (new) - `Sources/FluidAudioCLI/Commands/ASR/CtcZhCnBenchmark.swift` (new) - `Sources/FluidAudio/ModelNames.swift` (updated - both encoder variants) - `Documentation/Benchmarks.md` (updated - marked experimental) ### Tests - `Tests/FluidAudioTests/ASR/Parakeet/CtcZhCnTests.swift` (new - 13 test cases) ## Testing - [x] Swift 6 concurrency fixes pass existing tests - [x] CTC zh-CN transcription tested manually - [x] THCHS-30 full benchmark: 8.23% mean CER (2,495 samples) - [x] Unit tests: 13 test cases for normalization and CER (100% passing) - [x] Text normalization matches baseline exactly - [x] FP32 encoder download verified ## Notes - This PR is a clean rebase of #475 off main - Skipped conflicting decoder refactoring commit (superseded by #474) - **Experimental feature**: CTC zh-CN API may change in future releases - **No CI workflow**: Benchmarks are run manually for experimental features

… (Issues #1 & #4) (#502) ## Summary This PR addresses two architectural issues from the consolidated report (#457): 1. **Issue #1: File Organization** - Reorganizes batch managers into `SlidingWindow/`, grouped by algorithm (TDT vs CTC) 2. **Issue #4: Decoder State Management** - Exposes decoder state explicitly, removing per-source state routing Both changes improve architecture clarity and eliminate hidden complexity. --- ## Issue #1: File Organization ✅ **Problem**: Batch managers scattered at `Parakeet/` root, unclear relationship to `SlidingWindowAsrManager` **Solution**: Moved 34 files into `SlidingWindow/`, organized by decoding algorithm ### File Moves (24 source files + 10 test files) **TDT Batch Processing** → `SlidingWindow/TDT/`: - AsrManager.swift, AsrManager+*.swift (3 extensions), AsrModels.swift, ChunkProcessor.swift - TdtJaManager.swift, TdtJaModels.swift **TDT Infrastructure** → `SlidingWindow/TDT/Decoder/`: - TdtDecoderV2/V3, TdtConfig, TdtDecoderState, BlasIndex, etc. (12 files) **CTC Language Models** → `SlidingWindow/CTC/`: - CtcJaManager/Models, CtcZhCnManager/Models ### New Structure ``` SlidingWindow/ ├── SlidingWindowAsrManager.swift (public API) ├── SlidingWindowAsrSession.swift │ ├── TDT/ ← All TDT batch processing │ ├── AsrManager.swift (multilingual, internal engine) │ ├── TdtJaManager.swift (Japanese) │ └── Decoder/ (TDT infrastructure) │ └── CTC/ ← All CTC batch + language variants ├── CtcJaManager.swift (Japanese) └── CtcZhCnManager.swift (Chinese) ``` ### Documentation - Updated `Documentation/ASR/DirectoryStructure.md` with new structure - Added section explaining algorithm-based organization (TDT vs CTC) --- ## Issue #4: Decoder State Management ✅ **Problem**: AsrManager maintained hidden per-source decoder states: - Mixed model management with application-level state routing - Limited to 2 simultaneous transcriptions (microphone/system) - State not visible in method signatures **Solution**: Expose decoder state explicitly via `inout` parameters ### API Changes (Breaking) **Before**: ```swift let result = try await manager.transcribe(audio, source: .microphone) ``` **After**: ```swift var state = TdtDecoderState.make() let result = try await manager.transcribe(audio, decoderState: &state) ``` ### Changed Methods All public transcription methods now require `decoderState: inout TdtDecoderState`: - `transcribe(_ audioBuffer:, decoderState:)` - `transcribe(_ url:, decoderState:)` - `transcribeDiskBacked(_ url:, decoderState:)` - `transcribe(_ audioSamples:, decoderState:)` ### Removed Methods - `resetDecoderState()` - callers create fresh state with `TdtDecoderState.make()` - `resetDecoderState(for:)` - no longer needed - Internal `initializeDecoderState(for:)` - removed ### Internal Changes - **AsrManager+Transcription**: Updated to use `inout` state - **SlidingWindowAsrManager**: Manages own `decoderState` property - **ChunkProcessor**: Added `decoderState` parameter - **TdtDecoderState**: Made `public` for external use ### Updated Call Sites - **CLI**: 5 commands (AsrBenchmark, FleursBenchmark, CtcEarningsBenchmark, TranscribeCommand, TTSCommand) - **Tests**: AsrManagerTests, StressTests ### Benefits ✅ **Explicit state management** - Caller controls state lifecycle ✅ **Unlimited concurrency** - No limit on simultaneous transcriptions ✅ **Clearer architecture** - AsrManager manages models, not app state ✅ **Better testing** - State is visible, not hidden --- ## Testing ✅ **All tests pass**: - CI tests: 13/13 passed - AsrManager tests: 57/57 passed - ChunkProcessor tests: 40/40 passed - CtcJa tests: 23/23 passed ✅ **Build succeeds** with zero errors ✅ **CLI commands** work correctly ## Migration Notes **Issue #1**: Zero code changes required. Swift Package Manager treats all of `Sources/FluidAudio/` as a single module, so moving files between subdirectories requires no import changes. **Issue #4**: Breaking API change. Update all `transcribe()` calls to create and pass decoder state explicitly (see examples above). Most users use `SlidingWindowAsrManager` (high-level API) which handles state internally—no migration needed. --- ## Impact Summary **Before**: - 15 files at Parakeet root (unclear organization) - Hidden per-source state routing - Limited to 2 concurrent transcriptions **After**: - 3 files at Parakeet root (shared utilities only) - Algorithm-based organization (TDT vs CTC) - Explicit state management, unlimited concurrency  --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/502" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a>  ---------

Alex-Wengg added 3 commits June 22, 2025 00:04

Working branch

de1ea4c

Move libonnxruntime.a to Git LFS

c0b42c8

BrandonWeng reviewed Jun 22, 2025

View reviewed changes

Comment thread Package.swift Outdated

@@ -0,0 +1,59 @@

// swift-tools-version: 5.9

Copy link
Copy Markdown

Member

BrandonWeng Jun 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets use 6.1

BrandonWeng reviewed Jun 22, 2025

View reviewed changes

Comment thread .gitattributes

BrandonWeng approved these changes Jun 22, 2025

View reviewed changes

Alex-Wengg added 3 commits June 23, 2025 18:10

Upgrade to Swift 6.1.2

6bee0f9

- Updated swift-tools-version from 6.0 to 6.1 in Package.swift - Added .swift-version file specifying Swift 6.1.2 - Updated README.md to require Swift 6.1+ - Verified all tests pass with Swift 6.1.2

Alex-Wengg merged commit 9c56c81 into main Jun 23, 2025

BrandonWeng deleted the beta branch June 24, 2025 19:47

rohithjnayak mentioned this pull request Oct 31, 2025

Add support for x86_64 architecture #173

Closed

Alex-Wengg added a commit that referenced this pull request Jan 1, 2026

Merge pull request #1 from FluidInference/beta

51b5f2e

Migrate speaker diarization to OSS SDK

claude bot mentioned this pull request Feb 15, 2026

feat: integrate Qwen3-ForcedAligner-0.6B for per-word timestamp alignment #315

Closed

8 tasks

Alex-Wengg mentioned this pull request Apr 3, 2026

Add experimental CTC zh-CN Mandarin ASR #476

Merged

6 tasks

This was referenced Apr 8, 2026

Code architecture inconsistencies, tech debt & out of place #457

Open

refactor: Reorganize batch managers + expose decoder state explicitly (Issues #1 & #4) #502

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate speaker diarization to OSS SDK#1

Migrate speaker diarization to OSS SDK#1
Alex-Wengg merged 6 commits intomainfrom
beta

Alex-Wengg commented Jun 22, 2025

Uh oh!

BrandonWeng Jun 22, 2025

Uh oh!

Uh oh!

BrandonWeng left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Alex-Wengg commented Jun 22, 2025

Uh oh!

BrandonWeng Jun 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BrandonWeng left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants