Skip to content

Add local voice input (Whisper) with push-to-talk and visual feedback#178

Merged
pluginslab merged 10 commits intodevfrom
feature/voice-mode
Mar 22, 2026
Merged

Add local voice input (Whisper) with push-to-talk and visual feedback#178
pluginslab merged 10 commits intodevfrom
feature/voice-mode

Conversation

@moritzbappert
Copy link
Copy Markdown
Collaborator

Summary

  • Adds on-device speech-to-text via Whisper Tiny (Xenova/whisper-tiny, ~40 MB ONNX) running entirely in a Web Worker — no audio ever leaves the browser
  • Space bar hold-to-record (push-to-talk) with a 200 ms threshold to distinguish a quick tap (inserts space) from a deliberate hold (starts recording); OS key-repeat events suppressed during recording to prevent ghost spaces
  • Pulsing red border glow on the input wrapper during recording, transcribing wave overlay centred over the textarea while Whisper runs
  • Fixes ReAct agent crash when the LLM returns a tool_call action without a tool field (toolId.includes TypeError)

Changes

  • src/extensions/services/whisper-worker.js — new Web Worker: Whisper pipeline (WASM/q8), ISO→Whisper language map, warmup message, debug logging
  • src/extensions/components/VoiceButton.jsx — new component: forwardRef + useImperativeHandle exposing start/stop, 30 s countdown with auto-stop, warning state ≤10 s, transcribing dots
  • src/extensions/components/ChatInput.jsx — VoiceButton integration, voiceState tracking, Space push-to-talk with 200 ms timer, Enter-to-send, transcribing overlay, recording glow modifier
  • src/extensions/components/ChatContainer.jsx — placeholder updated to hint at Space shortcut; incorporates upstream Thinking… state from fix/issue-147
  • src/extensions/styles/main.scss — voice button styles, recording/transcribing animations, input wrapper --recording glow keyframe
  • src/extensions/services/react-agent.js — guard against undefined toolId at call site and in executeTool
  • webpack.config.js — whisper-worker entry as self-contained bundle (no code splitting)
  • package.json / package-lock.json — adds @huggingface/transformers

Testing

  • Unit tests pass (npm test)
  • Ability tests pass (npm run test:abilities -- --file tests/abilities/core-abilities.test.js)
  • JS lint clean (npm run lint:js)
  • PHP lint clean (composer lint)
  • Manually tested in browser (if UI changes)

Notes

  • First use downloads the ~40 MB Whisper ONNX model from HuggingFace Hub and caches it in Cache Storage — subsequent uses are instant
  • Model is pre-warmed on component mount (warmup message to worker) so first recording is fast
  • iOS Safari falls back to audio/mp4 (no WebM support) and WASM backend (no WebGPU)
  • Debug console.log statements remain in whisper-worker.js intentionally for now to aid diagnosing transcription issues in the field

moritzbappert and others added 10 commits March 22, 2026 10:54
Adds a self-contained Web Worker that runs Xenova/whisper-tiny (~40 MB
ONNX) entirely in the browser via @huggingface/transformers. No audio
ever leaves the device, keeping the plugin privacy-first.

- WASM backend (not WebGPU) avoids q8 precision issues on some drivers
- Language forced via ISO→name mapping + explicit task: 'transcribe'
  to prevent multilingual hallucinations and accidental translation
- Warmup message pre-loads the model on mount so first use is instant
- Webpack entry configured as self-contained bundle (no code splitting)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Microphone button for speech-to-text input with three states:
idle → recording → transcribing.

- Stop icon + live countdown badge during recording (flex-column,
  stays within button bounds — no overflow clipping issues)
- Hard auto-stop at 30 s matching Whisper's context window
- Warning state (≤ 10 s remaining): deeper red + faster pulse
- Subtle recording pulse animation to show the mic is active
- Pre-warms the Whisper worker on mount for instant first use
- Falls back to null on browsers without MediaRecorder/getUserMedia

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hold Space on an empty textarea to record; release to stop and transcribe.
Enter sends the message after transcription (Shift+Enter for newline).

- VoiceButton converted to forwardRef with useImperativeHandle exposing
  start()/stop() so ChatInput can drive recording from keyboard events
- Imperative-handle refs populated after the early-return check to keep
  hooks unconditional while startRecording/stopRecording stay post-guard
- e.repeat guard prevents holding Space from firing multiple starts
- isDisabled guard prevents Space from triggering during model load
- Placeholder updated to hint at both shortcuts

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
LLM occasionally returns {"action":"tool_call"} without a "tool" field,
causing a TypeError crash at toolId.includes('/').

- Skip the iteration gracefully when toolName is falsy, pushing a
  synthetic error observation so the loop can summarize and continue.
- Add early-return guard in executeTool() as a belt-and-suspenders
  defence against the same case at the execution layer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ChatContainer was overriding ChatInput's default placeholder with a
hardcoded string that omitted the hint. Updated to match the agreed
wording: '… (hold Space to speak)'.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
While push-to-talk is active the input wrapper gains a --recording
modifier that:
- Shifts the border to a muted red (rgba of #d63638 at 40% opacity)
- Applies a very faint warm tint to the background (2.5% opacity)
- Pulses the box-shadow between 30% and 8% opacity at 1.8s per cycle

The blue :focus-within highlight is suppressed during recording so the
red glow is not competed with.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The transcribing overlay now provides the visual feedback while Whisper
runs, so setting placeholder text in the textarea is redundant.

- Remove handlePartialTranscript and the onPartialTranscript prop from
  ChatInput — the overlay is the sole indicator during transcription.
- Remove onPartialTranscriptRef and all three call sites from VoiceButton
  (auto-stop timeout, stopRecording, and the error catch path).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Whitespace-only changes produced by --fix during earlier lint runs.
No functional changes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The --ours conflict resolutions during rebase accidentally kept an
intermediate (pre-voice) version of ChatInput.jsx. This commit restores
the correct final state: VoiceButton integration, voiceState tracking,
Space push-to-talk with 200ms threshold, transcribing overlay, and
recording glow modifier.

Also incorporates the upstream fix/issue-147 change: simplified the
textarea placeholder to use the prop directly (the isDisabled ternary
moved to ChatContainer).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@moritzbappert moritzbappert added the enhancement New feature or request label Mar 22, 2026
@pluginslab pluginslab merged commit 4e72071 into dev Mar 22, 2026
1 of 4 checks passed
Copy link
Copy Markdown
Owner

@pluginslab pluginslab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Privacy-first voice input — Whisper Tiny in a Web Worker, push-to-talk with Space bar, iOS Safari fallback, 30s countdown, visual feedback. Plus ReAct crash fix for missing tool name. No conflicts with dev. LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants