fix(embedding,api): unblock web UI by fixing pipeline() hang and read-path warmup#100
fix(embedding,api): unblock web UI by fixing pipeline() hang and read-path warmup#100StumHuang wants to merge 3 commits intotickernelz:mainfrom
Conversation
Two independent issues caused pipeline("feature-extraction", ...) to hang
indefinitely (35s+) on first call, poisoning initPromise so every subsequent
embed() blocked forever. Symptom: web UI blank, /api/search returned
"Empty reply from server".
1. ONNX WASM threading deadlock
@huggingface/transformers v4 defaults wasm.numThreads > 1, but Node.js
and Bun lack SharedArrayBuffer support, so onnxruntime-web deadlocks
during pipeline init. Fixed by forcing numThreads=1 in
ensureTransformersLoaded(). Ref huggingface/transformers.js#488.
2. dtype default mismatch
transformers v4 default dtype tries to load model.onnx (fp32, ~500MB).
The cached model directory only ships model_quantized.onnx, so pipeline
falls back to a network fetch from huggingface.co. In restricted
networks this fails with "Unable to connect". Fixed by passing
dtype: "q8" to the pipeline() options so the local quantized model is
used unconditionally.
After both fixes, pipeline ready in ~2.3s and /api/search returns real
results (similarity 0.457, vecLen=768).
handleListTags, handleListMemories, and handleStats each awaited embeddingService.warmup() before serving. These handlers only read SQLite/sqlite-vec rows and never compute query embeddings, so the coupling was unnecessary. When warmup() stalled (or simply took a few seconds on cold start), the entire web UI went blank because every read endpoint blocked behind the embedding model load. Removed the warmup() calls from the three read paths. handleSearch still warms up because it needs the query vector. Net effect: /api/stats, /api/tags, and /api/memories now respond immediately even when the embedding model has not been loaded yet, so the dashboard can render before the first search query.
There was a problem hiding this comment.
Pull request overview
This PR fixes a startup deadlock in the embedding initialization path that caused the web UI to render blank and /api/search to time out, and it decouples read-only API endpoints from embedding warmup so they remain responsive on cold start.
Changes:
- Force
onnxruntime-webWASM to single-thread mode to preventpipeline()init hangs in Node/Bun environments lackingSharedArrayBuffer. - Force quantized ONNX model loading via
dtype: "q8"to avoid unwanted remote fp32 model downloads. - Remove embedding warmup calls from read-only handlers (
stats,tags,list), keeping warmup only where vector similarity is actually needed.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/services/embedding.ts |
Applies runtime safeguards to prevent pipeline() initialization hangs and ensures a quantized model is used. |
src/services/api-handlers.ts |
Removes unnecessary embedding warmup from read-only handlers to prevent request blocking on embedding init. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| (mod.env as any).backends.onnx.wasm.numThreads = 1; | ||
| } catch (e) { | ||
| log("Failed to set wasm.numThreads", { error: String(e) }); |
There was a problem hiding this comment.
The numThreads override is described as critical to avoid an indefinite hang, but failures are currently swallowed (only logged) and init continues. If this assignment fails at runtime, pipeline() can still hang and the root issue returns; consider failing fast (throw) or explicitly validating/creating the nested env.backends.onnx.wasm object and asserting the final value before proceeding.
| (mod.env as any).backends.onnx.wasm.numThreads = 1; | |
| } catch (e) { | |
| log("Failed to set wasm.numThreads", { error: String(e) }); | |
| const envWithBackends = mod.env as any; | |
| envWithBackends.backends ??= {}; | |
| envWithBackends.backends.onnx ??= {}; | |
| envWithBackends.backends.onnx.wasm ??= {}; | |
| envWithBackends.backends.onnx.wasm.numThreads = 1; | |
| if (envWithBackends.backends.onnx.wasm.numThreads !== 1) { | |
| throw new Error("Failed to verify wasm.numThreads override"); | |
| } | |
| } catch (e) { | |
| log("Failed to set wasm.numThreads", { error: String(e) }); | |
| throw new Error(`Failed to initialize transformers WASM threading configuration: ${String(e)}`); |
| const { pipeline } = await ensureTransformersLoaded(); | ||
| this.pipe = await pipeline("feature-extraction", CONFIG.embeddingModel, { | ||
| progress_callback: progressCallback, | ||
| }); | ||
| // Force quantized ONNX. Default is fp32 model.onnx which transformers v4 | ||
| // tries to download from huggingface.co; cache only ships model_quantized.onnx | ||
| // and HF is unreachable behind GFW, causing init to fail. | ||
| dtype: "q8", | ||
| } as any); |
There was a problem hiding this comment.
The pipeline options are cast to any, which removes compile-time verification for the newly added dtype knob. Since this option is required to prevent an unwanted remote download, it would be safer to use the proper pipeline options type (or satisfies a known options interface) so typos/unsupported keys are caught by typecheck.
Per Copilot review on PR tickernelz#100: the 'as any' cast on pipeline() options silently dropped compile-time validation of the dtype key, which is the exact protection that prevents an unwanted fp32 model.onnx download. Use the official PretrainedModelOptions type so any future typo in dtype or other option keys fails at tsc time instead of at runtime.
Summary
Two bugs together caused the web UI at http://127.0.0.1:4747/ to render blank
and
/api/searchto time out:pipeline("feature-extraction", ...)hung indefinitely on the first call,poisoning
initPromiseso every subsequentembed()blocked.stats,tags,list) awaited embedding warmupeven though they only read SQLite rows — so the hang above propagated to
every read endpoint.
Root causes
@huggingface/transformersv4 defaultswasm.numThreads > 1, but Node/Bun lackSharedArrayBuffer, deadlockingonnxruntime-web.Fix:
env.backends.onnx.wasm.numThreads = 1.Ref: wasm does not work on node right now with multiple threads huggingface/transformers.js#488
model.onnx(fp32) whendtypeisnot specified. The shipped cache only has
model_quantized.onnx, so initfalls back to a network fetch from huggingface.co that fails in restricted
networks.
Fix: pass
dtype: "q8"topipeline().handleStats/handleListTags/handleListMemoriescalledembeddingService.warmup()despite neverusing the embedding model.
Fix: drop
warmup()from those three handlers;handleSearchkeeps it.Verification
/api/search?q=helloreturns real results (similarity0.457,vecLen=768)./api/stats,/api/tags,/api/memoriesrespond immediately on cold start.bun run typecheckand Prettier (via lint-staged) pass on both commits.Commits
fix(embedding): prevent pipeline() hang in Node/Bun runtimefix(api): remove embedding warmup from read-only handlers