Conversation
📝 WalkthroughWalkthroughThe change adjusts Colbert._post_process_onnx_output in fastembed/late_interaction/colbert.py to separate document and non-document flows. When is_doc is False, outputs are yielded immediately without document-specific post-processing. When is_doc is True, input_ids and attention_mask are validated, the attention mask is applied to embeddings, features are normalized, and masked embeddings are yielded. No public API signatures were changed. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
fastembed/late_interaction/colbert.py (1)
57-61: Vectorize mask construction to remove the nested Python loops.
Current O(B*L) Python loops will be a hotspot on long sequences/batches. Build a boolean mask with numpy instead.Apply this diff to replace the loop:
- for i, token_sequence in enumerate(output.input_ids): - for j, token_id in enumerate(token_sequence): # type: ignore - if token_id in self.skip_list or token_id == self.pad_token_id: - output.attention_mask[i, j] = 0 + # Build boolean mask of tokens to drop (skip-list and pad), shape: (B, L) + skip_ids = ( + np.fromiter(self.skip_list, dtype=output.input_ids.dtype) + if self.skip_list + else np.array([], dtype=output.input_ids.dtype) + ) + skip_mask = np.isin(output.input_ids, skip_ids) if skip_ids.size else np.zeros_like(output.input_ids, dtype=bool) + pad_mask = (output.input_ids == self.pad_token_id) if (self.pad_token_id is not None) else np.zeros_like(output.input_ids, dtype=bool) + # Keep only tokens where original attention_mask==1 and not in skip/pad + keep_mask = output.attention_mask.astype(bool, copy=False) & ~(skip_mask | pad_mask) + # Replace attention_mask with boolean keep mask for downstream steps + output.attention_mask = keep_mask
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
fastembed/late_interaction/colbert.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.12.x on ubuntu-latest test
- GitHub Check: Python 3.11.x on macos-latest test
- GitHub Check: Python 3.10.x on macos-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
- GitHub Check: Python 3.12.x on macos-latest test
- GitHub Check: Python 3.12.x on windows-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.9.x on macos-latest test
- GitHub Check: Python 3.11.x on windows-latest test
- GitHub Check: Python 3.9.x on ubuntu-latest test
- GitHub Check: Python 3.11.x on ubuntu-latest test
- GitHub Check: Python 3.9.x on windows-latest test
- GitHub Check: Python 3.10.x on windows-latest test
🔇 Additional comments (2)
fastembed/late_interaction/colbert.py (2)
51-56: Early-exit bug is correctly fixed; doc-only validation is now gated.
The control flow now prevents query paths from running doc-specific checks and processing. Looks good.
62-66: Resolve: in-place mutation ofOnnxOutputContextfields is safe. All uses ofmodel_outputandattention_maskoccur immediately within each_post_process_onnx_outputimplementation on freshly created context instances, and no downstream code retains or reuses the original arrays.
| output.model_output *= np.expand_dims(output.attention_mask, 2) | ||
| norm = np.linalg.norm(output.model_output, ord=2, axis=2, keepdims=True) | ||
| norm_clamped = np.maximum(norm, 1e-12) | ||
| output.model_output /= norm_clamped | ||
|
|
||
| for embedding, attention_mask in zip(output.model_output, output.attention_mask): | ||
| yield embedding[attention_mask == 1] |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Avoid dtype upcasts and reuse the boolean mask for both zeroing and yielding.
Multiplying float embeddings by an int64 mask risks dtype promotions and extra work. Use a boolean mask cast to the model_output dtype, and use dtype-aware epsilon.
Apply this diff:
- output.model_output *= np.expand_dims(output.attention_mask, 2)
- norm = np.linalg.norm(output.model_output, ord=2, axis=2, keepdims=True)
- norm_clamped = np.maximum(norm, 1e-12)
- output.model_output /= norm_clamped
-
- for embedding, attention_mask in zip(output.model_output, output.attention_mask):
- yield embedding[attention_mask == 1]
+ # Zero out masked tokens without changing dtype
+ mask_f = output.attention_mask[..., None].astype(output.model_output.dtype, copy=False)
+ output.model_output *= mask_f
+ # Normalize with dtype-appropriate epsilon
+ norm = np.linalg.norm(output.model_output, ord=2, axis=2, keepdims=True)
+ eps = np.finfo(output.model_output.dtype).eps
+ output.model_output /= np.maximum(norm, eps)
+
+ # Yield only kept tokens
+ for embedding, keep_mask in zip(output.model_output, output.attention_mask):
+ yield embedding[keep_mask]Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In fastembed/late_interaction/colbert.py around lines 62 to 68, avoid
integer->float dtype upcasts and redundant mask comparisons: convert
attention_mask once to a boolean array, cast that boolean mask to the
model_output dtype for in-place masking, and use a dtype-matching epsilon when
clamping norms; then reuse the boolean mask for yielding rows instead of
comparing to 1. Concretely: create mask_bool =
output.attention_mask.astype(bool), mask_cast =
mask_bool.astype(output.model_output.dtype), multiply model_output by mask_cast
in-place, compute norm and norm_clamped using eps = np.array(1e-12,
dtype=output.model_output.dtype), divide by norm_clamped, and in the final loop
use for embedding, mask in zip(output.model_output, mask_bool): yield
embedding[mask].
| if not is_doc: | ||
| for embedding in output.model_output: | ||
| yield embedding | ||
|
|
There was a problem hiding this comment.
Nit: Could also be fixed by adding an empty return here (outside loop but inside if), that would make the code less nested.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/test_late_interaction_embeddings.py (1)
222-224: Add multi-query regression test to prevent future double-yields.Also assert cardinality for multiple queries so a regression can’t slip by when batching queries.
Apply this addition:
+@pytest.mark.parametrize("model_name", ["answerdotai/answerai-colbert-small-v1"]) +def test_query_embed_emits_one_per_query(model_name: str): + is_ci = os.getenv("CI") + model = LateInteractionTextEmbedding(model_name=model_name) + queries = ["q1", "q2"] + results = list(model.query_embed(queries)) + assert len(results) == len(queries) + if is_ci: + delete_model_cache(model.model._model_dir)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tests/test_late_interaction_embeddings.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_late_interaction_embeddings.py (2)
fastembed/late_interaction/colbert.py (2)
embed(209-243)query_embed(245-255)fastembed/late_interaction/late_interaction_text_embedding.py (2)
embed(117-139)query_embed(141-153)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.12.x on windows-latest test
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.10.x on windows-latest test
- GitHub Check: Python 3.12.x on ubuntu-latest test
- GitHub Check: Python 3.12.x on macos-latest test
- GitHub Check: Python 3.11.x on macos-latest test
- GitHub Check: Python 3.9.x on ubuntu-latest test
- GitHub Check: Python 3.11.x on ubuntu-latest test
- GitHub Check: Python 3.11.x on windows-latest test
- GitHub Check: Python 3.9.x on windows-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
- GitHub Check: Python 3.10.x on macos-latest test
- GitHub Check: Python 3.9.x on macos-latest test
🔇 Additional comments (3)
tests/test_late_interaction_embeddings.py (3)
178-178: LGTM — formatting only.Inline list is fine and keeps the test concise.
199-201: Good regression guard for single-doc path.Materializing the generator and asserting a single output directly detects accidental double-yields from the embed path.
222-224: Correctly verifies single result for query path.This catches the original bug (extra yield after query post-processing).
We've introduced a bug in #547
Due to the replacement of
returntoyield, after yielding embeddings innot is_docbranch instead of leaving the function we continue its execution and additionally return embeddings with document post-processing.