LCORE-1314: conversation compaction spike

max-svistunov · 2026-03-16T17:04:25Z

Spike deliverable for LCORE-1311 (Conversation History Summarization).

What's in this PR

Design docs (docs/design/conversation-compaction/):

conversation-compaction-spike.md (use the "Outline" button) — the spike: research, design alternatives, PoC results, proposed JIRAs
conversation-compaction.md (use the "Outline" button) — feature spec (requirements, use cases, architecture, implementation suggestions) assuming the recommended options are chosen -- changeable based on final decisions, will be kept in the repo long-term
poc-results/ — evidence from a 50-query experiment validating the PoC

PoC code:

src/utils/compaction.py — compaction logic (trigger, split, summarize, new conversation)
src/utils/responses.py — 10-line hook in prepare_responses_params()
tests/unit/utils/test_compaction.py — 19 unit tests, all passing

Main findings

When conversation history exceeds the context window, lightspeed-stack returns HTTP 413 with no recovery. The conversation is dead.
LLM-based summarization is the right fix — provider-agnostic, proven pattern, good context quality.
Additive summarization (each chunk summarized once, kept independently) beats recursive (summary of a summary). Our 50-query experiment showed recursive loses early-conversation context after 4 cycles.
The feature is fully self-contained in lightspeed-stack. No Llama Stack changes needed.

For reviewers

@tisnik @sbunciak Please review and confirm (or override) the 5 strategic decisions in the "Decisions for @ptisnovs and @sbunciak" section of conversation-compaction-spike.md:

Which approach? → recommended: LLM summarization
Recursive or additive? → recommended: additive
Which model for summarization? → recommended: same model as user's query
Threshold strategy? → recommended: percentage of context window + fixed floor + admin-configurable
Where to implement? → recommended: in lightspeed-stack (not upstream)

@tisnik Please review and confirm the 4 technical-detail decisions (conversation_id handling, truncated field, storage location, buffer zone calculation) in the "Technical decisions for @ptisnovs" section.
@tisnik @sbunciak Please review the "Proposed JIRAs" section -- I will file the tickets once decisions are confirmed.

Doc structure note: The first three sections (Decisions for @ptisnovs and @sbunciak , Technical decisions for @ptisnovs, Proposed JIRAs) of the spike doc are where your input is needed. They link to background sections later in the doc (current architecture, API comparison, design alternatives, PoC results) — read those if you need more context on a specific point, but it is optional.

Closes # LCORE-1314
Related Issue # LCORE-1311

Summary by CodeRabbit

New Features
- Automatic conversation history compaction added to keep context sizes manageable by summarizing older messages while preserving recent turns.
Documentation
- Comprehensive design and PoC documentation added: architecture, configuration, trigger behavior, experiments, token/latency analysis, and implementation guidance.
Tests
- Unit test coverage added for the compaction workflow and related behaviors.

…query experiment

coderabbitai · 2026-03-16T17:05:19Z

Walkthrough

Adds conversation-history compaction: design docs and PoC results, a Python compaction module using Llama Stack summarization, integration into response handling to optionally compact conversations, and comprehensive unit tests covering compaction logic and flows.

Changes

Cohort / File(s)	Summary
Design Documentation `docs/design/conversation-compaction/conversation-compaction-spike.md`, `docs/design/conversation-compaction/conversation-compaction.md`	New comprehensive design documents describing goals, architecture, additive LLM summarization strategy, trigger/config schema, data models (ConversationSummary), context_status semantics, input/response flow changes, prompts, and implementation guidance.
PoC Results `docs/design/conversation-compaction/poc-results/01-analysis.txt`, `docs/design/conversation-compaction/poc-results/03-token-usage.txt`, `docs/design/conversation-compaction/poc-results/04-compaction-events.json`, `docs/design/conversation-compaction/poc-results/05-summaries-extracted.txt`, `docs/design/conversation-compaction/poc-results/06-probe-responses.txt`, `docs/design/conversation-compaction/poc-results/07-conversation-items-per-compaction.txt`	Experimental data and analysis from a 50-turn PoC: token usage tables, compaction event log and conv_id history, extracted summaries over cycles, probe responses for fidelity, and per-compaction conversation transcripts. Documents sawtooth token patterns, recursive summarization costs, and recommendations.
Compaction Implementation `src/utils/compaction.py`	New module implementing PoC compaction: threshold-triggered splitting of old vs recent items, formatting for summarization, Llama Stack summarization flow, creation/seeding of a new compacted conversation, and public API `compact_conversation_if_needed`. Includes constants and prompt text.
Response Integration `src/utils/responses.py`	Calls `compact_conversation_if_needed` when an existing conversation_id is provided; updates the llama-stack conversation ID with the possibly compacted ID before continuing request handling.
Unit Tests `tests/unit/utils/test_compaction.py`	Extensive unit tests for compaction: message detection, text extraction for varied content shapes, conversation formatting, split semantics, and end-to-end compaction flow with mocked Llama Stack client covering threshold behavior, summarization invocation, seeding of compacted conversation, and skip conditions.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant ResponseHandler as Response Handler
    participant Compaction as Compaction Logic
    participant ConvStore as Conversation Storage
    participant LlamaStack as Llama Stack Backend

    Client->>ResponseHandler: user message + existing conversation_id
    ResponseHandler->>Compaction: compact_conversation_if_needed(la_id, model, message_count)
    Compaction->>ConvStore: fetch conversation items for la_id
    ConvStore-->>Compaction: items list

    alt item count > threshold
        Compaction->>Compaction: split items -> old_items + recent_items
        Compaction->>LlamaStack: create temp conversation for summarization
        LlamaStack-->>Compaction: temp_conv_id
        Compaction->>LlamaStack: send old_items + summarization prompt
        LlamaStack-->>Compaction: summary_text
        Compaction->>ConvStore: create new conversation seeded with [summary_text] + recent_items
        ConvStore-->>Compaction: new_conv_id
        Compaction-->>ResponseHandler: return new_conv_id
    else below threshold
        Compaction-->>ResponseHandler: return original conversation_id
    end

    ResponseHandler-->>Client: continue with (possibly new) conversation_id

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly identifies the main change: adding design documentation, proof-of-concept code, and experiment results for a conversation compaction spike, which matches the comprehensive set of files and objectives documented in the PR.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can disable the changed files summary in the walkthrough.

Disable the reviews.changed_files_summary setting to disable the changed files summary in the walkthrough.

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (2)

tests/unit/utils/test_compaction.py (1)
245-258: Consider adding assert_not_called for consistency.

Unlike test_skips_below_threshold (Line 243), this test doesn't verify that conversations.items.list was not called. Adding this assertion would make the tests consistent and more explicitly verify the early-return behavior.
Suggested addition
         assert result == "conv_original"
+        mock_client.conversations.items.list.assert_not_called()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/utils/test_compaction.py` around lines 245 - 258, In
test_skips_at_threshold, add an assertion that the client list method was not
invoked to mirror test_skips_below_threshold: after calling
compact_conversation_if_needed (in test_skips_at_threshold) and asserting the
return equals "conv_original", call
mock_client.conversations.items.list.assert_not_called() (or
mock_client.conversations.items.list.assert_not_awaited() if you prefer to
specifically assert it wasn't awaited) to explicitly verify the early-return
behavior of compact_conversation_if_needed.
src/utils/compaction.py (1)
49-64: Text extraction differs from canonical pattern in responses.py.

The canonical _extract_text_from_content in src/utils/responses.py (lines 1127-1148) checks part.type for input_text, output_text, and refusal content types. This simpler implementation works for the PoC but may miss content in production if parts have different structures.

For production, consider aligning with or reusing the canonical extraction logic.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/compaction.py` around lines 49 - 64, The _extract_message_text
implementation is a simplified extractor that can miss parts handled by the
canonical _extract_text_from_content; update _extract_message_text to reuse or
mirror that canonical logic by checking each part's type (handle 'input_text',
'output_text', 'refusal') as well as existing cases (string, dict with "text",
object with .text), concatenating extracted texts into a single string and
falling back to str(content) when nothing matches; specifically modify the
function named _extract_message_text to call or inline the same extraction rules
as _extract_text_from_content so all content shapes are covered.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design/conversation-compaction/conversation-compaction-spike.md`:
- Line 11: The reviewer handle in the document is inconsistent: replace all
occurrences of "@ptisnovs" with the requested reviewer "@tisnik" (notably the
mentions currently at the header and the decision entries such as the mention
near the top and the one around the decision block referenced on lines ~11 and
~86) so the requested reviewers match the PR objectives; ensure you update every
instance of the old handle to the new handle to avoid routing confusion.
- Line 25: The intra-document markdown links using fragments like
`#design-alternatives`, `#llama-stack-upstream`, and `#decisions-technical` are
not matching the generated heading IDs; search the document for those exact
fragment strings and replace each with the precise generated anchor created by
your markdown renderer (match capitalization, spacing-to-hyphen rules, and any
punctuation handling) so links point to the actual headings (e.g., update the
link target to the exact slug produced for the "Design alternatives", "Llama
Stack (upstream)", and "Decisions - Technical" headings); also fix the same
fragments used elsewhere in the file to ensure all intra-doc links resolve.

In `@docs/design/conversation-compaction/conversation-compaction.md`:
- Line 8: Update the broken spike-doc link in conversation-compaction.md:
replace the existing reference to docs/design/conversation-compaction-spike.md
with the correct path under the conversation-compaction folder
(docs/design/conversation-compaction/conversation-compaction-spike.md) so the
table row that currently contains the link text "Spike doc:
`docs/design/conversation-compaction-spike.md`" points to the actual file
location.
- Around line 348-349: The guidance contains malformed code/markdown for config
classes: replace the broken fragments so it reads that all config classes should
extend ConfigurationBase which sets extra="forbid", use Field(...) with
defaults, title, and description, and add `@model_validator`(mode="after") for
cross-field validation when needed; update references to ConfigurationBase,
extra, Field, and `@model_validator` to use correct punctuation and valid syntax
so the instructions are unambiguous.

In `@docs/design/conversation-compaction/poc-results/06-probe-responses.txt`:
- Around line 19-20: The "Network Policies" sentence in Turn 11 is truncated
("communicate with each") and leaves the artifact unreliable; edit the Turn 11
probe response to either finish the sentence (e.g., "communicate with each other
or external services, limiting traffic between pods and namespaces") or
explicitly mark the paragraph as intentionally truncated (e.g., add
"[TRUNCATED]" and a brief note), updating the "Network Policies" line in the
Turn 11 section so the document is complete and unambiguous.

In `@src/utils/compaction.py`:
- Around line 109-145: The _summarize_conversation function must catch Llama
Stack API errors around client.conversations.create and client.responses.create;
wrap those calls in a try/except that handles APIConnectionError and
APIStatusError (import them from llama_stack_client), log the error with context
(e.g., "Failed to create conversation" / "Failed to create response for
summarization") and re-raise or return a fallback as appropriate; ensure you
avoid leaving an orphaned conversation by deleting the temporary conversation
(use client.conversations.delete(summary_conv.id)) if response creation fails
(perform delete in the except block or a finally that checks summary_conv and
whether response was created); keep the existing summary extraction from
response.output unchanged.

In `@src/utils/responses.py`:
- Around line 300-307: The call to compact_conversation_if_needed (when
user_conversation is truthy) can raise Llama Stack errors but is not translated
to the API's HTTPException like other Llama Stack operations; wrap the
compact_conversation_if_needed call in a try/except that catches the Llama Stack
error types used elsewhere (e.g., APIConnectionError or the project's
LlamaStackError wrapper) and re-raise an appropriate fastapi.HTTPException with
the same status code/message mapping used in this module; ensure you reference
compact_conversation_if_needed, llama_stack_conv_id, and user_conversation so
the behavior matches other error handling in responses.py.

---

Nitpick comments:
In `@src/utils/compaction.py`:
- Around line 49-64: The _extract_message_text implementation is a simplified
extractor that can miss parts handled by the canonical
_extract_text_from_content; update _extract_message_text to reuse or mirror that
canonical logic by checking each part's type (handle 'input_text',
'output_text', 'refusal') as well as existing cases (string, dict with "text",
object with .text), concatenating extracted texts into a single string and
falling back to str(content) when nothing matches; specifically modify the
function named _extract_message_text to call or inline the same extraction rules
as _extract_text_from_content so all content shapes are covered.

In `@tests/unit/utils/test_compaction.py`:
- Around line 245-258: In test_skips_at_threshold, add an assertion that the
client list method was not invoked to mirror test_skips_below_threshold: after
calling compact_conversation_if_needed (in test_skips_at_threshold) and
asserting the return equals "conv_original", call
mock_client.conversations.items.list.assert_not_called() (or
mock_client.conversations.items.list.assert_not_awaited() if you prefer to
specifically assert it wasn't awaited) to explicitly verify the early-return
behavior of compact_conversation_if_needed.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5bb2811f-d69d-4f62-951a-a2cbb2af3edd

📥 Commits

Reviewing files that changed from the base of the PR and between b2f54cf and dd6a24e.

📒 Files selected for processing (13)

docs/design/conversation-compaction/conversation-compaction-spike.md
docs/design/conversation-compaction/conversation-compaction.md
docs/design/conversation-compaction/poc-results/01-analysis.txt
docs/design/conversation-compaction/poc-results/02-conversation-log.txt
docs/design/conversation-compaction/poc-results/03-token-usage.txt
docs/design/conversation-compaction/poc-results/04-compaction-events.json
docs/design/conversation-compaction/poc-results/05-summaries-extracted.txt
docs/design/conversation-compaction/poc-results/06-probe-responses.txt
docs/design/conversation-compaction/poc-results/07-conversation-items-per-compaction.txt
docs/design/conversation-compaction/poc-results/08-all-responses-raw.json
src/utils/compaction.py
src/utils/responses.py
tests/unit/utils/test_compaction.py

coderabbitai · 2026-03-16T17:12:08Z

docs/design/conversation-compaction/conversation-compaction-spike.md

+
+**PoC validation**: A working proof-of-concept was built and tested with 50 queries across 4 compaction cycles. Results in [PoC results](#poc-results).
+
+# Decisions for @ptisnovs and @sbunciak


⚠️ Potential issue | 🟡 Minor

Reviewer handle appears inconsistent with requested reviewers.

Line 11 and Line 86 reference @ptisnovs; the PR objectives request confirmations from @tisnik. Please align naming to avoid decision-routing confusion.

Also applies to: 86-86

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/design/conversation-compaction/conversation-compaction-spike.md` at line 11, The reviewer handle in the document is inconsistent: replace all occurrences of "@ptisnovs" with the requested reviewer "@tisnik" (notably the mentions currently at the header and the decision entries such as the mention near the top and the one around the decision block referenced on lines ~11 and ~86) so the requested reviewers match the PR objectives; ensure you update every instance of the old handle to the new handle to avoid routing confusion.

docs/design/conversation-compaction/conversation-compaction.md

coderabbitai · 2026-03-16T17:12:09Z

docs/design/conversation-compaction/conversation-compaction.md

+All config classes extend `ConfigurationBase` which sets `extra`"forbid"`.
+Use =Field()` with defaults, title, and description. Add `@model_validator(mode`"after")= for cross-field validation if needed.


⚠️ Potential issue | 🟡 Minor

Correct malformed config-pattern guidance snippet.

Line 348 and Line 349 contain broken markdown/code (extra"forbid", =Field(), @model_validator(mode"after")=). Please fix these to valid syntax so implementation instructions are unambiguous.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/design/conversation-compaction/conversation-compaction.md` around lines 348 - 349, The guidance contains malformed code/markdown for config classes: replace the broken fragments so it reads that all config classes should extend ConfigurationBase which sets extra="forbid", use Field(...) with defaults, title, and description, and add `@model_validator`(mode="after") for cross-field validation when needed; update references to ConfigurationBase, extra, Field, and `@model_validator` to use correct punctuation and valid syntax so the instructions are unambiguous.

coderabbitai · 2026-03-16T17:12:09Z

docs/design/conversation-compaction/poc-results/06-probe-responses.txt

+5. **Network Policies**: Kubernetes allows the enforcement of network policies to control how pods communicate with each
+


⚠️ Potential issue | 🟡 Minor

Complete the truncated probe response in Turn 11.

Line 19 ends mid-sentence (“communicate with each”), and no continuation follows. Please restore the missing text or explicitly mark this section as intentionally truncated so the artifact remains reliable.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/design/conversation-compaction/poc-results/06-probe-responses.txt` around lines 19 - 20, The "Network Policies" sentence in Turn 11 is truncated ("communicate with each") and leaves the artifact unreliable; edit the Turn 11 probe response to either finish the sentence (e.g., "communicate with each other or external services, limiting traffic between pods and namespaces") or explicitly mark the paragraph as intentionally truncated (e.g., add "[TRUNCATED]" and a brief note), updating the "Network Policies" line in the Turn 11 section so the document is complete and unambiguous.

coderabbitai · 2026-03-16T17:12:09Z

src/utils/compaction.py

+async def _summarize_conversation(
+    client: AsyncLlamaStackClient,
+    model: str,
+    items: list[Any],
+) -> str:
+    """Use the LLM to produce a concise summary of *items*.
+
+    Parameters:
+        client: Llama Stack client.
+        model: Full model ID (``provider/model``).
+        items: Conversation items to summarize.
+
+    Returns:
+        Summary text produced by the LLM.
+    """
+    conversation_text = _format_conversation_for_summary(items)
+    prompt = SUMMARIZATION_PROMPT.format(conversation_text=conversation_text)
+
+    summary_conv = await client.conversations.create(metadata={})
+    response = await client.responses.create(
+        input=prompt,
+        model=model,
+        conversation=summary_conv.id,
+        store=False,
+    )
+
+    summary_parts: list[str] = []
+    for output_item in response.output:
+        content = getattr(output_item, "content", None)
+        if content is None:
+            continue
+        for content_part in content:
+            text = getattr(content_part, "text", None)
+            if text:
+                summary_parts.append(text)
+
+    return "".join(summary_parts) or "No summary generated."


⚠️ Potential issue | 🟠 Major

Missing error handling for Llama Stack API calls.

Per coding guidelines, APIConnectionError from Llama Stack should be handled in API operations. The calls to client.conversations.create (Line 127) and client.responses.create (Lines 128-133) could fail, leaving the caller without graceful error handling.

Additionally, if responses.create fails, the temporary conversation created at Line 127 becomes orphaned.

Suggested error handling pattern (from responses.py)

from llama_stack_client import APIConnectionError, APIStatusError async def _summarize_conversation(...) -> str: # ... format prompt ... try: summary_conv = await client.conversations.create(metadata={}) response = await client.responses.create( input=prompt, model=model, conversation=summary_conv.id, store=False, ) except APIConnectionError as e: logger.error("Failed to connect to Llama Stack for summarization: %s", e) raise # or return a fallback / re-raise wrapped except APIStatusError as e: logger.error("Llama Stack returned error during summarization: %s", e) raise # ... extract summary ...

As per coding guidelines: "Handle APIConnectionError from Llama Stack in API operations".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/utils/compaction.py` around lines 109 - 145, The _summarize_conversation function must catch Llama Stack API errors around client.conversations.create and client.responses.create; wrap those calls in a try/except that handles APIConnectionError and APIStatusError (import them from llama_stack_client), log the error with context (e.g., "Failed to create conversation" / "Failed to create response for summarization") and re-raise or return a fallback as appropriate; ensure you avoid leaving an orphaned conversation by deleting the temporary conversation (use client.conversations.delete(summary_conv.id)) if response creation fails (perform delete in the except block or a finally that checks summary_conv and whether response was created); keep the existing summary extraction from response.output unchanged.

coderabbitai · 2026-03-16T17:12:09Z

src/utils/responses.py

+        # Check if conversation needs compaction (PoC)
+        if user_conversation:
+            llama_stack_conv_id = await compact_conversation_if_needed(
+                client=client,
+                llama_stack_conv_id=llama_stack_conv_id,
+                model=model,
+                message_count=user_conversation.message_count,
+            )


⚠️ Potential issue | 🟠 Major

Wrap compaction call with the same API error translation used elsewhere.

Lines 302-307 can raise Llama Stack errors through compact_conversation_if_needed(), but this branch currently does not map them to your standard HTTPException responses.

💡 Suggested fix

# Check if conversation needs compaction (PoC) if user_conversation: - llama_stack_conv_id = await compact_conversation_if_needed( - client=client, - llama_stack_conv_id=llama_stack_conv_id, - model=model, - message_count=user_conversation.message_count, - ) + try: + llama_stack_conv_id = await compact_conversation_if_needed( + client=client, + llama_stack_conv_id=llama_stack_conv_id, + model=model, + message_count=user_conversation.message_count, + ) + except APIConnectionError as e: + error_response = ServiceUnavailableResponse( + backend_name="Llama Stack", + cause=str(e), + ) + raise HTTPException(**error_response.model_dump()) from e + except APIStatusError as e: + error_response = handle_known_apistatus_errors(e, model) + raise HTTPException(**error_response.model_dump()) from e

As per coding guidelines "Handle APIConnectionError from Llama Stack in API operations".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Check if conversation needs compaction (PoC)

if user_conversation:

llama_stack_conv_id = await compact_conversation_if_needed(

client=client,

llama_stack_conv_id=llama_stack_conv_id,

model=model,

message_count=user_conversation.message_count,

)

# Check if conversation needs compaction (PoC)

if user_conversation:

try:

llama_stack_conv_id = await compact_conversation_if_needed(

client=client,

llama_stack_conv_id=llama_stack_conv_id,

model=model,

message_count=user_conversation.message_count,

)

except APIConnectionError as e:

error_response = ServiceUnavailableResponse(

backend_name="Llama Stack",

cause=str(e),

)

raise HTTPException(**error_response.model_dump()) from e

except APIStatusError as e:

error_response = handle_known_apistatus_errors(e, model)

raise HTTPException(**error_response.model_dump()) from e

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/utils/responses.py` around lines 300 - 307, The call to compact_conversation_if_needed (when user_conversation is truthy) can raise Llama Stack errors but is not translated to the API's HTTPException like other Llama Stack operations; wrap the compact_conversation_if_needed call in a try/except that catches the Llama Stack error types used elsewhere (e.g., APIConnectionError or the project's LlamaStackError wrapper) and re-raise an appropriate fastapi.HTTPException with the same status code/message mapping used in this module; ensure you reference compact_conversation_if_needed, llama_stack_conv_id, and user_conversation so the behavior matches other error handling in responses.py.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

docs/design/conversation-compaction/conversation-compaction-spike.md (2)
661-661: Significant architectural change warrants emphasis.

Line 661 states "Call Llama Stack with explicit input (NOT =conversation_id=)" and Line 671 notes "lightspeed-stack stops using Llama Stack's conversation parameter." This is a major architectural shift from the current design where Llama Stack manages conversation history retrieval.

Consider adding a dedicated subsection or callout highlighting this change, its rationale (full control over LLM input), and implications (lightspeed-stack becomes responsible for conversation assembly). This will help reviewers and implementers understand the scope of the change.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/conversation-compaction/conversation-compaction-spike.md` at line
661, Add a dedicated subsection or highlighted callout in the document that
explicitly calls out the architectural change described by "Call Llama Stack
with explicit input (NOT =conversation_id=)" and "lightspeed-stack stops using
Llama Stack's `conversation` parameter"; in that subsection state the rationale
(lightspeed-stack takes full control over LLM input/assembly) and list concrete
implications (responsibility for conversation assembly, retrieval, ordering, and
any API/contract changes) so implementers understand scope and migration steps.
726-731: Consider highlighting cost implications more prominently.

The example shows ~74,000 tokens per compaction call (70K input + 2-4K output). At typical LLM pricing, this represents material cost per compaction event. For a frequently-compacting conversation, costs could accumulate significantly.

Consider adding a concrete cost estimate (e.g., "$0.74 per compaction at $10/1M tokens for GPT-4") and noting that high-usage conversations might trigger compaction multiple times, making cost monitoring important for deployments.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/design/conversation-compaction/conversation-compaction-spike.md` around
lines 726 - 731, Update the "Example for a 128K context window at 70% threshold"
block to call out cost implications explicitly: compute a concrete cost estimate
for the ~74,000-token compaction (e.g., "$0.74 per compaction at $10/1M tokens
for GPT-4") and add a short note that repeated compactions can accumulate costs
and should be monitored or throttled in high-usage conversations; place this
cost estimate immediately after the bullet list for the 128K example and ensure
the text references the same example wording ("128K context window at 70%
threshold" and the "~74,000 total tokens" summary) so readers can see cost per
compaction and the recommendation to monitor compaction frequency.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/design/conversation-compaction/conversation-compaction-spike.md`:
- Line 388: The doc claim mismatches the actual change: verify the insertion in
src/utils/responses.py is the same as described in the PR objectives (the hook
in prepare_responses_params()); either update the doc line to say "8-line
insertion" if the code contains 8 new lines, or change the code to include the
full 10-line hook as described — then update the text in
conversation-compaction-spike.md to reflect the accurate line count and ensure
the symbol prepare_responses_params() is referenced consistently.
- Line 153: Change the seven JIRA headings from third-level to second-level
headings so the document follows proper markdown hierarchy: replace each "###
LCORE-????: Add token estimation to lightspeed-stack" style heading (and the
other six JIRA headings in the same pattern) with "##" instead of "###" so they
sit under the top-level "Proposed JIRAs" heading; ensure all instances of the
JIRA headings (the exact heading texts at lines shown in the review) are updated
consistently.
- Line 25: Update the broken intra-document markdown links: replace any link
pointing to "#design-alternatives" with
"#design-alternatives-for-lightspeed-stack", change "#llama-stack-upstream"
occurrences to "#appendix-a-llama-stack-upstream-status" (including the repeated
instance near the appendix), and remove or replace the non-existent
"#decisions-technical" reference with the correct existing heading anchor or add
the missing "Decisions — Technical" heading to the document so the link
resolves.

---

Nitpick comments:
In `@docs/design/conversation-compaction/conversation-compaction-spike.md`:
- Line 661: Add a dedicated subsection or highlighted callout in the document
that explicitly calls out the architectural change described by "Call Llama
Stack with explicit input (NOT =conversation_id=)" and "lightspeed-stack stops
using Llama Stack's `conversation` parameter"; in that subsection state the
rationale (lightspeed-stack takes full control over LLM input/assembly) and list
concrete implications (responsibility for conversation assembly, retrieval,
ordering, and any API/contract changes) so implementers understand scope and
migration steps.
- Around line 726-731: Update the "Example for a 128K context window at 70%
threshold" block to call out cost implications explicitly: compute a concrete
cost estimate for the ~74,000-token compaction (e.g., "$0.74 per compaction at
$10/1M tokens for GPT-4") and add a short note that repeated compactions can
accumulate costs and should be monitored or throttled in high-usage
conversations; place this cost estimate immediately after the bullet list for
the 128K example and ensure the text references the same example wording ("128K
context window at 70% threshold" and the "~74,000 total tokens" summary) so
readers can see cost per compaction and the recommendation to monitor compaction
frequency.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7f012f04-87ce-475a-8eea-260918df70b4

📥 Commits

Reviewing files that changed from the base of the PR and between dd6a24e and a28dcb3.

📒 Files selected for processing (1)

docs/design/conversation-compaction/conversation-compaction-spike.md

coderabbitai · 2026-03-16T17:21:34Z

docs/design/conversation-compaction/conversation-compaction-spike.md

+| B      | Tiered memory (MemGPT-style)     | High       | Excellent       |
+| C      | Delegate to provider-native APIs | Low-Med    | Varies          |
+
+See [Design alternatives](#design-alternatives) for full pros/cons of each.


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify that markdown link fragments match the generated heading IDs # Extract all internal links and headings from the document cat docs/design/conversation-compaction/conversation-compaction-spike.md | \ gawk ' /^#+ / { heading = tolower($0) gsub(/^#+ /, "", heading) gsub(/[^a-z0-9- ]/, "", heading) gsub(/ /, "-", heading) print "HEADING: " heading } /]$#[^)]+$/ { match($0, /]$#([^)]+)$/, arr) if (arr[1]) print "LINK: " arr[1] } '

Repository: lightspeed-core/lightspeed-stack

Length of output: 134

🏁 Script executed:

#!/bin/bash # Verify markdown link fragments match heading IDs # Read the file and use Python to extract headings and links python3 << 'EOF' import re import os file_path = "docs/design/conversation-compaction/conversation-compaction-spike.md" if not os.path.exists(file_path): print(f"File not found: {file_path}") exit(1) with open(file_path, 'r') as f: content = f.read() # Extract all headings and generate their fragment IDs headings = {} for match in re.finditer(r'^(#{1,6})\s+(.+)$', content, re.MULTILINE): heading_text = match.group(2).strip() # Standard markdown ID generation: lowercase, spaces to hyphens, remove special chars fragment_id = re.sub(r'[^\w\s-]', '', heading_text.lower()) fragment_id = re.sub(r'\s+', '-', fragment_id) line_num = content[:match.start()].count('\n') + 1 headings[fragment_id] = {'text': heading_text, 'line': line_num} print(f"HEADING: Line {line_num} -> #{fragment_id} (text: '{heading_text}')") print("\n" + "="*80 + "\n") # Extract all internal links links = {} for match in re.finditer(r'\[([^\]]+)\]$#([^)]+)$', content): link_text = match.group(1) link_target = match.group(2) line_num = content[:match.start()].count('\n') + 1 links[line_num] = {'text': link_text, 'target': link_target} # Check if link exists in headings exists = link_target in headings status = "✓ FOUND" if exists else "✗ NOT FOUND" print(f"LINK: Line {line_num} -> #{link_target} [{status}]") if not exists and link_target: # Find similar headings similar = [h for h in headings.keys() if link_target.lower() in h or h in link_target.lower()] if similar: print(f" Possible matches: {similar}") EOF

Repository: lightspeed-core/lightspeed-stack

Length of output: 6831

Fix broken intra-document links in markdown.

Line 25: #design-alternatives should be #design-alternatives-for-lightspeed-stack
Line 84: #llama-stack-upstream should be #appendix-a-llama-stack-upstream-status

Additionally, Line 586 has the same issue as Line 84. Line 648 references #decisions-technical which does not exist in the document.

🧰 Tools

🪛 markdownlint-cli2 (0.21.0)

[warning] 25-25: Link fragments should be valid

(MD051, link-fragments)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/design/conversation-compaction/conversation-compaction-spike.md` at line 25, Update the broken intra-document markdown links: replace any link pointing to "#design-alternatives" with "#design-alternatives-for-lightspeed-stack", change "#llama-stack-upstream" occurrences to "#appendix-a-llama-stack-upstream-status" (including the repeated instance near the appendix), and remove or replace the non-existent "#decisions-technical" reference with the correct existing heading anchor or add the missing "Decisions — Technical" heading to the document so the link resolves.

coderabbitai · 2026-03-16T17:21:34Z

docs/design/conversation-compaction/conversation-compaction-spike.md

+
+Each JIRA includes an agentic tool instruction that an assignee can optionally feed to Claude Code or similar.
+
+### LCORE-????: Add token estimation to lightspeed-stack


⚠️ Potential issue | 🟡 Minor

Fix heading hierarchy.

The heading structure jumps from h1 (# Proposed JIRAs) directly to h3 (### LCORE-????), skipping h2. This violates standard markdown heading hierarchy (MD001). Consider using h2 (##) for individual JIRA entries to maintain proper document structure.

📝 Suggested fix

# Proposed JIRAs Each JIRA includes an agentic tool instruction that an assignee can optionally feed to Claude Code or similar. -### LCORE-????: Add token estimation to lightspeed-stack +## LCORE-????: Add token estimation to lightspeed-stack

Apply the same change to all 7 JIRA headings (lines 153, 179, 207, 234, 262, 286, 313).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

### LCORE-????: Add token estimation to lightspeed-stack

# Proposed JIRAs

Each JIRA includes an agentic tool instruction that an assignee can optionally feed to Claude Code or similar.

## LCORE-????: Add token estimation to lightspeed-stack

🧰 Tools

🪛 markdownlint-cli2 (0.21.0)

[warning] 153-153: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/design/conversation-compaction/conversation-compaction-spike.md` at line 153, Change the seven JIRA headings from third-level to second-level headings so the document follows proper markdown hierarchy: replace each "### LCORE-????: Add token estimation to lightspeed-stack" style heading (and the other six JIRA headings in the same pattern) with "##" instead of "###" so they sit under the top-level "Proposed JIRAs" heading; ensure all instances of the JIRA headings (the exact heading texts at lines shown in the review) are updated consistently.

coderabbitai · 2026-03-16T17:21:34Z

docs/design/conversation-compaction/conversation-compaction-spike.md

+## PoC code
+
+- `src/utils/compaction.py` — compaction logic (trigger, split, summarize, new conversation).
+- `src/utils/responses.py` — 8-line insertion calling `compact_conversation_if_needed()`.


⚠️ Potential issue | 🟡 Minor

Minor discrepancy in line count.

This line states "8-line insertion", but the PR objectives describe a "10-line hook in prepare_responses_params()". Please verify the correct line count for consistency.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/design/conversation-compaction/conversation-compaction-spike.md` at line 388, The doc claim mismatches the actual change: verify the insertion in src/utils/responses.py is the same as described in the PR objectives (the hook in prepare_responses_params()); either update the doc line to say "8-line insertion" if the code contains 8 new lines, or change the code to include the full 10-line hook as described — then update the text in conversation-compaction-spike.md to reflect the accurate line count and ensure the symbol prepare_responses_params() is referenced consistently.

tisnik

LGTM!

LCORE-1314: conversation compaction spike — design docs, PoC, and 50-…

dd6a24e

…query experiment

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

Point agentic tool instructions to in-repo spec doc, not to spike doc

a28dcb3

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

tisnik approved these changes Mar 17, 2026

View reviewed changes

tisnik requested a review from asimurka March 17, 2026 08:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LCORE-1314: conversation compaction spike — design docs, PoC, and 50-…#1328