Skip to content

Feature/mlx kv cache#139

Open
mikedoise wants to merge 7 commits intomattt:mainfrom
Techopolis:feature/mlx-kv-cache
Open

Feature/mlx kv cache#139
mikedoise wants to merge 7 commits intomattt:mainfrom
Techopolis:feature/mlx-kv-cache

Conversation

@mikedoise
Copy link

Hello, I've updated the MLX memory cache functions in the library to help with multi-session stability with LMX. I've also updated toe package to use the latest mlx-swift version. Here are the changes.

  1. MLX KV cache reuse for incremental prefill — avoids re-encoding the full
    conversation history on each turn
  2. GPU memory management — GPUMemoryConfiguration with automatic RAM-scaled
    Metal buffer cache limits and reference-counted active/idle toggling

This is honestly my first time submitting a pull request to someone, so please let me know if I need to make changes, or if we need to make changes. Thank you for making such an amazing library.

mikedoise and others added 3 commits February 23, 2026 09:02
Persist KV caches across respond()/streamResponse() calls within the
same LanguageModelSession. On subsequent turns only the new tokens are
prefilled instead of re-encoding the entire conversation history,
dramatically reducing time to first token.

- Add maxKVSize, kvBits, kvGroupSize to GenerationOptions
- Add SessionCacheEntry store with NSMapTable weak keys
- Implement incremental prefill in streamResponse() and respond()
- Enhance prewarm() to prefill system prompt into KV cache

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add GPUMemoryConfiguration struct with .automatic (RAM-scaled) and
  .unconstrained presets for controlling Metal buffer pool limits
- Add GPUMemoryManager singleton with reference-counted active/idle
  toggling — cache stays high during concurrent generations, drops to
  idle limit only when all sessions complete
- Wrap respond(), streamResponse(), and prewarm() with markActive/markIdle
- Call evict() on removeFromCache/removeAllFromCache to reclaim GPU buffers
- Upgrade mlx-swift from 0.29.1 to 0.30.6 (fast SDPA, cache race fix,
  Memory API, wired memory, iPhone 16 Pro NAX fix)
- Upgrade mlx-swift-lm from 2.29.3 to 2.30.6 (Gemma3n per-layer
  intermediate_size, model loading perf, chat rehydration, tool calling)
- Migrate deprecated GPU.set(cacheLimit:)/GPU.clearCache() to Memory.*

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts:
#	Sources/AnyLanguageModel/Models/MLXLanguageModel.swift
Copy link

@pthm pthm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this! KV cache reuse is a blocker for adoption of this lib for me.

I am working on a macOS app doing metadata / extraction with a pipeline that makes 4-6 sequential calls per session with a feedback loop, so KV cache reuse across turns is a big win, without it each iteration re-encodes the full conversation and latency compounds.

I have dropped a few comments inline to try and help get it across the line. Let me know if there is anything else I can do to help on this

Comment on lines 496 to 518
if isFirstIteration {
let existingEntry = getSessionCache(session)
let fullTokenCount = lmInput.text.tokens.dim(0)

if let existingEntry,
existingEntry.prefillTokenCount > 0,
fullTokenCount > existingEntry.prefillTokenCount,
lmInput.image == nil
{
// Cache HIT: only prefill new tokens
let cachedCount = existingEntry.prefillTokenCount
let newTokens = lmInput.text.tokens[cachedCount...]
let partialText = MLXLMCommon.LMInput.Text(tokens: newTokens)
inputForGeneration = MLXLMCommon.LMInput(text: partialText)
cache = existingEntry.kvCache
} else {
// Cache MISS: create fresh cache
if existingEntry != nil {
removeSessionCache(for: session)
}
cache = context.model.newCache(parameters: generateParameters)
inputForGeneration = lmInput
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to only be validating the token lengths to determine cache miss / hit, rather than validating is the prefix tokens are the same.

If a caller reuses the same LanguageModelSession object but replaces the conversation (e.g. resetting to just the system prompt for a new task), fullTokenCount could still be greater than prefillTokenCount even though the actual token content has changed. The cache would contain stale key/value states.

This could be mitigated with a checksum / hash of the tokens?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestions. I've implemented these changes, so please let me know if we need to make further changes.

Comment on lines 649 to 668
if let existingEntry,
existingEntry.prefillTokenCount > 0,
fullTokenCount > existingEntry.prefillTokenCount,
lmInput.image == nil
{
// Cache HIT: only prefill new tokens
let cachedCount = existingEntry.prefillTokenCount
let newTokens = lmInput.text.tokens[cachedCount...]
let partialText = MLXLMCommon.LMInput.Text(tokens: newTokens)
inputForGeneration = MLXLMCommon.LMInput(text: partialText)
cache = existingEntry.kvCache
} else {
// Cache MISS: create fresh cache, prefill everything
if existingEntry != nil {
removeSessionCache(for: session)
}
let newCache = context.model.newCache(parameters: generateParameters)
cache = newCache
inputForGeneration = lmInput
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same cache logic here, if the cache validation / invalidation logic needs to be changed it might be worth lifting this to a single helper func?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also hopefully be resolved. Please let me know if we need to make any changes.

self.hub = hub
self.directory = directory
self.gpuMemory = gpuMemory
GPUMemoryManager.shared.configure(gpuMemory)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If two instances are created with different gpuMemory configs, will this will overwrites the first's configuration resulting in a race?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this has been fixed as well. Please let me know if we need to make any changes.

Comment on lines 736 to 745
// Prefill the system prompt into a KV cache so the first turn is faster
if let instructions = session.instructions?.description, !instructions.isEmpty {
let params = MLXLMCommon.GenerateParameters()
let newCache = context.model.newCache(parameters: params)
let chat: [MLXLMCommon.Chat.Message] = [.init(role: .system, content: instructions)]
let userInput = MLXLMCommon.UserInput(
chat: chat,
processing: .init(resize: .init(width: 512, height: 512)),
tools: nil
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a session has tools, some chat templates (Qwen, Llama, etc.) inject the tool definitions into the system message, which will change the tokenization. This builds the chat with tools: nil the cache will have a different token prefix than the first actual respond() call causing a cache miss.

Maybe skip the KV prefill when session.tools is non-empty, or include the tool specs in the prewarm input?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added tools on pre-warm but I think that is all that has been changed here.

mikedoise and others added 4 commits February 25, 2026 14:06
…ool-aware prewarm

- Add prefillTokenHash to SessionCacheEntry to detect stale cache from
  replaced conversations (not just token count)
- Extract resolveCache() helper to deduplicate cache hit/miss logic
  between respond() and streamResponse()
- GPUMemoryManager.configure() now uses first-write-wins to prevent
  multiple MLXLanguageModel instances from silently overwriting config
- prewarm() accepts tools via protocol and session automatically
  forwards registered tools so prefill tokenization matches respond()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Detect when the MLX tool loop generates the same tool call signature
  as the previous iteration and break early instead of retrying
- Clear sessionKVCache in removeAllFromCache() so memory warning
  handlers actually free GPU memory from cached KV states

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Detect when the MLX tool loop generates the same tool call signature
  as the previous iteration and break early instead of retrying
- Clear sessionKVCache in removeAllFromCache() so memory warning
  handlers actually free GPU memory from cached KV states

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants