Conversation
Persist KV caches across respond()/streamResponse() calls within the same LanguageModelSession. On subsequent turns only the new tokens are prefilled instead of re-encoding the entire conversation history, dramatically reducing time to first token. - Add maxKVSize, kvBits, kvGroupSize to GenerationOptions - Add SessionCacheEntry store with NSMapTable weak keys - Implement incremental prefill in streamResponse() and respond() - Enhance prewarm() to prefill system prompt into KV cache Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add GPUMemoryConfiguration struct with .automatic (RAM-scaled) and .unconstrained presets for controlling Metal buffer pool limits - Add GPUMemoryManager singleton with reference-counted active/idle toggling — cache stays high during concurrent generations, drops to idle limit only when all sessions complete - Wrap respond(), streamResponse(), and prewarm() with markActive/markIdle - Call evict() on removeFromCache/removeAllFromCache to reclaim GPU buffers - Upgrade mlx-swift from 0.29.1 to 0.30.6 (fast SDPA, cache race fix, Memory API, wired memory, iPhone 16 Pro NAX fix) - Upgrade mlx-swift-lm from 2.29.3 to 2.30.6 (Gemma3n per-layer intermediate_size, model loading perf, chat rehydration, tool calling) - Migrate deprecated GPU.set(cacheLimit:)/GPU.clearCache() to Memory.* Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts: # Sources/AnyLanguageModel/Models/MLXLanguageModel.swift
pthm
left a comment
There was a problem hiding this comment.
Thank you for this! KV cache reuse is a blocker for adoption of this lib for me.
I am working on a macOS app doing metadata / extraction with a pipeline that makes 4-6 sequential calls per session with a feedback loop, so KV cache reuse across turns is a big win, without it each iteration re-encodes the full conversation and latency compounds.
I have dropped a few comments inline to try and help get it across the line. Let me know if there is anything else I can do to help on this
| if isFirstIteration { | ||
| let existingEntry = getSessionCache(session) | ||
| let fullTokenCount = lmInput.text.tokens.dim(0) | ||
|
|
||
| if let existingEntry, | ||
| existingEntry.prefillTokenCount > 0, | ||
| fullTokenCount > existingEntry.prefillTokenCount, | ||
| lmInput.image == nil | ||
| { | ||
| // Cache HIT: only prefill new tokens | ||
| let cachedCount = existingEntry.prefillTokenCount | ||
| let newTokens = lmInput.text.tokens[cachedCount...] | ||
| let partialText = MLXLMCommon.LMInput.Text(tokens: newTokens) | ||
| inputForGeneration = MLXLMCommon.LMInput(text: partialText) | ||
| cache = existingEntry.kvCache | ||
| } else { | ||
| // Cache MISS: create fresh cache | ||
| if existingEntry != nil { | ||
| removeSessionCache(for: session) | ||
| } | ||
| cache = context.model.newCache(parameters: generateParameters) | ||
| inputForGeneration = lmInput | ||
| } |
There was a problem hiding this comment.
This appears to only be validating the token lengths to determine cache miss / hit, rather than validating is the prefix tokens are the same.
If a caller reuses the same LanguageModelSession object but replaces the conversation (e.g. resetting to just the system prompt for a new task), fullTokenCount could still be greater than prefillTokenCount even though the actual token content has changed. The cache would contain stale key/value states.
This could be mitigated with a checksum / hash of the tokens?
There was a problem hiding this comment.
Thank you for the suggestions. I've implemented these changes, so please let me know if we need to make further changes.
| if let existingEntry, | ||
| existingEntry.prefillTokenCount > 0, | ||
| fullTokenCount > existingEntry.prefillTokenCount, | ||
| lmInput.image == nil | ||
| { | ||
| // Cache HIT: only prefill new tokens | ||
| let cachedCount = existingEntry.prefillTokenCount | ||
| let newTokens = lmInput.text.tokens[cachedCount...] | ||
| let partialText = MLXLMCommon.LMInput.Text(tokens: newTokens) | ||
| inputForGeneration = MLXLMCommon.LMInput(text: partialText) | ||
| cache = existingEntry.kvCache | ||
| } else { | ||
| // Cache MISS: create fresh cache, prefill everything | ||
| if existingEntry != nil { | ||
| removeSessionCache(for: session) | ||
| } | ||
| let newCache = context.model.newCache(parameters: generateParameters) | ||
| cache = newCache | ||
| inputForGeneration = lmInput | ||
| } |
There was a problem hiding this comment.
The same cache logic here, if the cache validation / invalidation logic needs to be changed it might be worth lifting this to a single helper func?
There was a problem hiding this comment.
This should also hopefully be resolved. Please let me know if we need to make any changes.
| self.hub = hub | ||
| self.directory = directory | ||
| self.gpuMemory = gpuMemory | ||
| GPUMemoryManager.shared.configure(gpuMemory) |
There was a problem hiding this comment.
If two instances are created with different gpuMemory configs, will this will overwrites the first's configuration resulting in a race?
There was a problem hiding this comment.
I believe this has been fixed as well. Please let me know if we need to make any changes.
| // Prefill the system prompt into a KV cache so the first turn is faster | ||
| if let instructions = session.instructions?.description, !instructions.isEmpty { | ||
| let params = MLXLMCommon.GenerateParameters() | ||
| let newCache = context.model.newCache(parameters: params) | ||
| let chat: [MLXLMCommon.Chat.Message] = [.init(role: .system, content: instructions)] | ||
| let userInput = MLXLMCommon.UserInput( | ||
| chat: chat, | ||
| processing: .init(resize: .init(width: 512, height: 512)), | ||
| tools: nil | ||
| ) |
There was a problem hiding this comment.
When a session has tools, some chat templates (Qwen, Llama, etc.) inject the tool definitions into the system message, which will change the tokenization. This builds the chat with tools: nil the cache will have a different token prefix than the first actual respond() call causing a cache miss.
Maybe skip the KV prefill when session.tools is non-empty, or include the tool specs in the prewarm input?
There was a problem hiding this comment.
I added tools on pre-warm but I think that is all that has been changed here.
…ool-aware prewarm - Add prefillTokenHash to SessionCacheEntry to detect stale cache from replaced conversations (not just token count) - Extract resolveCache() helper to deduplicate cache hit/miss logic between respond() and streamResponse() - GPUMemoryManager.configure() now uses first-write-wins to prevent multiple MLXLanguageModel instances from silently overwriting config - prewarm() accepts tools via protocol and session automatically forwards registered tools so prefill tokenization matches respond() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Detect when the MLX tool loop generates the same tool call signature as the previous iteration and break early instead of retrying - Clear sessionKVCache in removeAllFromCache() so memory warning handlers actually free GPU memory from cached KV states Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Detect when the MLX tool loop generates the same tool call signature as the previous iteration and break early instead of retrying - Clear sessionKVCache in removeAllFromCache() so memory warning handlers actually free GPU memory from cached KV states Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…AnyLanguageModel into feature/mlx-kv-cache
Hello, I've updated the MLX memory cache functions in the library to help with multi-session stability with LMX. I've also updated toe package to use the latest mlx-swift version. Here are the changes.
conversation history on each turn
Metal buffer cache limits and reference-counted active/idle toggling
This is honestly my first time submitting a pull request to someone, so please let me know if I need to make changes, or if we need to make changes. Thank you for making such an amazing library.