Feature/mlx kv cache by mikedoise · Pull Request #139 · mattt/AnyLanguageModel

mikedoise · 2026-02-24T14:47:28Z

Hello, I've updated the MLX memory cache functions in the library to help with multi-session stability with LMX. I've also updated toe package to use the latest mlx-swift version. Here are the changes.

MLX KV cache reuse for incremental prefill — avoids re-encoding the full
conversation history on each turn
GPU memory management — GPUMemoryConfiguration with automatic RAM-scaled
Metal buffer cache limits and reference-counted active/idle toggling

This is honestly my first time submitting a pull request to someone, so please let me know if I need to make changes, or if we need to make changes. Thank you for making such an amazing library.

Persist KV caches across respond()/streamResponse() calls within the same LanguageModelSession. On subsequent turns only the new tokens are prefilled instead of re-encoding the entire conversation history, dramatically reducing time to first token. - Add maxKVSize, kvBits, kvGroupSize to GenerationOptions - Add SessionCacheEntry store with NSMapTable weak keys - Implement incremental prefill in streamResponse() and respond() - Enhance prewarm() to prefill system prompt into KV cache Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add GPUMemoryConfiguration struct with .automatic (RAM-scaled) and .unconstrained presets for controlling Metal buffer pool limits - Add GPUMemoryManager singleton with reference-counted active/idle toggling — cache stays high during concurrent generations, drops to idle limit only when all sessions complete - Wrap respond(), streamResponse(), and prewarm() with markActive/markIdle - Call evict() on removeFromCache/removeAllFromCache to reclaim GPU buffers - Upgrade mlx-swift from 0.29.1 to 0.30.6 (fast SDPA, cache race fix, Memory API, wired memory, iPhone 16 Pro NAX fix) - Upgrade mlx-swift-lm from 2.29.3 to 2.30.6 (Gemma3n per-layer intermediate_size, model loading perf, chat rehydration, tool calling) - Migrate deprecated GPU.set(cacheLimit:)/GPU.clearCache() to Memory.* Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Conflicts: # Sources/AnyLanguageModel/Models/MLXLanguageModel.swift

pthm

Thank you for this! KV cache reuse is a blocker for adoption of this lib for me.

I am working on a macOS app doing metadata / extraction with a pipeline that makes 4-6 sequential calls per session with a feedback loop, so KV cache reuse across turns is a big win, without it each iteration re-encodes the full conversation and latency compounds.

I have dropped a few comments inline to try and help get it across the line. Let me know if there is anything else I can do to help on this

pthm · 2026-02-25T06:00:45Z

Sources/AnyLanguageModel/Models/MLXLanguageModel.swift

+                if isFirstIteration {
+                    let existingEntry = getSessionCache(session)
+                    let fullTokenCount = lmInput.text.tokens.dim(0)
+
+                    if let existingEntry,
+                       existingEntry.prefillTokenCount > 0,
+                       fullTokenCount > existingEntry.prefillTokenCount,
+                       lmInput.image == nil
+                    {
+                        // Cache HIT: only prefill new tokens
+                        let cachedCount = existingEntry.prefillTokenCount
+                        let newTokens = lmInput.text.tokens[cachedCount...]
+                        let partialText = MLXLMCommon.LMInput.Text(tokens: newTokens)
+                        inputForGeneration = MLXLMCommon.LMInput(text: partialText)
+                        cache = existingEntry.kvCache
+                    } else {
+                        // Cache MISS: create fresh cache
+                        if existingEntry != nil {
+                            removeSessionCache(for: session)
+                        }
+                        cache = context.model.newCache(parameters: generateParameters)
+                        inputForGeneration = lmInput
+                    }


This appears to only be validating the token lengths to determine cache miss / hit, rather than validating is the prefix tokens are the same.

If a caller reuses the same LanguageModelSession object but replaces the conversation (e.g. resetting to just the system prompt for a new task), fullTokenCount could still be greater than prefillTokenCount even though the actual token content has changed. The cache would contain stale key/value states.

This could be mitigated with a checksum / hash of the tokens?

Thank you for the suggestions. I've implemented these changes, so please let me know if we need to make further changes.

pthm · 2026-02-25T06:01:58Z

Sources/AnyLanguageModel/Models/MLXLanguageModel.swift

+                        if let existingEntry,
+                           existingEntry.prefillTokenCount > 0,
+                           fullTokenCount > existingEntry.prefillTokenCount,
+                           lmInput.image == nil
+                        {
+                            // Cache HIT: only prefill new tokens
+                            let cachedCount = existingEntry.prefillTokenCount
+                            let newTokens = lmInput.text.tokens[cachedCount...]
+                            let partialText = MLXLMCommon.LMInput.Text(tokens: newTokens)
+                            inputForGeneration = MLXLMCommon.LMInput(text: partialText)
+                            cache = existingEntry.kvCache
+                        } else {
+                            // Cache MISS: create fresh cache, prefill everything
+                            if existingEntry != nil {
+                                removeSessionCache(for: session)
+                            }
+                            let newCache = context.model.newCache(parameters: generateParameters)
+                            cache = newCache
+                            inputForGeneration = lmInput
+                        }


The same cache logic here, if the cache validation / invalidation logic needs to be changed it might be worth lifting this to a single helper func?

This should also hopefully be resolved. Please let me know if we need to make any changes.

pthm · 2026-02-25T06:05:50Z

Sources/AnyLanguageModel/Models/MLXLanguageModel.swift

            self.hub = hub
            self.directory = directory
+            self.gpuMemory = gpuMemory
+            GPUMemoryManager.shared.configure(gpuMemory)


If two instances are created with different gpuMemory configs, will this will overwrites the first's configuration resulting in a race?

I believe this has been fixed as well. Please let me know if we need to make any changes.

pthm · 2026-02-25T06:13:07Z

Sources/AnyLanguageModel/Models/MLXLanguageModel.swift

+                    // Prefill the system prompt into a KV cache so the first turn is faster
+                    if let instructions = session.instructions?.description, !instructions.isEmpty {
+                        let params = MLXLMCommon.GenerateParameters()
+                        let newCache = context.model.newCache(parameters: params)
+                        let chat: [MLXLMCommon.Chat.Message] = [.init(role: .system, content: instructions)]
+                        let userInput = MLXLMCommon.UserInput(
+                            chat: chat,
+                            processing: .init(resize: .init(width: 512, height: 512)),
+                            tools: nil
+                        )


When a session has tools, some chat templates (Qwen, Llama, etc.) inject the tool definitions into the system message, which will change the tokenization. This builds the chat with tools: nil the cache will have a different token prefix than the first actual respond() call causing a cache miss.

Maybe skip the KV prefill when session.tools is non-empty, or include the tool specs in the prewarm input?

I added tools on pre-warm but I think that is all that has been changed here.

…ool-aware prewarm - Add prefillTokenHash to SessionCacheEntry to detect stale cache from replaced conversations (not just token count) - Extract resolveCache() helper to deduplicate cache hit/miss logic between respond() and streamResponse() - GPUMemoryManager.configure() now uses first-write-wins to prevent multiple MLXLanguageModel instances from silently overwriting config - prewarm() accepts tools via protocol and session automatically forwards registered tools so prefill tokenization matches respond() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Detect when the MLX tool loop generates the same tool call signature as the previous iteration and break early instead of retrying - Clear sessionKVCache in removeAllFromCache() so memory warning handlers actually free GPU memory from cached KV states Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…AnyLanguageModel into feature/mlx-kv-cache

mikedoise and others added 3 commits February 23, 2026 09:02

Merge remote-tracking branch 'upstream/main' into feature/mlx-kv-cache

b7d8436

# Conflicts: # Sources/AnyLanguageModel/Models/MLXLanguageModel.swift

pthm reviewed Feb 25, 2026

View reviewed changes

mikedoise and others added 4 commits February 25, 2026 14:06

Merge branch 'feature/mlx-kv-cache' of https://github.com/Techopolis/…

d6e6ce3

…AnyLanguageModel into feature/mlx-kv-cache

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/mlx kv cache#139

Feature/mlx kv cache#139
mikedoise wants to merge 7 commits intomattt:mainfrom
Techopolis:feature/mlx-kv-cache

mikedoise commented Feb 24, 2026

Uh oh!

pthm left a comment

Uh oh!

pthm Feb 25, 2026

Uh oh!

mikedoise Feb 25, 2026

Uh oh!

pthm Feb 25, 2026

Uh oh!

mikedoise Feb 25, 2026

Uh oh!

pthm Feb 25, 2026

Uh oh!

mikedoise Feb 25, 2026

Uh oh!

pthm Feb 25, 2026

Uh oh!

mikedoise Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mikedoise commented Feb 24, 2026

Uh oh!

pthm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants