Skip to content

llama.cpp submodule update from b6153 to b7868#1

Open
metaspartan wants to merge 5 commits intomainfrom
dev
Open

llama.cpp submodule update from b6153 to b7868#1
metaspartan wants to merge 5 commits intomainfrom
dev

Conversation

@metaspartan
Copy link

@metaspartan metaspartan commented Jan 7, 2026

Adds support for flash attention type in context params and updates related logic in llama.py. Refactors deprecated sampling methods to improve error messaging. Updates llama_cpp.py with new constants, fields, and function signatures for API consistency and new features. Bumps version to 0.3.17.


Note

Medium Risk
Moderate risk because it syncs Python bindings with upstream llama.cpp C API changes (struct layouts, removed symbols, and flash-attention configuration), which can break runtime loading or behavior if the packaged native library and bindings get out of sync.

Overview
Bumps the package to 0.3.18 and updates the vendored llama.cpp integration, including a build fix that sets a default LLAMA_INSTALL_VERSION for the mtmd sub-build.

Updates the Python API and ctypes bindings to match upstream C API changes: replaces the context flash_attn bool with a flash_attn_type enum (while keeping Python backward compatibility), adds new llama.cpp constants/struct fields and helper functions (e.g., llama_n_ctx_seq, llama_max_tensor_buft_overrides, adapter metadata + aLoRA helpers, llama_log_get, llama_memory_breakdown_print, llama_model_is_hybrid), and removes deprecated KV-cache (llama_kv_self_*) and llama_sampler_init_softmax bindings.

Also switches internal code paths to use the new memory wrappers (e.g., embeddings KV-cache clearing) and includes mostly formatting/typing cleanups in sampler/chat-format code.

Written by Cursor Bugbot for commit ca37242. This will update automatically on new commits. Configure here.

Adds support for flash attention type in context params and updates related logic in llama.py. Refactors deprecated sampling methods to improve error messaging. Updates llama_cpp.py with new constants, fields, and function signatures for API consistency and new features. Bumps version to 0.3.17.
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 36f7b221ef

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Upgrades vendor/llama.cpp to commit 3bcc990, introducing new features such as the adaptive probability-based sampler (`llama_sampler_init_adaptive_p`), direct I/O support (`use_direct_io` in llama_model_params), GLM 4.7 Flash support, Qwen3 Next model support, and self-speculative decoding. Updates Python bindings and chat format handlers to support these features and improve code formatting and clarity.
@metaspartan metaspartan changed the title llama.cpp submodule update from b6153 to b7652 llama.cpp submodule update from b6153 to b7868 Jan 29, 2026
Added Python bindings for new llama.cpp APIs: llama_max_tensor_buft_overrides, llama_model_is_hybrid, adapter metadata functions, aLoRA invocation token functions, llama_memory_breakdown_print, and llama_log_get. Updated changelog to reflect new features, fixes, and deprecations in version 0.3.18. Commented out llama_sampler_init_adaptive_p binding as it requires a library rebuild.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments