Skip to content

Conversation

@stduhpf
Copy link
Contributor

@stduhpf stduhpf commented Jan 29, 2026

Tested with https://civitai.green/models/344873/plana-blue-archivelokr

Before:

[DEBUG] ggml_extend.hpp:1778 - unet compute buffer size: 3363.80 MB(VRAM)
[INFO ] stable-diffusion.cpp:3578 - sampling completed, taking 20.04s

After:

[DEBUG] ggml_extend.hpp:1778 - unet compute buffer size: 137.05 MB(VRAM)
[INFO ] stable-diffusion.cpp:3578 - sampling completed, taking 16.43s

(Vulkan backend, 512x512, 20 steps with cfg)

Edit: For some reason it seems it gets slower than the previous implementation at higher resolution (like 1024x1024), but the VRAM gains remain significant

Before:

[DEBUG] ggml_extend.hpp:1762 - unet compute buffer size: 4061.49 MB(VRAM)
[INFO ] stable-diffusion.cpp:3403 - sampling completed, taking 49.17s

After:

[DEBUG] ggml_extend.hpp:1778 - unet compute buffer size: 830.86 MB(VRAM)
[INFO ] stable-diffusion.cpp:3578 - sampling completed, taking 51.02s

Maybe it can be improved further?

@stduhpf
Copy link
Contributor Author

stduhpf commented Jan 29, 2026

Hmm for some yet unknown reason it doesn't seem to work on ROCm backend at all. VRAM usage is indeed very low like with Vulkan, but somehow my system memory fills up completely and the first step takes forever to complete. Seems like a memory leak?

Can anyone test with CUDA backend?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant