[draft] Adds memory-efficient selective attention processor #12844

cdutr · 2025-12-15T18:54:00Z

Implements a new attention processing technique based on the "Selective Attention Improves Transformer" paper, enabling more efficient and flexible attention mechanisms.

What does this PR do?

Key features:

Introduces MemoryEfficientSelectiveAttnProcessor2_0 for advanced attention masking
Supports configurable token selection and optional pruning
Adds methods to enable/disable selective attention across model modules
Provides fine-grained control over masking strength and token selection

Enables more intelligent and computationally efficient attention by allowing selective token interaction and pruning, which can improve model performance and reduce computational overhead.

The original paper targets causal attention in LLMs where benefits come from KV-cache eviction during autoregressive generation. For bidirectional attention in diffusion models, we observe overhead from computing selection scores without actual computation reduction. However, we expect better results for video generation where sequences are much longer (16K-100K+ tokens) and temporal redundancy provides more pruning opportunities.

As next step I am adapting it to video models to check the benefits. If it works, will work on the documentation and unit tests.

Fixes #12817

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. @sayakpaul @DN6 may be interested.

Implements a new attention processing technique based on the "Selective Attention Improves Transformer" paper, enabling more efficient and flexible attention mechanisms. Key features: - Introduces MemoryEfficientSelectiveAttnProcessor2_0 for advanced attention masking - Supports configurable token selection and optional pruning - Adds methods to enable/disable selective attention across model modules - Provides fine-grained control over masking strength and token selection Enables more intelligent and computationally efficient attention by allowing selective token interaction and pruning, which can improve model performance and reduce computational overhead.

sayakpaul · 2025-12-16T03:48:21Z

Thanks for the work! The attention backends we have in https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_dispatch.py are generic enough to be applied to most models.

The attention processor mechanism, on the other hand, is for allowing users to hook up their favorite processors (like this), which we may not be able to maintain.

So, let's try to make sure this attention processor is a generic one that can provide benefits for most models. Otherwise, we can always have it under research_projects and show a concrete example with a model where the benefits are totally evident and visible.

For bidirectional attention in diffusion models, we observe overhead from computing selection scores without actual computation reduction.

Hmm, I haven't looked too deep yet but can't we compute this in another CUDA stream or something? Or, is it always a blocking op?

cdutr · 2025-12-16T12:54:00Z

Thanks for the feedback @sayakpaul! After further experimentation (on CogVideoX), I've confirmed that selective attention doesn't provide benefits for diffusion models, the speedup in the original paper comes from KV-cache eviction during autoregressive generation, which doesn't apply here

Regarding CUDA streams for selection scores: even if computed asynchronously, we'd still need the full attention computation since there's no cache to evict from. The overhead comes from the selection matrix computation itself, not from blocking

I've reached out to the paper's author to ask if I'm missing something, but at the moment I'm not confident this path makes sense for diffusion models. Given this, I think this PR is not suitable for the main library in its current form

I'll review the TODOs in attention_dispatch.py and the open issues to see if there's a more effective way to contribute to the attention backend work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[draft] Adds memory-efficient selective attention processor #12844

[draft] Adds memory-efficient selective attention processor #12844

cdutr commented Dec 15, 2025

Uh oh!

sayakpaul commented Dec 16, 2025

Uh oh!

cdutr commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[draft] Adds memory-efficient selective attention processor #12844

Are you sure you want to change the base?

[draft] Adds memory-efficient selective attention processor #12844

Conversation

cdutr commented Dec 15, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

sayakpaul commented Dec 16, 2025

Uh oh!

cdutr commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants