Skip to content

Conversation

@cdutr
Copy link
Contributor

@cdutr cdutr commented Dec 15, 2025

Implements a new attention processing technique based on the "Selective Attention Improves Transformer" paper, enabling more efficient and flexible attention mechanisms.

What does this PR do?

Key features:

  • Introduces MemoryEfficientSelectiveAttnProcessor2_0 for advanced attention masking
  • Supports configurable token selection and optional pruning
  • Adds methods to enable/disable selective attention across model modules
  • Provides fine-grained control over masking strength and token selection

Enables more intelligent and computationally efficient attention by allowing selective token interaction and pruning, which can improve model performance and reduce computational overhead.

The original paper targets causal attention in LLMs where benefits come from KV-cache eviction during autoregressive generation. For bidirectional attention in diffusion models, we observe overhead from computing selection scores without actual computation reduction. However, we expect better results for video generation where sequences are much longer (16K-100K+ tokens) and temporal redundancy provides more pruning opportunities.

As next step I am adapting it to video models to check the benefits. If it works, will work on the documentation and unit tests.

Fixes #12817

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. @sayakpaul @DN6 may be interested.

Implements a new attention processing technique based on the "Selective Attention Improves Transformer" paper, enabling more efficient and flexible attention mechanisms.

Key features:
- Introduces MemoryEfficientSelectiveAttnProcessor2_0 for advanced attention masking
- Supports configurable token selection and optional pruning
- Adds methods to enable/disable selective attention across model modules
- Provides fine-grained control over masking strength and token selection

Enables more intelligent and computationally efficient attention by allowing selective token interaction and pruning, which can improve model performance and reduce computational overhead.
@sayakpaul
Copy link
Member

Thanks for the work! The attention backends we have in https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_dispatch.py are generic enough to be applied to most models.

The attention processor mechanism, on the other hand, is for allowing users to hook up their favorite processors (like this), which we may not be able to maintain.

So, let's try to make sure this attention processor is a generic one that can provide benefits for most models. Otherwise, we can always have it under research_projects and show a concrete example with a model where the benefits are totally evident and visible.

For bidirectional attention in diffusion models, we observe overhead from computing selection scores without actual computation reduction.

Hmm, I haven't looked too deep yet but can't we compute this in another CUDA stream or something? Or, is it always a blocking op?

@cdutr
Copy link
Contributor Author

cdutr commented Dec 16, 2025

Thanks for the feedback @sayakpaul! After further experimentation (on CogVideoX), I've confirmed that selective attention doesn't provide benefits for diffusion models, the speedup in the original paper comes from KV-cache eviction during autoregressive generation, which doesn't apply here

Regarding CUDA streams for selection scores: even if computed asynchronously, we'd still need the full attention computation since there's no cache to evict from. The overhead comes from the selection matrix computation itself, not from blocking

I've reached out to the paper's author to ask if I'm missing something, but at the moment I'm not confident this path makes sense for diffusion models. Given this, I think this PR is not suitable for the main library in its current form

I'll review the TODOs in attention_dispatch.py and the open issues to see if there's a more effective way to contribute to the attention backend work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Selective Attention for Memory-Efficient Inference

2 participants