Skip to content

Conversation

@frozenleaves
Copy link
Contributor

What does this PR do?

add allreduce distributed test on npu

@frozenleaves frozenleaves marked this pull request as ready for review December 13, 2025 08:54
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @frozenleaves, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new set of distributed tests to validate the "all_reduce" functionality on NPU devices. The tests ensure that various reduction operations, including sum, mean, and max, perform correctly in a multi-NPU environment, thereby enhancing the robustness and reliability of distributed computations within the "llamafactory" accelerator module.

Highlights

  • New Test File: A new test file, "tests_v1/accelerator/test_allreduce.py", has been added to the project.
  • NPU Allreduce Testing: This pull request introduces comprehensive distributed "all_reduce" tests specifically designed for NPU (Neural Processing Unit) devices.
  • Operation Coverage: The newly added tests cover various reduction operations, including "SUM", "MEAN", and "MAX", ensuring their correct behavior in a distributed NPU environment.
  • Distributed Setup: The tests leverage "torch.distributed" with the "hccl" backend and "torch.multiprocessing" for spawning worker processes, simulating a real-world distributed training setup.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a distributed test for allreduce on NPU. The test implementation is sound, but there are a few areas for improvement to enhance robustness and adhere to testing best practices. My review includes suggestions to:

  • Dynamically allocate network ports to prevent flaky tests.
  • Use pytest fixtures for setting environment variables to improve test isolation.
  • Remove unnecessary print statements from the test code.
    These changes will make the new test more reliable and maintainable.

Comment on lines 56 to 62
def test_distributed_ops():
mp.spawn(
_dist_worker,
args=(WORLD_SIZE,),
nprocs=WORLD_SIZE,
join=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This function should be updated to improve test reliability and isolation. My suggestion does two things:

  1. It dynamically finds a free network port to prevent test flakiness from port conflicts.
  2. It uses pytest's monkeypatch fixture to set the ASCEND_RT_VISIBLE_DEVICES environment variable, which isolates this configuration to this specific test. This change requires removing the global os.environ setting at the top of the file.
Suggested change
def test_distributed_ops():
mp.spawn(
_dist_worker,
args=(WORLD_SIZE,),
nprocs=WORLD_SIZE,
join=True,
)
def test_distributed_ops(monkeypatch):
monkeypatch.setenv("ASCEND_RT_VISIBLE_DEVICES", "0,1")
import socket
from contextlib import closing
with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
s.bind(("", 0))
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
port = s.getsockname()[1]
mp.spawn(
_dist_worker,
args=(WORLD_SIZE, port),
nprocs=WORLD_SIZE,
join=True,
)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add

def find_available_port() -> int:
to v1 utils

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just add v1/utils/utils.py to add this function, please consider wether this position and filename is ok? other function in misc.py we can add later when it has been used.

from llamafactory.v1.accelerator.helper import ReduceOp, all_reduce, is_torch_npu_available


os.environ["ASCEND_RT_VISIBLE_DEVICES"] = "0,1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Setting environment variables at the module level can introduce side effects and make tests less isolated. This should be moved into the test function using pytest's monkeypatch fixture. I've provided a full implementation in my comment on the test_distributed_ops function.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will affect other tests. Not recommended

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to use this env in test function inner and reset it after finish test.

Comment on lines 34 to 35
if rank == 0:
print("Case 1 Passed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

print statements should generally be avoided in automated tests as they can clutter the output. The success of a test is determined by its assertions passing, not by printed messages.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok ,

Makefile Outdated

test:
CUDA_VISIBLE_DEVICES= ASCEND_RT_VISIBLE_DEVICES=0 WANDB_DISABLED=true pytest -vv --import-mode=importlib tests/ tests_v1/
CUDA_VISIBLE_DEVICES= WANDB_DISABLED=true pytest -vv --import-mode=importlib tests/ tests_v1/
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiple spaces

add_valuehead: bool = False,
) -> Union["PreTrainedModel", "LoraModel"]:
current_device = get_current_device()
# current_device = get_current_device()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can directly remove this line

def _dist_worker(rank, world_size):
torch.npu.set_device(rank)
dist.init_process_group(
backend="hccl",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better considering cover cuda at first, rather write the dist backend fixed.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, can we use backend=None here?



@pytest.mark.runs_on(["npu"])
@pytest.mark.require_distributed(2)
Copy link
Collaborator

@jiaqiw09 jiaqiw09 Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a confest.py in test_v1, I think it's better to copy slow/runs_on/run_distributed function from test/confest.py, or it may not take effect as you expect.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

)
)

def _parse_visible_devices(value: str | None) -> list[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combine _parse_visible_devices to _handle_device_visibility, reduce function complexity. I think this function is specfic use.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR needs to be rebased

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants