[V1] add allreduce test on npu #9612

frozenleaves · 2025-12-13T08:54:35Z

What does this PR do?

add allreduce distributed test on npu

gemini-code-assist · 2025-12-13T08:54:48Z

Summary of Changes

Hello @frozenleaves, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new set of distributed tests to validate the "all_reduce" functionality on NPU devices. The tests ensure that various reduction operations, including sum, mean, and max, perform correctly in a multi-NPU environment, thereby enhancing the robustness and reliability of distributed computations within the "llamafactory" accelerator module.

Highlights

New Test File: A new test file, "tests_v1/accelerator/test_allreduce.py", has been added to the project.
NPU Allreduce Testing: This pull request introduces comprehensive distributed "all_reduce" tests specifically designed for NPU (Neural Processing Unit) devices.
Operation Coverage: The newly added tests cover various reduction operations, including "SUM", "MEAN", and "MAX", ensuring their correct behavior in a distributed NPU environment.
Distributed Setup: The tests leverage "torch.distributed" with the "hccl" backend and "torch.multiprocessing" for spawning worker processes, simulating a real-world distributed training setup.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds a distributed test for allreduce on NPU. The test implementation is sound, but there are a few areas for improvement to enhance robustness and adhere to testing best practices. My review includes suggestions to:

Dynamically allocate network ports to prevent flaky tests.
Use pytest fixtures for setting environment variables to improve test isolation.
Remove unnecessary print statements from the test code.
These changes will make the new test more reliable and maintainable.

tests_v1/accelerator/test_allreduce.py

gemini-code-assist · 2025-12-13T08:56:30Z

tests_v1/accelerator/test_allreduce.py

+def test_distributed_ops():
+    mp.spawn(
+        _dist_worker,
+        args=(WORLD_SIZE,),
+        nprocs=WORLD_SIZE,
+        join=True,
+    )


This function should be updated to improve test reliability and isolation. My suggestion does two things:

It dynamically finds a free network port to prevent test flakiness from port conflicts.

It uses pytest's monkeypatch fixture to set the ASCEND_RT_VISIBLE_DEVICES environment variable, which isolates this configuration to this specific test. This change requires removing the global os.environ setting at the top of the file.

Suggested change

def test_distributed_ops():

mp.spawn(

_dist_worker,

args=(WORLD_SIZE,),

nprocs=WORLD_SIZE,

join=True,

)

def test_distributed_ops(monkeypatch):

monkeypatch.setenv("ASCEND_RT_VISIBLE_DEVICES", "0,1")

import socket

from contextlib import closing

with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:

s.bind(("", 0))

s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

port = s.getsockname()[1]

mp.spawn(

_dist_worker,

args=(WORLD_SIZE, port),

nprocs=WORLD_SIZE,

join=True,

)

we can add

LLaMA-Factory/src/llamafactory/extras/misc.py

Line 320 in fdd2427

def find_available_port() -> int:

to v1 utils

I just add v1/utils/utils.py to add this function, please consider wether this position and filename is ok? other function in misc.py we can add later when it has been used.

gemini-code-assist · 2025-12-13T08:56:30Z

tests_v1/accelerator/test_allreduce.py

+from llamafactory.v1.accelerator.helper import ReduceOp, all_reduce, is_torch_npu_available
+
+
+os.environ["ASCEND_RT_VISIBLE_DEVICES"] = "0,1"


Setting environment variables at the module level can introduce side effects and make tests less isolated. This should be moved into the test function using pytest's monkeypatch fixture. I've provided a full implementation in my comment on the test_distributed_ops function.

It will affect other tests. Not recommended

I will try to use this env in test function inner and reset it after finish test.

gemini-code-assist · 2025-12-13T08:56:30Z

tests_v1/accelerator/test_allreduce.py

+    if rank == 0:
+        print("Case 1 Passed")


print statements should generally be avoided in automated tests as they can clutter the output. The success of a test is determined by its assertions passing, not by printed messages.

it should be removed

Co-authored-by: Yaowei Zheng <[email protected]>

hiyouga · 2025-12-15T16:52:31Z

Makefile


 test:
-	CUDA_VISIBLE_DEVICES= ASCEND_RT_VISIBLE_DEVICES=0 WANDB_DISABLED=true pytest -vv --import-mode=importlib tests/ tests_v1/
+	CUDA_VISIBLE_DEVICES=  WANDB_DISABLED=true pytest -vv --import-mode=importlib tests/ tests_v1/


multiple spaces

hiyouga · 2025-12-15T16:52:57Z

src/llamafactory/train/test_utils.py

    add_valuehead: bool = False,
 ) -> Union["PreTrainedModel", "LoraModel"]:
-    current_device = get_current_device()
+    # current_device = get_current_device()


we can directly remove this line

jiaqiw09 · 2025-12-15T16:59:02Z

tests_v1/accelerator/test_allreduce.py

+def _dist_worker(rank, world_size):
+    torch.npu.set_device(rank)
+    dist.init_process_group(
+        backend="hccl",


it's better considering cover cuda at first, rather write the dist backend fixed.

sure, can we use backend=None here?

jiaqiw09 · 2025-12-15T17:05:34Z

tests_v1/accelerator/test_allreduce.py

+
+
+@pytest.mark.runs_on(["npu"])
+@pytest.mark.require_distributed(2)


there is a confest.py in test_v1, I think it's better to copy slow/runs_on/run_distributed function from test/confest.py, or it may not take effect as you expect.

jiaqiw09 · 2025-12-15T17:09:03Z

tests/conftest.py

-                    )
-                )

+def _parse_visible_devices(value: str | None) -> list[str]:


combine _parse_visible_devices to _handle_device_visibility, reduce function complexity. I think this function is specfic use.

this PR needs to be rebased

frozenleaves added 3 commits December 13, 2025 15:10

add npu allreduce test

2c84203

add npu allreduce test

b287df0

add npu allreduce test

cd94764

frozenleaves marked this pull request as ready for review December 13, 2025 08:54

gemini-code-assist bot reviewed Dec 13, 2025

View reviewed changes

frozenleaves temporarily deployed to docker December 13, 2025 09:04 — with GitHub Actions Inactive

frozenleaves and others added 2 commits December 13, 2025 17:18

add npu allreduce test

f847b7c

[feat] support new function call value (hiyouga#9610)

fdd2427

Co-authored-by: Yaowei Zheng <[email protected]>

frozenleaves temporarily deployed to docker December 13, 2025 16:25 — with GitHub Actions Inactive

hiyouga and others added 3 commits December 14, 2025 11:50

[v1] model loader (hiyouga#9613)

aeda079

fix env set

4f39621

add find available port

35d80a2

frozenleaves temporarily deployed to docker December 15, 2025 07:03 — with GitHub Actions Inactive

sunyi0505 and others added 2 commits December 15, 2025 15:15

[example] add deepspeed autotp config and example (hiyouga#9602)

a017977

fix makefile, src/llamafactory/train/test_utils.py

3919e73

frozenleaves temporarily deployed to docker December 15, 2025 08:53 — with GitHub Actions Inactive

add distributed mark

9ddc460

hiyouga reviewed Dec 15, 2025

View reviewed changes

jiaqiw09 reviewed Dec 15, 2025

View reviewed changes

fix conflicts

7fff7d5

frozenleaves added 4 commits December 16, 2025 10:24

Merge branch 'main-1213' into main

559b703

fix

98af0a2

fix

728baaa

fix

eb26997

frozenleaves closed this Dec 16, 2025

		from llamafactory.v1.accelerator.helper import ReduceOp, all_reduce, is_torch_npu_available


		os.environ["ASCEND_RT_VISIBLE_DEVICES"] = "0,1"



		@pytest.mark.runs_on(["npu"])
		@pytest.mark.require_distributed(2)

[V1] add allreduce test on npu #9612

[V1] add allreduce test on npu #9612

Conversation

frozenleaves commented Dec 13, 2025

What does this PR do?

Uh oh!

gemini-code-assist bot commented Dec 13, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiaqiw09 Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jiaqiw09 Dec 15, 2025 •

edited

Loading