-
Notifications
You must be signed in to change notification settings - Fork 7.8k
[V1] add allreduce test on npu #9612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @frozenleaves, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new set of distributed tests to validate the "all_reduce" functionality on NPU devices. The tests ensure that various reduction operations, including sum, mean, and max, perform correctly in a multi-NPU environment, thereby enhancing the robustness and reliability of distributed computations within the "llamafactory" accelerator module. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a distributed test for allreduce on NPU. The test implementation is sound, but there are a few areas for improvement to enhance robustness and adhere to testing best practices. My review includes suggestions to:
- Dynamically allocate network ports to prevent flaky tests.
- Use pytest fixtures for setting environment variables to improve test isolation.
- Remove unnecessary print statements from the test code.
These changes will make the new test more reliable and maintainable.
| def test_distributed_ops(): | ||
| mp.spawn( | ||
| _dist_worker, | ||
| args=(WORLD_SIZE,), | ||
| nprocs=WORLD_SIZE, | ||
| join=True, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function should be updated to improve test reliability and isolation. My suggestion does two things:
- It dynamically finds a free network port to prevent test flakiness from port conflicts.
- It uses pytest's
monkeypatchfixture to set theASCEND_RT_VISIBLE_DEVICESenvironment variable, which isolates this configuration to this specific test. This change requires removing the globalos.environsetting at the top of the file.
| def test_distributed_ops(): | |
| mp.spawn( | |
| _dist_worker, | |
| args=(WORLD_SIZE,), | |
| nprocs=WORLD_SIZE, | |
| join=True, | |
| ) | |
| def test_distributed_ops(monkeypatch): | |
| monkeypatch.setenv("ASCEND_RT_VISIBLE_DEVICES", "0,1") | |
| import socket | |
| from contextlib import closing | |
| with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s: | |
| s.bind(("", 0)) | |
| s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) | |
| port = s.getsockname()[1] | |
| mp.spawn( | |
| _dist_worker, | |
| args=(WORLD_SIZE, port), | |
| nprocs=WORLD_SIZE, | |
| join=True, | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can add
LLaMA-Factory/src/llamafactory/extras/misc.py
Line 320 in fdd2427
| def find_available_port() -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just add v1/utils/utils.py to add this function, please consider wether this position and filename is ok? other function in misc.py we can add later when it has been used.
| from llamafactory.v1.accelerator.helper import ReduceOp, all_reduce, is_torch_npu_available | ||
|
|
||
|
|
||
| os.environ["ASCEND_RT_VISIBLE_DEVICES"] = "0,1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will affect other tests. Not recommended
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to use this env in test function inner and reset it after finish test.
| if rank == 0: | ||
| print("Case 1 Passed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok ,
Co-authored-by: Yaowei Zheng <[email protected]>
Makefile
Outdated
|
|
||
| test: | ||
| CUDA_VISIBLE_DEVICES= ASCEND_RT_VISIBLE_DEVICES=0 WANDB_DISABLED=true pytest -vv --import-mode=importlib tests/ tests_v1/ | ||
| CUDA_VISIBLE_DEVICES= WANDB_DISABLED=true pytest -vv --import-mode=importlib tests/ tests_v1/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multiple spaces
src/llamafactory/train/test_utils.py
Outdated
| add_valuehead: bool = False, | ||
| ) -> Union["PreTrainedModel", "LoraModel"]: | ||
| current_device = get_current_device() | ||
| # current_device = get_current_device() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can directly remove this line
| def _dist_worker(rank, world_size): | ||
| torch.npu.set_device(rank) | ||
| dist.init_process_group( | ||
| backend="hccl", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's better considering cover cuda at first, rather write the dist backend fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, can we use backend=None here?
|
|
||
|
|
||
| @pytest.mark.runs_on(["npu"]) | ||
| @pytest.mark.require_distributed(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a confest.py in test_v1, I think it's better to copy slow/runs_on/run_distributed function from test/confest.py, or it may not take effect as you expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
tests/conftest.py
Outdated
| ) | ||
| ) | ||
|
|
||
| def _parse_visible_devices(value: str | None) -> list[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
combine _parse_visible_devices to _handle_device_visibility, reduce function complexity. I think this function is specfic use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this PR needs to be rebased
What does this PR do?
add allreduce distributed test on npu