-
Notifications
You must be signed in to change notification settings - Fork 7.8k
Closed
Labels
npuThis problem is related to NPU devicesThis problem is related to NPU devicessolvedThis problem has been already solvedThis problem has been already solved
Description
Reminder
- I have read the README and searched the existing issues.
System Info
利用Qwen2-VL微调模型,发现如下问题:
(1)单机多卡训练图文对或者纯文本,不管是lora或者全量,成功
(2)多机多卡训练图文对或者纯文本,不管是lora或者全量,成功
(3)单机多卡训练混合数据,lora 7b成功
(4)单机多卡训练混合数据,全量微调7b zero3+offload 不成功
(5)多机多卡训练混合数据, lora 不成功
(6)多机多卡训练混合数据,全量微调 zero3+offload,不成功
不成功的情况下是刚开始训练就卡死
另外,由于每张卡的显存是32G,Zero2训不起来,所以只能用Zero3训练了
Reproduction
...
Expected behavior
No response
Others
No response
Metadata
Metadata
Assignees
Labels
npuThis problem is related to NPU devicesThis problem is related to NPU devicessolvedThis problem has been already solvedThis problem has been already solved