Skip to content

利用多机多卡NPU部署Qwen2-VL训练混合数据卡死 #5714

@lizhishan1997

Description

@lizhishan1997

Reminder

  • I have read the README and searched the existing issues.

System Info

利用Qwen2-VL微调模型,发现如下问题:
(1)单机多卡训练图文对或者纯文本,不管是lora或者全量,成功
(2)多机多卡训练图文对或者纯文本,不管是lora或者全量,成功
(3)单机多卡训练混合数据,lora 7b成功
(4)单机多卡训练混合数据,全量微调7b zero3+offload 不成功
(5)多机多卡训练混合数据, lora 不成功
(6)多机多卡训练混合数据,全量微调 zero3+offload,不成功

不成功的情况下是刚开始训练就卡死

另外,由于每张卡的显存是32G,Zero2训不起来,所以只能用Zero3训练了

Reproduction

Uploading image.png…

...

Expected behavior

No response

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    npuThis problem is related to NPU devicessolvedThis problem has been already solved

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions