[ total_loss = (1 - self.kd_ratio) * lm_loss + self.kd_ratio * distil_loss](https://github.com/modelscope/easydistill/blob/main/easydistill/kd/train.py#L131) 这里lm_loss和distil_loss在数量级上面差了近百倍,千倍,这样直接融合是否有意义?实际数据看lm_loss刚开始都是几十,最后收敛也到了0.1量级,但distil_loss是最开始也是0.001量级,收敛到0.0001量级,这样加权distill_loss基本没效果
total_loss = (1 - self.kd_ratio) * lm_loss + self.kd_ratio * distil_loss
这里lm_loss和distil_loss在数量级上面差了近百倍,千倍,这样直接融合是否有意义?实际数据看lm_loss刚开始都是几十,最后收敛也到了0.1量级,但distil_loss是最开始也是0.001量级,收敛到0.0001量级,这样加权distill_loss基本没效果