-
Notifications
You must be signed in to change notification settings - Fork 3k
[doc] misc: fix doc that penalty starts when exceeds the max_response_length - overlong_buffer.len
#3856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: bzantium <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly fixes a typo in the documentation for the DAPO algorithm's overlong_buffer penalty, aligning the description with the implementation. I have added a couple of suggestions to rephrase the updated documentation in both files to further improve clarity for users.
| Setting `overlong_buffer.enable` to `True` will penalize the outputs whose lengths are overlong but still within the hard context limit. | ||
|
|
||
| Specifically, the penalty increases linearly from `0` to `overlong_buffer.penalty_factor` when the length of the output exceeds the `max_response_length` by `0` to `overlong_buffer.len` tokens. | ||
| Specifically, the penalty increases linearly from `0` to `overlong_buffer.penalty_factor` when the length of the output exceeds the `max_response_length - overlong_buffer.len` by `0` to `overlong_buffer.len` tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this change correctly adjusts the threshold, the phrasing 'exceeds... by...' can be slightly confusing. For improved clarity, I suggest rephrasing to describe the range of the penalty application more directly.
| Specifically, the penalty increases linearly from `0` to `overlong_buffer.penalty_factor` when the length of the output exceeds the `max_response_length - overlong_buffer.len` by `0` to `overlong_buffer.len` tokens. | |
| Specifically, the penalty increases linearly from `0` to `overlong_buffer.penalty_factor` as the output length increases from `max_response_length - overlong_buffer.len` to `max_response_length` |
| Setting `overlong_buffer.enable` to `True` will penalize the outputs whose lengths are overlong but still within the hard context limit. | ||
|
|
||
| Specifically, the penalty increases linearly from `0` to `overlong_buffer.penalty_factor` when the length of the output exceeds the `max_response_length` by `0` to `overlong_buffer.len` tokens. | ||
| Specifically, the penalty increases linearly from `0` to `overlong_buffer.penalty_factor` when the length of the output exceeds the `max_response_length - overlong_buffer.len` by `0` to `overlong_buffer.len` tokens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this change correctly adjusts the threshold, the phrasing 'exceeds... by...' can be slightly confusing. For improved clarity, I suggest rephrasing to describe the range of the penalty application more directly.
| Specifically, the penalty increases linearly from `0` to `overlong_buffer.penalty_factor` when the length of the output exceeds the `max_response_length - overlong_buffer.len` by `0` to `overlong_buffer.len` tokens. | |
| Specifically, the penalty increases linearly from `0` to `overlong_buffer.penalty_factor` as the output length increases from `max_response_length - overlong_buffer.len` to `max_response_length` |
max_response_length - overlong_buffer.lenmax_response_length - overlong_buffer.len
…e_length - overlong_buffer.len` (volcengine#3856) ### What does this PR do? This PR corrects a minor typo in the documentation for the DAPO algorithm. It changes the threshold for the `overlong_buffer` penalty from starting at `max_response_length` to `max_response_length - overlong_buffer.len`. This ensures the documentation accurately reflects that the penalty is applied as the response length approaches the maximum limit. Fixes volcengine#3855 Signed-off-by: bzantium <[email protected]>
…e_length - overlong_buffer.len` (volcengine#3856) ### What does this PR do? This PR corrects a minor typo in the documentation for the DAPO algorithm. It changes the threshold for the `overlong_buffer` penalty from starting at `max_response_length` to `max_response_length - overlong_buffer.len`. This ensures the documentation accurately reflects that the penalty is applied as the response length approaches the maximum limit. Fixes volcengine#3855 Signed-off-by: bzantium <[email protected]>
…e_length - overlong_buffer.len` (volcengine#3856) ### What does this PR do? This PR corrects a minor typo in the documentation for the DAPO algorithm. It changes the threshold for the `overlong_buffer` penalty from starting at `max_response_length` to `max_response_length - overlong_buffer.len`. This ensures the documentation accurately reflects that the penalty is applied as the response length approaches the maximum limit. Fixes volcengine#3855 Signed-off-by: bzantium <[email protected]>
…e_length - overlong_buffer.len` (volcengine#3856) ### What does this PR do? This PR corrects a minor typo in the documentation for the DAPO algorithm. It changes the threshold for the `overlong_buffer` penalty from starting at `max_response_length` to `max_response_length - overlong_buffer.len`. This ensures the documentation accurately reflects that the penalty is applied as the response length approaches the maximum limit. Fixes volcengine#3855 Signed-off-by: bzantium <[email protected]>
…e_length - overlong_buffer.len` (volcengine#3856) ### What does this PR do? This PR corrects a minor typo in the documentation for the DAPO algorithm. It changes the threshold for the `overlong_buffer` penalty from starting at `max_response_length` to `max_response_length - overlong_buffer.len`. This ensures the documentation accurately reflects that the penalty is applied as the response length approaches the maximum limit. Fixes volcengine#3855 Signed-off-by: bzantium <[email protected]>
…e_length - overlong_buffer.len` (volcengine#3856) ### What does this PR do? This PR corrects a minor typo in the documentation for the DAPO algorithm. It changes the threshold for the `overlong_buffer` penalty from starting at `max_response_length` to `max_response_length - overlong_buffer.len`. This ensures the documentation accurately reflects that the penalty is applied as the response length approaches the maximum limit. Fixes volcengine#3855 Signed-off-by: bzantium <[email protected]>
…e_length - overlong_buffer.len` (volcengine#3856) ### What does this PR do? This PR corrects a minor typo in the documentation for the DAPO algorithm. It changes the threshold for the `overlong_buffer` penalty from starting at `max_response_length` to `max_response_length - overlong_buffer.len`. This ensures the documentation accurately reflects that the penalty is applied as the response length approaches the maximum limit. Fixes volcengine#3855 Signed-off-by: bzantium <[email protected]>
…e_length - overlong_buffer.len` (volcengine#3856) ### What does this PR do? This PR corrects a minor typo in the documentation for the DAPO algorithm. It changes the threshold for the `overlong_buffer` penalty from starting at `max_response_length` to `max_response_length - overlong_buffer.len`. This ensures the documentation accurately reflects that the penalty is applied as the response length approaches the maximum limit. Fixes volcengine#3855 Signed-off-by: bzantium <[email protected]>
What does this PR do?
This PR corrects a minor typo in the documentation for the DAPO algorithm.
It changes the threshold for the
overlong_bufferpenalty from starting atmax_response_lengthtomax_response_length - overlong_buffer.len. This ensures the documentation accurately reflects that the penalty is applied as the response length approaches the maximum limit.Fixes #3855