Enhance model engine with max concurrency support for sync endpoints#701
Open
saeidbarati-scale wants to merge 1 commit intomainfrom
Open
Enhance model engine with max concurrency support for sync endpoints#701saeidbarati-scale wants to merge 1 commit intomainfrom
saeidbarati-scale wants to merge 1 commit intomainfrom
Conversation
- Added `--max-concurrency` argument to service templates for configuring maximum concurrent requests per worker. - Updated validation logic in `validate_concurrent_requests_per_worker` to handle sync and streaming endpoints. - Modified `get_concurrency_limiter` to check for concurrency settings from environment variables. - Implemented extraction of `concurrent_requests_per_worker` from deployment configurations for autoscaling.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces improvements to how the maximum concurrency per worker is configured, validated, and propagated across the model engine deployment and autoscaling logic. The main changes ensure that the concurrency setting is passed from configuration templates through to runtime, validated for autoscaling safety, and surfaced in resource reporting.
Configuration and Runtime Propagation:
--max-concurrencyargument to the forwarder container command inservice_template_config_map.yaml, allowing per-worker concurrency to be set via environment variable (CONCURRENT_REQUESTS_PER_WORKER). [1] [2] [3] [4]--max-concurrencyas a command-line argument, storing it in the environment for use in concurrency limiter setup.Validation Enhancements:
validate_concurrent_requests_per_workerfunction to check that, for sync/streaming endpoints,per_workeris less than half ofconcurrent_requests_per_workerto prevent autoscaling issues. This validation is now called with the additionalper_workerparameter during endpoint creation and update. [1] [2] [3] [4]Resource Reporting and Autoscaling:
concurrent_requests_per_workervalue from the forwarder container’s command in deployment configs, making it available for autoscaling parameter calculations and resource reporting. [1] [2] [3]These changes collectively improve the robustness and transparency of concurrency configuration and autoscaling for model endpoints.
Pull Request Summary
Addressing MLI-3412
Test Plan and Usage Guide
How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.