Enhance model engine with max concurrency support for sync endpoints by saeidbarati-scale · Pull Request #701 · scaleapi/llm-engine

saeidbarati-scale · 2025-08-24T00:03:59Z

This pull request introduces improvements to how the maximum concurrency per worker is configured, validated, and propagated across the model engine deployment and autoscaling logic. The main changes ensure that the concurrency setting is passed from configuration templates through to runtime, validated for autoscaling safety, and surfaced in resource reporting.

Configuration and Runtime Propagation:

Added the --max-concurrency argument to the forwarder container command in service_template_config_map.yaml, allowing per-worker concurrency to be set via environment variable (CONCURRENT_REQUESTS_PER_WORKER). [1] [2] [3] [4]
Updated the HTTP forwarder entrypoint to accept --max-concurrency as a command-line argument, storing it in the environment for use in concurrency limiter setup.
Modified the concurrency limiter initialization to prefer the environment variable over config file settings, ensuring runtime concurrency matches deployment configuration.

Validation Enhancements:

Improved the validate_concurrent_requests_per_worker function to check that, for sync/streaming endpoints, per_worker is less than half of concurrent_requests_per_worker to prevent autoscaling issues. This validation is now called with the additional per_worker parameter during endpoint creation and update. [1] [2] [3] [4]

Resource Reporting and Autoscaling:

Added logic to extract the concurrent_requests_per_worker value from the forwarder container’s command in deployment configs, making it available for autoscaling parameter calculations and resource reporting. [1] [2] [3]
Updated autoscaling parameter constructors to use the extracted concurrency value instead of a hardcoded default, ensuring accurate scaling and reporting for sync/streaming endpoints. [1] [2] [3]

These changes collectively improve the robustness and transparency of concurrency configuration and autoscaling for model endpoints.

Pull Request Summary

Addressing MLI-3412

Test Plan and Usage Guide

How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.

- Added `--max-concurrency` argument to service templates for configuring maximum concurrent requests per worker. - Updated validation logic in `validate_concurrent_requests_per_worker` to handle sync and streaming endpoints. - Modified `get_concurrency_limiter` to check for concurrency settings from environment variables. - Implemented extraction of `concurrent_requests_per_worker` from deployment configurations for autoscaling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance model engine with max concurrency support for sync endpoints#701

Enhance model engine with max concurrency support for sync endpoints#701
saeidbarati-scale wants to merge 1 commit intomainfrom
saeid.barati-MLI3412

saeidbarati-scale commented Aug 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

saeidbarati-scale commented Aug 24, 2025

Pull Request Summary

Test Plan and Usage Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant