Add Qwen3-VL-2B and Qwen2.5-VL-3B builtin optimization recipes#254
Open
hanbitmyths wants to merge 16 commits intomicrosoft:mainfrom
Open
Add Qwen3-VL-2B and Qwen2.5-VL-3B builtin optimization recipes#254hanbitmyths wants to merge 16 commits intomicrosoft:mainfrom
hanbitmyths wants to merge 16 commits intomicrosoft:mainfrom
Conversation
- Restructure from olive/ subdir to recipe root - Export pipeline with Olive: embedding + text (ModelBuilder INT4) + vision (Dynamo->INT4) - Add CUDA configs alongside cpu_and_mobile configs - eval.py: AI2D benchmark evaluation vs PyTorch baseline - Multi-image support via QwenVisionState in ort-genai runtime - Requirements aligned: transformers>=4.57.0,<6.0, torch>=2.10.0 Results (AI2D, 100 samples, CPU): Qwen3-VL-2B: ONNX INT4 69% vs PyTorch FP32 74% (-5pp), 1.45x faster Qwen2.5-VL-3B: ONNX INT4 83% vs PyTorch FP32 81% (+2pp), 1.24x faster
…ripts - Move all files (except LICENSE) under builtin/ for Qwen3-VL and Qwen2.5-VL - Consolidate optimize.py and user_script.py to parent level with --config-dir flag - Remove duplicate export.py, optimize.py, user_script.py from cpu_and_mobile/ and cuda/ - Update JSON configs to reference ../user_script.py - Update READMEs with new structure and commands
…gs and eval - Qwen3-VL-2B-Instruct: cpu_and_mobile + cuda configs, eval.py, user_script.py, modeling code - Qwen2.5-VL-3B-Instruct: cpu_and_mobile + cuda configs, eval.py, user_script.py, modeling code - Vision/text/embedding JSON configs for both models and both targets
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds two new “builtin” optimization recipes to the repo for Qwen3-VL-2B-Instruct and Qwen2.5-VL-3B-Instruct, including Olive export/optimization/quantization pipelines (CPU/mobile + CUDA), ONNX Runtime GenAI config generation, and runnable eval/inference scripts.
Changes:
- Added a new builtin recipe for Qwen3-VL-2B-Instruct (custom exportable modeling code, Olive JSON pipelines, optimize/eval/inference scripts, docs).
- Restructured Qwen2.5-VL-3B-Instruct from the old
olive/layout into a matchingbuiltin/layout and added CPU/mobile + CUDA pipelines, eval, and updated modeling code for export. - Added runtime config generation (
genai_config.jsonpatch +processor_config.jsoncreation) for both models.
Reviewed changes
Copilot reviewed 32 out of 38 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| Qwen-Qwen3-VL-2B-Instruct/builtin/user_script.py | Olive callbacks for loading/exporting embedding + vision submodels. |
| Qwen-Qwen3-VL-2B-Instruct/builtin/requirements.txt | Recipe Python dependencies for Qwen3-VL builtin flow. |
| Qwen-Qwen3-VL-2B-Instruct/builtin/optimize.py | Orchestrates Olive runs and generates ORT GenAI configs. |
| Qwen-Qwen3-VL-2B-Instruct/builtin/info.yml | AI Toolkit metadata (keywords/EP/device/name). |
| Qwen-Qwen3-VL-2B-Instruct/builtin/inference.py | ORT GenAI inference entrypoint (text + image, interactive mode). |
| Qwen-Qwen3-VL-2B-Instruct/builtin/eval.py | AI2D evaluation comparing ONNX vs PyTorch baseline. |
| Qwen-Qwen3-VL-2B-Instruct/builtin/cuda/vision.json | CUDA vision export + graph surgeries + FP16 flow. |
| Qwen-Qwen3-VL-2B-Instruct/builtin/cuda/text.json | CUDA text export via ModelBuilder (INT4). |
| Qwen-Qwen3-VL-2B-Instruct/builtin/cuda/embedding.json | CUDA embedding export + ORT opt + FP16 flow. |
| Qwen-Qwen3-VL-2B-Instruct/builtin/cpu_and_mobile/vision.json | CPU/mobile vision export + graph surgeries + INT4 quantization. |
| Qwen-Qwen3-VL-2B-Instruct/builtin/cpu_and_mobile/text.json | CPU/mobile text export via ModelBuilder (INT4). |
| Qwen-Qwen3-VL-2B-Instruct/builtin/cpu_and_mobile/embedding.json | CPU/mobile embedding export + ORT opt + INT4 quantization. |
| Qwen-Qwen3-VL-2B-Instruct/builtin/codes/modeling_qwen3_vl.py | Custom Qwen3-VL modeling code adapted for ONNX export + Olive surgeries. |
| Qwen-Qwen3-VL-2B-Instruct/builtin/codes/init.py | Package marker for custom modeling code. |
| Qwen-Qwen3-VL-2B-Instruct/builtin/README.md | End-to-end usage docs for export, inference, and evaluation. |
| Qwen-Qwen3-VL-2B-Instruct/builtin/.gitignore | Ignores generated models/logs/cache artifacts. |
| Qwen-Qwen2.5-VL-3B-Instruct/olive/optimize.py | Removed old Olive optimize script (superseded by builtin layout). |
| Qwen-Qwen2.5-VL-3B-Instruct/olive/embedding.json | Removed old embedding config (superseded by builtin configs). |
| Qwen-Qwen2.5-VL-3B-Instruct/olive/README.md | Removed old docs (replaced by builtin README). |
| Qwen-Qwen2.5-VL-3B-Instruct/builtin/user_script.py | Updated Olive callbacks; fixes export issues (inv_freq, IO config, dtype). |
| Qwen-Qwen2.5-VL-3B-Instruct/builtin/requirements.txt | Updated recipe Python dependencies. |
| Qwen-Qwen2.5-VL-3B-Instruct/builtin/optimize.py | New builtin optimize orchestration + GenAI config generation. |
| Qwen-Qwen2.5-VL-3B-Instruct/builtin/info.yml | Normalized formatting (same metadata fields). |
| Qwen-Qwen2.5-VL-3B-Instruct/builtin/inference.py | Adjusted default model output path to cpu_and_mobile layout. |
| Qwen-Qwen2.5-VL-3B-Instruct/builtin/eval.py | Added AI2D evaluation comparing ONNX vs PyTorch baseline. |
| Qwen-Qwen2.5-VL-3B-Instruct/builtin/cuda/vision.json | CUDA vision export + surgeries + FP16 flow. |
| Qwen-Qwen2.5-VL-3B-Instruct/builtin/cuda/text.json | CUDA text export via ModelBuilder (INT4). |
| Qwen-Qwen2.5-VL-3B-Instruct/builtin/cuda/embedding.json | CUDA embedding export + ORT opt + FP16 flow. |
| Qwen-Qwen2.5-VL-3B-Instruct/builtin/cpu_and_mobile/vision.json | CPU/mobile vision export + ORT opt + INT4 quantization. |
| Qwen-Qwen2.5-VL-3B-Instruct/builtin/cpu_and_mobile/text.json | CPU/mobile text export switched to INT4 + new output path. |
| Qwen-Qwen2.5-VL-3B-Instruct/builtin/cpu_and_mobile/embedding.json | CPU/mobile embedding export + ORT opt + INT4 quantization. |
| Qwen-Qwen2.5-VL-3B-Instruct/builtin/codes/modeling_qwen2_5_vl.py | Export-compat fixes (rope param compatibility, buffers persistent, window index rewrite). |
| Qwen-Qwen2.5-VL-3B-Instruct/builtin/codes/init.py | Package marker for custom modeling code. |
| Qwen-Qwen2.5-VL-3B-Instruct/builtin/README.md | End-to-end usage docs for export, inference, and evaluation. |
| Qwen-Qwen2.5-VL-3B-Instruct/builtin/.gitignore | Ignores generated models/logs/cache artifacts. |
Comments suppressed due to low confidence (1)
Qwen-Qwen2.5-VL-3B-Instruct/builtin/requirements.txt:3
torch>=2.10.0is very likely a typo (PyTorch versions in this repo are pinned around 2.6–2.7). As written, this requirement can make the recipe environment impossible to resolve. Consider aligning to the repo’s supported torch range (e.g.,torch>=2.6,<2.9), or pin to the minimum version you actually validated for torch.export/torch.cond support.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ONNX INT4 CPU: 83.00% accuracy, 7.13 s/sample PyTorch FP32: 83.00% accuracy, 10.09 s/sample Delta: 0 pp, Speedup: 1.41x
ONNX INT4 CPU: 85.00% accuracy, 11.85 s/sample PyTorch FP32: 90.00% accuracy, 16.52 s/sample Delta: -5 pp, Speedup: 1.39x
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds end-to-end optimization recipes for Qwen3-VL-2B-Instruct and Qwen2.5-VL-3B-Instruct vision-language models, with CPU/mobile and CUDA configurations, evaluation scripts, and inference examples.
This PR depends on Olive PR.