Add Qwen3-VL-2B and Qwen2.5-VL-3B builtin optimization recipes by hanbitmyths · Pull Request #254 · microsoft/olive-recipes

hanbitmyths · 2026-03-04T06:48:24Z

This PR adds end-to-end optimization recipes for Qwen3-VL-2B-Instruct and Qwen2.5-VL-3B-Instruct vision-language models, with CPU/mobile and CUDA configurations, evaluation scripts, and inference examples.

Qwen3-VL-2B-Instruct/Qwen3-VL-4B-Instruct/Qwen3-VL-8B-Instruct

- optimize.py — Export and optimize the model via Olive's Model Builder with pass configurations for vision, text, and embedding ONNX subgraphs.
- eval.py — Evaluate ONNX model accuracy against PyTorch reference on the DocVQA dataset, comparing exact match and ANLS scores.
- inference.py — Run interactive or batch multimodal inference with the optimized model.
- user_script.py — Custom model loading with patched Qwen3VLForConditionalGeneration for ONNX-compatible export.
- codes/modeling_qwen3_vl.py — Modified Qwen3-VL model code for ONNX export compatibility.
- CPU/mobile and CUDA Olive pass configs (cpu_and_mobile/, cuda/) for vision, text, and embedding components.

Qwen2.5-VL-3B-Instruct (Qwen-Qwen2.5-VL-3B-Instruct/builtin/)

- Restructured from olive to builtin directory layout matching the Qwen3-VL recipe structure.
- Added eval.py, optimize.py, CUDA configs, and embedding export configs.
- Updated codes/modeling_qwen2_5_vl.py for improved ONNX export compatibility.
- Added .gitignore for model output artifacts.
- Shared infrastructure.
- Both recipes include info.yml for AI Toolkit integration, requirements.txt, and a sample cat.jpeg test image.
- Evaluation script supports both ONNX (via onnxruntime-genai) and PyTorch reference models.

This PR depends on Olive PR.

- Restructure from olive/ subdir to recipe root - Export pipeline with Olive: embedding + text (ModelBuilder INT4) + vision (Dynamo->INT4) - Add CUDA configs alongside cpu_and_mobile configs - eval.py: AI2D benchmark evaluation vs PyTorch baseline - Multi-image support via QwenVisionState in ort-genai runtime - Requirements aligned: transformers>=4.57.0,<6.0, torch>=2.10.0 Results (AI2D, 100 samples, CPU): Qwen3-VL-2B: ONNX INT4 69% vs PyTorch FP32 74% (-5pp), 1.45x faster Qwen2.5-VL-3B: ONNX INT4 83% vs PyTorch FP32 81% (+2pp), 1.24x faster

…ripts - Move all files (except LICENSE) under builtin/ for Qwen3-VL and Qwen2.5-VL - Consolidate optimize.py and user_script.py to parent level with --config-dir flag - Remove duplicate export.py, optimize.py, user_script.py from cpu_and_mobile/ and cuda/ - Update JSON configs to reference ../user_script.py - Update READMEs with new structure and commands

…gs and eval - Qwen3-VL-2B-Instruct: cpu_and_mobile + cuda configs, eval.py, user_script.py, modeling code - Qwen2.5-VL-3B-Instruct: cpu_and_mobile + cuda configs, eval.py, user_script.py, modeling code - Vision/text/embedding JSON configs for both models and both targets

Copilot

Pull request overview

This PR adds two new “builtin” optimization recipes to the repo for Qwen3-VL-2B-Instruct and Qwen2.5-VL-3B-Instruct, including Olive export/optimization/quantization pipelines (CPU/mobile + CUDA), ONNX Runtime GenAI config generation, and runnable eval/inference scripts.

Changes:

Added a new builtin recipe for Qwen3-VL-2B-Instruct (custom exportable modeling code, Olive JSON pipelines, optimize/eval/inference scripts, docs).
Restructured Qwen2.5-VL-3B-Instruct from the old olive/ layout into a matching builtin/ layout and added CPU/mobile + CUDA pipelines, eval, and updated modeling code for export.
Added runtime config generation (genai_config.json patch + processor_config.json creation) for both models.

Reviewed changes

Copilot reviewed 32 out of 38 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
Qwen-Qwen3-VL-2B-Instruct/builtin/user_script.py	Olive callbacks for loading/exporting embedding + vision submodels.
Qwen-Qwen3-VL-2B-Instruct/builtin/requirements.txt	Recipe Python dependencies for Qwen3-VL builtin flow.
Qwen-Qwen3-VL-2B-Instruct/builtin/optimize.py	Orchestrates Olive runs and generates ORT GenAI configs.
Qwen-Qwen3-VL-2B-Instruct/builtin/info.yml	AI Toolkit metadata (keywords/EP/device/name).
Qwen-Qwen3-VL-2B-Instruct/builtin/inference.py	ORT GenAI inference entrypoint (text + image, interactive mode).
Qwen-Qwen3-VL-2B-Instruct/builtin/eval.py	AI2D evaluation comparing ONNX vs PyTorch baseline.
Qwen-Qwen3-VL-2B-Instruct/builtin/cuda/vision.json	CUDA vision export + graph surgeries + FP16 flow.
Qwen-Qwen3-VL-2B-Instruct/builtin/cuda/text.json	CUDA text export via ModelBuilder (INT4).
Qwen-Qwen3-VL-2B-Instruct/builtin/cuda/embedding.json	CUDA embedding export + ORT opt + FP16 flow.
Qwen-Qwen3-VL-2B-Instruct/builtin/cpu_and_mobile/vision.json	CPU/mobile vision export + graph surgeries + INT4 quantization.
Qwen-Qwen3-VL-2B-Instruct/builtin/cpu_and_mobile/text.json	CPU/mobile text export via ModelBuilder (INT4).
Qwen-Qwen3-VL-2B-Instruct/builtin/cpu_and_mobile/embedding.json	CPU/mobile embedding export + ORT opt + INT4 quantization.
Qwen-Qwen3-VL-2B-Instruct/builtin/codes/modeling_qwen3_vl.py	Custom Qwen3-VL modeling code adapted for ONNX export + Olive surgeries.
Qwen-Qwen3-VL-2B-Instruct/builtin/codes/init.py	Package marker for custom modeling code.
Qwen-Qwen3-VL-2B-Instruct/builtin/README.md	End-to-end usage docs for export, inference, and evaluation.
Qwen-Qwen3-VL-2B-Instruct/builtin/.gitignore	Ignores generated models/logs/cache artifacts.
Qwen-Qwen2.5-VL-3B-Instruct/olive/optimize.py	Removed old Olive optimize script (superseded by builtin layout).
Qwen-Qwen2.5-VL-3B-Instruct/olive/embedding.json	Removed old embedding config (superseded by builtin configs).
Qwen-Qwen2.5-VL-3B-Instruct/olive/README.md	Removed old docs (replaced by builtin README).
Qwen-Qwen2.5-VL-3B-Instruct/builtin/user_script.py	Updated Olive callbacks; fixes export issues (inv_freq, IO config, dtype).
Qwen-Qwen2.5-VL-3B-Instruct/builtin/requirements.txt	Updated recipe Python dependencies.
Qwen-Qwen2.5-VL-3B-Instruct/builtin/optimize.py	New builtin optimize orchestration + GenAI config generation.
Qwen-Qwen2.5-VL-3B-Instruct/builtin/info.yml	Normalized formatting (same metadata fields).
Qwen-Qwen2.5-VL-3B-Instruct/builtin/inference.py	Adjusted default model output path to cpu_and_mobile layout.
Qwen-Qwen2.5-VL-3B-Instruct/builtin/eval.py	Added AI2D evaluation comparing ONNX vs PyTorch baseline.
Qwen-Qwen2.5-VL-3B-Instruct/builtin/cuda/vision.json	CUDA vision export + surgeries + FP16 flow.
Qwen-Qwen2.5-VL-3B-Instruct/builtin/cuda/text.json	CUDA text export via ModelBuilder (INT4).
Qwen-Qwen2.5-VL-3B-Instruct/builtin/cuda/embedding.json	CUDA embedding export + ORT opt + FP16 flow.
Qwen-Qwen2.5-VL-3B-Instruct/builtin/cpu_and_mobile/vision.json	CPU/mobile vision export + ORT opt + INT4 quantization.
Qwen-Qwen2.5-VL-3B-Instruct/builtin/cpu_and_mobile/text.json	CPU/mobile text export switched to INT4 + new output path.
Qwen-Qwen2.5-VL-3B-Instruct/builtin/cpu_and_mobile/embedding.json	CPU/mobile embedding export + ORT opt + INT4 quantization.
Qwen-Qwen2.5-VL-3B-Instruct/builtin/codes/modeling_qwen2_5_vl.py	Export-compat fixes (rope param compatibility, buffers persistent, window index rewrite).
Qwen-Qwen2.5-VL-3B-Instruct/builtin/codes/init.py	Package marker for custom modeling code.
Qwen-Qwen2.5-VL-3B-Instruct/builtin/README.md	End-to-end usage docs for export, inference, and evaluation.
Qwen-Qwen2.5-VL-3B-Instruct/builtin/.gitignore	Ignores generated models/logs/cache artifacts.

Comments suppressed due to low confidence (1)

Qwen-Qwen2.5-VL-3B-Instruct/builtin/requirements.txt:3

torch>=2.10.0 is very likely a typo (PyTorch versions in this repo are pinned around 2.6–2.7). As written, this requirement can make the recipe environment impossible to resolve. Consider aligning to the repo’s supported torch range (e.g., torch>=2.6,<2.9), or pin to the minimum version you actually validated for torch.export/torch.cond support.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Qwen-Qwen3-VL-2B-Instruct/builtin/codes/modeling_qwen3_vl.py

Qwen-Qwen3-VL-2B-Instruct/builtin/eval.py

Qwen-Qwen2.5-VL-3B-Instruct/builtin/eval.py

Qwen-Qwen3-VL-2B-Instruct/builtin/user_script.py

Qwen-Qwen3-VL-2B-Instruct/builtin/codes/modeling_qwen3_vl.py

…import, __all__ cleanup

ONNX INT4 CPU: 83.00% accuracy, 7.13 s/sample PyTorch FP32: 83.00% accuracy, 10.09 s/sample Delta: 0 pp, Speedup: 1.41x

ONNX INT4 CPU: 85.00% accuracy, 11.85 s/sample PyTorch FP32: 90.00% accuracy, 16.52 s/sample Delta: -5 pp, Speedup: 1.39x

hanbitmyths added 6 commits February 20, 2026 21:54

Support a qwen3-vl model

e45bc07

Fix rope_theta compatibility with transformers>=5.2.0

ca60261

Fix ModelBuilder hf_token for public models (no auth needed)

e597ad9

Copilot AI review requested due to automatic review settings March 4, 2026 06:48

Copilot started reviewing on behalf of hanbitmyths March 4, 2026 06:48 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

hanbitmyths and others added 3 commits March 3, 2026 22:59

Merge branch 'main' into sunghcho/qwen3-vl

d4e0e1c

Fix Copilot review comments: true_fn return bug, torch_dtype, unused …

179e823

…import, __all__ cleanup

Fix pre-commit: trailing whitespace, end-of-file, requirements sort

3b374f8

hanbitmyths mentioned this pull request Mar 4, 2026

Add Qwen3-VL runtime, export, and Python guide support microsoft/onnxruntime-genai#1999

Open

hanbitmyths and others added 7 commits March 4, 2026 18:00

Merge branch 'main' into sunghcho/qwen3-vl

8a6063a

Add Qwen3-VL-4B-Instruct and Qwen3-VL-8B-Instruct recipes

91a7328

Remove stale 2B eval results from 4B and 8B READMEs

d09f2dc

Qwen3-VL-4B: add evaluation results (AI2D, 100 samples)

bd70c77

ONNX INT4 CPU: 83.00% accuracy, 7.13 s/sample PyTorch FP32: 83.00% accuracy, 10.09 s/sample Delta: 0 pp, Speedup: 1.41x

Qwen3-VL-8B: add evaluation results (AI2D, 100 samples)

a276473

ONNX INT4 CPU: 85.00% accuracy, 11.85 s/sample PyTorch FP32: 90.00% accuracy, 16.52 s/sample Delta: -5 pp, Speedup: 1.39x

Update READMEs: add CUDA eval results, clarify INT4 vs FP16 per device

21a86de

Merge branch 'main' into sunghcho/qwen3-vl

bc26136

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen3-VL-2B and Qwen2.5-VL-3B builtin optimization recipes#254

Add Qwen3-VL-2B and Qwen2.5-VL-3B builtin optimization recipes#254
hanbitmyths wants to merge 16 commits intomicrosoft:mainfrom
hanbitmyths:sunghcho/qwen3-vl

hanbitmyths commented Mar 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hanbitmyths commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hanbitmyths commented Mar 4, 2026 •

edited

Loading