Add static per block MSE for NVFP4 weight by Fridah-nv · Pull Request #613 · NVIDIA/Model-Optimizer

Fridah-nv · 2025-11-27T00:12:52Z

What does this PR do?

Type of change: ?
new feature

Overview: ?
Support static block-wise MSE for NVFP4 weight quantization.
Add a FP4 triton kernel that take in scales for each block. It also quantizes the scales to FP8.

This PR does the following:

Enable static NVFP4 implementation, i.e. block scales for weights are calculated during calibration and feed into fake quant kernels
2.Extend mse_calibrate to support static NVFP4 with block scales searching by MSE and global scale set as MAX
3.Refinements: calibrate weight quantizers only once during MSE calibration

Usage

Example config:

NVFP4_WEIGHT_MSE_CFG = {
    "quant_cfg": {
        "*weight_quantizer": {
            "num_bits": (2, 1),
            "block_sizes": {-1: 16, "type": "static", "scale_bits": (4, 3)},
            "axis": None,
            "enable": True,
        },
        "*input_quantizer": {
            "enable": False,
        },
        **_default_disabled_quantizer_cfg,
    },
    "algorithm": {
        "method": "mse",
        "step_size": 0.25,
        "start_multiplier": 0.25,
        "stop_multiplier": 2.0,
    },
}

NVFP4_WEIGHT_ACT_MSE_CFG = {
    "quant_cfg": {
        "*weight_quantizer": {
            "num_bits": (2, 1),
            "block_sizes": {-1: 16, "type": "static", "scale_bits": (4, 3)},
            "axis": None,
            "enable": True,
        },
        "*input_quantizer": {
            "num_bits": (2, 1),
            "block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)},
            "axis": None,
            "enable": True,
        },
        **_default_disabled_quantizer_cfg,
    },
    "algorithm": {
        "method": "mse",
        "step_size": 0.25,
        "start_multiplier": 0.25,
        "stop_multiplier": 2.0,
    },
}

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

copy-pr-bot · 2025-11-27T00:12:55Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

codecov · 2025-11-27T00:23:37Z

Codecov Report

❌ Patch coverage is 87.03704% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.64%. Comparing base (f06c3f9) to head (76bd9b9).
⚠️ Report is 101 commits behind head on main.

Files with missing lines	Patch %	Lines
.../torch/quantization/nn/modules/tensor_quantizer.py	50.00%	3 Missing ⚠️
modelopt/torch/quantization/tensor_quant.py	70.00%	3 Missing ⚠️
modelopt/torch/quantization/model_calib.py	96.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #613      +/-   ##
==========================================
+ Coverage   74.57%   74.64%   +0.07%     
==========================================
  Files         183      192       +9     
  Lines       18412    19027     +615     
==========================================
+ Hits        13730    14202     +472     
- Misses       4682     4825     +143

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

modelopt/torch/quantization/triton/fp4_kernel.py

modelopt/torch/quantization/model_calib.py

…nce; quant scale to FP8; rename static kernel Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

modelopt/torch/quantization/triton/fp4_kernel.py

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

realAsma

can we add step_size as well as mse_calib argument?

modelopt/torch/quantization/triton/fp4_kernel.py

modelopt/torch/quantization/calib/mse.py

modelopt/torch/quantization/model_calib.py

…nel launch func Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

modelopt/torch/quantization/nn/modules/tensor_quantizer.py

modelopt/torch/quantization/model_calib.py

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

modelopt/torch/quantization/model_calib.py

modelopt/torch/quantization/nn/modules/tensor_quantizer.py

modelopt/torch/quantization/model_calib.py

modelopt/torch/quantization/config.py

tests/gpu/torch/quantization/test_tensor_quant_cuda.py

modelopt/torch/quantization/triton/fp4_kernel.py

tests/gpu/torch/quantization/test_tensor_quant_cuda.py

modelopt/torch/quantization/nn/modules/tensor_quantizer.py

modelopt/torch/quantization/triton/fp4_kernel.py

…calibrate Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

realAsma

Looks great!!

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

## What does this PR do? **Type of change:** ?  new feature **Overview:** ? Support static block-wise MSE for NVFP4 weight quantization. Add a FP4 triton kernel that take in scales for each block. It also quantizes the scales to FP8. This PR does the following: 1. Enable static NVFP4 implementation, i.e. block scales for weights are calculated during calibration and feed into fake quant kernels 2.Extend mse_calibrate to support static NVFP4 with block scales searching by MSE and global scale set as MAX 3.Refinements: calibrate weight quantizers only once during MSE calibration ## Usage  Example config: ```python NVFP4_WEIGHT_MSE_CFG = { "quant_cfg": { "*weight_quantizer": { "num_bits": (2, 1), "block_sizes": {-1: 16, "type": "static", "scale_bits": (4, 3)}, "axis": None, "enable": True, }, "*input_quantizer": { "enable": False, }, **_default_disabled_quantizer_cfg, }, "algorithm": { "method": "mse", "step_size": 0.25, "start_multiplier": 0.25, "stop_multiplier": 2.0, }, } NVFP4_WEIGHT_ACT_MSE_CFG = { "quant_cfg": { "*weight_quantizer": { "num_bits": (2, 1), "block_sizes": {-1: 16, "type": "static", "scale_bits": (4, 3)}, "axis": None, "enable": True, }, "*input_quantizer": { "num_bits": (2, 1), "block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)}, "axis": None, "enable": True, }, **_default_disabled_quantizer_cfg, }, "algorithm": { "method": "mse", "step_size": 0.25, "start_multiplier": 0.25, "stop_multiplier": 2.0, }, } ``` ## Testing  ## Before your PR is "*Ready for review*"  - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No  - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No  ## Additional Information  --------- Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

## What does this PR do? **Type of change:** ?  new feature **Overview:** ? Support static block-wise MSE for NVFP4 weight quantization. Add a FP4 triton kernel that take in scales for each block. It also quantizes the scales to FP8. This PR does the following: 1. Enable static NVFP4 implementation, i.e. block scales for weights are calculated during calibration and feed into fake quant kernels 2.Extend mse_calibrate to support static NVFP4 with block scales searching by MSE and global scale set as MAX 3.Refinements: calibrate weight quantizers only once during MSE calibration ## Usage  Example config: ```python NVFP4_WEIGHT_MSE_CFG = { "quant_cfg": { "*weight_quantizer": { "num_bits": (2, 1), "block_sizes": {-1: 16, "type": "static", "scale_bits": (4, 3)}, "axis": None, "enable": True, }, "*input_quantizer": { "enable": False, }, **_default_disabled_quantizer_cfg, }, "algorithm": { "method": "mse", "step_size": 0.25, "start_multiplier": 0.25, "stop_multiplier": 2.0, }, } NVFP4_WEIGHT_ACT_MSE_CFG = { "quant_cfg": { "*weight_quantizer": { "num_bits": (2, 1), "block_sizes": {-1: 16, "type": "static", "scale_bits": (4, 3)}, "axis": None, "enable": True, }, "*input_quantizer": { "num_bits": (2, 1), "block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)}, "axis": None, "enable": True, }, **_default_disabled_quantizer_cfg, }, "algorithm": { "method": "mse", "step_size": 0.25, "start_multiplier": 0.25, "stop_multiplier": 2.0, }, } ``` ## Testing  ## Before your PR is "*Ready for review*"  - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No  - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No  ## Additional Information  --------- Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com> Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

tmp update for per block mse NVFP4 and INT4

c025df7

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

Fridah-nv self-assigned this Nov 27, 2025

realAsma reviewed Dec 3, 2025

View reviewed changes

modelopt/torch/quantization/triton/fp4_kernel.py Outdated Show resolved Hide resolved

realAsma reviewed Dec 3, 2025

View reviewed changes

modelopt/torch/quantization/model_calib.py Outdated Show resolved Hide resolved

Fridah-nv added 3 commits December 5, 2025 02:32

improvements: even steps for mse amax search;calibrate weight quant o…

0bcba00

…nce; quant scale to FP8; rename static kernel Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

tmp: config update for experiments

bb23fe2

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

clear up cached data after calibration, remove tmp configs

6bcf535

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

cjluo-nv reviewed Jan 2, 2026

View reviewed changes

modelopt/torch/quantization/triton/fp4_kernel.py Show resolved Hide resolved

cjluo-nv reviewed Jan 2, 2026

View reviewed changes

modelopt/torch/quantization/triton/fp4_kernel.py Outdated Show resolved Hide resolved

cjluo-nv reviewed Jan 2, 2026

View reviewed changes

modelopt/torch/quantization/triton/fp4_kernel.py Outdated Show resolved Hide resolved

Fridah-nv marked this pull request as ready for review January 6, 2026 21:17

Fridah-nv requested a review from a team as a code owner January 6, 2026 21:17

Fridah-nv requested a review from kaix-nv January 6, 2026 21:17

add unit tests, address reviewer comments

f19a829

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

Fridah-nv changed the title ~~draft: Add per block MSE for NVFP4 and INT4~~ Add per block MSE for NVFP4 and INT4 Jan 6, 2026

Fridah-nv requested review from kinjalpatel27 and mxinO January 6, 2026 23:13

Fridah-nv changed the title ~~Add per block MSE for NVFP4 and INT4~~ Add per block MSE for NVFP4 Jan 7, 2026

Fridah-nv changed the title ~~Add per block MSE for NVFP4~~ Add static per block MSE for NVFP4 weight Jan 7, 2026

fix gpu unit test

e1d7ff5

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

realAsma reviewed Jan 8, 2026

View reviewed changes

modelopt/torch/quantization/triton/fp4_kernel.py Show resolved Hide resolved

realAsma reviewed Jan 8, 2026

View reviewed changes

modelopt/torch/quantization/calib/mse.py Outdated Show resolved Hide resolved

realAsma reviewed Jan 8, 2026

View reviewed changes

modelopt/torch/quantization/model_calib.py Outdated Show resolved Hide resolved

realAsma reviewed Jan 8, 2026

View reviewed changes

modelopt/torch/quantization/model_calib.py Outdated Show resolved Hide resolved

use step_size instead of num_steps; move FP8 scale quant into FP4 ker…

56f31df

…nel launch func Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>