Merge mbridge distillation for any_model by danielkorzekwa · Pull Request #1036 · NVIDIA/Model-Optimizer

danielkorzekwa · 2026-03-13T17:51:19Z

What does this PR do?

Merge anymodel mbridge distillation

Summary by CodeRabbit

New Features
- Heterogeneous-layer support for Puzzletron AnyModel → Megatron-Core conversions
- Knowledge-distillation workflow support and tooling
- Export of trained models to HuggingFace format (new export helper)
- Bridge support extended for Llama and Qwen3 architectures
- New CLI options to control HuggingFace export destination and template model
Tests
- Added integration test validating the distillation + export flow
- Removed an obsolete trivial test
Documentation
- Updated README to document HF export flags
Chores
- CI: switched example-tests workflow to use NeMo and updated pip invocation

- Add converter, model_descriptor, puzzformer, and llama model support - Selective merge of anymodel functionality Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

…s merged) Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

…tion_scoring

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

…tion_scoring

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

…tion_scoring

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

…tion_scoring

…nymodel_pruning

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

codecov · 2026-03-19T14:28:51Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.10%. Comparing base (1b42f0b) to head (a5b715f).
⚠️ Report is 3 commits behind head on feature/puzzletron.

Additional details and impacted files

@@                  Coverage Diff                   @@
##           feature/puzzletron    #1036      +/-   ##
======================================================
- Coverage               72.12%   72.10%   -0.03%     
======================================================
  Files                     209      209              
  Lines                   23628    23628              
======================================================
- Hits                    17042    17036       -6     
- Misses                   6586     6592       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 8

🧹 Nitpick comments (1)

.github/workflows/example_tests.yml (1)
70-70: Pin the NeMo container image by digest for reproducibility. NVIDIA's version tags (such as 26.02) are mutable and may be updated to point to newer patches, introducing drift in CI. Use an immutable digest instead: nvcr.io/nvidia/nemo:26.02@sha256:<digest>. To obtain the digest, run docker manifest inspect nvcr.io/nvidia/nemo:26.02 (or use ORAS; see NVIDIA NGC docs) and pin the returned sha256 hash.
Suggested change
-      docker_image: "nvcr.io/nvidia/nemo:26.02"
+      docker_image: "nvcr.io/nvidia/nemo:26.02@sha256:<digest>"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/example_tests.yml at line 70, The docker_image value
currently uses a mutable tag ("nvcr.io/nvidia/nemo:26.02"); replace it with an
immutable image reference pinned by digest (format:
"nvcr.io/nvidia/nemo:26.02@sha256:<digest>"). Run docker manifest inspect
nvcr.io/nvidia/nemo:26.02 (or ORAS/ngc) to obtain the sha256 digest and update
the docker_image entry to include that digest so CI uses a reproducible,
immutable container.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/puzzletron/export/mbridge/base.py`:
- Around line 47-50: The parent provider's transformer_layer_spec is being
copied back into provider_kwargs and overwrites the heterogeneous_layer_spec set
on GenericHeterogeneousProvider; update the provider construction logic (the
code that builds provider_kwargs around provider_kwargs / provider_kwargs update
in the factory method that handles parent providers) so it does not copy or
override transformer_layer_spec from the parent into provider_kwargs if
GenericHeterogeneousProvider.transformer_layer_spec was explicitly set to
heterogeneous_layer_spec (i.e., only inherit transformer_layer_spec when the
child has none), and ensure heterogeneous_layers_config_path /
heterogeneous_layers_config_encoded_json are likewise preserved on the returned
provider instead of being clobbered by the parent.

In `@modelopt/torch/puzzletron/export/mbridge/distillation_provider.py`:
- Around line 162-166: The __setattr__ currently mirrors every matching
attribute to self.teacher which can accidentally overwrite static config fields;
restrict mirroring to a small explicit whitelist of runtime-only fields (e.g.,
define RUNTIME_MIRRORED_FIELDS = {...} at class scope) and change __setattr__ to
only call setattr(self.teacher, name, value) when name is in that whitelist and
getattr(self, "teacher", None) is not None and hasattr(self.teacher, name);
update the whitelist to include only the fields that must stay aligned at
runtime (and exclude config keys like num_layers, hidden_size,
heterogeneous_layers_config_encoded_json).
- Around line 183-188: Don't mutate DistillationProvider.__bases__ globally;
instead create a per-instance dynamic subclass combining the existing
student_provider class and DistillationProvider and assign that new type to
student_provider.__class__ so only this instance's MRO changes. Use the existing
per-instance hook (_super_class) pattern: build a subclass whose bases are
(student_provider.__class__, DistillationProvider) (or vice versa as required
for correct MRO), set student_provider.__class__ to that new subclass, then set
student_provider.teacher, student_provider.kd_config and call
student_provider.__post_init__ as before.

In `@modelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.py`:
- Around line 63-74: The export currently hardcodes strict=False when calling
bridge.export_ckpt in export_mbridge_to_hf.py which disables missing-tensor
checks for all exports; change the export_mbridge_to_hf entrypoint (e.g., the
function that wraps this logic) to accept a strict: bool parameter defaulting to
True and use that parameter in the bridge.export_ckpt call (replace strict=False
with strict=strict); update the test path (test_distill_hf.py or its caller) to
call this function with strict=False while leaving production callers to use the
default True.
- Around line 25-31: The export_to_hf_and_copy_config function currently
hardcodes trust_remote_code=True when loading the HF model; add a
caller-configurable parameter trust_remote_code: bool = False to
export_to_hf_and_copy_config and use that parameter in place of the hardcoded
True when invoking the HF model loader (the call that currently passes
trust_remote_code=True for hf_model). Update the function signature and any
internal call sites inside export_to_hf_and_copy_config (and propagate the
parameter to any helper functions if they perform the load) so the default
behavior is secure (False) but callers can opt in by passing True.

In `@tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py`:
- Around line 17-23: Add a skip when no CUDA is available by importing pytest
and setting a module-level pytestmark =
pytest.mark.skipif(torch.cuda.device_count() == 0, reason="requires CUDA") in
tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py (so the test
won't call torchrun with nproc_per_node=0); add the import for pytest near the
other imports and apply the same guard to the other test block referenced (lines
44-50) or keep it module-wide so all tests in this file are skipped when
torch.cuda.device_count() == 0.
- Around line 100-111: The function _prepare_student_and_teacher_models
currently annotates its return type as tuple[str, str] but actually returns Path
objects (e.g., tmp_path / "student_anymodel" and tmp_path / "teacher_anymodel");
change the signature to return tuple[Path, Path], update the docstring return
description to say it returns Tuple of (student_hf_path, teacher_hf_path) as
Path objects, and ensure any type hints/imports reference pathlib.Path if
needed.
- Around line 87-91: The AutoModelForCausalLM.from_pretrained call is passing
trust_remote_code=True; remove that argument so the call uses the safe default
(trust_remote_code=False) when loading the exported Llama model locally (keep
local_files_only=True and the hf_export_dir argument intact); update the
invocation of AutoModelForCausalLM.from_pretrained (the variable "model" and the
"hf_export_dir" usage) to omit trust_remote_code.

---

Nitpick comments:
In @.github/workflows/example_tests.yml:
- Line 70: The docker_image value currently uses a mutable tag
("nvcr.io/nvidia/nemo:26.02"); replace it with an immutable image reference
pinned by digest (format: "nvcr.io/nvidia/nemo:26.02@sha256:<digest>"). Run
docker manifest inspect nvcr.io/nvidia/nemo:26.02 (or ORAS/ngc) to obtain the
sha256 digest and update the docker_image entry to include that digest so CI
uses a reproducible, immutable container.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7e2d0840-b540-4de4-8cd7-f4504db455ed

📥 Commits

Reviewing files that changed from the base of the PR and between 660dc17 and 3a15188ef7376972734363954ec3bcf9ecb64348.

📒 Files selected for processing (9)

.github/workflows/example_tests.yml
modelopt/torch/puzzletron/export/mbridge/__init__.py
modelopt/torch/puzzletron/export/mbridge/base.py
modelopt/torch/puzzletron/export/mbridge/distillation_provider.py
modelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.py
modelopt/torch/puzzletron/export/mbridge/llama.py
modelopt/torch/puzzletron/export/mbridge/qwen3.py
tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py
tests/examples/puzzletron/test_dummy.py

💤 Files with no reviewable changes (1)

tests/examples/puzzletron/test_dummy.py

coderabbitai · 2026-03-19T20:56:22Z

modelopt/torch/puzzletron/export/mbridge/base.py

+    # Heterogeneous configuration fields
+    heterogeneous_layers_config_path: str | None = None
+    heterogeneous_layers_config_encoded_json: str = ""
+    transformer_layer_spec: ModuleSpec | Callable = heterogeneous_layer_spec


⚠️ Potential issue | 🟠 Major

Don't let the parent provider overwrite the heterogeneous layer spec.

Line 50 sets GenericHeterogeneousProvider.transformer_layer_spec to heterogeneous_layer_spec, but Lines 93-113 copy the parent's transformer_layer_spec straight back into provider_kwargs. That means the returned provider can still build the vanilla layer layout instead of the heterogeneous one.

Proposed fix

- provider_kwargs = dataclasses.asdict(parent_provider) + provider_kwargs = { + field.name: getattr(parent_provider, field.name) + for field in fields(parent_provider) + if field.init + } @@ # Only keep kwargs that are valid fields provider_kwargs = {k: v for k, v in provider_kwargs.items() if k in valid_fields} + provider_kwargs["transformer_layer_spec"] = heterogeneous_layer_spec provider_kwargs["heterogeneous_layers_config_encoded_json"] = ( self._build_heterogeneous_config_json(hf_pretrained.config) )

Also applies to: 91-113

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/export/mbridge/base.py` around lines 47 - 50, The parent provider's transformer_layer_spec is being copied back into provider_kwargs and overwrites the heterogeneous_layer_spec set on GenericHeterogeneousProvider; update the provider construction logic (the code that builds provider_kwargs around provider_kwargs / provider_kwargs update in the factory method that handles parent providers) so it does not copy or override transformer_layer_spec from the parent into provider_kwargs if GenericHeterogeneousProvider.transformer_layer_spec was explicitly set to heterogeneous_layer_spec (i.e., only inherit transformer_layer_spec when the child has none), and ensure heterogeneous_layers_config_path / heterogeneous_layers_config_encoded_json are likewise preserved on the returned provider instead of being clobbered by the parent.

coderabbitai · 2026-03-19T20:56:22Z

modelopt/torch/puzzletron/export/mbridge/distillation_provider.py

+    def __setattr__(self, name, value):
+        super().__setattr__(name, value)
+        # Mirror to teacher if it has that attribute
+        if hasattr(self.teacher, name):
+            setattr(self.teacher, name, value)


⚠️ Potential issue | 🟠 Major

Mirror only the runtime fields that truly must stay aligned.

Line 165 currently forwards every matching student attribute to the teacher. In an asymmetric distillation setup, any later write to a common field name like num_layers, hidden_size, or heterogeneous_layers_config_encoded_json will silently rewrite the teacher config too.

Proposed fix

+_MIRRORED_TEACHER_ATTRS = { + "tensor_model_parallel_size", + "pipeline_model_parallel_size", + "context_parallel_size", + "seq_length", + "pipeline_dtype", + "cross_entropy_fusion_impl", +} + @@ def __setattr__(self, name, value): super().__setattr__(name, value) - # Mirror to teacher if it has that attribute - if hasattr(self.teacher, name): - setattr(self.teacher, name, value) + teacher = getattr(self, "teacher", None) + if ( + name in _MIRRORED_TEACHER_ATTRS + and teacher is not None + and teacher is not self + and hasattr(teacher, name) + ): + setattr(teacher, name, value)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/export/mbridge/distillation_provider.py` around lines 162 - 166, The __setattr__ currently mirrors every matching attribute to self.teacher which can accidentally overwrite static config fields; restrict mirroring to a small explicit whitelist of runtime-only fields (e.g., define RUNTIME_MIRRORED_FIELDS = {...} at class scope) and change __setattr__ to only call setattr(self.teacher, name, value) when name is in that whitelist and getattr(self, "teacher", None) is not None and hasattr(self.teacher, name); update the whitelist to include only the fields that must stay aligned at runtime (and exclude config keys like num_layers, hidden_size, heterogeneous_layers_config_encoded_json).

coderabbitai · 2026-03-19T20:56:22Z

modelopt/torch/puzzletron/export/mbridge/distillation_provider.py

+    DistillationProvider.__bases__ = (type(student_provider),)
+    student_provider.__class__ = DistillationProvider
+
+    student_provider.teacher = teacher_provider
+    student_provider.kd_config = kd_config
+    student_provider.__post_init__()


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file structure and the specific lines mentioned cd /tmp && find . -name "distillation_provider.py" -path "*modelopt*" 2>/dev/null | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Search for the file find . -name "distillation_provider.py" 2>/dev/null

Repository: NVIDIA/Model-Optimizer

Length of output: 133

🏁 Script executed:

# List files in the modelopt directory structure git ls-files | grep -i distillation_provider

Repository: NVIDIA/Model-Optimizer

Length of output: 131

🏁 Script executed:

# Look at the specific file mentioned head -200 modelopt/torch/puzzletron/export/mbridge/distillation_provider.py | tail -50

Repository: NVIDIA/Model-Optimizer

Length of output: 2187

🏁 Script executed:

# Search for calls to convert_to_distillation_provider rg "convert_to_distillation_provider" --type py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 3349

🏁 Script executed:

# Search for uses of both GPTModelProvider and MambaModelProvider to see if they're mixed rg "(GPTModelProvider|MambaModelProvider)" --type py -l | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 263

🏁 Script executed:

# Look at the full function and surrounding code cat -n modelopt/torch/puzzletron/export/mbridge/distillation_provider.py | sed -n '160,200p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1475

🏁 Script executed:

# Check if there are tests that might reveal usage patterns find . -name "*test*" -path "*puzzletron*" -type f | grep -i distill

Repository: NVIDIA/Model-Optimizer

Length of output: 133

🏁 Script executed:

# Look at the test file to understand expected usage cat -n tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py

Repository: NVIDIA/Model-Optimizer

Length of output: 6917

🏁 Script executed:

# Check the actual examples to see if both GPT and Mamba are used in the same process cat -n examples/puzzletron/mbridge_distillation/distill_hf.py | head -150

Repository: NVIDIA/Model-Optimizer

Length of output: 7673

🏁 Script executed:

# Look for DistillationProvider class definition to understand the full design cat -n modelopt/torch/puzzletron/export/mbridge/distillation_provider.py | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 5455

🏁 Script executed:

# Let me see if there are scenarios where multiple different provider types could be converted in the same process rg "convert_to_distillation_provider" -A 5 -B 5 --type py

Repository: NVIDIA/Model-Optimizer

Length of output: 6876

🏁 Script executed:

# Check if DistillationProvider is instantiated/converted multiple times with different provider types rg "GPTModelProvider|MambaModelProvider" examples/puzzletron/mbridge_distillation/distill_hf.py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Look at the broader distillation workflow to see if mixed scenarios could occur cat -n examples/puzzletron/mbridge_distillation/distill_hf.py | sed -n '150,250p'

Repository: NVIDIA/Model-Optimizer

Length of output: 4766

🏁 Script executed:

# Let's check the actual usage - are there any scenarios where this could be called multiple times? rg "convert_to_distillation_provider\|DistillationProvider\.__bases__" --type py -B 3 -A 3 examples/ tests/

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Check if there's any code that might call convert_to_distillation_provider multiple times in sequence rg "for.*student.*teacher|multiple.*distill|loop.*provider" --type py examples/ tests/

Repository: NVIDIA/Model-Optimizer

Length of output: 817

🏁 Script executed:

# Let's see if the type of student_provider can actually change between calls cat -n modelopt/torch/puzzletron/export/mbridge/distillation_provider.py | sed -n '1,50p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2325

🏁 Script executed:

# Check if __post_init__ is called and what _super_class is used for cat -n modelopt/torch/puzzletron/export/mbridge/distillation_provider.py | sed -n '52,120p'

Repository: NVIDIA/Model-Optimizer

Length of output: 4169

🏁 Script executed:

# Let's verify the actual potential for issues - check if convert_to_distillation_provider could theoretically be called multiple times rg -A 20 "def main" examples/puzzletron/mbridge_distillation/distill_hf.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1117

🏁 Script executed:

# Look at whether __post_init__ actually uses _super_class properly cat -n modelopt/torch/puzzletron/export/mbridge/distillation_provider.py | sed -n '52,100p'

Repository: NVIDIA/Model-Optimizer

Length of output: 3187

🏁 Script executed:

# Let me check if there are any scenarios where convert_to_distillation_provider is called more than once in the same process rg "convert_to_distillation_provider" --type py -B 10 -A 10 | grep -E "def |for |while |loop" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 864

🏁 Script executed:

# Check if there are any integration tests that might mix different provider types find . -name "*.py" -path "*/test*" -exec grep -l "GPTModelProvider\|MambaModelProvider" {} \; | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

# Let's see the full context of how __bases__ affects the instances python3 << 'EOF' # Demonstrate the issue conceptually class Base1: def method(self): return "Base1" class Base2: def method(self): return "Base2" class Dynamic(Base1): pass obj1 = Dynamic() print(f"Before change - obj1.method(): {obj1.method()}") print(f"Before change - Dynamic.__bases__: {Dynamic.__bases__}") # Now change bases Dynamic.__bases__ = (Base2,) print(f"After change - obj1.method(): {obj1.method()}") # MRO is recalculated print(f"After change - Dynamic.__bases__: {Dynamic.__bases__}") # This shows that existing instances ARE affected EOF

Repository: NVIDIA/Model-Optimizer

Length of output: 263

🏁 Script executed:

# Let's verify if the __bases__ mutation is actually necessary or if _super_class alone would suffice cat -n modelopt/torch/puzzletron/export/mbridge/distillation_provider.py | sed -n '180,195p'

Repository: NVIDIA/Model-Optimizer

Length of output: 513

🏁 Script executed:

# Check if there's any other code path that might cause issues with the global __bases__ mutation rg "DistillationProvider\.__bases__|__class__ = DistillationProvider" --type py

Repository: NVIDIA/Model-Optimizer

Length of output: 314

🏁 Script executed:

# Check if _super_class is properly used everywhere it's needed rg "_super_class" modelopt/torch/puzzletron/export/mbridge/distillation_provider.py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 1471

🏁 Script executed:

# Check if there are any tests that might call convert_to_distillation_provider multiple times or with different types grep -r "convert_to_distillation_provider" tests/ --include="*.py" -B 5 -A 5

Repository: NVIDIA/Model-Optimizer

Length of output: 48

Avoid mutating DistillationProvider.__bases__ globally.

Line 183 changes the class hierarchy for every existing and future DistillationProvider instance. A second call with a different provider type (e.g., Mamba after GPT) would overwrite __bases__, breaking MRO and super() behavior for objects already in use. The design already stores _super_class per instance (line 70), making this global mutation unnecessary. Replace with a per-provider dynamic subclass or wrapper instead.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/export/mbridge/distillation_provider.py` around lines 183 - 188, Don't mutate DistillationProvider.__bases__ globally; instead create a per-instance dynamic subclass combining the existing student_provider class and DistillationProvider and assign that new type to student_provider.__class__ so only this instance's MRO changes. Use the existing per-instance hook (_super_class) pattern: build a subclass whose bases are (student_provider.__class__, DistillationProvider) (or vice versa as required for correct MRO), set student_provider.__class__ to that new subclass, then set student_provider.teacher, student_provider.kd_config and call student_provider.__post_init__ as before.

modelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.py

coderabbitai · 2026-03-19T20:56:22Z

modelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.py

+    print_rank_0("📤 Exporting to HuggingFace format...")
+    # Use strict=False for test_distill_hf.py which uses small models (2 layers) with fewer layers
+    # than the template model (32 layers). This allows partial exports when some tensors are missing.
+    # Note: This is NOT needed when running on real compressed puzzletron student models,
+    # which have the same number of layers as the template model (some may be skipped via no_op
+    # in block_configs, but all layer tensors are still present in the checkpoint).
+    bridge.export_ckpt(
+        megatron_path=megatron_path,
+        hf_path=hf_export_path,
+        show_progress=True,
+        strict=False,  # Needed for test_distill_hf.py small models; not needed for real compressed models
+    )


⚠️ Potential issue | 🟠 Major

Don't bake the test-only strict=False into every export.

The inline comment says this relaxation is only needed for test_distill_hf.py, but the helper now suppresses missing/mismatched tensor errors for production exports too. Please make strict a caller-controlled argument defaulting to True, and pass False only from the small-model test path.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.py` around lines 63 - 74, The export currently hardcodes strict=False when calling bridge.export_ckpt in export_mbridge_to_hf.py which disables missing-tensor checks for all exports; change the export_mbridge_to_hf entrypoint (e.g., the function that wraps this logic) to accept a strict: bool parameter defaulting to True and use that parameter in the bridge.export_ckpt call (replace strict=False with strict=strict); update the test path (test_distill_hf.py or its caller) to call this function with strict=False while leaving production callers to use the default True.

tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py

coderabbitai · 2026-03-19T20:56:22Z

tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py

+    model = AutoModelForCausalLM.from_pretrained(
+        str(hf_export_dir),
+        local_files_only=True,
+        trust_remote_code=True,
+    )


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py | sed -n '70,105p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1692

🏁 Script executed:

cat -n tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py | sed -n '1,50p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2360

🏁 Script executed:

cat -n tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py | sed -n '100,200p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2743

🏁 Script executed:

# Check if there are any custom modeling files in the AnyModel conversion or export find tests/examples/puzzletron/mbridge_distillation -name "*.py" -type f | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 131

🏁 Script executed:

# Check the distill_hf.py script to see what gets exported find tests/examples/puzzletron/mbridge_distillation -name "*.py" -exec grep -l "hf_export" {} \;

Repository: NVIDIA/Model-Optimizer

Length of output: 131

🏁 Script executed:

# Find the actual distill_hf.py example script find . -path ./tests -prune -o -name "distill_hf.py" -type f -print

Repository: NVIDIA/Model-Optimizer

Length of output: 122

🏁 Script executed:

cat -n ./examples/puzzletron/mbridge_distillation/distill_hf.py | grep -A 20 -B 5 "hf_export"

Repository: NVIDIA/Model-Optimizer

Length of output: 2016

🏁 Script executed:

cat -n ./examples/puzzletron/mbridge_distillation/distill_hf.py | grep -A 50 "def export_to_hf_and_copy_config"

Repository: NVIDIA/Model-Optimizer

Length of output: 48

🏁 Script executed:

grep -n "export_to_hf_and_copy_config\|def export" ./examples/puzzletron/mbridge_distillation/distill_hf.py

Repository: NVIDIA/Model-Optimizer

Length of output: 152

🏁 Script executed:

cat -n ./examples/puzzletron/mbridge_distillation/distill_hf.py | sed -n '60,70p'

Repository: NVIDIA/Model-Optimizer

Length of output: 673

🏁 Script executed:

find . -name "export_mbridge_to_hf.py" -type f

Repository: NVIDIA/Model-Optimizer

Length of output: 132

🏁 Script executed:

cat -n ./modelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.py | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 4601

Remove trust_remote_code=True from the model load.

The exported model is a standard Llama architecture loaded locally with local_files_only=True. This hardcoded parameter violates coding guidelines requiring that trust_remote_code default to False with caller opt-in. Removing it enforces the safe default for HuggingFace model loading.

Proposed fix

model = AutoModelForCausalLM.from_pretrained( str(hf_export_dir), local_files_only=True, - trust_remote_code=True, )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

model = AutoModelForCausalLM.from_pretrained(

str(hf_export_dir),

local_files_only=True,

trust_remote_code=True,

)

model = AutoModelForCausalLM.from_pretrained(

str(hf_export_dir),

local_files_only=True,

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py` around lines 87 - 91, The AutoModelForCausalLM.from_pretrained call is passing trust_remote_code=True; remove that argument so the call uses the safe default (trust_remote_code=False) when loading the exported Llama model locally (keep local_files_only=True and the hf_export_dir argument intact); update the invocation of AutoModelForCausalLM.from_pretrained (the variable "model" and the "hf_export_dir" usage) to omit trust_remote_code.

coderabbitai · 2026-03-19T20:56:22Z

tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py

+def _prepare_student_and_teacher_models(project_root_path: Path, tmp_path: Path) -> tuple[str, str]:
+    """Prepare student and teacher models for distillation.
+
+    Creates Llama models programmatically, converts them to heterogeneous format (AnyModel),
+    and returns the paths to the converted checkpoints.
+
+    Args:
+        project_root_path: Path to the project root directory
+        tmp_path: Temporary directory for test artifacts
+
+    Returns:
+        Tuple of (student_hf_path, teacher_hf_path) as strings


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, find and examine the file find . -name "test_distill_hf.py" -path "*/puzzletron/mbridge_distillation/*" | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 133

🏁 Script executed:

# Read the relevant lines from the file to verify the issue sed -n '100,165p' tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2316

Fix the helper's return type annotation.

The function signature declares tuple[str, str] but line 160 returns Path objects (created via tmp_path / "student_anymodel" and tmp_path / "teacher_anymodel"). Change the return type to tuple[Path, Path] and update the docstring accordingly.

Proposed fix

-def _prepare_student_and_teacher_models(project_root_path: Path, tmp_path: Path) -> tuple[str, str]: +def _prepare_student_and_teacher_models( + project_root_path: Path, tmp_path: Path +) -> tuple[Path, Path]: """Prepare student and teacher models for distillation. Creates Llama models programmatically, converts them to heterogeneous format (AnyModel), and returns the paths to the converted checkpoints. Args: project_root_path: Path to the project root directory tmp_path: Temporary directory for test artifacts Returns: - Tuple of (student_hf_path, teacher_hf_path) as strings + Tuple of (student_hf_path, teacher_hf_path) as paths

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _prepare_student_and_teacher_models(project_root_path: Path, tmp_path: Path) -> tuple[str, str]:

"""Prepare student and teacher models for distillation.

Creates Llama models programmatically, converts them to heterogeneous format (AnyModel),

and returns the paths to the converted checkpoints.

Args:

project_root_path: Path to the project root directory

tmp_path: Temporary directory for test artifacts

Returns:

Tuple of (student_hf_path, teacher_hf_path) as strings

def _prepare_student_and_teacher_models(

project_root_path: Path, tmp_path: Path

) -> tuple[Path, Path]:

"""Prepare student and teacher models for distillation.

Creates Llama models programmatically, converts them to heterogeneous format (AnyModel),

and returns the paths to the converted checkpoints.

Args:

project_root_path: Path to the project root directory

tmp_path: Temporary directory for test artifacts

Returns:

Tuple of (student_hf_path, teacher_hf_path) as paths

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py` around lines 100 - 111, The function _prepare_student_and_teacher_models currently annotates its return type as tuple[str, str] but actually returns Path objects (e.g., tmp_path / "student_anymodel" and tmp_path / "teacher_anymodel"); change the signature to return tuple[Path, Path], update the docstring return description to say it returns Tuple of (student_hf_path, teacher_hf_path) as Path objects, and ensure any type hints/imports reference pathlib.Path if needed.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

coderabbitai

🧹 Nitpick comments (1)

examples/puzzletron/mbridge_distillation/distill_hf.py (1)

156-163: Consider making --hf_model conditionally required based on export usage.

The --hf_model argument is marked as required=True, but it's only used when --hf_export_path is provided (line 290-311). Users who don't want to export to HuggingFace format are still forced to provide this argument.

Consider either:

Making it optional with a validation check when --hf_export_path is provided
Or documenting that it's required for future use cases

♻️ Proposed fix to make --hf_model conditionally required

     parser.add_argument(
         "--hf_model",
         "--hf-model",
         type=str,
-        required=True,
+        default=None,
         help="HuggingFace model ID to use as template for export (e.g., meta-llama/Llama-3.1-8B-Instruct). "
         "Should match the base architecture of the student model.",
     )

Then add validation after argument parsing:

    # Sanity checks
    if not args.use_mock_data and not args.data_paths:
        raise ValueError("Must provide either --data_paths or set --use_mock_data.")
+   if args.hf_export_path and not args.hf_model:
+       raise ValueError("--hf_model is required when --hf_export_path is provided.")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/puzzletron/mbridge_distillation/distill_hf.py` around lines 156 -
163, The hf_model argument is currently required unconditionally by
parser.add_argument("--hf_model") but is only needed when exporting to
HuggingFace; change parser.add_argument to make hf_model optional (remove
required=True) and add a post-parse validation in the main function where args
are checked (the same place that validates use_mock_data / data_paths) to raise
a clear error if args.hf_export_path is set but args.hf_model is empty (e.g., if
args.hf_export_path and not args.hf_model: raise ValueError("--hf_model is
required when --hf_export_path is provided.")).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/puzzletron/mbridge_distillation/distill_hf.py`:
- Around line 156-163: The hf_model argument is currently required
unconditionally by parser.add_argument("--hf_model") but is only needed when
exporting to HuggingFace; change parser.add_argument to make hf_model optional
(remove required=True) and add a post-parse validation in the main function
where args are checked (the same place that validates use_mock_data /
data_paths) to raise a clear error if args.hf_export_path is set but
args.hf_model is empty (e.g., if args.hf_export_path and not args.hf_model:
raise ValueError("--hf_model is required when --hf_export_path is provided.")).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4635cdc3-c666-4611-9336-b0a3ea4fe62b

📥 Commits

Reviewing files that changed from the base of the PR and between 24ba700 and 44186c7.

📒 Files selected for processing (1)

examples/puzzletron/mbridge_distillation/distill_hf.py

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

modelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.py

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

coderabbitai

🧹 Nitpick comments (1)

examples/puzzletron/mbridge_distillation/distill_hf.py (1)

156-163: Consider making --hf_model conditionally required.

--hf_model is marked required=True, but HuggingFace export only runs when --hf_export_path is provided (line 290). This forces users to supply --hf_model even when they don't intend to export.

♻️ Suggested fix: validate conditionally in argument sanity checks

     parser.add_argument(
         "--hf_model",
         "--hf-model",
         type=str,
-        required=True,
+        default=None,
         help="HuggingFace model ID to use as template for export (e.g., meta-llama/Llama-3.1-8B-Instruct). "
         "Should match the base architecture of the student model.",
     )

Then add validation in the sanity checks section around line 167:

    if args.hf_export_path and not args.hf_model:
        raise ValueError("--hf_model is required when --hf_export_path is provided.")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/puzzletron/mbridge_distillation/distill_hf.py` around lines 156 -
163, Remove required=True from the parser.add_argument call for "--hf_model" so
the flag is optional, and add a conditional validation in the argument
sanity-checks (where args is validated) to raise an error when hf_export_path is
provided but args.hf_model is missing (e.g., check if args.hf_export_path and
not args.hf_model then raise ValueError indicating "--hf_model is required when
--hf_export_path is provided"); this ensures parser.add_argument("--hf_model",
...) is optional while the export pathway still enforces the requirement.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/puzzletron/mbridge_distillation/distill_hf.py`:
- Around line 156-163: Remove required=True from the parser.add_argument call
for "--hf_model" so the flag is optional, and add a conditional validation in
the argument sanity-checks (where args is validated) to raise an error when
hf_export_path is provided but args.hf_model is missing (e.g., check if
args.hf_export_path and not args.hf_model then raise ValueError indicating
"--hf_model is required when --hf_export_path is provided"); this ensures
parser.add_argument("--hf_model", ...) is optional while the export pathway
still enforces the requirement.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8a205585-a2b7-42fc-813a-e6cffaa9c187

📥 Commits

Reviewing files that changed from the base of the PR and between 81f6d4e and a5b715f.

📒 Files selected for processing (2)

examples/puzzletron/mbridge_distillation/distill_hf.py
modelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.py

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.py

danielkorzekwa added 30 commits March 4, 2026 11:33

Add anymodel directories to feature/puzzletron

e82164f

- Add converter, model_descriptor, puzzformer, and llama model support - Selective merge of anymodel functionality Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Make any_model conversion working.

2099df3

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Update child_init.py with anymodel version

eb5cf8a

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

fix attention pruning

c9de41c

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Add trust_remote_code to load_model_config (default to false)

3c1bc1f

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Make activation scoring working

8357136

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Comment all tested models aside of llama_3_1_8b_instruct

6cc2194

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Delete not needed decilm test

ee4e1e3

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Fix broken tests

449b523

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Update puzzletron_nas_pluging to any_model version

fb27bba

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Correct test resources used by tests.

b350f82

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Disable puzzletron tests (will be enabled after all any_model logic i…

fafe5a3

…s merged) Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…

e988248

…tion_scoring

Comment out not implemented models.

c717852

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

format python docs

030f126

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…

8dcdfbf

…tion_scoring

Use trust_remote_code in force_cache_dynamic_modules()

70df0df

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…

bb56662

…tion_scoring

Fix anymodel pruning

ecd953e

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Fix buid docs issue.

ee8f538

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…

c9b76a1

…tion_scoring

Merge branch 'dkorzekwa/anymodel_activation_scoring' into dkorzekwa/a…

6e3af61

…nymodel_pruning

Merging build_library_and_stats

0ad6d92

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Merging anymodel: calc_one_block_scores

995eb1a

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Mering any_model: calc_one_block_scores

34081c9

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

merge any_model: mip_and_realize_models

ed5c00f

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Add all anymodel models but gptoss

993b5ec

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Make nemotron-nano-12b-v2 to work (set trust_remote_code=true)

6e9f03b

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

merge anymodel for nemotron-3-nano-30b-a3b-base-bf16

e8b7a7d

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Clarify readme and avoid reusing the same reference in llama_converter.

47414d5

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

danielkorzekwa added 4 commits March 17, 2026 08:00

Code clean up.

3592eec

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

Use all available GPUs for test_distill_hf

f06cb20

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

use extend_cmd_parts

ad31b09

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

code clean up.

0505916

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

kevalmorabia97 reviewed Mar 17, 2026

View reviewed changes

tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py Outdated Show resolved Hide resolved

Improve naming of --hf_export_path and --hf_export_path

7016857

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

danielkorzekwa requested a review from a team as a code owner March 18, 2026 13:16

danielkorzekwa requested review from Edwardf0t1 and removed request for a team March 18, 2026 13:16

kevalmorabia97 approved these changes Mar 18, 2026

View reviewed changes

kevalmorabia97 requested review from kevalmorabia97 and removed request for Edwardf0t1 March 19, 2026 12:01

kevalmorabia97 approved these changes Mar 19, 2026

View reviewed changes

Base automatically changed from dkorzekwa/anymodel_tutorial to feature/puzzletron March 19, 2026 12:56

Merge branch 'feature/puzzletron' into dkorzekwa/anymodel_mbridgedist

7ede076

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

kevalmorabia97 requested a review from a team as a code owner March 19, 2026 20:38

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

Use Nemo container for mbridge distillation test

24ba700

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 force-pushed the dkorzekwa/anymodel_mbridgedist branch from 3a15188 to 24ba700 Compare March 19, 2026 21:07

fix a broken test

44186c7

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

Fix typos in README.

81f6d4e

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

kevalmorabia97 reviewed Mar 20, 2026

View reviewed changes

modelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.py Outdated Show resolved Hide resolved

Fix trust_remote_code=true issue

a5b715f

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

danielkorzekwa merged commit 01cba6a into feature/puzzletron Mar 20, 2026
39 of 41 checks passed

danielkorzekwa deleted the dkorzekwa/anymodel_mbridgedist branch March 20, 2026 11:43

danielkorzekwa mentioned this pull request Mar 25, 2026

Merge puzzletron compression algorithm #1121

Open

coderabbitai bot mentioned this pull request Mar 25, 2026

Remove custom DistillationProvider and simplify mbridge distillation and hf export #1122

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge mbridge distillation for any_model#1036

Merge mbridge distillation for any_model#1036
danielkorzekwa merged 101 commits intofeature/puzzletronfrom
dkorzekwa/anymodel_mbridgedist

danielkorzekwa commented Mar 13, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

Uh oh!

codecov bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 19, 2026

Uh oh!

coderabbitai bot Mar 19, 2026

Uh oh!

coderabbitai bot Mar 19, 2026

Uh oh!

Uh oh!

coderabbitai bot Mar 19, 2026

Uh oh!

Uh oh!

coderabbitai bot Mar 19, 2026

Uh oh!

coderabbitai bot Mar 19, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

danielkorzekwa commented Mar 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Summary by CodeRabbit

Uh oh!

Uh oh!

codecov bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danielkorzekwa commented Mar 13, 2026 •

edited by coderabbitai bot

Loading

codecov bot commented Mar 19, 2026 •

edited

Loading