Skip to content

Merge mbridge distillation for any_model#1036

Merged
danielkorzekwa merged 101 commits intofeature/puzzletronfrom
dkorzekwa/anymodel_mbridgedist
Mar 20, 2026
Merged

Merge mbridge distillation for any_model#1036
danielkorzekwa merged 101 commits intofeature/puzzletronfrom
dkorzekwa/anymodel_mbridgedist

Conversation

@danielkorzekwa
Copy link
Copy Markdown
Contributor

@danielkorzekwa danielkorzekwa commented Mar 13, 2026

What does this PR do?

Merge anymodel mbridge distillation

Summary by CodeRabbit

  • New Features

    • Heterogeneous-layer support for Puzzletron AnyModel → Megatron-Core conversions
    • Knowledge-distillation workflow support and tooling
    • Export of trained models to HuggingFace format (new export helper)
    • Bridge support extended for Llama and Qwen3 architectures
    • New CLI options to control HuggingFace export destination and template model
  • Tests

    • Added integration test validating the distillation + export flow
    • Removed an obsolete trivial test
  • Documentation

    • Updated README to document HF export flags
  • Chores

    • CI: switched example-tests workflow to use NeMo and updated pip invocation

- Add converter, model_descriptor, puzzformer, and llama model support
- Selective merge of anymodel functionality

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
…s merged)

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
@danielkorzekwa danielkorzekwa requested a review from a team as a code owner March 18, 2026 13:16
@danielkorzekwa danielkorzekwa requested review from Edwardf0t1 and removed request for a team March 18, 2026 13:16
@kevalmorabia97 kevalmorabia97 requested review from kevalmorabia97 and removed request for Edwardf0t1 March 19, 2026 12:01
Base automatically changed from dkorzekwa/anymodel_tutorial to feature/puzzletron March 19, 2026 12:56
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.10%. Comparing base (1b42f0b) to head (a5b715f).
⚠️ Report is 3 commits behind head on feature/puzzletron.

Additional details and impacted files
@@                  Coverage Diff                   @@
##           feature/puzzletron    #1036      +/-   ##
======================================================
- Coverage               72.12%   72.10%   -0.03%     
======================================================
  Files                     209      209              
  Lines                   23628    23628              
======================================================
- Hits                    17042    17036       -6     
- Misses                   6586     6592       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kevalmorabia97 kevalmorabia97 requested a review from a team as a code owner March 19, 2026 20:38
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (1)
.github/workflows/example_tests.yml (1)

70-70: Pin the NeMo container image by digest for reproducibility. NVIDIA's version tags (such as 26.02) are mutable and may be updated to point to newer patches, introducing drift in CI. Use an immutable digest instead: nvcr.io/nvidia/nemo:26.02@sha256:<digest>. To obtain the digest, run docker manifest inspect nvcr.io/nvidia/nemo:26.02 (or use ORAS; see NVIDIA NGC docs) and pin the returned sha256 hash.

Suggested change
-      docker_image: "nvcr.io/nvidia/nemo:26.02"
+      docker_image: "nvcr.io/nvidia/nemo:26.02@sha256:<digest>"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/example_tests.yml at line 70, The docker_image value
currently uses a mutable tag ("nvcr.io/nvidia/nemo:26.02"); replace it with an
immutable image reference pinned by digest (format:
"nvcr.io/nvidia/nemo:26.02@sha256:<digest>"). Run docker manifest inspect
nvcr.io/nvidia/nemo:26.02 (or ORAS/ngc) to obtain the sha256 digest and update
the docker_image entry to include that digest so CI uses a reproducible,
immutable container.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/puzzletron/export/mbridge/base.py`:
- Around line 47-50: The parent provider's transformer_layer_spec is being
copied back into provider_kwargs and overwrites the heterogeneous_layer_spec set
on GenericHeterogeneousProvider; update the provider construction logic (the
code that builds provider_kwargs around provider_kwargs / provider_kwargs update
in the factory method that handles parent providers) so it does not copy or
override transformer_layer_spec from the parent into provider_kwargs if
GenericHeterogeneousProvider.transformer_layer_spec was explicitly set to
heterogeneous_layer_spec (i.e., only inherit transformer_layer_spec when the
child has none), and ensure heterogeneous_layers_config_path /
heterogeneous_layers_config_encoded_json are likewise preserved on the returned
provider instead of being clobbered by the parent.

In `@modelopt/torch/puzzletron/export/mbridge/distillation_provider.py`:
- Around line 162-166: The __setattr__ currently mirrors every matching
attribute to self.teacher which can accidentally overwrite static config fields;
restrict mirroring to a small explicit whitelist of runtime-only fields (e.g.,
define RUNTIME_MIRRORED_FIELDS = {...} at class scope) and change __setattr__ to
only call setattr(self.teacher, name, value) when name is in that whitelist and
getattr(self, "teacher", None) is not None and hasattr(self.teacher, name);
update the whitelist to include only the fields that must stay aligned at
runtime (and exclude config keys like num_layers, hidden_size,
heterogeneous_layers_config_encoded_json).
- Around line 183-188: Don't mutate DistillationProvider.__bases__ globally;
instead create a per-instance dynamic subclass combining the existing
student_provider class and DistillationProvider and assign that new type to
student_provider.__class__ so only this instance's MRO changes. Use the existing
per-instance hook (_super_class) pattern: build a subclass whose bases are
(student_provider.__class__, DistillationProvider) (or vice versa as required
for correct MRO), set student_provider.__class__ to that new subclass, then set
student_provider.teacher, student_provider.kd_config and call
student_provider.__post_init__ as before.

In `@modelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.py`:
- Around line 63-74: The export currently hardcodes strict=False when calling
bridge.export_ckpt in export_mbridge_to_hf.py which disables missing-tensor
checks for all exports; change the export_mbridge_to_hf entrypoint (e.g., the
function that wraps this logic) to accept a strict: bool parameter defaulting to
True and use that parameter in the bridge.export_ckpt call (replace strict=False
with strict=strict); update the test path (test_distill_hf.py or its caller) to
call this function with strict=False while leaving production callers to use the
default True.
- Around line 25-31: The export_to_hf_and_copy_config function currently
hardcodes trust_remote_code=True when loading the HF model; add a
caller-configurable parameter trust_remote_code: bool = False to
export_to_hf_and_copy_config and use that parameter in place of the hardcoded
True when invoking the HF model loader (the call that currently passes
trust_remote_code=True for hf_model). Update the function signature and any
internal call sites inside export_to_hf_and_copy_config (and propagate the
parameter to any helper functions if they perform the load) so the default
behavior is secure (False) but callers can opt in by passing True.

In `@tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py`:
- Around line 17-23: Add a skip when no CUDA is available by importing pytest
and setting a module-level pytestmark =
pytest.mark.skipif(torch.cuda.device_count() == 0, reason="requires CUDA") in
tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py (so the test
won't call torchrun with nproc_per_node=0); add the import for pytest near the
other imports and apply the same guard to the other test block referenced (lines
44-50) or keep it module-wide so all tests in this file are skipped when
torch.cuda.device_count() == 0.
- Around line 100-111: The function _prepare_student_and_teacher_models
currently annotates its return type as tuple[str, str] but actually returns Path
objects (e.g., tmp_path / "student_anymodel" and tmp_path / "teacher_anymodel");
change the signature to return tuple[Path, Path], update the docstring return
description to say it returns Tuple of (student_hf_path, teacher_hf_path) as
Path objects, and ensure any type hints/imports reference pathlib.Path if
needed.
- Around line 87-91: The AutoModelForCausalLM.from_pretrained call is passing
trust_remote_code=True; remove that argument so the call uses the safe default
(trust_remote_code=False) when loading the exported Llama model locally (keep
local_files_only=True and the hf_export_dir argument intact); update the
invocation of AutoModelForCausalLM.from_pretrained (the variable "model" and the
"hf_export_dir" usage) to omit trust_remote_code.

---

Nitpick comments:
In @.github/workflows/example_tests.yml:
- Line 70: The docker_image value currently uses a mutable tag
("nvcr.io/nvidia/nemo:26.02"); replace it with an immutable image reference
pinned by digest (format: "nvcr.io/nvidia/nemo:26.02@sha256:<digest>"). Run
docker manifest inspect nvcr.io/nvidia/nemo:26.02 (or ORAS/ngc) to obtain the
sha256 digest and update the docker_image entry to include that digest so CI
uses a reproducible, immutable container.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7e2d0840-b540-4de4-8cd7-f4504db455ed

📥 Commits

Reviewing files that changed from the base of the PR and between 660dc17 and 3a15188ef7376972734363954ec3bcf9ecb64348.

📒 Files selected for processing (9)
  • .github/workflows/example_tests.yml
  • modelopt/torch/puzzletron/export/mbridge/__init__.py
  • modelopt/torch/puzzletron/export/mbridge/base.py
  • modelopt/torch/puzzletron/export/mbridge/distillation_provider.py
  • modelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.py
  • modelopt/torch/puzzletron/export/mbridge/llama.py
  • modelopt/torch/puzzletron/export/mbridge/qwen3.py
  • tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py
  • tests/examples/puzzletron/test_dummy.py
💤 Files with no reviewable changes (1)
  • tests/examples/puzzletron/test_dummy.py

Comment on lines +47 to +50
# Heterogeneous configuration fields
heterogeneous_layers_config_path: str | None = None
heterogeneous_layers_config_encoded_json: str = ""
transformer_layer_spec: ModuleSpec | Callable = heterogeneous_layer_spec
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't let the parent provider overwrite the heterogeneous layer spec.

Line 50 sets GenericHeterogeneousProvider.transformer_layer_spec to heterogeneous_layer_spec, but Lines 93-113 copy the parent's transformer_layer_spec straight back into provider_kwargs. That means the returned provider can still build the vanilla layer layout instead of the heterogeneous one.

Proposed fix
-        provider_kwargs = dataclasses.asdict(parent_provider)
+        provider_kwargs = {
+            field.name: getattr(parent_provider, field.name)
+            for field in fields(parent_provider)
+            if field.init
+        }
@@
         # Only keep kwargs that are valid fields
         provider_kwargs = {k: v for k, v in provider_kwargs.items() if k in valid_fields}
+        provider_kwargs["transformer_layer_spec"] = heterogeneous_layer_spec
 
         provider_kwargs["heterogeneous_layers_config_encoded_json"] = (
             self._build_heterogeneous_config_json(hf_pretrained.config)
         )

Also applies to: 91-113

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/export/mbridge/base.py` around lines 47 - 50, The
parent provider's transformer_layer_spec is being copied back into
provider_kwargs and overwrites the heterogeneous_layer_spec set on
GenericHeterogeneousProvider; update the provider construction logic (the code
that builds provider_kwargs around provider_kwargs / provider_kwargs update in
the factory method that handles parent providers) so it does not copy or
override transformer_layer_spec from the parent into provider_kwargs if
GenericHeterogeneousProvider.transformer_layer_spec was explicitly set to
heterogeneous_layer_spec (i.e., only inherit transformer_layer_spec when the
child has none), and ensure heterogeneous_layers_config_path /
heterogeneous_layers_config_encoded_json are likewise preserved on the returned
provider instead of being clobbered by the parent.

Comment on lines +162 to +166
def __setattr__(self, name, value):
super().__setattr__(name, value)
# Mirror to teacher if it has that attribute
if hasattr(self.teacher, name):
setattr(self.teacher, name, value)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Mirror only the runtime fields that truly must stay aligned.

Line 165 currently forwards every matching student attribute to the teacher. In an asymmetric distillation setup, any later write to a common field name like num_layers, hidden_size, or heterogeneous_layers_config_encoded_json will silently rewrite the teacher config too.

Proposed fix
+_MIRRORED_TEACHER_ATTRS = {
+    "tensor_model_parallel_size",
+    "pipeline_model_parallel_size",
+    "context_parallel_size",
+    "seq_length",
+    "pipeline_dtype",
+    "cross_entropy_fusion_impl",
+}
+
@@
     def __setattr__(self, name, value):
         super().__setattr__(name, value)
-        # Mirror to teacher if it has that attribute
-        if hasattr(self.teacher, name):
-            setattr(self.teacher, name, value)
+        teacher = getattr(self, "teacher", None)
+        if (
+            name in _MIRRORED_TEACHER_ATTRS
+            and teacher is not None
+            and teacher is not self
+            and hasattr(teacher, name)
+        ):
+            setattr(teacher, name, value)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/export/mbridge/distillation_provider.py` around
lines 162 - 166, The __setattr__ currently mirrors every matching attribute to
self.teacher which can accidentally overwrite static config fields; restrict
mirroring to a small explicit whitelist of runtime-only fields (e.g., define
RUNTIME_MIRRORED_FIELDS = {...} at class scope) and change __setattr__ to only
call setattr(self.teacher, name, value) when name is in that whitelist and
getattr(self, "teacher", None) is not None and hasattr(self.teacher, name);
update the whitelist to include only the fields that must stay aligned at
runtime (and exclude config keys like num_layers, hidden_size,
heterogeneous_layers_config_encoded_json).

Comment on lines +183 to +188
DistillationProvider.__bases__ = (type(student_provider),)
student_provider.__class__ = DistillationProvider

student_provider.teacher = teacher_provider
student_provider.kd_config = kd_config
student_provider.__post_init__()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file structure and the specific lines mentioned
cd /tmp && find . -name "distillation_provider.py" -path "*modelopt*" 2>/dev/null | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Search for the file
find . -name "distillation_provider.py" 2>/dev/null

Repository: NVIDIA/Model-Optimizer

Length of output: 133


🏁 Script executed:

# List files in the modelopt directory structure
git ls-files | grep -i distillation_provider

Repository: NVIDIA/Model-Optimizer

Length of output: 131


🏁 Script executed:

# Look at the specific file mentioned
head -200 modelopt/torch/puzzletron/export/mbridge/distillation_provider.py | tail -50

Repository: NVIDIA/Model-Optimizer

Length of output: 2187


🏁 Script executed:

# Search for calls to convert_to_distillation_provider
rg "convert_to_distillation_provider" --type py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 3349


🏁 Script executed:

# Search for uses of both GPTModelProvider and MambaModelProvider to see if they're mixed
rg "(GPTModelProvider|MambaModelProvider)" --type py -l | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 263


🏁 Script executed:

# Look at the full function and surrounding code
cat -n modelopt/torch/puzzletron/export/mbridge/distillation_provider.py | sed -n '160,200p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1475


🏁 Script executed:

# Check if there are tests that might reveal usage patterns
find . -name "*test*" -path "*puzzletron*" -type f | grep -i distill

Repository: NVIDIA/Model-Optimizer

Length of output: 133


🏁 Script executed:

# Look at the test file to understand expected usage
cat -n tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py

Repository: NVIDIA/Model-Optimizer

Length of output: 6917


🏁 Script executed:

# Check the actual examples to see if both GPT and Mamba are used in the same process
cat -n examples/puzzletron/mbridge_distillation/distill_hf.py | head -150

Repository: NVIDIA/Model-Optimizer

Length of output: 7673


🏁 Script executed:

# Look for DistillationProvider class definition to understand the full design
cat -n modelopt/torch/puzzletron/export/mbridge/distillation_provider.py | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 5455


🏁 Script executed:

# Let me see if there are scenarios where multiple different provider types could be converted in the same process
rg "convert_to_distillation_provider" -A 5 -B 5 --type py

Repository: NVIDIA/Model-Optimizer

Length of output: 6876


🏁 Script executed:

# Check if DistillationProvider is instantiated/converted multiple times with different provider types
rg "GPTModelProvider|MambaModelProvider" examples/puzzletron/mbridge_distillation/distill_hf.py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Look at the broader distillation workflow to see if mixed scenarios could occur
cat -n examples/puzzletron/mbridge_distillation/distill_hf.py | sed -n '150,250p'

Repository: NVIDIA/Model-Optimizer

Length of output: 4766


🏁 Script executed:

# Let's check the actual usage - are there any scenarios where this could be called multiple times?
rg "convert_to_distillation_provider\|DistillationProvider\.__bases__" --type py -B 3 -A 3 examples/ tests/

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Check if there's any code that might call convert_to_distillation_provider multiple times in sequence
rg "for.*student.*teacher|multiple.*distill|loop.*provider" --type py examples/ tests/

Repository: NVIDIA/Model-Optimizer

Length of output: 817


🏁 Script executed:

# Let's see if the type of student_provider can actually change between calls
cat -n modelopt/torch/puzzletron/export/mbridge/distillation_provider.py | sed -n '1,50p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2325


🏁 Script executed:

# Check if __post_init__ is called and what _super_class is used for
cat -n modelopt/torch/puzzletron/export/mbridge/distillation_provider.py | sed -n '52,120p'

Repository: NVIDIA/Model-Optimizer

Length of output: 4169


🏁 Script executed:

# Let's verify the actual potential for issues - check if convert_to_distillation_provider could theoretically be called multiple times
rg -A 20 "def main" examples/puzzletron/mbridge_distillation/distill_hf.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1117


🏁 Script executed:

# Look at whether __post_init__ actually uses _super_class properly
cat -n modelopt/torch/puzzletron/export/mbridge/distillation_provider.py | sed -n '52,100p'

Repository: NVIDIA/Model-Optimizer

Length of output: 3187


🏁 Script executed:

# Let me check if there are any scenarios where convert_to_distillation_provider is called more than once in the same process
rg "convert_to_distillation_provider" --type py -B 10 -A 10 | grep -E "def |for |while |loop" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 864


🏁 Script executed:

# Check if there are any integration tests that might mix different provider types
find . -name "*.py" -path "*/test*" -exec grep -l "GPTModelProvider\|MambaModelProvider" {} \; | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Let's see the full context of how __bases__ affects the instances
python3 << 'EOF'
# Demonstrate the issue conceptually
class Base1:
    def method(self): return "Base1"

class Base2:
    def method(self): return "Base2"

class Dynamic(Base1):
    pass

obj1 = Dynamic()
print(f"Before change - obj1.method(): {obj1.method()}")
print(f"Before change - Dynamic.__bases__: {Dynamic.__bases__}")

# Now change bases
Dynamic.__bases__ = (Base2,)
print(f"After change - obj1.method(): {obj1.method()}")  # MRO is recalculated
print(f"After change - Dynamic.__bases__: {Dynamic.__bases__}")

# This shows that existing instances ARE affected
EOF

Repository: NVIDIA/Model-Optimizer

Length of output: 263


🏁 Script executed:

# Let's verify if the __bases__ mutation is actually necessary or if _super_class alone would suffice
cat -n modelopt/torch/puzzletron/export/mbridge/distillation_provider.py | sed -n '180,195p'

Repository: NVIDIA/Model-Optimizer

Length of output: 513


🏁 Script executed:

# Check if there's any other code path that might cause issues with the global __bases__ mutation
rg "DistillationProvider\.__bases__|__class__ = DistillationProvider" --type py

Repository: NVIDIA/Model-Optimizer

Length of output: 314


🏁 Script executed:

# Check if _super_class is properly used everywhere it's needed
rg "_super_class" modelopt/torch/puzzletron/export/mbridge/distillation_provider.py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 1471


🏁 Script executed:

# Check if there are any tests that might call convert_to_distillation_provider multiple times or with different types
grep -r "convert_to_distillation_provider" tests/ --include="*.py" -B 5 -A 5

Repository: NVIDIA/Model-Optimizer

Length of output: 48


Avoid mutating DistillationProvider.__bases__ globally.

Line 183 changes the class hierarchy for every existing and future DistillationProvider instance. A second call with a different provider type (e.g., Mamba after GPT) would overwrite __bases__, breaking MRO and super() behavior for objects already in use. The design already stores _super_class per instance (line 70), making this global mutation unnecessary. Replace with a per-provider dynamic subclass or wrapper instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/export/mbridge/distillation_provider.py` around
lines 183 - 188, Don't mutate DistillationProvider.__bases__ globally; instead
create a per-instance dynamic subclass combining the existing student_provider
class and DistillationProvider and assign that new type to
student_provider.__class__ so only this instance's MRO changes. Use the existing
per-instance hook (_super_class) pattern: build a subclass whose bases are
(student_provider.__class__, DistillationProvider) (or vice versa as required
for correct MRO), set student_provider.__class__ to that new subclass, then set
student_provider.teacher, student_provider.kd_config and call
student_provider.__post_init__ as before.

Comment on lines +63 to +74
print_rank_0("📤 Exporting to HuggingFace format...")
# Use strict=False for test_distill_hf.py which uses small models (2 layers) with fewer layers
# than the template model (32 layers). This allows partial exports when some tensors are missing.
# Note: This is NOT needed when running on real compressed puzzletron student models,
# which have the same number of layers as the template model (some may be skipped via no_op
# in block_configs, but all layer tensors are still present in the checkpoint).
bridge.export_ckpt(
megatron_path=megatron_path,
hf_path=hf_export_path,
show_progress=True,
strict=False, # Needed for test_distill_hf.py small models; not needed for real compressed models
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't bake the test-only strict=False into every export.

The inline comment says this relaxation is only needed for test_distill_hf.py, but the helper now suppresses missing/mismatched tensor errors for production exports too. Please make strict a caller-controlled argument defaulting to True, and pass False only from the small-model test path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.py` around
lines 63 - 74, The export currently hardcodes strict=False when calling
bridge.export_ckpt in export_mbridge_to_hf.py which disables missing-tensor
checks for all exports; change the export_mbridge_to_hf entrypoint (e.g., the
function that wraps this logic) to accept a strict: bool parameter defaulting to
True and use that parameter in the bridge.export_ckpt call (replace strict=False
with strict=strict); update the test path (test_distill_hf.py or its caller) to
call this function with strict=False while leaving production callers to use the
default True.

Comment on lines +87 to +91
model = AutoModelForCausalLM.from_pretrained(
str(hf_export_dir),
local_files_only=True,
trust_remote_code=True,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py | sed -n '70,105p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1692


🏁 Script executed:

cat -n tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py | sed -n '1,50p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2360


🏁 Script executed:

cat -n tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py | sed -n '100,200p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2743


🏁 Script executed:

# Check if there are any custom modeling files in the AnyModel conversion or export
find tests/examples/puzzletron/mbridge_distillation -name "*.py" -type f | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 131


🏁 Script executed:

# Check the distill_hf.py script to see what gets exported
find tests/examples/puzzletron/mbridge_distillation -name "*.py" -exec grep -l "hf_export" {} \;

Repository: NVIDIA/Model-Optimizer

Length of output: 131


🏁 Script executed:

# Find the actual distill_hf.py example script
find . -path ./tests -prune -o -name "distill_hf.py" -type f -print

Repository: NVIDIA/Model-Optimizer

Length of output: 122


🏁 Script executed:

cat -n ./examples/puzzletron/mbridge_distillation/distill_hf.py | grep -A 20 -B 5 "hf_export"

Repository: NVIDIA/Model-Optimizer

Length of output: 2016


🏁 Script executed:

cat -n ./examples/puzzletron/mbridge_distillation/distill_hf.py | grep -A 50 "def export_to_hf_and_copy_config"

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

grep -n "export_to_hf_and_copy_config\|def export" ./examples/puzzletron/mbridge_distillation/distill_hf.py

Repository: NVIDIA/Model-Optimizer

Length of output: 152


🏁 Script executed:

cat -n ./examples/puzzletron/mbridge_distillation/distill_hf.py | sed -n '60,70p'

Repository: NVIDIA/Model-Optimizer

Length of output: 673


🏁 Script executed:

find . -name "export_mbridge_to_hf.py" -type f

Repository: NVIDIA/Model-Optimizer

Length of output: 132


🏁 Script executed:

cat -n ./modelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.py | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 4601


Remove trust_remote_code=True from the model load.

The exported model is a standard Llama architecture loaded locally with local_files_only=True. This hardcoded parameter violates coding guidelines requiring that trust_remote_code default to False with caller opt-in. Removing it enforces the safe default for HuggingFace model loading.

Proposed fix
     model = AutoModelForCausalLM.from_pretrained(
         str(hf_export_dir),
         local_files_only=True,
-        trust_remote_code=True,
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
model = AutoModelForCausalLM.from_pretrained(
str(hf_export_dir),
local_files_only=True,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
str(hf_export_dir),
local_files_only=True,
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py` around
lines 87 - 91, The AutoModelForCausalLM.from_pretrained call is passing
trust_remote_code=True; remove that argument so the call uses the safe default
(trust_remote_code=False) when loading the exported Llama model locally (keep
local_files_only=True and the hf_export_dir argument intact); update the
invocation of AutoModelForCausalLM.from_pretrained (the variable "model" and the
"hf_export_dir" usage) to omit trust_remote_code.

Comment on lines +100 to +111
def _prepare_student_and_teacher_models(project_root_path: Path, tmp_path: Path) -> tuple[str, str]:
"""Prepare student and teacher models for distillation.

Creates Llama models programmatically, converts them to heterogeneous format (AnyModel),
and returns the paths to the converted checkpoints.

Args:
project_root_path: Path to the project root directory
tmp_path: Temporary directory for test artifacts

Returns:
Tuple of (student_hf_path, teacher_hf_path) as strings
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, find and examine the file
find . -name "test_distill_hf.py" -path "*/puzzletron/mbridge_distillation/*" | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 133


🏁 Script executed:

# Read the relevant lines from the file to verify the issue
sed -n '100,165p' tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2316


Fix the helper's return type annotation.

The function signature declares tuple[str, str] but line 160 returns Path objects (created via tmp_path / "student_anymodel" and tmp_path / "teacher_anymodel"). Change the return type to tuple[Path, Path] and update the docstring accordingly.

Proposed fix
-def _prepare_student_and_teacher_models(project_root_path: Path, tmp_path: Path) -> tuple[str, str]:
+def _prepare_student_and_teacher_models(
+    project_root_path: Path, tmp_path: Path
+) -> tuple[Path, Path]:
     """Prepare student and teacher models for distillation.
 
     Creates Llama models programmatically, converts them to heterogeneous format (AnyModel),
     and returns the paths to the converted checkpoints.
 
     Args:
         project_root_path: Path to the project root directory
         tmp_path: Temporary directory for test artifacts
 
     Returns:
-        Tuple of (student_hf_path, teacher_hf_path) as strings
+        Tuple of (student_hf_path, teacher_hf_path) as paths
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _prepare_student_and_teacher_models(project_root_path: Path, tmp_path: Path) -> tuple[str, str]:
"""Prepare student and teacher models for distillation.
Creates Llama models programmatically, converts them to heterogeneous format (AnyModel),
and returns the paths to the converted checkpoints.
Args:
project_root_path: Path to the project root directory
tmp_path: Temporary directory for test artifacts
Returns:
Tuple of (student_hf_path, teacher_hf_path) as strings
def _prepare_student_and_teacher_models(
project_root_path: Path, tmp_path: Path
) -> tuple[Path, Path]:
"""Prepare student and teacher models for distillation.
Creates Llama models programmatically, converts them to heterogeneous format (AnyModel),
and returns the paths to the converted checkpoints.
Args:
project_root_path: Path to the project root directory
tmp_path: Temporary directory for test artifacts
Returns:
Tuple of (student_hf_path, teacher_hf_path) as paths
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py` around
lines 100 - 111, The function _prepare_student_and_teacher_models currently
annotates its return type as tuple[str, str] but actually returns Path objects
(e.g., tmp_path / "student_anymodel" and tmp_path / "teacher_anymodel"); change
the signature to return tuple[Path, Path], update the docstring return
description to say it returns Tuple of (student_hf_path, teacher_hf_path) as
Path objects, and ensure any type hints/imports reference pathlib.Path if
needed.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 force-pushed the dkorzekwa/anymodel_mbridgedist branch from 3a15188 to 24ba700 Compare March 19, 2026 21:07
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
examples/puzzletron/mbridge_distillation/distill_hf.py (1)

156-163: Consider making --hf_model conditionally required based on export usage.

The --hf_model argument is marked as required=True, but it's only used when --hf_export_path is provided (line 290-311). Users who don't want to export to HuggingFace format are still forced to provide this argument.

Consider either:

  1. Making it optional with a validation check when --hf_export_path is provided
  2. Or documenting that it's required for future use cases
♻️ Proposed fix to make --hf_model conditionally required
     parser.add_argument(
         "--hf_model",
         "--hf-model",
         type=str,
-        required=True,
+        default=None,
         help="HuggingFace model ID to use as template for export (e.g., meta-llama/Llama-3.1-8B-Instruct). "
         "Should match the base architecture of the student model.",
     )

Then add validation after argument parsing:

    # Sanity checks
    if not args.use_mock_data and not args.data_paths:
        raise ValueError("Must provide either --data_paths or set --use_mock_data.")
+   if args.hf_export_path and not args.hf_model:
+       raise ValueError("--hf_model is required when --hf_export_path is provided.")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/puzzletron/mbridge_distillation/distill_hf.py` around lines 156 -
163, The hf_model argument is currently required unconditionally by
parser.add_argument("--hf_model") but is only needed when exporting to
HuggingFace; change parser.add_argument to make hf_model optional (remove
required=True) and add a post-parse validation in the main function where args
are checked (the same place that validates use_mock_data / data_paths) to raise
a clear error if args.hf_export_path is set but args.hf_model is empty (e.g., if
args.hf_export_path and not args.hf_model: raise ValueError("--hf_model is
required when --hf_export_path is provided.")).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/puzzletron/mbridge_distillation/distill_hf.py`:
- Around line 156-163: The hf_model argument is currently required
unconditionally by parser.add_argument("--hf_model") but is only needed when
exporting to HuggingFace; change parser.add_argument to make hf_model optional
(remove required=True) and add a post-parse validation in the main function
where args are checked (the same place that validates use_mock_data /
data_paths) to raise a clear error if args.hf_export_path is set but
args.hf_model is empty (e.g., if args.hf_export_path and not args.hf_model:
raise ValueError("--hf_model is required when --hf_export_path is provided.")).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4635cdc3-c666-4611-9336-b0a3ea4fe62b

📥 Commits

Reviewing files that changed from the base of the PR and between 24ba700 and 44186c7.

📒 Files selected for processing (1)
  • examples/puzzletron/mbridge_distillation/distill_hf.py

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
examples/puzzletron/mbridge_distillation/distill_hf.py (1)

156-163: Consider making --hf_model conditionally required.

--hf_model is marked required=True, but HuggingFace export only runs when --hf_export_path is provided (line 290). This forces users to supply --hf_model even when they don't intend to export.

♻️ Suggested fix: validate conditionally in argument sanity checks
     parser.add_argument(
         "--hf_model",
         "--hf-model",
         type=str,
-        required=True,
+        default=None,
         help="HuggingFace model ID to use as template for export (e.g., meta-llama/Llama-3.1-8B-Instruct). "
         "Should match the base architecture of the student model.",
     )

Then add validation in the sanity checks section around line 167:

    if args.hf_export_path and not args.hf_model:
        raise ValueError("--hf_model is required when --hf_export_path is provided.")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/puzzletron/mbridge_distillation/distill_hf.py` around lines 156 -
163, Remove required=True from the parser.add_argument call for "--hf_model" so
the flag is optional, and add a conditional validation in the argument
sanity-checks (where args is validated) to raise an error when hf_export_path is
provided but args.hf_model is missing (e.g., check if args.hf_export_path and
not args.hf_model then raise ValueError indicating "--hf_model is required when
--hf_export_path is provided"); this ensures parser.add_argument("--hf_model",
...) is optional while the export pathway still enforces the requirement.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/puzzletron/mbridge_distillation/distill_hf.py`:
- Around line 156-163: Remove required=True from the parser.add_argument call
for "--hf_model" so the flag is optional, and add a conditional validation in
the argument sanity-checks (where args is validated) to raise an error when
hf_export_path is provided but args.hf_model is missing (e.g., check if
args.hf_export_path and not args.hf_model then raise ValueError indicating
"--hf_model is required when --hf_export_path is provided"); this ensures
parser.add_argument("--hf_model", ...) is optional while the export pathway
still enforces the requirement.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8a205585-a2b7-42fc-813a-e6cffaa9c187

📥 Commits

Reviewing files that changed from the base of the PR and between 81f6d4e and a5b715f.

📒 Files selected for processing (2)
  • examples/puzzletron/mbridge_distillation/distill_hf.py
  • modelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • modelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.py

@danielkorzekwa danielkorzekwa merged commit 01cba6a into feature/puzzletron Mar 20, 2026
39 of 41 checks passed
@danielkorzekwa danielkorzekwa deleted the dkorzekwa/anymodel_mbridgedist branch March 20, 2026 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants