Merge mbridge distillation for any_model#1036
Merge mbridge distillation for any_model#1036danielkorzekwa merged 101 commits intofeature/puzzletronfrom
Conversation
- Add converter, model_descriptor, puzzformer, and llama model support - Selective merge of anymodel functionality Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
…s merged) Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## feature/puzzletron #1036 +/- ##
======================================================
- Coverage 72.12% 72.10% -0.03%
======================================================
Files 209 209
Lines 23628 23628
======================================================
- Hits 17042 17036 -6
- Misses 6586 6592 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (1)
.github/workflows/example_tests.yml (1)
70-70: Pin the NeMo container image by digest for reproducibility. NVIDIA's version tags (such as26.02) are mutable and may be updated to point to newer patches, introducing drift in CI. Use an immutable digest instead:nvcr.io/nvidia/nemo:26.02@sha256:<digest>. To obtain the digest, rundocker manifest inspect nvcr.io/nvidia/nemo:26.02(or use ORAS; see NVIDIA NGC docs) and pin the returnedsha256hash.Suggested change
- docker_image: "nvcr.io/nvidia/nemo:26.02" + docker_image: "nvcr.io/nvidia/nemo:26.02@sha256:<digest>"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/example_tests.yml at line 70, The docker_image value currently uses a mutable tag ("nvcr.io/nvidia/nemo:26.02"); replace it with an immutable image reference pinned by digest (format: "nvcr.io/nvidia/nemo:26.02@sha256:<digest>"). Run docker manifest inspect nvcr.io/nvidia/nemo:26.02 (or ORAS/ngc) to obtain the sha256 digest and update the docker_image entry to include that digest so CI uses a reproducible, immutable container.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/puzzletron/export/mbridge/base.py`:
- Around line 47-50: The parent provider's transformer_layer_spec is being
copied back into provider_kwargs and overwrites the heterogeneous_layer_spec set
on GenericHeterogeneousProvider; update the provider construction logic (the
code that builds provider_kwargs around provider_kwargs / provider_kwargs update
in the factory method that handles parent providers) so it does not copy or
override transformer_layer_spec from the parent into provider_kwargs if
GenericHeterogeneousProvider.transformer_layer_spec was explicitly set to
heterogeneous_layer_spec (i.e., only inherit transformer_layer_spec when the
child has none), and ensure heterogeneous_layers_config_path /
heterogeneous_layers_config_encoded_json are likewise preserved on the returned
provider instead of being clobbered by the parent.
In `@modelopt/torch/puzzletron/export/mbridge/distillation_provider.py`:
- Around line 162-166: The __setattr__ currently mirrors every matching
attribute to self.teacher which can accidentally overwrite static config fields;
restrict mirroring to a small explicit whitelist of runtime-only fields (e.g.,
define RUNTIME_MIRRORED_FIELDS = {...} at class scope) and change __setattr__ to
only call setattr(self.teacher, name, value) when name is in that whitelist and
getattr(self, "teacher", None) is not None and hasattr(self.teacher, name);
update the whitelist to include only the fields that must stay aligned at
runtime (and exclude config keys like num_layers, hidden_size,
heterogeneous_layers_config_encoded_json).
- Around line 183-188: Don't mutate DistillationProvider.__bases__ globally;
instead create a per-instance dynamic subclass combining the existing
student_provider class and DistillationProvider and assign that new type to
student_provider.__class__ so only this instance's MRO changes. Use the existing
per-instance hook (_super_class) pattern: build a subclass whose bases are
(student_provider.__class__, DistillationProvider) (or vice versa as required
for correct MRO), set student_provider.__class__ to that new subclass, then set
student_provider.teacher, student_provider.kd_config and call
student_provider.__post_init__ as before.
In `@modelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.py`:
- Around line 63-74: The export currently hardcodes strict=False when calling
bridge.export_ckpt in export_mbridge_to_hf.py which disables missing-tensor
checks for all exports; change the export_mbridge_to_hf entrypoint (e.g., the
function that wraps this logic) to accept a strict: bool parameter defaulting to
True and use that parameter in the bridge.export_ckpt call (replace strict=False
with strict=strict); update the test path (test_distill_hf.py or its caller) to
call this function with strict=False while leaving production callers to use the
default True.
- Around line 25-31: The export_to_hf_and_copy_config function currently
hardcodes trust_remote_code=True when loading the HF model; add a
caller-configurable parameter trust_remote_code: bool = False to
export_to_hf_and_copy_config and use that parameter in place of the hardcoded
True when invoking the HF model loader (the call that currently passes
trust_remote_code=True for hf_model). Update the function signature and any
internal call sites inside export_to_hf_and_copy_config (and propagate the
parameter to any helper functions if they perform the load) so the default
behavior is secure (False) but callers can opt in by passing True.
In `@tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py`:
- Around line 17-23: Add a skip when no CUDA is available by importing pytest
and setting a module-level pytestmark =
pytest.mark.skipif(torch.cuda.device_count() == 0, reason="requires CUDA") in
tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py (so the test
won't call torchrun with nproc_per_node=0); add the import for pytest near the
other imports and apply the same guard to the other test block referenced (lines
44-50) or keep it module-wide so all tests in this file are skipped when
torch.cuda.device_count() == 0.
- Around line 100-111: The function _prepare_student_and_teacher_models
currently annotates its return type as tuple[str, str] but actually returns Path
objects (e.g., tmp_path / "student_anymodel" and tmp_path / "teacher_anymodel");
change the signature to return tuple[Path, Path], update the docstring return
description to say it returns Tuple of (student_hf_path, teacher_hf_path) as
Path objects, and ensure any type hints/imports reference pathlib.Path if
needed.
- Around line 87-91: The AutoModelForCausalLM.from_pretrained call is passing
trust_remote_code=True; remove that argument so the call uses the safe default
(trust_remote_code=False) when loading the exported Llama model locally (keep
local_files_only=True and the hf_export_dir argument intact); update the
invocation of AutoModelForCausalLM.from_pretrained (the variable "model" and the
"hf_export_dir" usage) to omit trust_remote_code.
---
Nitpick comments:
In @.github/workflows/example_tests.yml:
- Line 70: The docker_image value currently uses a mutable tag
("nvcr.io/nvidia/nemo:26.02"); replace it with an immutable image reference
pinned by digest (format: "nvcr.io/nvidia/nemo:26.02@sha256:<digest>"). Run
docker manifest inspect nvcr.io/nvidia/nemo:26.02 (or ORAS/ngc) to obtain the
sha256 digest and update the docker_image entry to include that digest so CI
uses a reproducible, immutable container.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7e2d0840-b540-4de4-8cd7-f4504db455ed
📥 Commits
Reviewing files that changed from the base of the PR and between 660dc17 and 3a15188ef7376972734363954ec3bcf9ecb64348.
📒 Files selected for processing (9)
.github/workflows/example_tests.ymlmodelopt/torch/puzzletron/export/mbridge/__init__.pymodelopt/torch/puzzletron/export/mbridge/base.pymodelopt/torch/puzzletron/export/mbridge/distillation_provider.pymodelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.pymodelopt/torch/puzzletron/export/mbridge/llama.pymodelopt/torch/puzzletron/export/mbridge/qwen3.pytests/examples/puzzletron/mbridge_distillation/test_distill_hf.pytests/examples/puzzletron/test_dummy.py
💤 Files with no reviewable changes (1)
- tests/examples/puzzletron/test_dummy.py
| # Heterogeneous configuration fields | ||
| heterogeneous_layers_config_path: str | None = None | ||
| heterogeneous_layers_config_encoded_json: str = "" | ||
| transformer_layer_spec: ModuleSpec | Callable = heterogeneous_layer_spec |
There was a problem hiding this comment.
Don't let the parent provider overwrite the heterogeneous layer spec.
Line 50 sets GenericHeterogeneousProvider.transformer_layer_spec to heterogeneous_layer_spec, but Lines 93-113 copy the parent's transformer_layer_spec straight back into provider_kwargs. That means the returned provider can still build the vanilla layer layout instead of the heterogeneous one.
Proposed fix
- provider_kwargs = dataclasses.asdict(parent_provider)
+ provider_kwargs = {
+ field.name: getattr(parent_provider, field.name)
+ for field in fields(parent_provider)
+ if field.init
+ }
@@
# Only keep kwargs that are valid fields
provider_kwargs = {k: v for k, v in provider_kwargs.items() if k in valid_fields}
+ provider_kwargs["transformer_layer_spec"] = heterogeneous_layer_spec
provider_kwargs["heterogeneous_layers_config_encoded_json"] = (
self._build_heterogeneous_config_json(hf_pretrained.config)
)Also applies to: 91-113
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/export/mbridge/base.py` around lines 47 - 50, The
parent provider's transformer_layer_spec is being copied back into
provider_kwargs and overwrites the heterogeneous_layer_spec set on
GenericHeterogeneousProvider; update the provider construction logic (the code
that builds provider_kwargs around provider_kwargs / provider_kwargs update in
the factory method that handles parent providers) so it does not copy or
override transformer_layer_spec from the parent into provider_kwargs if
GenericHeterogeneousProvider.transformer_layer_spec was explicitly set to
heterogeneous_layer_spec (i.e., only inherit transformer_layer_spec when the
child has none), and ensure heterogeneous_layers_config_path /
heterogeneous_layers_config_encoded_json are likewise preserved on the returned
provider instead of being clobbered by the parent.
| def __setattr__(self, name, value): | ||
| super().__setattr__(name, value) | ||
| # Mirror to teacher if it has that attribute | ||
| if hasattr(self.teacher, name): | ||
| setattr(self.teacher, name, value) |
There was a problem hiding this comment.
Mirror only the runtime fields that truly must stay aligned.
Line 165 currently forwards every matching student attribute to the teacher. In an asymmetric distillation setup, any later write to a common field name like num_layers, hidden_size, or heterogeneous_layers_config_encoded_json will silently rewrite the teacher config too.
Proposed fix
+_MIRRORED_TEACHER_ATTRS = {
+ "tensor_model_parallel_size",
+ "pipeline_model_parallel_size",
+ "context_parallel_size",
+ "seq_length",
+ "pipeline_dtype",
+ "cross_entropy_fusion_impl",
+}
+
@@
def __setattr__(self, name, value):
super().__setattr__(name, value)
- # Mirror to teacher if it has that attribute
- if hasattr(self.teacher, name):
- setattr(self.teacher, name, value)
+ teacher = getattr(self, "teacher", None)
+ if (
+ name in _MIRRORED_TEACHER_ATTRS
+ and teacher is not None
+ and teacher is not self
+ and hasattr(teacher, name)
+ ):
+ setattr(teacher, name, value)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/export/mbridge/distillation_provider.py` around
lines 162 - 166, The __setattr__ currently mirrors every matching attribute to
self.teacher which can accidentally overwrite static config fields; restrict
mirroring to a small explicit whitelist of runtime-only fields (e.g., define
RUNTIME_MIRRORED_FIELDS = {...} at class scope) and change __setattr__ to only
call setattr(self.teacher, name, value) when name is in that whitelist and
getattr(self, "teacher", None) is not None and hasattr(self.teacher, name);
update the whitelist to include only the fields that must stay aligned at
runtime (and exclude config keys like num_layers, hidden_size,
heterogeneous_layers_config_encoded_json).
| DistillationProvider.__bases__ = (type(student_provider),) | ||
| student_provider.__class__ = DistillationProvider | ||
|
|
||
| student_provider.teacher = teacher_provider | ||
| student_provider.kd_config = kd_config | ||
| student_provider.__post_init__() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the file structure and the specific lines mentioned
cd /tmp && find . -name "distillation_provider.py" -path "*modelopt*" 2>/dev/null | head -5Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Search for the file
find . -name "distillation_provider.py" 2>/dev/nullRepository: NVIDIA/Model-Optimizer
Length of output: 133
🏁 Script executed:
# List files in the modelopt directory structure
git ls-files | grep -i distillation_providerRepository: NVIDIA/Model-Optimizer
Length of output: 131
🏁 Script executed:
# Look at the specific file mentioned
head -200 modelopt/torch/puzzletron/export/mbridge/distillation_provider.py | tail -50Repository: NVIDIA/Model-Optimizer
Length of output: 2187
🏁 Script executed:
# Search for calls to convert_to_distillation_provider
rg "convert_to_distillation_provider" --type py -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 3349
🏁 Script executed:
# Search for uses of both GPTModelProvider and MambaModelProvider to see if they're mixed
rg "(GPTModelProvider|MambaModelProvider)" --type py -l | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 263
🏁 Script executed:
# Look at the full function and surrounding code
cat -n modelopt/torch/puzzletron/export/mbridge/distillation_provider.py | sed -n '160,200p'Repository: NVIDIA/Model-Optimizer
Length of output: 1475
🏁 Script executed:
# Check if there are tests that might reveal usage patterns
find . -name "*test*" -path "*puzzletron*" -type f | grep -i distillRepository: NVIDIA/Model-Optimizer
Length of output: 133
🏁 Script executed:
# Look at the test file to understand expected usage
cat -n tests/examples/puzzletron/mbridge_distillation/test_distill_hf.pyRepository: NVIDIA/Model-Optimizer
Length of output: 6917
🏁 Script executed:
# Check the actual examples to see if both GPT and Mamba are used in the same process
cat -n examples/puzzletron/mbridge_distillation/distill_hf.py | head -150Repository: NVIDIA/Model-Optimizer
Length of output: 7673
🏁 Script executed:
# Look for DistillationProvider class definition to understand the full design
cat -n modelopt/torch/puzzletron/export/mbridge/distillation_provider.py | head -100Repository: NVIDIA/Model-Optimizer
Length of output: 5455
🏁 Script executed:
# Let me see if there are scenarios where multiple different provider types could be converted in the same process
rg "convert_to_distillation_provider" -A 5 -B 5 --type pyRepository: NVIDIA/Model-Optimizer
Length of output: 6876
🏁 Script executed:
# Check if DistillationProvider is instantiated/converted multiple times with different provider types
rg "GPTModelProvider|MambaModelProvider" examples/puzzletron/mbridge_distillation/distill_hf.py -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Look at the broader distillation workflow to see if mixed scenarios could occur
cat -n examples/puzzletron/mbridge_distillation/distill_hf.py | sed -n '150,250p'Repository: NVIDIA/Model-Optimizer
Length of output: 4766
🏁 Script executed:
# Let's check the actual usage - are there any scenarios where this could be called multiple times?
rg "convert_to_distillation_provider\|DistillationProvider\.__bases__" --type py -B 3 -A 3 examples/ tests/Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Check if there's any code that might call convert_to_distillation_provider multiple times in sequence
rg "for.*student.*teacher|multiple.*distill|loop.*provider" --type py examples/ tests/Repository: NVIDIA/Model-Optimizer
Length of output: 817
🏁 Script executed:
# Let's see if the type of student_provider can actually change between calls
cat -n modelopt/torch/puzzletron/export/mbridge/distillation_provider.py | sed -n '1,50p'Repository: NVIDIA/Model-Optimizer
Length of output: 2325
🏁 Script executed:
# Check if __post_init__ is called and what _super_class is used for
cat -n modelopt/torch/puzzletron/export/mbridge/distillation_provider.py | sed -n '52,120p'Repository: NVIDIA/Model-Optimizer
Length of output: 4169
🏁 Script executed:
# Let's verify the actual potential for issues - check if convert_to_distillation_provider could theoretically be called multiple times
rg -A 20 "def main" examples/puzzletron/mbridge_distillation/distill_hf.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1117
🏁 Script executed:
# Look at whether __post_init__ actually uses _super_class properly
cat -n modelopt/torch/puzzletron/export/mbridge/distillation_provider.py | sed -n '52,100p'Repository: NVIDIA/Model-Optimizer
Length of output: 3187
🏁 Script executed:
# Let me check if there are any scenarios where convert_to_distillation_provider is called more than once in the same process
rg "convert_to_distillation_provider" --type py -B 10 -A 10 | grep -E "def |for |while |loop" | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 864
🏁 Script executed:
# Check if there are any integration tests that might mix different provider types
find . -name "*.py" -path "*/test*" -exec grep -l "GPTModelProvider\|MambaModelProvider" {} \; | head -5Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Let's see the full context of how __bases__ affects the instances
python3 << 'EOF'
# Demonstrate the issue conceptually
class Base1:
def method(self): return "Base1"
class Base2:
def method(self): return "Base2"
class Dynamic(Base1):
pass
obj1 = Dynamic()
print(f"Before change - obj1.method(): {obj1.method()}")
print(f"Before change - Dynamic.__bases__: {Dynamic.__bases__}")
# Now change bases
Dynamic.__bases__ = (Base2,)
print(f"After change - obj1.method(): {obj1.method()}") # MRO is recalculated
print(f"After change - Dynamic.__bases__: {Dynamic.__bases__}")
# This shows that existing instances ARE affected
EOFRepository: NVIDIA/Model-Optimizer
Length of output: 263
🏁 Script executed:
# Let's verify if the __bases__ mutation is actually necessary or if _super_class alone would suffice
cat -n modelopt/torch/puzzletron/export/mbridge/distillation_provider.py | sed -n '180,195p'Repository: NVIDIA/Model-Optimizer
Length of output: 513
🏁 Script executed:
# Check if there's any other code path that might cause issues with the global __bases__ mutation
rg "DistillationProvider\.__bases__|__class__ = DistillationProvider" --type pyRepository: NVIDIA/Model-Optimizer
Length of output: 314
🏁 Script executed:
# Check if _super_class is properly used everywhere it's needed
rg "_super_class" modelopt/torch/puzzletron/export/mbridge/distillation_provider.py -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 1471
🏁 Script executed:
# Check if there are any tests that might call convert_to_distillation_provider multiple times or with different types
grep -r "convert_to_distillation_provider" tests/ --include="*.py" -B 5 -A 5Repository: NVIDIA/Model-Optimizer
Length of output: 48
Avoid mutating DistillationProvider.__bases__ globally.
Line 183 changes the class hierarchy for every existing and future DistillationProvider instance. A second call with a different provider type (e.g., Mamba after GPT) would overwrite __bases__, breaking MRO and super() behavior for objects already in use. The design already stores _super_class per instance (line 70), making this global mutation unnecessary. Replace with a per-provider dynamic subclass or wrapper instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/export/mbridge/distillation_provider.py` around
lines 183 - 188, Don't mutate DistillationProvider.__bases__ globally; instead
create a per-instance dynamic subclass combining the existing student_provider
class and DistillationProvider and assign that new type to
student_provider.__class__ so only this instance's MRO changes. Use the existing
per-instance hook (_super_class) pattern: build a subclass whose bases are
(student_provider.__class__, DistillationProvider) (or vice versa as required
for correct MRO), set student_provider.__class__ to that new subclass, then set
student_provider.teacher, student_provider.kd_config and call
student_provider.__post_init__ as before.
| print_rank_0("📤 Exporting to HuggingFace format...") | ||
| # Use strict=False for test_distill_hf.py which uses small models (2 layers) with fewer layers | ||
| # than the template model (32 layers). This allows partial exports when some tensors are missing. | ||
| # Note: This is NOT needed when running on real compressed puzzletron student models, | ||
| # which have the same number of layers as the template model (some may be skipped via no_op | ||
| # in block_configs, but all layer tensors are still present in the checkpoint). | ||
| bridge.export_ckpt( | ||
| megatron_path=megatron_path, | ||
| hf_path=hf_export_path, | ||
| show_progress=True, | ||
| strict=False, # Needed for test_distill_hf.py small models; not needed for real compressed models | ||
| ) |
There was a problem hiding this comment.
Don't bake the test-only strict=False into every export.
The inline comment says this relaxation is only needed for test_distill_hf.py, but the helper now suppresses missing/mismatched tensor errors for production exports too. Please make strict a caller-controlled argument defaulting to True, and pass False only from the small-model test path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.py` around
lines 63 - 74, The export currently hardcodes strict=False when calling
bridge.export_ckpt in export_mbridge_to_hf.py which disables missing-tensor
checks for all exports; change the export_mbridge_to_hf entrypoint (e.g., the
function that wraps this logic) to accept a strict: bool parameter defaulting to
True and use that parameter in the bridge.export_ckpt call (replace strict=False
with strict=strict); update the test path (test_distill_hf.py or its caller) to
call this function with strict=False while leaving production callers to use the
default True.
| model = AutoModelForCausalLM.from_pretrained( | ||
| str(hf_export_dir), | ||
| local_files_only=True, | ||
| trust_remote_code=True, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py | sed -n '70,105p'Repository: NVIDIA/Model-Optimizer
Length of output: 1692
🏁 Script executed:
cat -n tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py | sed -n '1,50p'Repository: NVIDIA/Model-Optimizer
Length of output: 2360
🏁 Script executed:
cat -n tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py | sed -n '100,200p'Repository: NVIDIA/Model-Optimizer
Length of output: 2743
🏁 Script executed:
# Check if there are any custom modeling files in the AnyModel conversion or export
find tests/examples/puzzletron/mbridge_distillation -name "*.py" -type f | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 131
🏁 Script executed:
# Check the distill_hf.py script to see what gets exported
find tests/examples/puzzletron/mbridge_distillation -name "*.py" -exec grep -l "hf_export" {} \;Repository: NVIDIA/Model-Optimizer
Length of output: 131
🏁 Script executed:
# Find the actual distill_hf.py example script
find . -path ./tests -prune -o -name "distill_hf.py" -type f -printRepository: NVIDIA/Model-Optimizer
Length of output: 122
🏁 Script executed:
cat -n ./examples/puzzletron/mbridge_distillation/distill_hf.py | grep -A 20 -B 5 "hf_export"Repository: NVIDIA/Model-Optimizer
Length of output: 2016
🏁 Script executed:
cat -n ./examples/puzzletron/mbridge_distillation/distill_hf.py | grep -A 50 "def export_to_hf_and_copy_config"Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
grep -n "export_to_hf_and_copy_config\|def export" ./examples/puzzletron/mbridge_distillation/distill_hf.pyRepository: NVIDIA/Model-Optimizer
Length of output: 152
🏁 Script executed:
cat -n ./examples/puzzletron/mbridge_distillation/distill_hf.py | sed -n '60,70p'Repository: NVIDIA/Model-Optimizer
Length of output: 673
🏁 Script executed:
find . -name "export_mbridge_to_hf.py" -type fRepository: NVIDIA/Model-Optimizer
Length of output: 132
🏁 Script executed:
cat -n ./modelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.py | head -100Repository: NVIDIA/Model-Optimizer
Length of output: 4601
Remove trust_remote_code=True from the model load.
The exported model is a standard Llama architecture loaded locally with local_files_only=True. This hardcoded parameter violates coding guidelines requiring that trust_remote_code default to False with caller opt-in. Removing it enforces the safe default for HuggingFace model loading.
Proposed fix
model = AutoModelForCausalLM.from_pretrained(
str(hf_export_dir),
local_files_only=True,
- trust_remote_code=True,
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| model = AutoModelForCausalLM.from_pretrained( | |
| str(hf_export_dir), | |
| local_files_only=True, | |
| trust_remote_code=True, | |
| ) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| str(hf_export_dir), | |
| local_files_only=True, | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py` around
lines 87 - 91, The AutoModelForCausalLM.from_pretrained call is passing
trust_remote_code=True; remove that argument so the call uses the safe default
(trust_remote_code=False) when loading the exported Llama model locally (keep
local_files_only=True and the hf_export_dir argument intact); update the
invocation of AutoModelForCausalLM.from_pretrained (the variable "model" and the
"hf_export_dir" usage) to omit trust_remote_code.
| def _prepare_student_and_teacher_models(project_root_path: Path, tmp_path: Path) -> tuple[str, str]: | ||
| """Prepare student and teacher models for distillation. | ||
|
|
||
| Creates Llama models programmatically, converts them to heterogeneous format (AnyModel), | ||
| and returns the paths to the converted checkpoints. | ||
|
|
||
| Args: | ||
| project_root_path: Path to the project root directory | ||
| tmp_path: Temporary directory for test artifacts | ||
|
|
||
| Returns: | ||
| Tuple of (student_hf_path, teacher_hf_path) as strings |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find and examine the file
find . -name "test_distill_hf.py" -path "*/puzzletron/mbridge_distillation/*" | head -5Repository: NVIDIA/Model-Optimizer
Length of output: 133
🏁 Script executed:
# Read the relevant lines from the file to verify the issue
sed -n '100,165p' tests/examples/puzzletron/mbridge_distillation/test_distill_hf.pyRepository: NVIDIA/Model-Optimizer
Length of output: 2316
Fix the helper's return type annotation.
The function signature declares tuple[str, str] but line 160 returns Path objects (created via tmp_path / "student_anymodel" and tmp_path / "teacher_anymodel"). Change the return type to tuple[Path, Path] and update the docstring accordingly.
Proposed fix
-def _prepare_student_and_teacher_models(project_root_path: Path, tmp_path: Path) -> tuple[str, str]:
+def _prepare_student_and_teacher_models(
+ project_root_path: Path, tmp_path: Path
+) -> tuple[Path, Path]:
"""Prepare student and teacher models for distillation.
Creates Llama models programmatically, converts them to heterogeneous format (AnyModel),
and returns the paths to the converted checkpoints.
Args:
project_root_path: Path to the project root directory
tmp_path: Temporary directory for test artifacts
Returns:
- Tuple of (student_hf_path, teacher_hf_path) as strings
+ Tuple of (student_hf_path, teacher_hf_path) as paths📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _prepare_student_and_teacher_models(project_root_path: Path, tmp_path: Path) -> tuple[str, str]: | |
| """Prepare student and teacher models for distillation. | |
| Creates Llama models programmatically, converts them to heterogeneous format (AnyModel), | |
| and returns the paths to the converted checkpoints. | |
| Args: | |
| project_root_path: Path to the project root directory | |
| tmp_path: Temporary directory for test artifacts | |
| Returns: | |
| Tuple of (student_hf_path, teacher_hf_path) as strings | |
| def _prepare_student_and_teacher_models( | |
| project_root_path: Path, tmp_path: Path | |
| ) -> tuple[Path, Path]: | |
| """Prepare student and teacher models for distillation. | |
| Creates Llama models programmatically, converts them to heterogeneous format (AnyModel), | |
| and returns the paths to the converted checkpoints. | |
| Args: | |
| project_root_path: Path to the project root directory | |
| tmp_path: Temporary directory for test artifacts | |
| Returns: | |
| Tuple of (student_hf_path, teacher_hf_path) as paths |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/examples/puzzletron/mbridge_distillation/test_distill_hf.py` around
lines 100 - 111, The function _prepare_student_and_teacher_models currently
annotates its return type as tuple[str, str] but actually returns Path objects
(e.g., tmp_path / "student_anymodel" and tmp_path / "teacher_anymodel"); change
the signature to return tuple[Path, Path], update the docstring return
description to say it returns Tuple of (student_hf_path, teacher_hf_path) as
Path objects, and ensure any type hints/imports reference pathlib.Path if
needed.
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
3a15188 to
24ba700
Compare
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
examples/puzzletron/mbridge_distillation/distill_hf.py (1)
156-163: Consider making--hf_modelconditionally required based on export usage.The
--hf_modelargument is marked asrequired=True, but it's only used when--hf_export_pathis provided (line 290-311). Users who don't want to export to HuggingFace format are still forced to provide this argument.Consider either:
- Making it optional with a validation check when
--hf_export_pathis provided- Or documenting that it's required for future use cases
♻️ Proposed fix to make --hf_model conditionally required
parser.add_argument( "--hf_model", "--hf-model", type=str, - required=True, + default=None, help="HuggingFace model ID to use as template for export (e.g., meta-llama/Llama-3.1-8B-Instruct). " "Should match the base architecture of the student model.", )Then add validation after argument parsing:
# Sanity checks if not args.use_mock_data and not args.data_paths: raise ValueError("Must provide either --data_paths or set --use_mock_data.") + if args.hf_export_path and not args.hf_model: + raise ValueError("--hf_model is required when --hf_export_path is provided.")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/puzzletron/mbridge_distillation/distill_hf.py` around lines 156 - 163, The hf_model argument is currently required unconditionally by parser.add_argument("--hf_model") but is only needed when exporting to HuggingFace; change parser.add_argument to make hf_model optional (remove required=True) and add a post-parse validation in the main function where args are checked (the same place that validates use_mock_data / data_paths) to raise a clear error if args.hf_export_path is set but args.hf_model is empty (e.g., if args.hf_export_path and not args.hf_model: raise ValueError("--hf_model is required when --hf_export_path is provided.")).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@examples/puzzletron/mbridge_distillation/distill_hf.py`:
- Around line 156-163: The hf_model argument is currently required
unconditionally by parser.add_argument("--hf_model") but is only needed when
exporting to HuggingFace; change parser.add_argument to make hf_model optional
(remove required=True) and add a post-parse validation in the main function
where args are checked (the same place that validates use_mock_data /
data_paths) to raise a clear error if args.hf_export_path is set but
args.hf_model is empty (e.g., if args.hf_export_path and not args.hf_model:
raise ValueError("--hf_model is required when --hf_export_path is provided.")).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4635cdc3-c666-4611-9336-b0a3ea4fe62b
📒 Files selected for processing (1)
examples/puzzletron/mbridge_distillation/distill_hf.py
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
modelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
examples/puzzletron/mbridge_distillation/distill_hf.py (1)
156-163: Consider making--hf_modelconditionally required.
--hf_modelis markedrequired=True, but HuggingFace export only runs when--hf_export_pathis provided (line 290). This forces users to supply--hf_modeleven when they don't intend to export.♻️ Suggested fix: validate conditionally in argument sanity checks
parser.add_argument( "--hf_model", "--hf-model", type=str, - required=True, + default=None, help="HuggingFace model ID to use as template for export (e.g., meta-llama/Llama-3.1-8B-Instruct). " "Should match the base architecture of the student model.", )Then add validation in the sanity checks section around line 167:
if args.hf_export_path and not args.hf_model: raise ValueError("--hf_model is required when --hf_export_path is provided.")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/puzzletron/mbridge_distillation/distill_hf.py` around lines 156 - 163, Remove required=True from the parser.add_argument call for "--hf_model" so the flag is optional, and add a conditional validation in the argument sanity-checks (where args is validated) to raise an error when hf_export_path is provided but args.hf_model is missing (e.g., check if args.hf_export_path and not args.hf_model then raise ValueError indicating "--hf_model is required when --hf_export_path is provided"); this ensures parser.add_argument("--hf_model", ...) is optional while the export pathway still enforces the requirement.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@examples/puzzletron/mbridge_distillation/distill_hf.py`:
- Around line 156-163: Remove required=True from the parser.add_argument call
for "--hf_model" so the flag is optional, and add a conditional validation in
the argument sanity-checks (where args is validated) to raise an error when
hf_export_path is provided but args.hf_model is missing (e.g., check if
args.hf_export_path and not args.hf_model then raise ValueError indicating
"--hf_model is required when --hf_export_path is provided"); this ensures
parser.add_argument("--hf_model", ...) is optional while the export pathway
still enforces the requirement.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8a205585-a2b7-42fc-813a-e6cffaa9c187
📒 Files selected for processing (2)
examples/puzzletron/mbridge_distillation/distill_hf.pymodelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.py
🚧 Files skipped from review as they are similar to previous changes (1)
- modelopt/torch/puzzletron/export/mbridge/export_mbridge_to_hf.py
What does this PR do?
Merge anymodel mbridge distillation
Summary by CodeRabbit
New Features
Tests
Documentation
Chores