Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
e82164f
Add anymodel directories to feature/puzzletron
danielkorzekwa Mar 4, 2026
2099df3
Make any_model conversion working.
danielkorzekwa Mar 5, 2026
eb5cf8a
Update child_init.py with anymodel version
danielkorzekwa Mar 5, 2026
c9de41c
fix attention pruning
danielkorzekwa Mar 5, 2026
3c1bc1f
Add trust_remote_code to load_model_config (default to false)
danielkorzekwa Mar 5, 2026
8357136
Make activation scoring working
danielkorzekwa Mar 5, 2026
6cc2194
Comment all tested models aside of llama_3_1_8b_instruct
danielkorzekwa Mar 5, 2026
ee4e1e3
Delete not needed decilm test
danielkorzekwa Mar 5, 2026
449b523
Fix broken tests
danielkorzekwa Mar 5, 2026
fb27bba
Update puzzletron_nas_pluging to any_model version
danielkorzekwa Mar 5, 2026
b350f82
Correct test resources used by tests.
danielkorzekwa Mar 5, 2026
fafe5a3
Disable puzzletron tests (will be enabled after all any_model logic i…
danielkorzekwa Mar 5, 2026
e988248
Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…
danielkorzekwa Mar 6, 2026
c717852
Comment out not implemented models.
danielkorzekwa Mar 6, 2026
030f126
format python docs
danielkorzekwa Mar 6, 2026
8dcdfbf
Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…
danielkorzekwa Mar 6, 2026
70df0df
Use trust_remote_code in force_cache_dynamic_modules()
danielkorzekwa Mar 6, 2026
bb56662
Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…
danielkorzekwa Mar 6, 2026
ecd953e
Fix anymodel pruning
danielkorzekwa Mar 6, 2026
ee8f538
Fix buid docs issue.
danielkorzekwa Mar 6, 2026
c9b76a1
Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…
danielkorzekwa Mar 6, 2026
6e3af61
Merge branch 'dkorzekwa/anymodel_activation_scoring' into dkorzekwa/a…
danielkorzekwa Mar 6, 2026
47414d5
Clarify readme and avoid reusing the same reference in llama_converter.
danielkorzekwa Mar 9, 2026
a8305d8
Fix tied-embedding handling before writing the safetensors index.
danielkorzekwa Mar 9, 2026
68421a5
Fix NaN ranking currently selects NaNs as “best” experts by default.
danielkorzekwa Mar 9, 2026
d6b8028
Code clean up.
danielkorzekwa Mar 9, 2026
ecd2341
Code clean up.
danielkorzekwa Mar 10, 2026
f9d845d
code clean up
danielkorzekwa Mar 10, 2026
d171b01
Merge branch 'dkorzekwa/anymodel_core' into dkorzekwa/anymodel_activa…
danielkorzekwa Mar 10, 2026
722da90
Merge branch 'dkorzekwa/anymodel_activation_scoring' into dkorzekwa/a…
danielkorzekwa Mar 10, 2026
934ab2f
code clean up
danielkorzekwa Mar 10, 2026
176a435
Fix a broken test_puzzletron test on 2 gpus.
danielkorzekwa Mar 10, 2026
02e2c9b
Merge branch 'dkorzekwa/anymodel_activation_scoring' into dkorzekwa/a…
danielkorzekwa Mar 10, 2026
0fc10a1
Merge branch 'feature/puzzletron' into dkorzekwa/anymodel_pruning
danielkorzekwa Mar 12, 2026
d9a8647
Uncomment pruning step.
danielkorzekwa Mar 12, 2026
8398294
Fix docs building issue ( tox -e build-docs)
danielkorzekwa Mar 12, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions modelopt/torch/puzzletron/puzzletron.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,10 +57,10 @@ def puzzletron(
# Step 1: score_pruning_activations (distributed processing)
score_pruning_activations.launch_score_activations(hydra_cfg)

# # Step 2: pruning_ckpts (single process)
# if dist.is_master():
# pruning_ckpts.launch_prune_ckpt(hydra_cfg)
# dist.barrier()
# Step 2: pruning_ckpts (single process)
if dist.is_master():
pruning_ckpts.launch_prune_ckpt(hydra_cfg)
dist.barrier()

# # Step 4: build_library_and_stats (single process)
# if dist.is_master():
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,22 @@
# limitations under the License.
# mypy: ignore-errors

"""TODO Add description"""
"""Initialize child models from parent models using AnyModel approach with deci_x_patcher."""

import json
import time
from pathlib import Path
from typing import Optional

import torch
import yaml
from transformers import AutoModelForCausalLM

from modelopt.torch.puzzletron.decilm.deci_lm_hf_code.modeling_decilm import DeciLMForCausalLM
from modelopt.torch.puzzletron.anymodel.model_descriptor import (
ModelDescriptor,
ModelDescriptorFactory,
)
from modelopt.torch.puzzletron.anymodel.puzzformer import deci_x_patcher
from modelopt.torch.puzzletron.tools.bypassed_training.child_init import (
GQAInitMode,
HiddenSizeInitMode,
Expand All @@ -31,85 +38,37 @@
create_child_state_dict,
update_model_config,
)
from modelopt.torch.puzzletron.tools.checkpoint_utils import (
copy_tokenizer,
load_model_config,
load_state_dict,
)
from modelopt.torch.puzzletron.tools.checkpoint_utils import copy_tokenizer, load_state_dict
from modelopt.torch.puzzletron.tools.checkpoint_utils_hf import (
_save_checkpoint,
copy_deci_lm_hf_code,
load_model_config,
)
from modelopt.torch.puzzletron.tools.logger import mprint

"""

Usage example - remove all/some routed experts:
===============================================

PARENT_DIR=".../meta-llama/Llama-4-Scout-17B-16E-Instruct--deci-hf"

MLP_INIT_MODE="ConcatExpertsIntoDenseFFN"

## remove all routed experts, turn the shared expert into a dense FFN
# OUTPUT_DIR="/.../micro_scout/Scout-remove-routed-experts"
# MODEL_CONFIG_OVERRIDES_JSON='
# {
# "ffn": [
# {
# "moe": null,
# "intermediate_size": 14336,
# "gated": true,
# "hidden_act": "silu"
# }
# ]
# }
# '

## concat the shared expert with one routed expert into a dense FFN
OUTPUT_DIR=".../scratch/micro_scout/Scout-ConcatExpertsIntoDenseFFN-concat-shared-and-3-routed"
MODEL_CONFIG_OVERRIDES_JSON='
{
"ffn": [
{
"moe": null,
"intermediate_size": 14336,
"gated": true,
"hidden_act": "silu"
}
]
}
'

echo ""
echo "MODEL_CONFIG_OVERRIDES_JSON:"
echo "${MODEL_CONFIG_OVERRIDES_JSON}"

python -m modelopt.torch.puzzletron.tools.bypassed_training.init_child_from_parent \
--parent_checkpoint_dir="$PARENT_DIR" \
--model_config_overrides_json="$MODEL_CONFIG_OVERRIDES_JSON" \
--output_checkpoint_dir="$OUTPUT_DIR" \
--mlp_init_mode="$MLP_INIT_MODE" \
--mlp_init_config_yaml="$MLP_INIT_CONFIG_YAML"
"""
from modelopt.torch.puzzletron.tools.sharded_checkpoint_utils import _get_model_class_from_config


def init_child_from_parent(
descriptor: ModelDescriptor,
pruning_mixin,
parent_checkpoint_dir: str,
model_config_overrides_json: str,
model_config_overrides_dict: dict | str,
output_checkpoint_dir: str,
gqa_init_mode: GQAInitMode,
mlp_init_mode: MlpInitMode,
mlp_init_config_yaml: str | None,
mlp_init_config_yaml: Optional[str],
linear_init_mode: LinearInitMode,
hidden_size_init_mode: HiddenSizeInitMode | None = None,
channel_importance_path: str | None = None,
max_workers: int | None = None, # Auto-calculate optimal workers if None
max_layer_workers: int | None = None, # Auto-calculate optimal workers if None
hidden_size_init_mode: Optional[HiddenSizeInitMode] = None,
channel_importance_path: Optional[str] = None,
max_workers: Optional[int] = None, # Auto-calculate optimal workers if None
max_layer_workers: Optional[int] = None, # Auto-calculate optimal workers if None
) -> None:
"""Init child models from parent models in the style of bypass training,
"""
Init child models from parent models in the style of bypass training,
but without having to run the entire bypass pipeline.

Uses AnyModel approach with deci_x_patcher for heterogeneous layer configurations.

I/O Optimization Parameters:
- max_workers: Number of threads for parallel file I/O (default: auto-calculate min(CPU count, num files))
- max_layer_workers: Number of threads for parallel layer processing (default: auto-calculate min(CPU count, num layers))
Expand All @@ -123,16 +82,16 @@ def init_child_from_parent(
"We do not support random init of any subblock in this script to avoid initializing the student model"
)

descriptor = ModelDescriptorFactory.get(descriptor)

copy_tokenizer(parent_checkpoint_dir, output_checkpoint_dir)

parent_model_config = load_model_config(parent_checkpoint_dir)
parent_state_dict = load_state_dict(parent_checkpoint_dir)

# Parse the model config overrides
if isinstance(model_config_overrides_json, str):
model_config_overrides_dict = json.loads(model_config_overrides_json)
else:
model_config_overrides_dict = model_config_overrides_json
# Parse JSON if string
if isinstance(model_config_overrides_dict, str):
model_config_overrides_dict = json.loads(model_config_overrides_dict)

Comment on lines +92 to 95
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Reject non-object JSON overrides early.

If the string decodes to anything other than a mapping, Line 100 fails later with a generic '... has no attribute items'. Validate the parsed type here and raise a clearer error at the boundary.

Proposed fix
     if isinstance(model_config_overrides_dict, str):
         model_config_overrides_dict = json.loads(model_config_overrides_dict)
+    if not isinstance(model_config_overrides_dict, dict):
+        raise TypeError(
+            "model_config_overrides_dict must be a dict or a JSON object string"
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/tools/bypassed_training/init_child_from_parent.py`
around lines 92 - 95, When parsing model_config_overrides_dict (in
init_child_from_parent), ensure the JSON string decodes to a mapping before
proceeding: after json.loads(model_config_overrides_dict) check that the result
is a dict (or collections.abc.Mapping) and if not raise a clear TypeError or
ValueError (e.g. "model_config_overrides_dict must decode to an object/dict, got
<type>") so downstream use of .items() won't raise an opaque attribute error.

# Separate global config overrides from block-level overrides
global_config_overrides = {}
Expand All @@ -146,7 +105,7 @@ def init_child_from_parent(

# Load child model config with global overrides
child_model_config = load_model_config(
checkpoint_dir=parent_checkpoint_dir,
parent_checkpoint_dir,
model_config_overrides=global_config_overrides,
ignore_unexpected_config_keys=True,
)
Expand All @@ -159,19 +118,32 @@ def init_child_from_parent(
)

with torch.device("meta"):
child_model = DeciLMForCausalLM(child_model_config)
# Pass block_configs explicitly so patcher works for VL models where
# decoder layers receive nested config (e.g., text_config) without block_configs
with deci_x_patcher(
model_descriptor=descriptor, block_configs=child_model_config.block_configs
):
model_class = _get_model_class_from_config(child_model_config)
# AutoModelForCausalLM uses from_config(); concrete model classes use _from_config()
if model_class is AutoModelForCausalLM:
child_model = model_class.from_config(child_model_config, trust_remote_code=True)
else:
child_model = model_class._from_config(child_model_config)
Comment on lines +123 to +131
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Find the file and check if it exists
find . -type f -name "init_child_from_parent.py" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 143


🏁 Script executed:

# Once found, read the relevant section of the file
cd modelopt/torch/puzzletron/tools/bypassed_training 2>/dev/null && wc -l init_child_from_parent.py

Repository: NVIDIA/Model-Optimizer

Length of output: 95


🏁 Script executed:

# Read the full function to understand the context and function signature
cat -n modelopt/torch/puzzletron/tools/bypassed_training/init_child_from_parent.py | head -150

Repository: NVIDIA/Model-Optimizer

Length of output: 6974


🏁 Script executed:

# Check for inline comments justifying the hardcoded True and look at more context
cat -n modelopt/torch/puzzletron/tools/bypassed_training/init_child_from_parent.py | sed -n '120,140p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1179


🏁 Script executed:

# Search for all calls to init_child_from_parent function
rg "init_child_from_parent\(" --type py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 2855


🏁 Script executed:

# Check if there are other uses of trust_remote_code in the codebase to understand patterns
rg "trust_remote_code" --type py

Repository: NVIDIA/Model-Optimizer

Length of output: 22982


🏁 Script executed:

# Check the sharded_checkpoint_utils.py for similar issues
cat -n modelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py | grep -B 5 -A 5 "trust_remote_code"

Repository: NVIDIA/Model-Optimizer

Length of output: 756


🏁 Script executed:

# Let's see the full context of the call in pruning_ckpts.py to understand the impact
rg "init_child_from_parent\(" modelopt/torch/puzzletron/pruning/pruning_ckpts.py -B 2 -A 15 | head -100

Repository: NVIDIA/Model-Optimizer

Length of output: 4850


🏁 Script executed:

# Verify that there are no inline comments justifying the hardcoded True
sed -n '126,132p' modelopt/torch/puzzletron/tools/bypassed_training/init_child_from_parent.py

Repository: NVIDIA/Model-Optimizer

Length of output: 481


Don't hardcode trust_remote_code=True.

Line 129 turns child-model construction into an RCE boundary for any checkpoint/config that carries custom modeling code. Thread this through as a caller-controlled flag with a safe default of False, and only enable it under a documented exception.

Proposed fix
 def init_child_from_parent(
     descriptor: ModelDescriptor,
     pruning_mixin,
     parent_checkpoint_dir: str,
     model_config_overrides_dict: dict | str,
     output_checkpoint_dir: str,
     gqa_init_mode: GQAInitMode,
     mlp_init_mode: MlpInitMode,
     mlp_init_config_yaml: Optional[str],
     linear_init_mode: LinearInitMode,
     hidden_size_init_mode: Optional[HiddenSizeInitMode] = None,
     channel_importance_path: Optional[str] = None,
     max_workers: Optional[int] = None,  # Auto-calculate optimal workers if None
     max_layer_workers: Optional[int] = None,  # Auto-calculate optimal workers if None
+    trust_remote_code: bool = False,
 ) -> None:
@@
             if model_class is AutoModelForCausalLM:
-                child_model = model_class.from_config(child_model_config, trust_remote_code=True)
+                child_model = model_class.from_config(
+                    child_model_config, trust_remote_code=trust_remote_code
+                )
             else:
                 child_model = model_class._from_config(child_model_config)

Per coding guidelines: "Do not hardcode trust_remote_code=True when loading transformers models. Let the caller decide via a parameter with default value False."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/tools/bypassed_training/init_child_from_parent.py`
around lines 123 - 131, The code currently hardcodes trust_remote_code=True when
constructing child models (see AutoModelForCausalLM branch and child_model
creation), which creates an RCE risk; add a caller-controlled boolean parameter
(e.g., trust_remote_code: bool = False) to the function that contains this logic
(init_child_from_parent or the enclosing function), thread that flag through to
the model creation call, and use it instead of the hardcoded True for
model_class.from_config(..., trust_remote_code=trust_remote_code); leave the
default False and document that callers must explicitly opt in for remote code.


child_state_dict_with_meta_tensors = child_model.state_dict()

mlp_init_config = (
yaml.safe_load(mlp_init_config_yaml)
if isinstance(mlp_init_config_yaml, str) is None
if isinstance(mlp_init_config_yaml, str)
else mlp_init_config_yaml
)

# Profile create_child_state_dict with automatic layer parallelization
mprint("Starting create_child_state_dict...")
start_time = time.time()
child_state_dict = create_child_state_dict(
pruning_mixin=pruning_mixin,
descriptor=descriptor,
original_state_dict=parent_state_dict,
new_state_dict=child_state_dict_with_meta_tensors,
original_config=parent_model_config,
Expand All @@ -182,7 +154,7 @@ def init_child_from_parent(
linear_init_mode=linear_init_mode,
hidden_size_init_mode=hidden_size_init_mode or HiddenSizeInitMode.CopyAsIs,
channel_importance_path=channel_importance_path,
max_layer_workers=max_layer_workers, # Will auto-calculate if None
max_layer_workers=max_layer_workers,
)
create_child_state_dict_time = time.time() - start_time
mprint(f"create_child_state_dict completed in {create_child_state_dict_time:.2f} seconds")
Expand All @@ -196,7 +168,8 @@ def init_child_from_parent(
child_model_config,
child_state_dict,
output_checkpoint_dir,
max_workers=max_workers, # Will auto-calculate if None
descriptor,
max_workers=max_workers,
)
save_checkpoint_time = time.time() - start_time
mprint(f"_save_checkpoint completed in {save_checkpoint_time:.2f} seconds")
Expand All @@ -207,7 +180,7 @@ def init_child_from_parent(
total_core_time = create_child_state_dict_time + save_checkpoint_time
actual_layer_workers = max_layer_workers if max_layer_workers else "auto"
actual_io_workers = max_workers if max_workers else "auto"
mprint("\n=== PROFILING SUMMARY ===")
mprint(f"\n=== PROFILING SUMMARY ===")
mprint(
f"create_child_state_dict: {create_child_state_dict_time:.2f}s ({create_child_state_dict_time / total_core_time * 100:.1f}%)"
)
Expand All @@ -216,4 +189,4 @@ def init_child_from_parent(
)
mprint(f"Total core processing: {total_core_time:.2f}s")
mprint(f"Optimizations: I/O workers={actual_io_workers}, Layer workers={actual_layer_workers}")
mprint("=========================\n")
mprint(f"=========================\n")
10 changes: 5 additions & 5 deletions modelopt/torch/puzzletron/tools/validate_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,12 +85,12 @@ def validate_model(
Args:
args: Configuration object containing the following attributes:

**Model Configuration:**
Model Configuration:
- model_name_or_path (str): Path to model checkpoint or HuggingFace model name. Required unless model is passed directly.
- model_dtype (str or torch.dtype): Model data type (e.g., "torch.bfloat16", torch.float16).
- autocast_dtype (str or torch.dtype): Autocast data type for mixed precision.

**Dataset Configuration:**
Dataset Configuration:
- dataset_path (str): Path to the validation dataset.
- tokenizer_name (str, optional): Tokenizer name/path. Uses model_name_or_path if not specified.
- data_column (str): Column name in dataset containing text data.
Expand All @@ -100,7 +100,7 @@ def validate_model(
- source_datasets_to_discard (list[str], optional): List of source datasets to exclude.
- load_dataset_fn (callable, optional): Custom function to load the dataset.

**Data Processing:**
Data Processing:
- micro_batch_size (int): Batch size for evaluation.
- seed (int): Random seed for reproducibility.
- shuffle_seed (int, optional): Seed for shuffling data. Uses seed if None.
Expand All @@ -109,11 +109,11 @@ def validate_model(
- fim_rate (float): Fill-in-the-middle rate for code completion tasks.
- fim_spm_rate (float): SPM-based fill-in-the-middle rate.

**Activation Hooks:**
Activation Hooks:
- activations_log_dir (str, optional): Directory to log activation scores. If provided, hooks will be registered to capture activations.
- activation_hooks_kwargs (str or dict, optional): Arguments for activation hooks. If string, comma-separated format: "arg1=val1,arg2=val2".

**Execution Options:**
Execution Options:
- calc_losses_on_cpu (bool): Calculate losses on CPU to avoid OOM. Very slow, not recommended.
- write_results (bool): Write validation results to file.

Expand Down
Loading