Skip to content

Dkorzekwa/anymodel subblock stats nodecilm#1102

Merged
danielkorzekwa merged 125 commits intofeature/puzzletronfrom
dkorzekwa/anymodel_subblock_stats_nodecilm
Mar 24, 2026
Merged

Dkorzekwa/anymodel subblock stats nodecilm#1102
danielkorzekwa merged 125 commits intofeature/puzzletronfrom
dkorzekwa/anymodel_subblock_stats_nodecilm

Conversation

@danielkorzekwa
Copy link
Copy Markdown
Contributor

@danielkorzekwa danielkorzekwa commented Mar 23, 2026

What does this PR do?

Refactoring of subblock stats to stop using DeciLM code and use anymodel instead.

Summary by CodeRabbit

  • Refactor

    • Restructured internal estimation logic for model memory and parameters to support broader model architectures.
    • Updated model initialization utilities.
  • Tests

    • Updated baseline metrics for validation tests.

- Add converter, model_descriptor, puzzformer, and llama model support
- Selective merge of anymodel functionality

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
…s merged)

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
@danielkorzekwa danielkorzekwa requested a review from a team as a code owner March 23, 2026 17:22
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 23, 2026

📝 Walkthrough

Walkthrough

The changes remove the DeciLMMoe mixture-of-experts implementation and refactor the parameter and memory calculation system from closed-form formulas to a descriptor-based meta-model construction approach. Supporting utilities are added for model class resolution and initialization. Test expectations are updated accordingly.

Changes

Cohort / File(s) Summary
MoE Removal
modelopt/torch/puzzletron/decilm/deci_lm_hf_code/modeling_decilm.py
Removed DeciLMMoe class implementation, including router initialization, top-k selection logic, and forward pass; updated module docstring and imports.
Subblock Stats Refactoring
modelopt/torch/puzzletron/subblock_stats/calc_subblock_params_and_memory.py
Replaced closed-form parameter/memory calculation functions (calculate_attention_params, calculate_mamba_params, calculate_ffn_params, etc.) with descriptor-based meta-model construction. calculate_subblock_params now uses EmptyInitOnDevice and descriptor post-init hooks to compute parameters; memory functions updated to call new calculate_subblock_params.
Subblock Stats Threading
modelopt/torch/puzzletron/subblock_stats/calc_subblock_stats.py
Added model_config: PretrainedConfig and descriptor: Type[ModelDescriptor] parameters to calculate_subblock_stats and calculate_subblock_stats_for_puzzle_dir; threaded these arguments through to downstream calculate_subblock_memory and parameter calculation calls.
Model Initialization Utilities
modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py
Added init_model_from_config function for creating meta models via descriptor approach, and _get_model_class_from_config to resolve model classes from config.architectures[0] with fallback to AutoModelForCausalLM. Expanded imports to include AutoModelForCausalLM and transformers.
Test Expectations
tests/gpu/torch/puzzletron/test_puzzletron.py
Updated EXPECTED_TEACHER_MEMORY_MIB and EXPECTED_TEACHER_NUM_PARAMS hardcoded values for multiple models (Llama-3.1-8B, Llama-3.2-3B, Mistral-Small-24B, Nemotron-3, Qwen variants) to reflect refactored calculation results.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.89% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Dkorzekwa/anymodel subblock stats nodecilm' is vague and uses non-descriptive terms like branch names and acronyms without conveying the actual purpose of the changes. Revise the title to clearly describe the main objective, such as 'Refactor subblock stats to use anymodel instead of DeciLM-specific code' or 'Remove DeciLM-specific dependencies from subblock stats calculation'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Security Anti-Patterns ✅ Passed Security review confirms no anti-patterns: trust_remote_code is configurable parameter, no hardcoded unsafe patterns found.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dkorzekwa/anymodel_subblock_stats_nodecilm

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment on lines +132 to +133
mprint(
f"Warning: {model_class_name} not found in transformers, "
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mprint(
f"Warning: {model_class_name} not found in transformers, "
warnings.warn(
f"{model_class_name} not found in transformers, "

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not do it and approach unifying logging across the board, using modelopt.torch.utils.logging.py

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is sub-block stats generic enough to move to modelopt.torch.utils just like runtime stats so we could reuse in other places?

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/puzzletron/subblock_stats/calc_subblock_stats.py (1)

348-359: ⚠️ Potential issue | 🟠 Major

model_hidden_sizes sweeps are now internally inconsistent.

Lines 351-352 always forward the teacher model_config, but subblock params now come from calculate_subblock_params(model_config, ...) rather than from n_embd. Combined with Line 359’s n_embd=model_hidden_size, any non-teacher sweep gets cache/non-block numbers for the override and subblock params/memory for the teacher width.

🛠️ Minimal safe guard
     for batch_size, (
         weights_dtype,
         activations_dtype,
         kv_cache_dtype,
     ), model_hidden_size in product(batch_sizes, data_types, model_hidden_sizes):
+        if model_hidden_size != lm_config.hidden_size:
+            raise NotImplementedError(
+                "model_hidden_sizes overrides are not supported with config-based subblock param counting yet."
+            )
         if num_active_tokens_override is not None:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/subblock_stats/calc_subblock_stats.py` around lines
348 - 359, The call to calculate_subblock_stats is passing model_hidden_size via
n_embd while the code derives subblock widths from
calculate_subblock_params(model_config, ...), causing mismatched widths for
teacher vs swept models; compute the effective embedding width from the subblock
params returned by calculate_subblock_params (e.g., use the subblock params'
n_embd/hidden size) and pass that value as n_embd to calculate_subblock_stats
instead of model_hidden_size, keeping the teacher model_config/teacher_dir usage
unchanged so the subblock stats use the same width source as
calculate_subblock_params.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/puzzletron/subblock_stats/calc_subblock_params_and_memory.py`:
- Around line 114-119: The current branch replaces measured sublayers by calling
to_blockconfig() on FFNConfig/AttentionConfig which loses parallel_blocks
context and the inner_block_idx, causing meta-init to instantiate incorrect
topology; change the logic in the block that handles layer_config so that when
layer_config originates from a BlockConfig.parallel_blocks entry you retain or
reconstruct the parent BlockConfig plus the selected inner_block_idx (e.g.
accept/propagate an inner_block_idx parameter or use the original BlockConfig
object instead of calling to_blockconfig()), or modify
FFNConfig.to_blockconfig()/AttentionConfig.to_blockconfig() to accept and
preserve inner_block_idx/parent-parallel context so the instantiated
block_config still reflects the parallel_blocks entry. Ensure refs to
layer_config, FFNConfig, AttentionConfig, to_blockconfig(),
BlockConfig.parallel_blocks, and inner_block_idx are used to locate and update
the code paths.

In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py`:
- Around line 156-160: The function init_model_from_config currently defaults
trust_remote_code=True which is a security risk; change the default to
trust_remote_code: bool = False in the function signature and any
overloaded/duplicate signatures (e.g., init_model_from_config) so callers must
opt-in explicitly, and update any internal calls or tests that relied on the old
default to pass trust_remote_code=True where appropriate; ensure any docstring
or usage comment near init_model_from_config reflects the new safe default.

---

Outside diff comments:
In `@modelopt/torch/puzzletron/subblock_stats/calc_subblock_stats.py`:
- Around line 348-359: The call to calculate_subblock_stats is passing
model_hidden_size via n_embd while the code derives subblock widths from
calculate_subblock_params(model_config, ...), causing mismatched widths for
teacher vs swept models; compute the effective embedding width from the subblock
params returned by calculate_subblock_params (e.g., use the subblock params'
n_embd/hidden size) and pass that value as n_embd to calculate_subblock_stats
instead of model_hidden_size, keeping the teacher model_config/teacher_dir usage
unchanged so the subblock stats use the same width source as
calculate_subblock_params.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0b698ceb-94cc-4adf-bf77-79a62e79882a

📥 Commits

Reviewing files that changed from the base of the PR and between 837e14f and 4458fb9.

📒 Files selected for processing (5)
  • modelopt/torch/puzzletron/decilm/deci_lm_hf_code/modeling_decilm.py
  • modelopt/torch/puzzletron/subblock_stats/calc_subblock_params_and_memory.py
  • modelopt/torch/puzzletron/subblock_stats/calc_subblock_stats.py
  • modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py
  • tests/gpu/torch/puzzletron/test_puzzletron.py

Comment on lines +114 to +119
if isinstance(layer_config, FFNConfig):
block_config = layer_config.to_blockconfig()
elif isinstance(layer_config, AttentionConfig):
block_config = layer_config.to_blockconfig()
else:
block_config = layer_config
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

to_blockconfig() drops parallel_blocks context.

Line 114 and Line 116 collapse the measured sublayer into a standalone BlockConfig. That cannot represent entries coming from BlockConfig.parallel_blocks: the replacement-library loader already preserves an inner_block_idx for those cases, but this API has nowhere to carry it, so meta-init instantiates the wrong layer topology and miscounts params/memory for parallel-block models.

If we want to keep the meta-init approach here, this helper needs the parent BlockConfig plus the selected inner_block_idx, not just a bare FFNConfig/AttentionConfig.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/subblock_stats/calc_subblock_params_and_memory.py`
around lines 114 - 119, The current branch replaces measured sublayers by
calling to_blockconfig() on FFNConfig/AttentionConfig which loses
parallel_blocks context and the inner_block_idx, causing meta-init to
instantiate incorrect topology; change the logic in the block that handles
layer_config so that when layer_config originates from a
BlockConfig.parallel_blocks entry you retain or reconstruct the parent
BlockConfig plus the selected inner_block_idx (e.g. accept/propagate an
inner_block_idx parameter or use the original BlockConfig object instead of
calling to_blockconfig()), or modify
FFNConfig.to_blockconfig()/AttentionConfig.to_blockconfig() to accept and
preserve inner_block_idx/parent-parallel context so the instantiated
block_config still reflects the parallel_blocks entry. Ensure refs to
layer_config, FFNConfig, AttentionConfig, to_blockconfig(),
BlockConfig.parallel_blocks, and inner_block_idx are used to locate and update
the code paths.

Base automatically changed from dkorzekwa/anymodel_subblock_stats to feature/puzzletron March 24, 2026 07:41
@danielkorzekwa danielkorzekwa requested review from a team as code owners March 24, 2026 07:41
…ats_nodecilm

Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py (1)

138-150: ⚠️ Potential issue | 🔴 Critical

CRITICAL: Default trust_remote_code to False.

The function defaults trust_remote_code=True (line 141), which violates repository security guidelines. Per SECURITY.md and coding guidelines, this flag must default to False to prevent callers from silently opting into executing checkpoint-supplied Python during model construction. Callers should explicitly opt-in at audited call sites.

The call site in calc_subblock_params_and_memory.py (line 167) already passes trust_remote_code=descriptor.requires_trust_remote_code(), so changing the default here should not break existing functionality.

🔐 Proposed fix
 def init_model_from_config(
     config: PretrainedConfig,
     *,
-    trust_remote_code: bool = True,
+    trust_remote_code: bool = False,
     **kwargs,
 ) -> PreTrainedModel:

As per coding guidelines: "Do not hardcode trust_remote_code=True when loading Hugging Face Transformers models. Let the caller decide via a parameter; default to False."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py` around lines 138 -
150, The init_model_from_config function currently defaults trust_remote_code to
True which violates security guidelines; change the default to False in the
function signature (init_model_from_config) and ensure callers that need remote
code explicitly pass trust_remote_code=True (e.g.,
calc_subblock_params_and_memory uses descriptor.requires_trust_remote_code()).
Update only the default value for the trust_remote_code parameter and keep the
existing conditional logic using model_class.from_config and
model_class._from_config unchanged.
🧹 Nitpick comments (2)
modelopt/torch/puzzletron/subblock_stats/calc_subblock_params_and_memory.py (1)

322-331: Unused parameter experts_dtype.

The experts_dtype parameter (line 327) is added to the function signature but is never used in the function body. The TODO comment on line 329 suggests this was intended for separating expert weights from the rest, but the current implementation doesn't differentiate.

Consider either:

  1. Removing the unused parameter until the TODO is addressed
  2. Implementing the expert weight separation logic
Option 1: Remove unused parameter
 def calculate_ffn_memory(
     ffn_config: FFNConfig,
     model_config: PretrainedConfig,
     descriptor: Type[ModelDescriptor],
     weights_dtype: torch.dtype | str,
-    experts_dtype: torch.dtype | str | None = None,
 ) -> float:
-    # TODO: How to separate between expert weights and the rest for any model (same as puzzletron).
     num_params = calculate_subblock_params(model_config, ffn_config, descriptor)
     return num_params * sizeof_dtype(weights_dtype) / 2**20
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/subblock_stats/calc_subblock_params_and_memory.py`
around lines 322 - 331, The function calculate_ffn_memory has an unused
parameter experts_dtype; either remove it from the signature and from all
callers (and update any type hints referencing FFNConfig/descriptor usage) or
implement expert vs non-expert separation: call calculate_subblock_params twice
(or modify it) to get expert_params and non_expert_params, compute memory as
expert_params * sizeof_dtype(experts_dtype) + non_expert_params *
sizeof_dtype(weights_dtype) and return the total in MiB (divide by 2**20); use
the existing symbols calculate_subblock_params, sizeof_dtype,
calculate_ffn_memory, FFNConfig and descriptor to locate where to change.
modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py (1)

125-135: Consider extracting shared helper to avoid duplication.

This _get_model_class_from_config implementation is nearly identical to the one in modelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py (lines 99-112). The only difference is using warnings.warn here vs mprint there.

Consider extracting this helper to a shared location to avoid maintaining duplicate logic.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py` around lines 125 -
135, The _get_model_class_from_config function duplicates logic present in
sharded_checkpoint_utils (same name/behavior but different warning call);
extract this logic into a single shared helper (e.g.,
get_model_class_from_config) placed in a utilities module and have both
checkpoint_utils_hf._get_model_class_from_config and sharded_checkpoint_utils
call that helper; unify the warning/logging behavior by accepting an optional
warn_func parameter (or default to warnings.warn) so callers can pass mprint
when needed, and update both callers to use the new helper to remove
duplication.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py`:
- Around line 138-150: The init_model_from_config function currently defaults
trust_remote_code to True which violates security guidelines; change the default
to False in the function signature (init_model_from_config) and ensure callers
that need remote code explicitly pass trust_remote_code=True (e.g.,
calc_subblock_params_and_memory uses descriptor.requires_trust_remote_code()).
Update only the default value for the trust_remote_code parameter and keep the
existing conditional logic using model_class.from_config and
model_class._from_config unchanged.

---

Nitpick comments:
In `@modelopt/torch/puzzletron/subblock_stats/calc_subblock_params_and_memory.py`:
- Around line 322-331: The function calculate_ffn_memory has an unused parameter
experts_dtype; either remove it from the signature and from all callers (and
update any type hints referencing FFNConfig/descriptor usage) or implement
expert vs non-expert separation: call calculate_subblock_params twice (or modify
it) to get expert_params and non_expert_params, compute memory as expert_params
* sizeof_dtype(experts_dtype) + non_expert_params * sizeof_dtype(weights_dtype)
and return the total in MiB (divide by 2**20); use the existing symbols
calculate_subblock_params, sizeof_dtype, calculate_ffn_memory, FFNConfig and
descriptor to locate where to change.

In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py`:
- Around line 125-135: The _get_model_class_from_config function duplicates
logic present in sharded_checkpoint_utils (same name/behavior but different
warning call); extract this logic into a single shared helper (e.g.,
get_model_class_from_config) placed in a utilities module and have both
checkpoint_utils_hf._get_model_class_from_config and sharded_checkpoint_utils
call that helper; unify the warning/logging behavior by accepting an optional
warn_func parameter (or default to warnings.warn) so callers can pass mprint
when needed, and update both callers to use the new helper to remove
duplication.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4fd8f067-0fd7-4794-af5d-e04e07da9d54

📥 Commits

Reviewing files that changed from the base of the PR and between 4458fb9 and 684ed36.

📒 Files selected for processing (2)
  • modelopt/torch/puzzletron/subblock_stats/calc_subblock_params_and_memory.py
  • modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.10%. Comparing base (0708ca2) to head (684ed36).
⚠️ Report is 1 commits behind head on feature/puzzletron.

Additional details and impacted files
@@                  Coverage Diff                   @@
##           feature/puzzletron    #1102      +/-   ##
======================================================
- Coverage               72.12%   72.10%   -0.03%     
======================================================
  Files                     209      209              
  Lines                   23628    23628              
======================================================
- Hits                    17042    17036       -6     
- Misses                   6586     6592       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@danielkorzekwa danielkorzekwa merged commit 3193f30 into feature/puzzletron Mar 24, 2026
28 checks passed
@danielkorzekwa danielkorzekwa deleted the dkorzekwa/anymodel_subblock_stats_nodecilm branch March 24, 2026 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants