Dkorzekwa/anymodel subblock stats nodecilm#1102
Dkorzekwa/anymodel subblock stats nodecilm#1102danielkorzekwa merged 125 commits intofeature/puzzletronfrom
Conversation
- Add converter, model_descriptor, puzzformer, and llama model support - Selective merge of anymodel functionality Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
…s merged) Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
📝 WalkthroughWalkthroughThe changes remove the Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
| mprint( | ||
| f"Warning: {model_class_name} not found in transformers, " |
There was a problem hiding this comment.
| mprint( | |
| f"Warning: {model_class_name} not found in transformers, " | |
| warnings.warn( | |
| f"{model_class_name} not found in transformers, " |
There was a problem hiding this comment.
let's not do it and approach unifying logging across the board, using modelopt.torch.utils.logging.py
There was a problem hiding this comment.
Is sub-block stats generic enough to move to modelopt.torch.utils just like runtime stats so we could reuse in other places?
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/puzzletron/subblock_stats/calc_subblock_stats.py (1)
348-359:⚠️ Potential issue | 🟠 Major
model_hidden_sizessweeps are now internally inconsistent.Lines 351-352 always forward the teacher
model_config, but subblock params now come fromcalculate_subblock_params(model_config, ...)rather than fromn_embd. Combined with Line 359’sn_embd=model_hidden_size, any non-teacher sweep gets cache/non-block numbers for the override and subblock params/memory for the teacher width.🛠️ Minimal safe guard
for batch_size, ( weights_dtype, activations_dtype, kv_cache_dtype, ), model_hidden_size in product(batch_sizes, data_types, model_hidden_sizes): + if model_hidden_size != lm_config.hidden_size: + raise NotImplementedError( + "model_hidden_sizes overrides are not supported with config-based subblock param counting yet." + ) if num_active_tokens_override is not None:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/subblock_stats/calc_subblock_stats.py` around lines 348 - 359, The call to calculate_subblock_stats is passing model_hidden_size via n_embd while the code derives subblock widths from calculate_subblock_params(model_config, ...), causing mismatched widths for teacher vs swept models; compute the effective embedding width from the subblock params returned by calculate_subblock_params (e.g., use the subblock params' n_embd/hidden size) and pass that value as n_embd to calculate_subblock_stats instead of model_hidden_size, keeping the teacher model_config/teacher_dir usage unchanged so the subblock stats use the same width source as calculate_subblock_params.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/puzzletron/subblock_stats/calc_subblock_params_and_memory.py`:
- Around line 114-119: The current branch replaces measured sublayers by calling
to_blockconfig() on FFNConfig/AttentionConfig which loses parallel_blocks
context and the inner_block_idx, causing meta-init to instantiate incorrect
topology; change the logic in the block that handles layer_config so that when
layer_config originates from a BlockConfig.parallel_blocks entry you retain or
reconstruct the parent BlockConfig plus the selected inner_block_idx (e.g.
accept/propagate an inner_block_idx parameter or use the original BlockConfig
object instead of calling to_blockconfig()), or modify
FFNConfig.to_blockconfig()/AttentionConfig.to_blockconfig() to accept and
preserve inner_block_idx/parent-parallel context so the instantiated
block_config still reflects the parallel_blocks entry. Ensure refs to
layer_config, FFNConfig, AttentionConfig, to_blockconfig(),
BlockConfig.parallel_blocks, and inner_block_idx are used to locate and update
the code paths.
In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py`:
- Around line 156-160: The function init_model_from_config currently defaults
trust_remote_code=True which is a security risk; change the default to
trust_remote_code: bool = False in the function signature and any
overloaded/duplicate signatures (e.g., init_model_from_config) so callers must
opt-in explicitly, and update any internal calls or tests that relied on the old
default to pass trust_remote_code=True where appropriate; ensure any docstring
or usage comment near init_model_from_config reflects the new safe default.
---
Outside diff comments:
In `@modelopt/torch/puzzletron/subblock_stats/calc_subblock_stats.py`:
- Around line 348-359: The call to calculate_subblock_stats is passing
model_hidden_size via n_embd while the code derives subblock widths from
calculate_subblock_params(model_config, ...), causing mismatched widths for
teacher vs swept models; compute the effective embedding width from the subblock
params returned by calculate_subblock_params (e.g., use the subblock params'
n_embd/hidden size) and pass that value as n_embd to calculate_subblock_stats
instead of model_hidden_size, keeping the teacher model_config/teacher_dir usage
unchanged so the subblock stats use the same width source as
calculate_subblock_params.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0b698ceb-94cc-4adf-bf77-79a62e79882a
📒 Files selected for processing (5)
modelopt/torch/puzzletron/decilm/deci_lm_hf_code/modeling_decilm.pymodelopt/torch/puzzletron/subblock_stats/calc_subblock_params_and_memory.pymodelopt/torch/puzzletron/subblock_stats/calc_subblock_stats.pymodelopt/torch/puzzletron/tools/checkpoint_utils_hf.pytests/gpu/torch/puzzletron/test_puzzletron.py
| if isinstance(layer_config, FFNConfig): | ||
| block_config = layer_config.to_blockconfig() | ||
| elif isinstance(layer_config, AttentionConfig): | ||
| block_config = layer_config.to_blockconfig() | ||
| else: | ||
| block_config = layer_config |
There was a problem hiding this comment.
to_blockconfig() drops parallel_blocks context.
Line 114 and Line 116 collapse the measured sublayer into a standalone BlockConfig. That cannot represent entries coming from BlockConfig.parallel_blocks: the replacement-library loader already preserves an inner_block_idx for those cases, but this API has nowhere to carry it, so meta-init instantiates the wrong layer topology and miscounts params/memory for parallel-block models.
If we want to keep the meta-init approach here, this helper needs the parent BlockConfig plus the selected inner_block_idx, not just a bare FFNConfig/AttentionConfig.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/puzzletron/subblock_stats/calc_subblock_params_and_memory.py`
around lines 114 - 119, The current branch replaces measured sublayers by
calling to_blockconfig() on FFNConfig/AttentionConfig which loses
parallel_blocks context and the inner_block_idx, causing meta-init to
instantiate incorrect topology; change the logic in the block that handles
layer_config so that when layer_config originates from a
BlockConfig.parallel_blocks entry you retain or reconstruct the parent
BlockConfig plus the selected inner_block_idx (e.g. accept/propagate an
inner_block_idx parameter or use the original BlockConfig object instead of
calling to_blockconfig()), or modify
FFNConfig.to_blockconfig()/AttentionConfig.to_blockconfig() to accept and
preserve inner_block_idx/parent-parallel context so the instantiated
block_config still reflects the parallel_blocks entry. Ensure refs to
layer_config, FFNConfig, AttentionConfig, to_blockconfig(),
BlockConfig.parallel_blocks, and inner_block_idx are used to locate and update
the code paths.
…ats_nodecilm Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py (1)
138-150:⚠️ Potential issue | 🔴 CriticalCRITICAL: Default
trust_remote_codetoFalse.The function defaults
trust_remote_code=True(line 141), which violates repository security guidelines. Per SECURITY.md and coding guidelines, this flag must default toFalseto prevent callers from silently opting into executing checkpoint-supplied Python during model construction. Callers should explicitly opt-in at audited call sites.The call site in
calc_subblock_params_and_memory.py(line 167) already passestrust_remote_code=descriptor.requires_trust_remote_code(), so changing the default here should not break existing functionality.🔐 Proposed fix
def init_model_from_config( config: PretrainedConfig, *, - trust_remote_code: bool = True, + trust_remote_code: bool = False, **kwargs, ) -> PreTrainedModel:As per coding guidelines: "Do not hardcode
trust_remote_code=Truewhen loading Hugging Face Transformers models. Let the caller decide via a parameter; default toFalse."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py` around lines 138 - 150, The init_model_from_config function currently defaults trust_remote_code to True which violates security guidelines; change the default to False in the function signature (init_model_from_config) and ensure callers that need remote code explicitly pass trust_remote_code=True (e.g., calc_subblock_params_and_memory uses descriptor.requires_trust_remote_code()). Update only the default value for the trust_remote_code parameter and keep the existing conditional logic using model_class.from_config and model_class._from_config unchanged.
🧹 Nitpick comments (2)
modelopt/torch/puzzletron/subblock_stats/calc_subblock_params_and_memory.py (1)
322-331: Unused parameterexperts_dtype.The
experts_dtypeparameter (line 327) is added to the function signature but is never used in the function body. The TODO comment on line 329 suggests this was intended for separating expert weights from the rest, but the current implementation doesn't differentiate.Consider either:
- Removing the unused parameter until the TODO is addressed
- Implementing the expert weight separation logic
Option 1: Remove unused parameter
def calculate_ffn_memory( ffn_config: FFNConfig, model_config: PretrainedConfig, descriptor: Type[ModelDescriptor], weights_dtype: torch.dtype | str, - experts_dtype: torch.dtype | str | None = None, ) -> float: - # TODO: How to separate between expert weights and the rest for any model (same as puzzletron). num_params = calculate_subblock_params(model_config, ffn_config, descriptor) return num_params * sizeof_dtype(weights_dtype) / 2**20🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/subblock_stats/calc_subblock_params_and_memory.py` around lines 322 - 331, The function calculate_ffn_memory has an unused parameter experts_dtype; either remove it from the signature and from all callers (and update any type hints referencing FFNConfig/descriptor usage) or implement expert vs non-expert separation: call calculate_subblock_params twice (or modify it) to get expert_params and non_expert_params, compute memory as expert_params * sizeof_dtype(experts_dtype) + non_expert_params * sizeof_dtype(weights_dtype) and return the total in MiB (divide by 2**20); use the existing symbols calculate_subblock_params, sizeof_dtype, calculate_ffn_memory, FFNConfig and descriptor to locate where to change.modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py (1)
125-135: Consider extracting shared helper to avoid duplication.This
_get_model_class_from_configimplementation is nearly identical to the one inmodelopt/torch/puzzletron/tools/sharded_checkpoint_utils.py(lines 99-112). The only difference is usingwarnings.warnhere vsmprintthere.Consider extracting this helper to a shared location to avoid maintaining duplicate logic.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py` around lines 125 - 135, The _get_model_class_from_config function duplicates logic present in sharded_checkpoint_utils (same name/behavior but different warning call); extract this logic into a single shared helper (e.g., get_model_class_from_config) placed in a utilities module and have both checkpoint_utils_hf._get_model_class_from_config and sharded_checkpoint_utils call that helper; unify the warning/logging behavior by accepting an optional warn_func parameter (or default to warnings.warn) so callers can pass mprint when needed, and update both callers to use the new helper to remove duplication.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py`:
- Around line 138-150: The init_model_from_config function currently defaults
trust_remote_code to True which violates security guidelines; change the default
to False in the function signature (init_model_from_config) and ensure callers
that need remote code explicitly pass trust_remote_code=True (e.g.,
calc_subblock_params_and_memory uses descriptor.requires_trust_remote_code()).
Update only the default value for the trust_remote_code parameter and keep the
existing conditional logic using model_class.from_config and
model_class._from_config unchanged.
---
Nitpick comments:
In `@modelopt/torch/puzzletron/subblock_stats/calc_subblock_params_and_memory.py`:
- Around line 322-331: The function calculate_ffn_memory has an unused parameter
experts_dtype; either remove it from the signature and from all callers (and
update any type hints referencing FFNConfig/descriptor usage) or implement
expert vs non-expert separation: call calculate_subblock_params twice (or modify
it) to get expert_params and non_expert_params, compute memory as expert_params
* sizeof_dtype(experts_dtype) + non_expert_params * sizeof_dtype(weights_dtype)
and return the total in MiB (divide by 2**20); use the existing symbols
calculate_subblock_params, sizeof_dtype, calculate_ffn_memory, FFNConfig and
descriptor to locate where to change.
In `@modelopt/torch/puzzletron/tools/checkpoint_utils_hf.py`:
- Around line 125-135: The _get_model_class_from_config function duplicates
logic present in sharded_checkpoint_utils (same name/behavior but different
warning call); extract this logic into a single shared helper (e.g.,
get_model_class_from_config) placed in a utilities module and have both
checkpoint_utils_hf._get_model_class_from_config and sharded_checkpoint_utils
call that helper; unify the warning/logging behavior by accepting an optional
warn_func parameter (or default to warnings.warn) so callers can pass mprint
when needed, and update both callers to use the new helper to remove
duplication.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4fd8f067-0fd7-4794-af5d-e04e07da9d54
📒 Files selected for processing (2)
modelopt/torch/puzzletron/subblock_stats/calc_subblock_params_and_memory.pymodelopt/torch/puzzletron/tools/checkpoint_utils_hf.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## feature/puzzletron #1102 +/- ##
======================================================
- Coverage 72.12% 72.10% -0.03%
======================================================
Files 209 209
Lines 23628 23628
======================================================
- Hits 17042 17036 -6
- Misses 6586 6592 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What does this PR do?
Refactoring of subblock stats to stop using DeciLM code and use anymodel instead.
Summary by CodeRabbit
Refactor
Tests