Add subblock stats to the compress algorithm#623
Add subblock stats to the compress algorithm#623danielkorzekwa merged 60 commits intofeature/compressfrom
Conversation
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## feature/compress #623 +/- ##
=================================================
Coverage 74.37% 74.37%
=================================================
Files 182 182
Lines 18219 18219
=================================================
Hits 13550 13550
Misses 4669 4669 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
modelopt/torch/_compress/subblock_stats/calc_subblock_params_and_memory.py
Show resolved
Hide resolved
| raise_unknown_subblock_config_error(subblock_config) | ||
|
|
||
|
|
||
| def calculate_subblock_params( |
There was a problem hiding this comment.
I think param count is a bit hacky and can be simplified by just running 1 forward pass on sample input and calculate params. We have this simple utility in modelopt already (param_num_from_forward): https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/utils/network.py#L129 which is generic and works for any HF model (moe or dense).
We can run both functions and compare the numbers as well
There was a problem hiding this comment.
Good candidate for a shared component. Added as a high priority internal Nvidia issue: issues/74.
| ) | ||
|
|
||
|
|
||
| def calculate_subblock_memory( |
There was a problem hiding this comment.
Quick question - is memory = (active params * param dtype memory) + (kv cache params * kv cache dtype memory) ? Or is a bit more complicated than that? I see a lot of logic for different types of layers but there isnt any docstring so I'm not sure whats the main reason for custom logic per layer
There was a problem hiding this comment.
Sepehr raised similar questions and concerns. Added to: issues/74
| from puzzle_tools.subblock_stats.runtime_stats.calc_runtime_stats import ( | ||
| calc_runtime_ms_for_subblocks, | ||
| ) |
There was a problem hiding this comment.
Is this for TRT-LLM stats?
There was a problem hiding this comment.
yes, this is used only if runtime_stats is enabled via a config param.
| # TODO: fix | ||
| # from puzzle_tools.calc_subblock_runtime import measure_non_block_runtime_ms | ||
| # non_block_runtime_ms, embedding_runtime_ms, lm_head_runtime_ms = \ | ||
| # measure_non_block_runtime_ms(batch_size, prefill_seq_len, generation_seq_len, n_embd, vocab_size, | ||
| # benchmark_iterations, use_cuda_graph) |
There was a problem hiding this comment.
will this be added in follow-up PR?
There was a problem hiding this comment.
Once scoring/mip are in, we can prioritize what to do next. I added an internal issue for subblock runtime stats: issues/75
| # ==== START === Setup for attach-helper ==== | ||
| # import sys | ||
| # import os | ||
| # sys.path.insert(0, os.environ["ATTACH_HELPER_INSTALLATION_PATH"]) | ||
| # from attach_helper import debugging_setup | ||
| # debugging_setup() # You can optionally pass a name to identify the job (e.g. `debugging_setup(name="my_script")`) | ||
| # ==== END === Setup for attach-helper ==== |
There was a problem hiding this comment.
for debugging likely, I removed
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
What does this PR do?
Add subblock stats to the compress algorithm.