-
Notifications
You must be signed in to change notification settings - Fork 321
Compress tutorial (PoC) #492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
c758ad5
8af9903
5ba6c27
0bc5d84
87d4fa5
ced1e99
5de0bdc
800414c
16abcc9
a5ba1c7
1bda391
bb38401
8415548
b1b1833
d4ffc91
6f28e4a
016fb63
0ccf1c4
58439ca
2e5f776
6274db5
d942e0a
f765921
de876d6
72bdc7a
40d28af
f7fe23c
3d1d286
a210483
739f868
b06d22b
9352978
20a3c5e
18cb88b
1033c81
0680c45
2d9da30
febab44
86bf394
86e04a0
ae61644
2998cdb
3778ec2
d940000
0bf9a92
b56df9a
fd63130
9bfcc21
1dc89c4
6bfa3ec
6d45e33
f9e09d9
a0cfd13
2c2995c
b152689
24e30e6
21f115e
d19b9ab
78d7a87
9753b8d
f71c1b6
7eb2fd7
3eb39f9
ca16d77
8360de9
9230d81
28b5c13
a7eba4b
21ed59b
e3ed0a4
9e09e8f
abb39f3
64b33e2
21a602c
9a381fe
c47e0af
ce8d53a
c754419
6505631
734c32c
ee14792
5dca0aa
5454c59
b3fd9df
d4ed34a
9979872
8cb50d4
2856ca1
553107a
3917a78
6e1d910
25b4aed
59d0b46
5c9d6f4
e337869
5dd18b1
498f7ac
2222952
d6a80af
68c875f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,200 @@ | ||
| # Compress Algorithm Tutorial | ||
|
|
||
| This tutorial demonstrates how to compress large language models using the Compress algorithm based on the [Puzzle paper](https://arxiv.org/abs/2411.19146). | ||
| This tutorial demonstrates how to compress large language models using the compress algorithm based on the [Puzzle paper](https://arxiv.org/abs/2411.19146). | ||
| The goal of the algorithm it to find the most optimal modifications to MLP and attention layers of the model, resulting in a heterogeneous model architecture. | ||
| The supported modifications are: | ||
|
|
||
| - `ffn_intermediate_size`: different FFN intermediate sizes | ||
| - `attention op/noop`: complete removal of attention layers | ||
|
|
||
| To use the Puzzle algorithm effectively, we need to specify the target number of parameters and/or the memory. The final stage is based on Mixed-Integer Programming (MIP) algorithm to find the most optimal combination of layer modifications that satisfy the target requirements. | ||
|
|
||
| In this example, we compress the [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model reducing GPU memory usage from 113 GiB to 96 GiB (15% reduction) with less than 1% regression in the token_accuracy_top_10 metric. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you know what parameter reduction we see as well? That would be useful info to add here |
||
|
|
||
| ## Environment | ||
|
|
||
| - Install TensorRT-Model-Optimizer in editable mode with the corresponding dependencies: | ||
|
|
||
| ```bash | ||
| pip install -e .[hf,compress] | ||
| ``` | ||
|
|
||
| - For this example we are using 2x NVIDIA H100 80GB HBM3 to show multi-GPU steps. You can use also use s single GPU. | ||
|
|
||
| ## Compress the Model | ||
|
|
||
| 1. Specify the `puzzle_dir`, `input_hf_model_path`, `dataset_path`, `intermediate_size_list`, and `target_memory` arguments in the [llama-3_1-8B_pruneffn_memory.yaml](./configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml) configuration file. | ||
|
|
||
| **_NOTE:_** | ||
| How to choose `intermediate_size_list`? | ||
| The list specifies the candidate FFN sizes that we wish to search over. It is recommended to choose several pruning sizes (e.g. 15%, 20%, 30% etc of the original). Note that the values must be hardware-friendly (divisible by a 256) to avoid issues with tensor operations in subsequent steps. | ||
|
|
||
| Let's first shoot for 32% GPU memory reduction setting `target_memory = 78_000` GiB. This means that the algorithm will choose the candidates with highest accuracy that also meet the specified requirements. | ||
|
|
||
| 2. Download and prepare the [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2). | ||
|
|
||
| dataset split: "code", "math", "stem", "chat", excluding reasoning samples (2.62GB) | ||
|
|
||
| ```bash | ||
| python -m modelopt.torch._compress.dataset.prepare_dataset --dataset_name nvidia/Nemotron-Post-Training-Dataset-v2 --output_dir path/to/Nemotron-Post-Training-Dataset-v2 | ||
| ``` | ||
|
|
||
| 3. Run the compression script. | ||
|
|
||
| ```bash | ||
| torchrun --nproc_per_node 2 examples/compress/main.py --config path/to/llama-3_1-8B_pruneffn_memory.yaml 2>&1 | tee ./log.txt | grep "Compress Progress" | ||
| ``` | ||
|
|
||
| This will save the full output to `log.txt` and display the following progress on screen: | ||
|
|
||
| ```bash | ||
| [2025-11-02 12:06:34][rank-0][main.py:71] Compress Progress 1/8: starting compression pipeline | ||
| [2025-11-02 12:06:45][rank-0][compress_nas_plugin.py:123] Compress Progress 2/8: converting model from HF to DeciLM (single-gpu) | ||
| [2025-11-02 12:07:07][rank-0][compress_nas_plugin.py:132] Compress Progress 3/8: scoring pruning activations (multi-gpu) | ||
| [2025-11-02 12:11:36][rank-0][compress_nas_plugin.py:137] Compress Progress 4/8: pruning the model and saving pruned checkpoints (single-gpu) | ||
| [2025-11-02 12:12:20][rank-0][compress_nas_plugin.py:217] Compress Progress 5/8: building replacement library and subblock statistics (single-gpu) | ||
| [2025-11-02 12:12:21][rank-0][compress_nas_plugin.py:222] Compress Progress 6/8: calculating one block scores (multi-gpu) | ||
| [2025-11-02 12:50:41][rank-0][compress_nas_plugin.py:226] Compress Progress 7/8: running MIP and realizing models (multi-gpu) | ||
| [2025-11-02 12:52:34][rank-0][main.py:115] Compress Progress 8/8: compression pipeline completed (multi-gpu) | ||
| ``` | ||
|
|
||
| Once the process is complete, the resulting network architecture will be recorded in `log.txt` for your review: | ||
|
|
||
| ```bash | ||
| ... | ||
| block_0: attention gqa_4 ffn intermediate_14336 | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. GQA4 will only work with TP4 if training in Megatron-fw. Maybe deployment also but I dont know for sure. Should we remove GQA pruning from search space?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. GQA is not in the search space, only attenion op/noop GQA4 - there are 8 groups each with 4 KV heads, and not 4 groups. Add internal NV issues/60 to clarify it. |
||
| block_1: attention gqa_4 ffn intermediate_14336 | ||
kevalmorabia97 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| block_2: attention gqa_4 ffn intermediate_14336 | ||
| block_3: attention gqa_4 ffn intermediate_14336 | ||
| block_4: attention gqa_4 ffn intermediate_14336 | ||
| block_5: attention gqa_4 ffn intermediate_14336 | ||
| block_6: attention gqa_4 ffn intermediate_14336 | ||
| block_7: attention gqa_4 ffn intermediate_14336 | ||
| block_8: attention gqa_4 ffn intermediate_14336 | ||
| block_9: attention gqa_4 ffn intermediate_14336 | ||
| block_10: attention gqa_4 ffn intermediate_14336 | ||
| block_11: attention gqa_4 ffn intermediate_14336 | ||
| block_12: attention gqa_4 ffn intermediate_14336 | ||
| block_13: attention gqa_4 ffn intermediate_14336 | ||
| block_14: attention gqa_4 ffn intermediate_14336 | ||
| block_15: attention gqa_4 ffn intermediate_14336 | ||
| block_16: attention gqa_4 ffn intermediate_14336 | ||
| block_17: attention no_op ffn intermediate_14336 | ||
| block_18: attention no_op ffn intermediate_14336 | ||
| block_19: attention no_op ffn intermediate_14336 | ||
| block_20: attention no_op ffn intermediate_14336 | ||
| block_21: attention no_op ffn intermediate_14336 | ||
| block_22: attention no_op ffn intermediate_14336 | ||
| block_23: attention no_op ffn intermediate_14336 | ||
| block_24: attention no_op ffn intermediate_14336 | ||
| block_25: attention no_op ffn intermediate_14336 | ||
| block_26: attention no_op ffn intermediate_14336 | ||
| block_27: attention no_op ffn intermediate_14336 | ||
| block_28: attention no_op ffn intermediate_14336 | ||
| block_29: attention gqa_4 ffn intermediate_14336 | ||
| block_30: attention gqa_4 ffn intermediate_14336 | ||
| block_31: attention gqa_4 ffn intermediate_14336 | ||
|
|
||
| [2025-11-02 04:53:11,332]^[[92m[rank-0]^[[0m[run_puzzle.py:295] Total costs: {'stats.memory_mib': 75796.4140625, 'stats.ffn_num_params': 5637275648, 'stats.num_kv_heads': 160, 'stats.kv_cache_memory_mib': 61440.0, 'stats.ffn_memory_mib': 10752.25, 'stats.attention_memory_mib': 63040.15625, 'stats.attention_num_params': 838942720, 'stats.num_params': 7526895616, 'stats.has_attention': 20, 'stats.has_ffn': 32} | ||
| ... | ||
| ################################################################ | ||
| validate_model_and_extract_token_probs(model_name='teacher') | ||
| ################################################################ | ||
| ... | ||
| Average losses = {'lm_loss': 1.118250765837729, 'token_accuracy_top_1': 0.7331905364990234, 'token_accuracy_top_5': 0.9094219207763672, 'token_accuracy_top_10': 0.9423646926879883} | ||
| ... | ||
| ################################################################ | ||
| validate_model_with_kl_div(model_name='solution_0', is_calc_kl_div=True) | ||
| ################################################################ | ||
| .... | ||
| Average losses = {'lm_loss': 1.7577573340386152, 'token_accuracy_top_1': 0.6225490570068359, 'token_accuracy_top_5': 0.846257209777832, 'token_accuracy_top_10': 0.8987817764282227} | ||
| ``` | ||
|
|
||
| 30% GPU memory reduction leads to nearly 5% regression in token_accuracy_top_10 metric (0.898 / 0.942). Let's rerun MIP search aiming for 15% memory reduction. | ||
|
|
||
| ## Re-run MIP Search with different constraints | ||
|
|
||
| If you want to try different constraints without re-running the expensive pruning and scoring steps, use the `--mip-only` flag. | ||
| This assumes pruning, replacement library building, NAS scoring, and subblock stats calculation have already been completed. | ||
|
|
||
| For example, let's set `target_memory: 96_000` in `llama-3_1-8B_pruneffn_memory.yaml`. | ||
|
|
||
| ```bash | ||
| torchrun --nproc_per_node 2 examples/compress/main.py --config path/to/llama-3_1-8B_pruneffn_memory.yaml --mip-only 2>&1 | tee ./log.txt | grep "Compress Progress" | ||
| ``` | ||
|
|
||
| This will generate the following network architecture (see `log.txt`): | ||
|
|
||
| ```bash | ||
| block_0: attention gqa_4 ffn intermediate_14336 | ||
| block_1: attention gqa_4 ffn intermediate_14336 | ||
| block_2: attention gqa_4 ffn intermediate_14336 | ||
| block_3: attention gqa_4 ffn intermediate_14336 | ||
| block_4: attention gqa_4 ffn intermediate_14336 | ||
| block_5: attention gqa_4 ffn intermediate_14336 | ||
| block_6: attention gqa_4 ffn intermediate_14336 | ||
| block_7: attention gqa_4 ffn intermediate_14336 | ||
| block_8: attention gqa_4 ffn intermediate_14336 | ||
| block_9: attention gqa_4 ffn intermediate_14336 | ||
| block_10: attention gqa_4 ffn intermediate_14336 | ||
| block_11: attention gqa_4 ffn intermediate_14336 | ||
| block_12: attention gqa_4 ffn intermediate_14336 | ||
| block_13: attention gqa_4 ffn intermediate_14336 | ||
| block_14: attention gqa_4 ffn intermediate_14336 | ||
| block_15: attention gqa_4 ffn intermediate_14336 | ||
| block_16: attention gqa_4 ffn intermediate_14336 | ||
| block_17: attention gqa_4 ffn intermediate_14336 | ||
| block_18: attention no_op ffn intermediate_14336 | ||
| block_19: attention no_op ffn intermediate_14336 | ||
| block_20: attention no_op ffn intermediate_14336 | ||
| block_21: attention gqa_4 ffn intermediate_14336 | ||
| block_22: attention no_op ffn intermediate_14336 | ||
| block_23: attention no_op ffn intermediate_14336 | ||
| block_24: attention no_op ffn intermediate_14336 | ||
| block_25: attention gqa_4 ffn intermediate_14336 | ||
| block_26: attention gqa_4 ffn intermediate_14336 | ||
| block_27: attention gqa_4 ffn intermediate_14336 | ||
| block_28: attention gqa_4 ffn intermediate_14336 | ||
| block_29: attention gqa_4 ffn intermediate_14336 | ||
| block_30: attention gqa_4 ffn intermediate_14336 | ||
| block_31: attention gqa_4 ffn intermediate_14336 | ||
|
|
||
| [2025-11-02 12:50:42,024]^[[92m[rank-0]^[[0m[run_puzzle.py:295] Total costs: {'stats.memory_mib': 94708.4609375, 'stats.has_ffn': 32, 'stats.ffn_memory_mib': 10752.25, 'stats.kv_cache_memory_mib': 79872.0, 'stats.attention_num_params': 1090625536, 'stats.ffn_num_params': 5637275648, 'stats.has_attention': 26, 'stats.num_params': 7778578432, 'stats.attention_memory_mib': 81952.203125, 'stats.num_kv_heads': 208} | ||
| ... | ||
| ################################################################ | ||
| validate_model_with_kl_div(model_name='solution_0', is_calc_kl_div=True) | ||
| ################################################################ | ||
| Average losses = {'lm_loss': 1.2425934937782586, 'token_accuracy_top_1': 0.703862190246582, 'token_accuracy_top_5': 0.8954982757568359, 'token_accuracy_top_10': 0.9336576461791992 | ||
| ``` | ||
|
|
||
| On the other hand, if you set `target_memory: 28_000`, you'll observe that the intermediate FFN sizes are significantly reduced in certain layers (see `log.txt` for details): | ||
|
|
||
| ```bash | ||
| block_5: attention no_op ffn intermediate_11520 | ||
| block_6: attention no_op ffn intermediate_14336 | ||
| block_7: attention no_op ffn intermediate_8704 | ||
| block_8: attention no_op ffn intermediate_14336 | ||
| block_9: attention no_op ffn intermediate_3072 | ||
| block_10: attention no_op ffn intermediate_11520 | ||
| block_11: attention no_op ffn intermediate_11520 | ||
| block_12: attention no_op ffn intermediate_11520 | ||
| block_13: attention no_op ffn intermediate_11520 | ||
| block_14: attention no_op ffn intermediate_3072 | ||
| ``` | ||
|
|
||
| ## Evaluation | ||
|
|
||
| Once the model is ready, you can evaluate it using [Language Model Evaluation Harness](https://pypi.org/project/lm-eval/). For example, run the following to evaluate the model on [Massive Multitask Language Understanding](https://huggingface.co/datasets/cais/mmlu) benchmark. | ||
|
|
||
| ```bash | ||
| lm_eval --model hf \ | ||
| --model_args pretrained=path/to/model,dtype=bfloat16,trust_remote_code=true,parallelize=True \ | ||
| --tasks mmlu \ | ||
| --num_fewshot 5 \ | ||
| --batch_size 4 | ||
| ``` | ||
|
|
||
| ## Advanced usage | ||
|
|
||
| Modify `path/to/Llama-3_1-8B yaml` file for advanced compression scenarios. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,110 @@ | ||
| defaults: | ||
| - pruning: ffn_pruning | ||
| - scoring: ../validate_solutions_defaults | ||
| - realize_model: ../validate_solutions_defaults | ||
| - bypass: | ||
| - override hydra/hydra_logging: disabled | ||
| - _self_ | ||
|
|
||
| puzzle_dir: ??? | ||
| teacher_dir: ${puzzle_dir}/ckpts/teacher/ | ||
| replacement_library_path: ${puzzle_dir}/replacement_library.json | ||
| dataset_path: ??? # path to v0.4_mini | ||
|
|
||
| skip_realize_model: false | ||
|
|
||
| build_replacement_library: | ||
| add_ffn_no_ops: true | ||
| add_attention_no_ops: true | ||
|
|
||
| calc_subblock_stats: | ||
| batch_sizes: [64, 96, 128] | ||
| prefill_seq_len: 4096 | ||
| generation_seq_len: 4096 | ||
| num_active_tokens_override: # Optional override for sequence lengths | ||
| prefill_queue_size: 0 | ||
| allocate_prefill_query: false | ||
| benchmark_iterations: # Set to a number (e.g., 1000) to enable runtime benchmarking | ||
| merge_with_existing_stats: false | ||
| subblock_stats_filename: "subblock_stats.json" | ||
| moe_stats_filename: "moe_stats.json" | ||
| runtime_stats: | ||
| backend: trt_torch | ||
|
|
||
| scoring: | ||
| solutions_to_validate: | ||
| skip_existing_solutions: true | ||
|
|
||
| replacement_library_path: ${replacement_library_path} | ||
| solutions_path: ${to_path:${puzzle_dir}/single_sequence_replacement_solutions.json} | ||
| teacher_dir: ${to_path:${teacher_dir}} | ||
| output_dir: ${puzzle_dir}/single_sequence_replacement_solutions--validation | ||
|
|
||
| eval_samples: 10 # default is 128 | ||
| micro_batch_size: 1 | ||
| seed: 42 | ||
| shuffle_seed: 444 | ||
| dataset_path: ${dataset_path} | ||
|
|
||
| mip: | ||
| single_block_replacement_validation_dir: ${to_path:${scoring.output_dir}} | ||
| subblock_stats_path: ${to_path:${puzzle_dir}/${calc_subblock_stats.subblock_stats_filename}} | ||
| output_path: ${to_path:${puzzle_dir}/mip/puzzle_solutions} | ||
| gathered_metrics_path: | ||
| puzzle_profile: | ||
|
|
||
| # puzzle_profile: | ||
| objective: metrics.cosine_embedding_loss_hidden_states | ||
| bigger_is_better: false | ||
| num_solutions: 1 | ||
| minimal_diversity: 2 | ||
|
|
||
| subblock_stats_args: | ||
| - batch_size: 96 | ||
| weights_dtype: torch.bfloat16 | ||
| activations_dtype: torch.bfloat16 | ||
| kv_cache_dtype: torch.bfloat16 | ||
|
|
||
| report_additional_costs: | ||
| - stats.memory_mib | ||
| - stats.num_params | ||
| - stats.num_kv_heads | ||
| - stats.has_attention | ||
| - stats.has_ffn | ||
| - stats.kv_cache_memory_mib | ||
| - stats.attention_memory_mib | ||
| - stats.ffn_memory_mib | ||
| - stats.ffn_num_params | ||
| - stats.attention_num_params | ||
|
|
||
| human_constraints: | ||
| target_memory: 78_000 | ||
|
|
||
| mip_constraints: | ||
| use_greedy_search: false | ||
| is_multi_layer_puzzle: true | ||
| metric_overrides: | ||
| constrain_search_func: | ||
| max_seconds_per_solution: 60 | ||
|
|
||
| realize_model: | ||
| teacher_dir: ${to_path:${teacher_dir}} | ||
| tokenizer_name: ${to_path:${teacher_dir}} | ||
| replacement_library_path: ${replacement_library_path} | ||
| save_models: true | ||
| solutions_path: # Filled dynamically | ||
|
|
||
| # Validate params | ||
| skip_validation: false # To enable validation of the model solution set `skip_validation` as False | ||
| eval_samples: 128 | ||
| micro_batch_size: 1 | ||
| seed: 42 | ||
| shuffle_seed: 444 | ||
| dataset_path: ${dataset_path} | ||
|
|
||
| nccl_timeout_minutes: ${timedelta_minutes:10} | ||
|
|
||
| # This section redirects Hydra outputs | ||
| hydra: | ||
| run: | ||
| dir: ${puzzle_dir}/hydra_logs/${now:%Y-%m-%d}/${now:%H-%M-%S} |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| defaults: | ||
| - Llama-3_1-8B | ||
| - _self_ | ||
|
|
||
| # Input Hugging Face model to compress | ||
| input_hf_model_path: /workspace/hf_models/meta-llama/Llama-3.1-8B-Instruct | ||
|
|
||
| # Dataset path for pruning and NAS scoring | ||
| dataset_path: /workspace/datasets/Nemotron-Post-Training-Dataset-v2 | ||
|
|
||
| # Working directory for compression outputs | ||
| puzzle_dir: /workspace/puzzle_dir | ||
|
|
||
| # MIP memory constraint (in MiB) | ||
| mip: | ||
| human_constraints: | ||
| target_memory: 96_000 # 96 GiB | ||
|
|
||
| # FFN intermediate sizes to search over (heterogeneous architecture) | ||
| pruning: | ||
| intermediate_size_list: [3072, 5888, 8704, 11520] # teacher_intermediate_size is 14336 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| defaults: | ||
| - pruning_defaults | ||
|
|
||
| activations_log_dir: ${puzzle_dir}/pruning/pruning_scores/attn_${pruning.activation_hooks_kwargs.method}/${pruning.experiment_id} | ||
|
|
||
| activation_hooks_kwargs: | ||
| method: independent_kv_head_contribution | ||
| optimize_for: memory # IndependentKvHeadContributionHook implementation that consumes less memory | ||
| target_layer: "self_attn.o_proj" | ||
| layer_input_descriptors_path: | ||
|
|
||
| # n_heads_in_group: 4 | ||
| # num_attention_heads: 32 # num query heads | ||
| # num_kv_heads: 32 / 4 = 8 # num_query_heads // n_heads_in_group | ||
| n_heads_in_group_list: [8, 16, 32] # num_kv_heads = [4, 2, 1] | ||
| gqa_init_mode: "PruneKVHeads" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| defaults: | ||
| - pruning_defaults | ||
|
|
||
| activations_log_dir: ${puzzle_dir}/pruning/pruning_scores/ffn_${pruning.activation_hooks_kwargs.method}/${pruning.experiment_id} | ||
|
|
||
| activation_hooks_kwargs: | ||
| method: iterative | ||
| target_layer: "mlp.down_proj" | ||
| layer_input_descriptors_path: | ||
|
|
||
| intermediate_size_list: [3072, 5888, 8704, 11520] # teacher_intermediate_size is 14336 | ||
| mlp_init_mode: "PruneByActivationsLog" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| defaults: | ||
| - pruning_defaults | ||
|
|
||
| activations_log_dir: ${puzzle_dir}/pruning/pruning_scores/hidden_dim_${pruning.activation_hooks_kwargs.method}/${pruning.experiment_id} | ||
|
|
||
| activation_hooks_kwargs: | ||
| method: layer_norm_contribution | ||
| target_layer: "layernorm" | ||
|
|
||
| # Hidden dimension pruning specific settings | ||
| hidden_size_list: [3072, 2048] # Target hidden sizes to prune to | ||
| hidden_size_init_mode: "PruneByChannelRanking" | ||
| mlp_init_mode: "Truncate" # TODO, make it work with CopyAsIs/FromTeacher | ||
| gqa_init_mode: "AverageKV" # TODO, make it work with CopyAsIs/FromTeacher | ||
| linear_init_mode: "FromTeacher" |
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didnt we decide to keep PoC just ffn pruning and no attn module replacement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use also attention op/noop as this is part of the solid compression example we did internally at Nvidia.