Skip to content

> 4GB tensors don't seem to work with torch_compat #198

@stuart-mv

Description

@stuart-mv

Saving a deepspeed checkpoint I hit this. As far as I can see, any model with > 4GB weights would hit this.

│ [rank0]:   File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3194, in save_checkpoint                                                           │││ ││  │        │ │
│ [rank0]:     self._save_checkpoint(save_dir,                                                                                                                                                │││ ││  │        │ │
│ [rank0]:   File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3416, in _save_checkpoint                                                          │││ ││  │        │ │
│ [rank0]:     self.checkpoint_engine.save(state, save_path)                                                                                                                                  │││ ││  │        │ │
│ [rank0]:   File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 22, in save                                     │││ ││  │        │ │
│ [rank0]:     torch.save(state_dict, path)                                                                                                                                                   │││ ││  │        │ │
│ [rank0]:   File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/tensorizer/torch_compat.py", line 480, in _save_wrapper                                                               │││ ││  │        │ │
│ [rank0]:     return _ORIG_TORCH_SAVE(                                                                                                                                                       │││ ││  │        │ │
│ [rank0]:   File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/serialization.py", line 628, in save                                                                            │││ ││  │        │ │
│ [rank0]:     _save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record)                                                                                          │││ ││  │        │ │
│ [rank0]:   File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/serialization.py", line 840, in _save                                                                           │││ ││  │        │ │
│ [rank0]:     pickler.dump(obj)                                                                                                                                                              │││ ││  │        │ │
│ [rank0]:   File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/tensorizer/torch_compat.py", line 160, in dump                                                                        │││ ││  │        │ │
│ [rank0]:     serializer.write_state_dict(self.__tensors)                                                                                                                                    │││ ││  │        │ │
│ [rank0]:   File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/tensorizer/serialization.py", line 4568, in write_state_dict                                                          │││ ││  │        │ │
│ [rank0]:     self._bulk_write(                                                                                                                                                              │││ ││  │        │ │
│ [rank0]:   File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/tensorizer/serialization.py", line 4324, in _bulk_write                                                               │││ ││  │        │ │
│ [rank0]:     next_pos = self._write_tensor(                                                                                                                                                 │││ ││  │        │ │
│ [rank0]:   File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/tensorizer/serialization.py", line 3958, in _write_tensor                                                             │││ ││  │        │ │
│ [rank0]:     header = _TensorHeaderSerializer(                                                                                                                                              │││ ││  │        │ │
│ [rank0]:   File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/tensorizer/serialization.py", line 465, in __init__                                                                   │││ ││  │        │ │
│ [rank0]:     self.variable_length_segment.pack_into(                                                                                                                                        │││ ││  │        │ │
│ [rank0]: struct.error: 'I' format requires 0 <= number <= 4294967295     

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions