-
Notifications
You must be signed in to change notification settings - Fork 49
> 4GB tensors don't seem to work with torch_compat #198
Copy link
Copy link
Closed
Labels
Description
Saving a deepspeed checkpoint I hit this. As far as I can see, any model with > 4GB weights would hit this.
│ [rank0]: File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3194, in save_checkpoint │││ ││ │ │ │
│ [rank0]: self._save_checkpoint(save_dir, │││ ││ │ │ │
│ [rank0]: File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3416, in _save_checkpoint │││ ││ │ │ │
│ [rank0]: self.checkpoint_engine.save(state, save_path) │││ ││ │ │ │
│ [rank0]: File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 22, in save │││ ││ │ │ │
│ [rank0]: torch.save(state_dict, path) │││ ││ │ │ │
│ [rank0]: File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/tensorizer/torch_compat.py", line 480, in _save_wrapper │││ ││ │ │ │
│ [rank0]: return _ORIG_TORCH_SAVE( │││ ││ │ │ │
│ [rank0]: File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/serialization.py", line 628, in save │││ ││ │ │ │
│ [rank0]: _save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record) │││ ││ │ │ │
│ [rank0]: File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/torch/serialization.py", line 840, in _save │││ ││ │ │ │
│ [rank0]: pickler.dump(obj) │││ ││ │ │ │
│ [rank0]: File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/tensorizer/torch_compat.py", line 160, in dump │││ ││ │ │ │
│ [rank0]: serializer.write_state_dict(self.__tensors) │││ ││ │ │ │
│ [rank0]: File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/tensorizer/serialization.py", line 4568, in write_state_dict │││ ││ │ │ │
│ [rank0]: self._bulk_write( │││ ││ │ │ │
│ [rank0]: File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/tensorizer/serialization.py", line 4324, in _bulk_write │││ ││ │ │ │
│ [rank0]: next_pos = self._write_tensor( │││ ││ │ │ │
│ [rank0]: File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/tensorizer/serialization.py", line 3958, in _write_tensor │││ ││ │ │ │
│ [rank0]: header = _TensorHeaderSerializer( │││ ││ │ │ │
│ [rank0]: File "/opt/conda/envs/pytorch/lib/python3.10/site-packages/tensorizer/serialization.py", line 465, in __init__ │││ ││ │ │ │
│ [rank0]: self.variable_length_segment.pack_into( │││ ││ │ │ │
│ [rank0]: struct.error: 'I' format requires 0 <= number <= 4294967295
Reactions are currently unavailable