Fix DMA-BUF Export/Import with PyTorch Caching Allocator Offsets#348
Merged
Fix DMA-BUF Export/Import with PyTorch Caching Allocator Offsets#348
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This pull request fixes a critical bug in DMA-BUF export/import functionality when working with PyTorch's caching allocator. The issue arose because hipMemGetHandleForAddressRange exports handles to entire allocation buffers rather than specific sub-allocations, causing data corruption when tensors had non-zero offsets within those buffers.
Changes:
- Modified
export_dmabuf_handle()to return a tuple(fd, base_ptr, base_size)instead of justfd, usinghipMemGetAddressRangeto query the base allocation information - Updated
import_dmabuf_handle()to accept offset parameters and correctly map to the intended memory location - Updated
TorchAllocatorto pack/unpack the additional metadata when exchanging handles between ranks - Added comprehensive tests including a dedicated offset test to validate the fix
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| iris/hip.py | Added hipMemGetAddressRange call to query base allocation; modified return type to tuple; added offset calculation in import function |
| iris/allocators/torch_allocator.py | Updated handle exchange to pack/unpack metadata (base_ptr, base_size, heap_ptr); updated import call to use offset parameters |
| tests/unittests/test_dmabuf_apis.py | Updated all existing tests for new API; added test_dmabuf_with_offset() to explicitly test offset handling |
Increase bfloat16 tolerance from 1.5 to 2.5 to handle worst-case numerical precision with large matrices (1024x256x512) and 8-rank all_reduce operations. CI was seeing max difference of 2.0. Co-authored-by: Cursor <cursoragent@cursor.com>
- run_core_tests.sh: Quick validation (examples + unittests, 1-8 ranks) - run_all_tests.sh: Full CI-style testing (all 5 dirs, 1-8 ranks) Both scripts: - Create timestamped log subdirectories (logs/TIMESTAMP/) - Generate _all.log mega log capturing everything - Create individual test logs for debugging specific failures Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI
added a commit
that referenced
this pull request
Feb 4, 2026
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
DMA-BUF export/import was failing when tensors were suballocated from PyTorch's caching allocator.
hipMemGetHandleForAddressRangeexports a handle to the entire allocation buffer, not just the specific tensor's memory. Without offset correction, imports would map to the wrong location, causing data corruption.This became apparent after importing
iris.ops(which loads tritonBLAS), causing subsequent tensors to be suballocated with non-zero offsets.Technical Details
Changes to
iris/hip.py:export_dmabuf_handle()- Now returns(fd, base_ptr, base_size)tuple. UseshipMemGetAddressRange()to query allocation base and size.import_dmabuf_handle()- Now acceptsoriginal_ptrandbase_ptrparameters. Calculates offset and returnsmapped_base + offset.Changes to
iris/allocators/torch_allocator.py:(base_ptr, base_size, heap_ptr)along with FD usingstruct.pack('QQQ', ...).Changes to
tests/unittests/test_dmabuf_apis.py:test_dmabuf_with_offset()to explicitly validate offset handling.Breaking Change: API changed from returning single
fdto tuple(fd, base_ptr, base_size).Test Plan
test_dmabuf_with_offset()explicitly tests the fix by allocating two tensors (forcing offset > 0) and verifying correct data importTest Result
✅ All 6 DMA-BUF tests pass:
Submission Checklist