Symmetric memory pytorch backends by saivishal1999 · Pull Request #6023 · NVIDIA/Fuser

saivishal1999 · 2026-03-02T22:55:09Z

Adds support for PyTorch-backed symmetric memory in nvFuser alongside the existing native implementation. This enables using c10d::symmetric_memory (NCCL/NVSHMEM/CUDA) as an alternative backend while keeping the current SymmetricTensor API unchanged.

Introduced SymmetricMemoryBackend (Native, PyTorchNccl, PyTorchNvshmem, PyTorchCuda) with runtime selection.
--Default behavior unchanged (native backend).
--PyTorch backend is opt-in via NVFUSER_ENABLE=symmetric_memory_backend(...).
Integrated PyTorch symmetric memory into SymmetricTensor:
allocate (empty_strided_p2p)
setupRemoteHandles (c10d::symmetric_memory::rendezvous)
remoteTensor(get_remote_tensor)
Added process group tracking + registration in Communicator to support rendezvous and ensure proper cleanup.
Extended c10d mocks to allow builds without distributed support.

github-actions · 2026-03-02T22:56:04Z

Review updated until commit 6996d05

Description

Add PyTorch symmetric memory backends (NCCL, NVSHMEM, CUDA) as alternatives to native VMM
Implement getSymmetricMemoryBackend() to select backend via NVFUSER_ENABLE=symmetric_memory_backend option
Integrate PyTorch's c10d::symmetric_memory for allocation, rendezvous, and remote tensor access
Add Communicator methods to expose Store and Backend for PyTorch symmetric memory integration

Changes walkthrough

Relevant files

Enhancement

6 files

ipc_utils.h `Add SymmetricMemoryBackend enum and getter`	+13/-0
ipc_utils.cpp `Implement getSymmetricMemoryBackend option parsing`	+18/-0
symmetric_tensor.h `Add PyTorch symmetric memory handle member`	+15/-6
symmetric_tensor.cpp `Implement PyTorch backend allocation and remote access`	+162/-1
communicator.h `Declare getStore and getWorldBackendIntrusivePtr`	+13/-0
communicator.cpp `Implement getStore and getWorldBackendIntrusivePtr`	+16/-0

Configuration changes

2 files

options.h `Add SymmetricMemoryBackend to EnableOption enum`	+2/-0
options.cpp `Register symmetric_memory_backend enable option`	+1/-0

Tests

1 files

test_multidevice_symmetric_tensor.cpp `Add tests for symmetric memory backend selection`	+108/-0

Miscellaneous

1 files

fbuild.sh `Add build script for development`	+24/-0

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Silent fallback to Native backend

When an invalid argument is passed to symmetric_memory_backend option (e.g., "pytorch_invalid"),
getSymmetricMemoryBackend() silently falls back to Native instead of reporting an error.
This could mask user configuration mistakes. Consider adding validation to warn or error
on unknown backend arguments.

SymmetricMemoryBackend getSymmetricMemoryBackend() {
  if (isOptionEnabled(EnableOption::SymmetricMemoryBackend)) {
    if (hasEnableOptionArgument(
            EnableOption::SymmetricMemoryBackend, "pytorch_nccl")) {
      return SymmetricMemoryBackend::PyTorchNccl;
    }
    if (hasEnableOptionArgument(
            EnableOption::SymmetricMemoryBackend, "pytorch_nvshmem")) {
      return SymmetricMemoryBackend::PyTorchNvshmem;
    }
    if (hasEnableOptionArgument(
            EnableOption::SymmetricMemoryBackend, "pytorch_cuda")) {
      return SymmetricMemoryBackend::PyTorchCuda;
    }
  }
  return SymmetricMemoryBackend::Native;
}

PyTorch backend tests commented out

The test PyTorchBackend_RemoteAccessCorrectness (lines 125-163) is commented out. Since this
PR introduces PyTorch symmetric memory backends, having at least one active test for the
non-native paths would be valuable to ensure correctness. Consider enabling or adding an
alternative test for the PyTorch backend path.

// TEST_F(SymmetricTensorTest, PyTorchBackend_RemoteAccessCorrectness) {
//   if (communicator_->size() == 1) {
//     GTEST_SKIP() << "Skipping test for single device";
//   }
//   SymmetricMemoryBackend backend = getSymmetricMemoryBackend();
//   if (backend == SymmetricMemoryBackend::Native) {
//     GTEST_SKIP()
//         << "PyTorch backend not selected; set NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nccl) to run";
//   }

//   const int64_t rank = communicator_->deviceId();
//   const int64_t world_size = communicator_->size();

//   at::Tensor local_tensor = SymmetricTensor::allocate(
//       {256, 512}, at::ScalarType::Float, communicator_->device());
//   SymmetricTensor sym_tensor(local_tensor);

//   EXPECT_TRUE(local_tensor.is_cuda());
//   EXPECT_EQ(local_tensor.numel(), 256 * 512);

//   float local_value = static_cast<float>(rank + 200);
//   local_tensor.fill_(local_value);

//   sym_tensor.setupRemoteHandles();

//   for (int64_t peer_rank = 0; peer_rank < world_size; ++peer_rank) {
//     void* peer_ptr = sym_tensor.remoteTensor(peer_rank).data_ptr();
//     EXPECT_NE(peer_ptr, nullptr);

//     float peer_value;
//     NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpy(
//         &peer_value, peer_ptr, sizeof(float), cudaMemcpyDeviceToHost));

//     float expected_value = static_cast<float>(peer_rank + 200);
//     EXPECT_FLOAT_EQ(peer_value, expected_value)
//         << "Rank " << rank << " reading from rank " << peer_rank
//         << " (PyTorch backend)";
//   }
// }

Unnecessary build script added

A new file fbuild.sh was added which appears to be a local development/build script with
hardcoded paths (e.g., /opt/hpcx/ucc). This should likely be removed from the PR as it's
not part of the feature implementation and contains machine-specific configuration.

#!/bin/bash

export CC=clang-20
export CXX=clang++-20
export LDFLAGS="-fuse-ld=mold"

export NVFUSER_BUILD_ENABLE_PCH

export UCC_HOME="/opt/hpcx/ucc"
export UCC_DIR="/opt/hpcx/ucc/lib/cmake/ucc"
export UCX_HOME="/opt/hpcx/ucx"
export UCX_DIR="/opt/hpcx/ucx/lib/cmake/ucx"

# export TORCH_CUDA_ARCH_LIST="9.0"

export NVFUSER_BUILD_WITH_UCC=1
export NVFUSER_BUILD_INSTALL_DIR=$BUILD_DIRECTORY/nvfuser
export NVFUSER_BUILD_DIR=$BUILD_DIRECTORY

# Enable debug mode, leave empty for non-debug compilation
export NVFUSER_BUILD_BUILD_TYPE=Debug
export RUN_CMAKE=""

pip install -v -e ./python --no-build-isolation

greptile-apps · 2026-03-02T23:01:28Z

Greptile Summary

This PR introduces three new PyTorch-backed symmetric memory allocator/rendezvous options (pytorch_nccl, pytorch_nvshmem, pytorch_cuda) alongside the existing native CUDA VMM path, selectable via NVFUSER_ENABLE=symmetric_memory_backend(...). It also adds c10d::ProcessGroup wrappers in Communicator for rendezvous support, expands c10d_mock.h with ProcessGroup and SymmetricMemory stubs, and adds GetSymmetricMemoryBackend plumbing through options.cpp.

Confidence Score: 4/5

PR is close to mergeable; the USE_DISTRIBUTED ProcessGroup guard mismatch is the one remaining tracked issue that can break the PyTorch symmetric memory path in ARM iGPU builds.

Most critical issues from prior review rounds have been addressed (debug print removed, NVF_THROW(false,...) fixed, process_groups_ cleanup tracked, '0' alias tracked in process_groups_). Remaining findings are P2: a silent setupMulticast return and spurious int64_t casts. The USE_DISTRIBUTED/NVFUSER_DISTRIBUTED guard mismatch in getBackendForTeam is acknowledged and in progress but still present.

csrc/multidevice/communicator.cpp (line 422 USE_DISTRIBUTED guard), csrc/multidevice/symmetric_tensor.cpp (setupMulticast silent return)

Important Files Changed

Filename	Overview
csrc/multidevice/symmetric_tensor.cpp	Core PyTorch backend integration: allocate, rendezvous, remote access, and multicast paths. Contains a subtle `setupMulticast` silent-return bug when `torch_symm_handle_` is already set but multicast is unsupported.
csrc/multidevice/communicator.cpp	Adds ProcessGroup wrapper creation/registration in `getBackendForTeam` and `registerProcessGroup`; cleanup loop added. ProcessGroup block is still gated on `USE_DISTRIBUTED` which may diverge from `NVFUSER_DISTRIBUTED` in some builds (acknowledged by developer, in progress).
csrc/multidevice/communicator.h	Adds `process_groups_` map and `registerProcessGroup` declaration; `c10d::ProcessGroup` now available in all builds via mock so no guard needed on the field.
csrc/multidevice/c10d_mock.h	Adds `ProcessGroup`, `SymmetricMemory`, `set_backend`, `empty_strided_p2p`, and `rendezvous` stubs; allows non-distributed builds to compile without guards.
csrc/multidevice/ipc_utils.cpp	Adds `getSymmetricMemoryBackend()` parsing logic; minor cleanups (zero-initialization of C structs, braces around single-line if).
csrc/multidevice/ipc_utils.h	Adds `SymmetricMemoryBackend` enum and `getSymmetricMemoryBackend()` declaration; clean addition.
csrc/multidevice/symmetric_tensor.h	Renames `mc_ptr_` to `multicast_ptr_`, adds `torch_symm_handle_` member, and adds mock include for non-distributed builds.
tests/cpp/test_multidevice_symmetric_tensor.cpp	Adds `ContiguousView` skip guard for non-native backends; debug print removed; `SmallAllocation` test present but PyTorch-backend end-to-end test remains absent.
csrc/options.cpp	Registers `symmetric_memory_backend` as a valid `EnableOption`; straightforward single-line addition.
csrc/options.h	Adds `SymmetricMemoryBackend` enum value with comment; clean addition.

Sequence Diagram

sequenceDiagram
    participant U as User
    participant ST as SymmetricTensor
    participant IS as initSymmMemBackendAndGetGroup
    participant C as Communicator
    participant PT as c10d::symmetric_memory

    U->>ST: allocate(sizes, dtype, device)
    ST->>IS: initSymmMemBackendAndGetGroup(backend)
    IS->>PT: set_backend(NCCL|NVSHMEM)
    IS->>C: getBackendForTeam(all_ranks, kNccl)
    C-->>C: createBackend + registerProcessGroup(group_name, pg)
    IS-->>ST: group_name
    ST->>PT: empty_strided_p2p(sizes, strides, dtype, device, alloc_group_name)
    PT-->>ST: p2p tensor

    U->>ST: SymmetricTensor(p2p_tensor)
    ST-->>ST: world_size_, my_device_id_, requested_size_

    U->>ST: setupRemoteHandles()
    ST->>IS: initSymmMemBackendAndGetGroup(backend)
    IS-->>ST: group_name
    ST->>C: barrier(kNccl)
    ST->>PT: rendezvous(local_tensor_, group_name)
    PT-->>ST: SymmetricMemory handle
    ST-->>ST: are_remote_tensors_setup_=true

    U->>ST: remoteTensor(rank)
    ST->>PT: handle.get_remote_tensor(rank, sizes, dtype)
    PT-->>U: remote tensor

    U->>ST: multicastPtr()
    ST-->>U: multicast_ptr_ (if is_multicast_setup_)

_{Reviews (18): Last reviewed commit: "Remove guard comments" | Re-trigger Greptile}

greptile-apps

_{10 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

fbuild.sh

csrc/multidevice/symmetric_tensor.cpp

tests/cpp/test_multidevice_symmetric_tensor.cpp

greptile-apps · 2026-03-02T23:01:35Z

csrc/multidevice/symmetric_tensor.cpp

+    std::vector<int64_t> strides(sizes.size());
+    strides.back() = 1;
+    for (int64_t i = (int64_t)strides.size() - 2; i >= 0; --i) {


Undefined behavior when sizes is empty (0-dim tensor)

std::vector<int64_t> strides(sizes.size()); strides.back() = 1; // UB if sizes is empty

std::vector::back() on an empty vector is undefined behaviour. The same guard-free pattern also exists in the native path further down in the same function (~line 225). While allocating a 0-dimensional symmetric tensor is unusual, the PyTorch path that was just added adds a new callsite where callers may pass {} as sizes. A simple check is sufficient:

NVF_CHECK(!sizes.empty(), "Cannot allocate a 0-dim symmetric tensor");

or initialise strides defensively (matching the standard row-major convention for 0-dim tensors, which is an empty strides vector) and skip the loop entirely when sizes is empty.

nsarka · 2026-03-03T21:22:05Z

Sorry! I accidentally hit the button to merge main into the branch. Hopefully it's ok.

greptile-apps · 2026-03-03T21:25:13Z

csrc/multidevice/symmetric_tensor.cpp

+void ensurePyTorchSymmMemBackend(SymmetricMemoryBackend backend) {
+  static std::once_flag once;
+  std::call_once(once, [backend]() {
+    const char* name = nullptr;
+    switch (backend) {
+      case SymmetricMemoryBackend::PyTorchNccl:
+        name = "NCCL";
+        break;
+      case SymmetricMemoryBackend::PyTorchNvshmem:
+        name = "NVSHMEM";
+        break;
+      case SymmetricMemoryBackend::PyTorchCuda:
+        name = "CUDA";
+        break;
+      default:
+        NVF_ERROR(false, "Unexpected PyTorch symmetric memory backend");
+    }
+    c10d::symmetric_memory::set_backend(name);
+    Communicator& comm = Communicator::getInstance();
+    NVF_CHECK(comm.is_available(), "Communicator not available for symmetric memory");
+    c10d::symmetric_memory::set_group_info(
+        kPyTorchSymmMemGroupName,
+        static_cast<int>(comm.deviceId()),
+        static_cast<int>(comm.size()),
+        comm.getStore());
+  });
+}


std::call_once exception-safety leaves set_backend in a permanently broken state on retry

std::call_once resets its once_flag if the callable exits via an exception, allowing a subsequent call to retry. However, the callable here calls set_backend(name) before set_group_info(...). If set_backend succeeds but set_group_info subsequently throws (e.g., because the store is unavailable), once_flag is reset and the next allocate() call will attempt set_backend(name) a second time. PyTorch's symmetric memory layer is likely to throw on that second set_backend call (backend already configured), making it impossible to recover without restarting the process.

A straightforward mitigation is to separate the two calls into distinct phases or to wrap set_backend in its own protection:

// Separate once-flags for each idempotent step, or catch and suppress // the "already set" error from set_backend on retry: try { c10d::symmetric_memory::set_backend(name); } catch (const std::exception& e) { // If the backend is already set to the correct name, treat as success. // Re-throw otherwise. } c10d::symmetric_memory::set_group_info( kPyTorchSymmMemGroupName, ...);

Alternatively, split the once_flag so set_backend has its own dedicated guard that truly runs at most once, while set_group_info can retry on failure.

csrc/multidevice/symmetric_tensor.cpp

tests/cpp/test_multidevice_symmetric_tensor.cpp

samnordmann

Thank you! Some minor comments
Please add test, fix linter, and run the CI with !test command (comment directly on the PR)

Fuser/.github/workflows/lint.yml

Line 83 in 5b210dd

- name: Run lintrunner

tests/cpp/test_multidevice_symmetric_tensor.cpp

csrc/multidevice/symmetric_tensor.h

csrc/multidevice/symmetric_tensor.cpp

samnordmann · 2026-03-10T11:14:51Z

csrc/multidevice/communicator.h

    return store_.get();
  }

+#ifdef NVFUSER_DISTRIBUTED


why do we need guard here?

samnordmann · 2026-03-10T11:14:58Z

csrc/multidevice/communicator.h


 #ifdef NVFUSER_DISTRIBUTED
 #include <torch/csrc/distributed/c10d/Backend.hpp>
+#include <torch/csrc/distributed/c10d/Store.hpp>


samnordmann · 2026-03-10T11:15:24Z

csrc/multidevice/communicator.h

+  // Returns the store as an intrusive_ptr for use with PyTorch symmetric
+  // memory (c10d::symmetric_memory::set_group_info).
+  c10::intrusive_ptr<c10d::Store> getStore() const;
+
+  // Returns the world backend as an intrusive_ptr so it can be registered with
+  // c10d::register_process_group (e.g. for PyTorch symmetric memory NCCL
+  // rendezvous, which resolves the group by name).
+  c10::intrusive_ptr<c10d::Backend> getWorldBackendIntrusivePtr(
+      std::optional<CommunicatorBackend> backend = std::nullopt);


rather, change the signature of the existing getter method to return intrusive_ptr instead of raw pointer

greptile-apps · 2026-03-16T12:02:07Z

csrc/multidevice/communicator.cpp

+std::string Communicator::getSymmMemGroupKey(
+  std::optional<CommunicatorBackend> backend) {
+std::vector<RankType> all_ranks(size_);
+std::iota(all_ranks.begin(), all_ranks.end(), 0);
+CommunicatorBackend b = backend.value_or(default_backend_);
+(void)getBackendForTeam(all_ranks, b, "symm_mem_");
+return getTeamKey(all_ranks, b);
+}


getSymmMemGroupKey returns key without "symm_mem_" prefix — mismatch with registered process group

getBackendForTeam(all_ranks, b, "symm_mem_") registers the process group under the key "symm_mem_" + getTeamKey(all_ranks, b) (see the register_process_group call in that function). However, getSymmMemGroupKey then returns just getTeamKey(all_ranks, b) — without the "symm_mem_" prefix.

The returned key is subsequently used in ensurePyTorchSymmMemBackend as the group_name passed to both set_group_info and rendezvous. Newer NCCL builds resolve the process group by name at rendezvous time; they will look for a process group registered as "nccl0,1,..." but only "symm_mem_nccl0,1,..." exists, causing rendezvous to fail.

The current workaround that registers under "0" papers over this for older NCCL, but the mismatch will surface as soon as the TODO comment is resolved and older-NCCL special-casing is removed.

The return statement should return the full team_key including the prefix:

Suggested change

std::string Communicator::getSymmMemGroupKey(

std::optional<CommunicatorBackend> backend) {

std::vector<RankType> all_ranks(size_);

std::iota(all_ranks.begin(), all_ranks.end(), 0);

CommunicatorBackend b = backend.value_or(default_backend_);

(void)getBackendForTeam(all_ranks, b, "symm_mem_");

return getTeamKey(all_ranks, b);

}

std::string Communicator::getSymmMemGroupKey(

std::optional<CommunicatorBackend> backend) {

std::vector<RankType> all_ranks(size_);

std::iota(all_ranks.begin(), all_ranks.end(), 0);

CommunicatorBackend b = backend.value_or(default_backend_);

const std::string prefix = "symm_mem_";

(void)getBackendForTeam(all_ranks, b, prefix);

return prefix + getTeamKey(all_ranks, b);

}

seems relevant, wdyt ?

I reverted and removed "symm_mem_" prefix from the group name everywhere in my latest patch. Would you say it is important to have a prefix?

Removing the prefix everywhere does fix the mismatch issue, so your code will work correctly now.

However, having the prefix is better practice. Here's why:

Without the prefix, a symmetric memory process group and a regular process group with the same ranks will have identical keys. For example, if you create a regular NCCL group with ranks [0,1,2,3] and then create a symmetric memory group with the same ranks, they'll collide on key "nccl0,1,2,3".

The prefix provides namespace separation — it ensures symmetric memory groups ("symm_mem_nccl0,1,2,3") can coexist with regular groups ("nccl0,1,2,3"). This matters if your code path might create both types of groups for the same rank configuration.

If you're certain that symmetric memory groups will never coexist with regular groups of the same configuration, removing the prefix is fine. But if there's any chance of that happening, keep the prefix for safety.

I reverted and removed "symm_mem_" prefix from the group name everywhere in my latest patch. Would you say it is important to have a prefix?

that's up to you. Also, cf #6023 (comment)

greptile-apps · 2026-03-16T12:02:08Z

csrc/multidevice/communicator.h

+  c10::intrusive_ptr<c10d::Store> getStore() const {
+    return c10::intrusive_ptr<c10d::Store>(store_);
+  }


getStore() uses non-idiomatic intrusive_ptr construction

c10::intrusive_ptr<c10d::Store>(store_) passes the raw TCPStore* obtained from store_ (via the implicit operator T* of intrusive_ptr) to a new intrusive_ptr<Store>. This calls the unsafe intrusive_ptr<T>(T*, bool) constructor that takes an already-retained raw pointer — but store_ is managed and this path risks a ref-count imbalance.

The idiomatic way is to let the intrusive_ptr copy-conversion handle it:

Suggested change

c10::intrusive_ptr<c10d::Store> getStore() const {

return c10::intrusive_ptr<c10d::Store>(store_);

}

c10::intrusive_ptr<c10d::Store> getStore() const {

return store_;

}

csrc/multidevice/symmetric_tensor.cpp

greptile-apps · 2026-03-20T00:17:02Z

csrc/multidevice/symmetric_tensor.cpp

+        case SymmetricMemoryBackend::PyTorchCuda:
+          name = "CUDA";
+          break;


set_backend is never called for the PyTorchCuda backend

For PyTorchNccl and PyTorchNvshmem, c10d::symmetric_memory::set_backend(name) is called inside the call_once lambda. For PyTorchCuda, name is assigned "CUDA" but set_backend is never invoked. If PyTorch's symmetric-memory layer requires an explicit set_backend call before allocating with a CUDA transport, every empty_strided_p2p call on the CUDA path will either use whatever backend was previously configured (potentially NCCL or NVSHMEM) or fail silently at rendezvous time.

If PyTorchCuda truly requires no set_backend call (e.g., because "CUDA" is the implicit default), please add a comment explaining this so future maintainers don't perceive it as an oversight. Otherwise, add the missing call:

case SymmetricMemoryBackend::PyTorchCuda: name = "CUDA"; c10d::symmetric_memory::set_backend(name); break;

tests/cpp/test_multidevice_symmetric_tensor.cpp

greptile-apps · 2026-03-20T20:35:36Z

csrc/multidevice/symmetric_tensor.cpp

+    if (backend != SymmetricMemoryBackend::Native) {
+      NVF_CHECK(
+          comm.isBackendAvailable(CommunicatorBackend::kNccl),
+          "NCCL backend is required for symmetric_memory_backend(nccl)");


NCCL availability check incorrectly required for all PyTorch backends

isBackendAvailable(CommunicatorBackend::kNccl) is checked unconditionally for every non-Native backend — including PyTorchNvshmem and PyTorchCuda. If those backends don't actually require an NCCL process group (e.g., NVSHMEM uses its own transport), this check will spuriously reject them on systems where NCCL is unavailable.

Additionally, the error message hardcodes "(nccl)" even when the active backend is NVSHMEM or CUDA, which will confuse users:

"NCCL backend is required for symmetric_memory_backend(nccl)" // fired even when NVFUSER_ENABLE=symmetric_memory_backend(pytorch_nvshmem)

Consider guarding the NCCL check only for PyTorchNccl, and adjusting the error message dynamically:

if (backend == SymmetricMemoryBackend::PyTorchNccl) { NVF_CHECK( comm.isBackendAvailable(CommunicatorBackend::kNccl), "NCCL backend is required for symmetric_memory_backend(pytorch_nccl)"); }

greptile-apps · 2026-03-20T20:35:37Z

csrc/multidevice/symmetric_tensor.cpp

+    static std::once_flag once;
+    std::call_once(once, [backend]() {
+      const char* name = nullptr;
+      switch (backend) {
+        case SymmetricMemoryBackend::PyTorchNccl:
+          name = "NCCL";
+          c10d::symmetric_memory::set_backend(name);
+          break;
+        case SymmetricMemoryBackend::PyTorchNvshmem:
+          name = "NVSHMEM";
+          c10d::symmetric_memory::set_backend(name);
+          break;
+        case SymmetricMemoryBackend::PyTorchCuda:
+          name = "CUDA";
+          break;
+        default:
+          NVF_ERROR(false, "Unexpected PyTorch symmetric memory backend");
+      }
+    });


Static once_flag binds to whichever backend is passed first — silently ignores later backends

once is a static std::once_flag, so set_backend(name) is called exactly once for the lifetime of the process. If the flag fires on the first call (e.g., PyTorchCuda), a later call with PyTorchNccl won't call set_backend("NCCL") at all — the wrong (or absent) backend will silently remain active.

In practice a single process shouldn't mix backends, but the current structure provides no error if it does. The typical guard is to also capture the name into a static and assert consistency on subsequent calls:

static std::string configured_name; std::call_once(once, [backend, &configured_name]() { // ... set backend and populate configured_name }); NVF_CHECK( configured_name == expected_name, "symmetric memory backend already configured as '", configured_name, "', cannot reconfigure to '", expected_name, "'");

Or, at minimum, document that mixing backends within a process is undefined behaviour.

csrc/multidevice/communicator.cpp

samnordmann

LGTM overall!
Please cleanup, fix the CI and all the minor issues

1

samnordmann · 2026-03-23T15:19:47Z

csrc/multidevice/communicator.h

 #include <c10/util/intrusive_ptr.h>

+#if defined(NVFUSER_DISTRIBUTED) && \
+    __has_include(<torch/csrc/distributed/c10d/GroupRegistry.hpp>) && \


what is the rationale behind defining NVFUSER_CAN_REGISTER_C10D_PROCESS_GROUP? In what scenario can the header be missing?

samnordmann · 2026-03-23T15:20:15Z

csrc/multidevice/communicator.h

 #ifdef NVFUSER_DISTRIBUTED
 #include <torch/csrc/distributed/c10d/Backend.hpp>
+#if NVFUSER_CAN_REGISTER_C10D_PROCESS_GROUP
+#include <torch/csrc/distributed/c10d/ProcessGroup.hpp>


this header should always be present, no?

I added this header when I added the process_groups_ variable, so the same guard is used. It wasn't needed before my changes

csrc/multidevice/symmetric_tensor.cpp

samnordmann · 2026-03-23T15:25:26Z

csrc/multidevice/symmetric_tensor.cpp

+    if (backend != SymmetricMemoryBackend::Native) {
+      NVF_CHECK(
+          comm.isBackendAvailable(CommunicatorBackend::kNccl),
+          "NCCL backend is required for symmetric_memory_backend(nccl)");


Suggested change

"NCCL backend is required for symmetric_memory_backend(nccl)");

"NCCL backend is required for non-native symmetric memory backend: , backend");

csrc/multidevice/symmetric_tensor.cpp

samnordmann · 2026-03-24T10:50:27Z

Also, please write a PR description

greptile-apps · 2026-03-24T10:54:53Z

csrc/multidevice/symmetric_tensor.cpp

+    static std::once_flag pg0_once;
+    std::call_once(pg0_once, [&]() {
+      try {
+        (void)c10d::resolve_process_group("0");
+      } catch (const c10::Error&) {
+        auto pg = c10d::resolve_process_group(group_name);
+        c10d::register_process_group("0", pg);
+      }
+    });


"0" alias registered but never unregistered on cleanup

c10d::register_process_group("0", pg) is called inside a static std::once_flag lambda that lives in ensurePyTorchSymmMemBackend. The "0" key is never added to process_groups_ in Communicator, so Communicator::cleanup() will not unregister it:

for (const auto& entry : process_groups_) { c10d::unregister_process_group(entry.first); // only unregisters team_key, never "0" }

In test environments that tear down and re-create a Communicator, the stale "0" registration persists across test cases. On the next call to ensurePyTorchSymmMemBackend, pg0_once is permanently fired, so c10d::resolve_process_group("0") succeeds with the old, destroyed process group — and symm-mem rendezvous will silently use it.

The fix is to track the "0" alias and unregister it during cleanup(), or unconditionally overwrite the "0" registration rather than checking first.

what about this ?

taken care of in next commit, process_groups_ variable keeps track if "0" was registered by fuser's symmem

csrc/multidevice/symmetric_tensor.cpp

greptile-apps · 2026-03-25T16:50:50Z

csrc/multidevice/communicator.cpp

+  for (const auto& entry : process_groups_) {
+    c10d::unregister_process_group(entry.first);
+  }
+  process_groups_.clear();


process_groups_ cleanup guard mismatch — compile error when NVFUSER_DISTRIBUTED is set without USE_DISTRIBUTED

process_groups_ is declared in communicator.h under #if defined(NVFUSER_DISTRIBUTED) && defined(USE_DISTRIBUTED), but the cleanup loop here lives under the broader #if defined(NVFUSER_DISTRIBUTED) (without the USE_DISTRIBUTED guard). When a build defines NVFUSER_DISTRIBUTED but not USE_DISTRIBUTED, process_groups_ does not exist as a member, yet this code tries to iterate over it — a hard compile error.

c10d::unregister_process_group (from GroupRegistry.hpp) is already included under #ifdef NVFUSER_DISTRIBUTED, so fixing just the guard on these lines is sufficient:

Suggested change

for (const auto& entry : process_groups_) {

c10d::unregister_process_group(entry.first);

}

process_groups_.clear();

#if defined(USE_DISTRIBUTED)

for (const auto& entry : process_groups_) {

c10d::unregister_process_group(entry.first);

}

process_groups_.clear();

#endif

(The surrounding #if defined(NVFUSER_DISTRIBUTED) / #endif already provides the outer distributed guard.)

this comment is relevant -- but the real fix is to replace USE_DISTRIBUTED by NVFUSER_DISTRIBUTED wherever it appears (we don't use USE_DISTRIBUTED in the codebase)

Got it. You're right—USE_DISTRIBUTED should be replaced with NVFUSER_DISTRIBUTED everywhere. The codebase currently has process_groups_ guarded by both macros, but the cleanup code only checks one, which causes the mismatch.

USE_DISTRIBUTED appears in 4 files:

csrc/multidevice/communicator.h (line 14, line 162) — the header where process_groups_ is declared

csrc/multidevice/communicator.cpp (line 414) — where process_groups_ is populated

csrc/multidevice/symmetric_tensor.cpp (multiple locations) — where symmetric memory uses USE_DISTRIBUTED

csrc/multidevice/symmetric_tensor.h (line 88) — where torch_symm_handle_ is declared

All of these #if defined(NVFUSER_DISTRIBUTED) && defined(USE_DISTRIBUTED) guards should become #if defined(NVFUSER_DISTRIBUTED).

The Python files (python/utils.py, python/tools/gen_nvfuser_version.py) and CMakeLists.txt also reference USE_DISTRIBUTED for build configuration, so those would need updating too depending on your build system changes.

There was a build "build_image_ARM_igpu_20_TNVF" which was failing - it was setting NVFUSER_DISTRIBUTED to ON but USE_DISTRIBUTED to False during cmake configuration. So I had to explicitly add USE_DISTRIBUTED checks. I'll add the missing guard here.

ok, I didnt know
cc @wujingyue to make sure you are aware of this

Thanks for tagging me.

the real fix is to replace USE_DISTRIBUTED by NVFUSER_DISTRIBUTED wherever it appears (we don't use USE_DISTRIBUTED in the codebase)

That sounds right.

I'm not sure why NVFUSER_DISTRIBUTED was created in the first place -- it was there before I joined the project. We could use USE_DISTRIBUTED throughout the codebase. I guess NVFUSER_DISTRIBUTED gives an extra control of turning nvFuser distributed off even when pytorch distributed is on. But I'm not sure that use case is relevant.

it was setting NVFUSER_DISTRIBUTED to ON but USE_DISTRIBUTED to False

Are you sure about this given

Fuser/CMakeLists.txt

Line 47 in 771eddd

cmake_dependent_option(NVFUSER_DISTRIBUTED "" ON "USE_DISTRIBUTED" OFF)

?

Also note

Fuser/csrc/multidevice/c10d_mock.h

Lines 12 to 19 in 771eddd

// nvFuser is sometimes built on a pytorch without c10d. When that

// happens, c10d isn't linked, NVFUSER_DISTRIBUTED is undefined and the

// multi-GPU component of nvFuser is expected to be disabled.

//

// Instead of adding `#ifdef NVFUSER_DISTRIBUTED` in too many places, this file

// provides a buildable mock implementation of c10d to keep nvFuser code less

// divergent. This implementation won't run because tests and user code are

// guarded by Communicator::is_available.

. I think we can avoid the #if by defining a mock.

@wujingyue I saw that USE_DISTRIBUTED was false and NVFUSER_DISTRIBUTE was on in logs here https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/jobs/287606349/raw.

@saivishal1999 The other guard still looks missing here? Did you attempt the mock approach -- that should ideally work

I'm adding mocks and removing these guards in future commits, builds/tests are passing.

greptile-apps · 2026-03-25T17:29:13Z

csrc/multidevice/symmetric_tensor.cpp

+    static std::once_flag pg0_once;
+    std::call_once(pg0_once, [&]() {
+      try {
+        (void)c10d::resolve_process_group("0");
+      } catch (const std::exception&) {
+        // resolve_process_group throws c10d Error
+        // (derives from std::exception)
+        auto pg = c10d::resolve_process_group(group_name);
+        c10d::register_process_group("0", pg);
+      }
+    });
+
+    comm.barrier(CommunicatorBackend::kNccl);
+    return group_name;


Barrier fires on every call, not just during initial setup

comm.barrier(CommunicatorBackend::kNccl) is placed outside the pg0_once lambda, so it executes on every invocation of ensurePyTorchSymmMemBackend — including every subsequent call to allocate() and setupRemoteHandles() after initialization has already completed. This is both a performance concern (unnecessary NCCL barrier per allocation/rendezvous) and a correctness risk: if any caller ever invokes allocate() or setupRemoteHandles() asymmetrically across ranks (different call counts), these extra barriers will hang.

The barrier is only needed once — after pg0_once — to ensure all ranks have completed the group alias registration before any rank proceeds to use it. Moving it inside the pg0_once lambda would fix both concerns:

static std::once_flag pg0_once; std::call_once(pg0_once, [&]() { try { (void)c10d::resolve_process_group("0"); } catch (const std::exception&) { auto pg = c10d::resolve_process_group(group_name); c10d::register_process_group("0", pg); } comm.barrier(CommunicatorBackend::kNccl); // moved inside once-flag });

seems relevant too, wdyt?
Why do we need a barrier here? Please add a short comment in the code to explain.

The barrier ensures all ranks have completed the process group "0" alias registration before proceeding. Without it, there's a race: one rank could try to resolve "0" before another rank has finished registering it.

As mentioned in the earlier comment, the barrier should be moved inside the pg0_once lambda—it only needs to run once after registration, not on every call to this function:

Suggested change

static std::once_flag pg0_once;

std::call_once(pg0_once, [&]() {

try {

(void)c10d::resolve_process_group("0");

} catch (const std::exception&) {

// resolve_process_group throws c10d Error

// (derives from std::exception)

auto pg = c10d::resolve_process_group(group_name);

c10d::register_process_group("0", pg);

}

});

comm.barrier(CommunicatorBackend::kNccl);

return group_name;

static std::once_flag pg0_once;

std::call_once(pg0_once, [&]() {

try {

(void)c10d::resolve_process_group("0");

} catch (const std::exception&) {

// resolve_process_group throws c10d Error

// (derives from std::exception)

auto pg = c10d::resolve_process_group(group_name);

c10d::register_process_group("0", pg);

}

// Barrier ensures all ranks have registered the "0" alias before any

// rank proceeds to use it. Without this, races can occur where one rank

// tries to resolve "0" before another rank has finished registering it.

comm.barrier(CommunicatorBackend::kNccl);

});

@samnordmann moved the barrier call inside setupremotehandles function.

csrc/multidevice/communicator.cpp

csrc/multidevice/symmetric_tensor.cpp

csrc/multidevice/symmetric_tensor.h

wujingyue · 2026-03-25T17:42:56Z

csrc/multidevice/symmetric_tensor.h

  bool is_contiguous_view_setup_ = false;
  at::Tensor contiguous_view_;
+#if defined(NVFUSER_DISTRIBUTED) && defined(USE_DISTRIBUTED)
+  c10::intrusive_ptr<c10d::symmetric_memory::SymmetricMemory>


Do you have a bare minimum example of using c10d::symmetric_memory::SymmetricMemory and at::Tensor without any nvFuser? I ask this because I feel this class has lots of fields that are irrelevant for c10d SymmetricMemory, but I could be terribly wrong.

Here are some tests using c10d::symmetric_memory. https://github.com/pytorch/pytorch/blob/main/test/distributed/test_symmetric_memory.py
Does this help?
Yes, except for one or two fields like local_tensor_ and mc_ptr_ no other fields are used by c10d::symmetric_memory.

Does this help?

Yes -- thanks! I understood the Python API. When I say c10d, I was mostly trying to figure out the C++ side of the things. Looks like it's roughly one-to-one correspondence.

no other fields are used by c10d::symmetric_memory.

Given that, would you consider a separate data structure encapsulating just what you need? Inheritance comes with coupling: https://media.pragprog.com/titles/tpp20/inheritance-tax.pdf

In addition, I'm wondering whether the multicast pointer needs to be packed in SymmetricTensor. Can you remind me how you plan to use it so I can also think about it? I haven't seen this PR tries to but I could be missing something.

Sorry I don't exactly understand the reason to encapsulate them into a separate data structure. Just to be sure, local_tensor_ and mc_ptr are used in both fuser native and c10d::symmem paths; only c10d methods use these fields as arguments and aren't typecasted anywhere.

It's mainly to decouple the c10d::symmem path from (many) fields that are used only in native. Am I missing anything?

all the methods like allocate, setupRemoteHandles, remoteTensor are all methods of SymmetricTensor class. We thought of adding the torch backends without changing the basic interface(above methods). So very minimal change was done to SymmetricTensor class to accommodate torch backends - just the torch_symm_handle was added. Regarding the other design choices, @samnordmann can you please clarify

I'm not sure whether the multicast pointer should be packed together with the local tensor.

I think I figured this out from the documentation. symm_mem.rendezvous returns a handle containing remote buffers (one per GPU) and optionally a multicasting buffer. Writing to the local tensor doesn't change any remote data; writing to a remote buffer only changes the local tensor on that particular GPU; writing to the multicasting buffer changes the local tensors on all GPUs. That's why SymmetricTensor contains local_tensor_, remote_buffers_ and mc_ptr_.

Am I understanding this correctly?

Looks like c10d::symmetric_memory::SymmetricMemory (the type of torch_symm_handle_) already holds remote buffers and the multicast pointer. So for the c10d path we don't need to store another version of remote_buffers_ and mc_ptr_ separately.

So, if we make a new class NewSymmetricTensor (need a better name...) for c10d, that class will only need to contain the local tensor and a c10d::symmetric_memory::SymmetricMemory. Is that right?

Yes that's correct - if we make a new class NewSymmetricTensor or c10d, that class will only need to contain the local tensor and a c10d::symmetric_memory::SymmetricMemory.

if we make a new class NewSymmetricTensor for c10d

That's what I would do in this PR. But given nv/jingyue-handoff, I'll defer the review to @Priya2698

saivishal1999 · 2026-03-25T17:50:40Z

!test

saivishal1999 · 2026-03-30T13:49:26Z

!test

saivishal1999 · 2026-03-30T15:49:23Z

!test

samnordmann · 2026-03-30T16:10:44Z

!test

csrc/multidevice/symmetric_tensor.h

wujingyue · 2026-03-25T21:41:25Z

csrc/multidevice/symmetric_tensor.h

  bool is_contiguous_view_setup_ = false;
  at::Tensor contiguous_view_;
+#if defined(NVFUSER_DISTRIBUTED) && defined(USE_DISTRIBUTED)
+  c10::intrusive_ptr<c10d::symmetric_memory::SymmetricMemory>


Does this help?

Yes -- thanks! I understood the Python API. When I say c10d, I was mostly trying to figure out the C++ side of the things. Looks like it's roughly one-to-one correspondence.

no other fields are used by c10d::symmetric_memory.

Given that, would you consider a separate data structure encapsulating just what you need? Inheritance comes with coupling: https://media.pragprog.com/titles/tpp20/inheritance-tax.pdf

In addition, I'm wondering whether the multicast pointer needs to be packed in SymmetricTensor. Can you remind me how you plan to use it so I can also think about it? I haven't seen this PR tries to but I could be missing something.

csrc/multidevice/symmetric_tensor.cpp

wujingyue

Sorry, some of my comments went pending and didn't land. Should be fixed now.

samnordmann · 2026-03-31T09:30:00Z

csrc/multidevice/communicator.cpp

+  for (const auto& entry : process_groups_) {
+    c10d::unregister_process_group(entry.first);
+  }
+  process_groups_.clear();


this comment is relevant -- but the real fix is to replace USE_DISTRIBUTED by NVFUSER_DISTRIBUTED wherever it appears (we don't use USE_DISTRIBUTED in the codebase)

samnordmann · 2026-03-31T09:41:35Z

csrc/multidevice/communicator.cpp

+std::string Communicator::getSymmMemGroupKey(
+  std::optional<CommunicatorBackend> backend) {
+std::vector<RankType> all_ranks(size_);
+std::iota(all_ranks.begin(), all_ranks.end(), 0);
+CommunicatorBackend b = backend.value_or(default_backend_);
+(void)getBackendForTeam(all_ranks, b, "symm_mem_");
+return getTeamKey(all_ranks, b);
+}


csrc/multidevice/communicator.h

samnordmann · 2026-03-31T10:08:09Z

csrc/multidevice/communicator.cpp

+#if defined(NVFUSER_DISTRIBUTED) && defined(USE_DISTRIBUTED)
+    std::optional<c10d::ProcessGroup::BackendType> pg_backend =
+        (b == CommunicatorBackend::kNccl)
+        ? std::optional<c10d::ProcessGroup::BackendType>(
+              c10d::ProcessGroup::BackendType::NCCL)
+        : std::nullopt;
+    if (backends_[team_key] != nullptr && pg_backend.has_value()) {
+      auto rank_it = std::ranges::find(team.begin(), team.end(), deviceId());
+      RankType team_rank = std::distance(team.begin(), rank_it);
+
+      auto pg = c10::make_intrusive<c10d::ProcessGroup>(
+          c10::make_intrusive<c10d::PrefixStore>(team_key, store_),
+          team_rank,
+          static_cast<int>(team.size()));
+      pg->setBackend(c10::DeviceType::CUDA, *pg_backend, backends_[team_key]);
+      pg->setDefaultBackend(*pg_backend);
+      pg->setGroupName(team_key);
+
+      c10d::register_process_group(team_key, pg);
+      process_groups_[team_key] = std::move(pg);
+    }


can you explain why we need this change? I am not sure to understand the logic and motivation. It seems like an old artifact -- process_groups_ doesn't seem to be read anywhere. Please clarify

I added this to keep track of process groups registered by fuser's symmem so that they can be unregistered during cleanup and also to keep track if the group is already registered or not. in the next commit you'll see that i'll use this variable's keys(to read) and during cleanup

samnordmann · 2026-03-31T10:09:43Z

csrc/multidevice/communicator.h

  std::unordered_map<std::string, c10::intrusive_ptr<c10d::Backend>> backends_;
+  // c10d process-group wrappers registered for symmetric-memory rendezvous.
+#if defined(NVFUSER_DISTRIBUTED) && defined(USE_DISTRIBUTED)
+  std::unordered_map<std::string, c10::intrusive_ptr<c10d::ProcessGroup>>


please make sure c10d_mock.h is up to date to avoid compilation issue in the non-distributed mode

Also, can you explain (and add a comment in the code) why we need ProcessGroup here?

samnordmann · 2026-03-31T10:10:22Z

csrc/multidevice/symmetric_tensor.cpp

+    static std::once_flag pg0_once;
+    std::call_once(pg0_once, [&]() {
+      try {
+        (void)c10d::resolve_process_group("0");
+      } catch (const c10::Error&) {
+        auto pg = c10d::resolve_process_group(group_name);
+        c10d::register_process_group("0", pg);
+      }
+    });


what about this ?

samnordmann · 2026-03-31T10:11:30Z

csrc/multidevice/symmetric_tensor.cpp

+    static std::once_flag pg0_once;
+    std::call_once(pg0_once, [&]() {
+      try {
+        (void)c10d::resolve_process_group("0");
+      } catch (const std::exception&) {
+        // resolve_process_group throws c10d Error
+        // (derives from std::exception)
+        auto pg = c10d::resolve_process_group(group_name);
+        c10d::register_process_group("0", pg);
+      }
+    });
+
+    comm.barrier(CommunicatorBackend::kNccl);
+    return group_name;


seems relevant too, wdyt?
Why do we need a barrier here? Please add a short comment in the code to explain.

csrc/multidevice/symmetric_tensor.cpp

csrc/multidevice/communicator.cpp

csrc/multidevice/symmetric_tensor.cpp

Priya2698 · 2026-04-09T22:40:48Z

csrc/multidevice/communicator.cpp

+  for (const auto& entry : process_groups_) {
+    c10d::unregister_process_group(entry.first);
+  }
+  process_groups_.clear();


@saivishal1999 The other guard still looks missing here? Did you attempt the mock approach -- that should ideally work

csrc/multidevice/symmetric_tensor.cpp

greptile-apps · 2026-04-10T12:05:40Z

Tip:

Greploop — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

csrc/multidevice/symmetric_tensor.cpp

csrc/multidevice/communicator.cpp

csrc/multidevice/symmetric_tensor.h

saivishal1999 · 2026-04-10T12:38:49Z

!test

saivishal1999 · 2026-04-10T13:12:44Z

!test

csrc/multidevice/symmetric_tensor.cpp

saivishal1999 · 2026-04-10T15:13:10Z

!test

Priya2698

LGTM.

saivishal1999 added 3 commits February 27, 2026 04:40

Initial implementation of symmetric memory backend for PyTorch

14fd212

Initital changes to add pytorch symmetric memory backend

5646c03

Initial pytorch symmetric memory backend changes

14816aa

saivishal1999 requested a review from samnordmann March 2, 2026 22:55

greptile-apps bot reviewed Mar 2, 2026

View reviewed changes

Merge branch 'main' into symmetric-memory-pytorch-backends

6996d05

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

Initial review comments

49d669c

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes

csrc/multidevice/symmetric_tensor.cpp Outdated Show resolved Hide resolved

csrc/multidevice/symmetric_tensor.cpp Outdated Show resolved Hide resolved

tests/cpp/test_multidevice_symmetric_tensor.cpp Outdated Show resolved Hide resolved

samnordmann reviewed Mar 10, 2026

View reviewed changes

saivishal1999 added 2 commits March 16, 2026 13:55

Alloc, rendezvous passing

8962475

Merge branch 'main' into symmetric-memory-pytorch-backends

62c6945

greptile-apps bot reviewed Mar 16, 2026

View reviewed changes

multicast pending

67181c8

greptile-apps bot reviewed Mar 17, 2026

View reviewed changes

csrc/multidevice/symmetric_tensor.cpp Outdated Show resolved Hide resolved

all backends passing

eea57d8

greptile-apps bot reviewed Mar 20, 2026

View reviewed changes

saivishal1999 added 2 commits March 20, 2026 22:24

delete build file

a9ddffd

Merge branch 'main' into symmetric-memory-pytorch-backends

f9cac71

greptile-apps bot reviewed Mar 20, 2026

View reviewed changes

samnordmann reviewed Mar 23, 2026

View reviewed changes

Lint errors and review comments

8e62ccc

greptile-apps bot reviewed Mar 24, 2026

View reviewed changes

fix 3 lint errors

1be0134

greptile-apps bot reviewed Mar 24, 2026

View reviewed changes

csrc/multidevice/symmetric_tensor.cpp Show resolved Hide resolved

Fix clang-tidy errors

3596301

Add torch distributed gaurd

6147139

saivishal1999 requested a review from wujingyue March 25, 2026 16:47

greptile-apps bot reviewed Mar 25, 2026

View reviewed changes

wujingyue requested a review from Priya2698 March 25, 2026 17:16

Merge branch 'main' into symmetric-memory-pytorch-backends

b5a2418

greptile-apps bot reviewed Mar 25, 2026

View reviewed changes

wujingyue reviewed Mar 25, 2026

View reviewed changes

wujingyue reviewed Mar 30, 2026

View reviewed changes

samnordmann reviewed Mar 31, 2026

View reviewed changes

Address pending review cmnts

af128e4

greptile-apps bot reviewed Apr 3, 2026

View reviewed changes

csrc/multidevice/communicator.cpp Show resolved Hide resolved

csrc/multidevice/symmetric_tensor.cpp Outdated Show resolved Hide resolved

Priya2698 reviewed Apr 10, 2026

View reviewed changes

saivishal1999 added 2 commits April 10, 2026 14:33

Add mocks for c10d

828573d

Merge branch 'main' into symmetric-memory-pytorch-backends

294a867

greptile-apps bot reviewed Apr 10, 2026

View reviewed changes

csrc/multidevice/symmetric_tensor.cpp Outdated Show resolved Hide resolved

csrc/multidevice/communicator.cpp Outdated Show resolved Hide resolved

Fix missing guard for process_groups

6a5d3c3

greptile-apps bot reviewed Apr 10, 2026

View reviewed changes

csrc/multidevice/symmetric_tensor.h Outdated Show resolved Hide resolved

include mock header for non distributed build

2908e70

greptile-apps bot reviewed Apr 10, 2026

View reviewed changes

csrc/multidevice/symmetric_tensor.cpp Outdated Show resolved Hide resolved

Remove guard comments

aa4f9c1

Priya2698 approved these changes Apr 10, 2026

View reviewed changes

	"NCCL backend is required for symmetric_memory_backend(nccl)");
	"NCCL backend is required for non-native symmetric memory backend: , backend");

	// nvFuser is sometimes built on a pytorch without c10d. When that
	// happens, c10d isn't linked, NVFUSER_DISTRIBUTED is undefined and the
	// multi-GPU component of nvFuser is expected to be disabled.
	//
	// Instead of adding `#ifdef NVFUSER_DISTRIBUTED` in too many places, this file
	// provides a buildable mock implementation of c10d to keep nvFuser code less
	// divergent. This implementation won't run because tests and user code are
	// guarded by Communicator::is_available.

Conversation

saivishal1999 commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

greptile-apps bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

nsarka commented Mar 3, 2026

Uh oh!

greptile-apps bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

samnordmann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 20, 2026

Choose a reason for hiding this comment

saivishal1999 commented Mar 2, 2026 •

edited

Loading

github-actions bot commented Mar 2, 2026 •

edited

Loading

greptile-apps bot commented Mar 2, 2026 •

edited

Loading