Skip to content

Segmentation fault on systems with CUDA 13.x drivers due to UCX/RAPIDS dependency chain #30

@marcovarrone

Description

@marcovarrone

Summary

segger segment crashes with a segmentation fault on systems running NVIDIA driver 590.x (CUDA 13.1). The crash is caused by the UCX communication library, which is loaded transitively through cugraph at import time. UCX calls cuCtxGetDevice_v2 in the system's libcuda.so.1 (CUDA 13.1 driver) before a CUDA context is initialized, causing a segfault before any segger code actually runs.

This issue is not fixable by adjusting the CUDA toolkit or conda environmentlibcuda.so.1 is always the system-global driver library. It is also not a transient problem: cuspatial has been archived (July 2025, read-only) and will never receive CUDA 13 builds, meaning segger cannot run on any system with a CUDA 13.x driver without changes to its dependencies.

Environment

  • OS: Linux (RHEL-based), x86_64
  • GPU: 2× NVIDIA RTX A4000 (16 GB each)
  • NVIDIA Driver: 590.48.01 (CUDA 13.1)
  • Python: 3.11.15 (conda-forge)
  • segger: 0.1.0 (installed from dpeerlab/segger main branch)
  • PyTorch: 2.5.0+cu121 (works correctly — torch.cuda.is_available() returns True)
  • RAPIDS: 25.4.x (cudf-cu12, cuml-cu12, cugraph-cu12, cuspatial-cu12)
  • UCX: ucx-py-cu12 0.43.0, libucx-cu12 1.18.1

Reproducing the issue

segger segment -i /path/to/ist/data/ -o /path/to/output/

Immediately crashes with:

[photon:912752:0:912752] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid: 912752) ====
 0  .../libucs.so(ucs_handle_error+0x294)
 ...
 4  /lib64/libcuda.so.1(+0x31a708)
 5  /lib64/libcuda.so.1(cuCtxGetDevice_v2+0x20)
 6  .../libffi.so.8(+0x702a)
 ...
Segmentation fault (core dumped)

Root cause analysis

The crash occurs in the NVIDIA driver's cuCtxGetDevice_v2 function, called via ctypes/libffi by the UCX library during Python module import. The import chain is:

segger segment
  → segger.cli.segment
    → segger.data.ISTDataModule
      → segger.data.utils.anndata
        → segger.data.utils.neighbors  (line 8: `import cugraph`)
          → cugraph.__init__
            → cugraph.structure.graph_primtypes_wrapper
              → cugraph.dask.__init__
                → cugraph.dask.comms.comms
                  → raft_dask.common.comms
                    → UCX (libucp.so / libucs.so)
                      → libcuda.so.1 cuCtxGetDevice_v2  ← SEGFAULT

Why it happens

  1. libcuda.so.1 is always system-global. It is provided by the NVIDIA kernel module and cannot be installed per-environment via conda or pip. On this system it is the CUDA 13.1 driver.

  2. UCX probes the CUDA driver at import time by calling cuCtxGetDevice_v2 before any CUDA context has been created. On the CUDA 13.1 driver, this results in a segfault instead of a graceful error return.

  3. RAPIDS cu12 packages ship UCX libraries compiled against CUDA 12.x, creating a mismatch with the CUDA 13.1 system driver.

  4. PyTorch handles this correctlytorch.cuda.is_available() works fine with the same driver, demonstrating that the CUDA 13.1 driver is functional and backward-compatible for well-behaved clients.

  5. cuspatial is archived and will never have CUDA 13 builds. The cuspatial repository was archived by RAPIDS on July 28, 2025. The cuspatial-cu13 entry on PyPI is a zero-version placeholder. This means segger's dependency on cuspatial is a permanent blocker for CUDA 13.x systems — not a temporary gap that will be filled by a future release.

  6. UCX is not needed by segger. UCX provides multi-node multi-GPU communication for Dask distributed workloads. Segger runs single-node and does not use Dask distributed, yet UCX is loaded unconditionally because cugraph imports its dask submodule at package init time.

What was tried (and failed)

Attempt Result
Downgrade conda cuda-toolkit to 12.1 Same segfault — the toolkit is irrelevant, libcuda.so.1 is always system-global
export UCX_MEMTYPE_CACHE=n; export UCX_TLS=tcp,self Same segfault — crash happens before UCX config is read
unset LD_LIBRARY_PATH Same segfault
Install RAPIDS via conda (mamba install -c rapidsai) Same segfault — conda UCX also calls into system libcuda.so.1
CUDA_VISIBLE_DEVICES="" No segfault, but then no GPU is available for computation
Uninstall UCX packages (ucx-py-cu12, libucx-cu12, etc.) No segfault, but import cugraph fails with ImportError: libucp.so.0 because cugraph unconditionally imports its Dask/distributed submodule which requires UCX

Possible solution

  1. The most long-term solution would be to replace cuspatial, but I guess it would be hard to replicate its functions
  2. Replace cugraph so that one could remove UCX, but it doesn't seem to be a very stable solution

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions