-
Notifications
You must be signed in to change notification settings - Fork 7
Segmentation fault on systems with CUDA 13.x drivers due to UCX/RAPIDS dependency chain #30
Description
Summary
segger segment crashes with a segmentation fault on systems running NVIDIA driver 590.x (CUDA 13.1). The crash is caused by the UCX communication library, which is loaded transitively through cugraph at import time. UCX calls cuCtxGetDevice_v2 in the system's libcuda.so.1 (CUDA 13.1 driver) before a CUDA context is initialized, causing a segfault before any segger code actually runs.
This issue is not fixable by adjusting the CUDA toolkit or conda environment — libcuda.so.1 is always the system-global driver library. It is also not a transient problem: cuspatial has been archived (July 2025, read-only) and will never receive CUDA 13 builds, meaning segger cannot run on any system with a CUDA 13.x driver without changes to its dependencies.
Environment
- OS: Linux (RHEL-based), x86_64
- GPU: 2× NVIDIA RTX A4000 (16 GB each)
- NVIDIA Driver: 590.48.01 (CUDA 13.1)
- Python: 3.11.15 (conda-forge)
- segger: 0.1.0 (installed from
dpeerlab/seggermain branch) - PyTorch: 2.5.0+cu121 (works correctly —
torch.cuda.is_available()returnsTrue) - RAPIDS: 25.4.x (
cudf-cu12,cuml-cu12,cugraph-cu12,cuspatial-cu12) - UCX:
ucx-py-cu12 0.43.0,libucx-cu12 1.18.1
Reproducing the issue
segger segment -i /path/to/ist/data/ -o /path/to/output/Immediately crashes with:
[photon:912752:0:912752] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid: 912752) ====
0 .../libucs.so(ucs_handle_error+0x294)
...
4 /lib64/libcuda.so.1(+0x31a708)
5 /lib64/libcuda.so.1(cuCtxGetDevice_v2+0x20)
6 .../libffi.so.8(+0x702a)
...
Segmentation fault (core dumped)
Root cause analysis
The crash occurs in the NVIDIA driver's cuCtxGetDevice_v2 function, called via ctypes/libffi by the UCX library during Python module import. The import chain is:
segger segment
→ segger.cli.segment
→ segger.data.ISTDataModule
→ segger.data.utils.anndata
→ segger.data.utils.neighbors (line 8: `import cugraph`)
→ cugraph.__init__
→ cugraph.structure.graph_primtypes_wrapper
→ cugraph.dask.__init__
→ cugraph.dask.comms.comms
→ raft_dask.common.comms
→ UCX (libucp.so / libucs.so)
→ libcuda.so.1 cuCtxGetDevice_v2 ← SEGFAULT
Why it happens
-
libcuda.so.1is always system-global. It is provided by the NVIDIA kernel module and cannot be installed per-environment via conda or pip. On this system it is the CUDA 13.1 driver. -
UCX probes the CUDA driver at import time by calling
cuCtxGetDevice_v2before any CUDA context has been created. On the CUDA 13.1 driver, this results in a segfault instead of a graceful error return. -
RAPIDS
cu12packages ship UCX libraries compiled against CUDA 12.x, creating a mismatch with the CUDA 13.1 system driver. -
PyTorch handles this correctly —
torch.cuda.is_available()works fine with the same driver, demonstrating that the CUDA 13.1 driver is functional and backward-compatible for well-behaved clients. -
cuspatialis archived and will never have CUDA 13 builds. The cuspatial repository was archived by RAPIDS on July 28, 2025. Thecuspatial-cu13entry on PyPI is a zero-version placeholder. This means segger's dependency on cuspatial is a permanent blocker for CUDA 13.x systems — not a temporary gap that will be filled by a future release. -
UCX is not needed by segger. UCX provides multi-node multi-GPU communication for Dask distributed workloads. Segger runs single-node and does not use Dask distributed, yet UCX is loaded unconditionally because
cugraphimports itsdasksubmodule at package init time.
What was tried (and failed)
| Attempt | Result |
|---|---|
Downgrade conda cuda-toolkit to 12.1 |
Same segfault — the toolkit is irrelevant, libcuda.so.1 is always system-global |
export UCX_MEMTYPE_CACHE=n; export UCX_TLS=tcp,self |
Same segfault — crash happens before UCX config is read |
unset LD_LIBRARY_PATH |
Same segfault |
Install RAPIDS via conda (mamba install -c rapidsai) |
Same segfault — conda UCX also calls into system libcuda.so.1 |
CUDA_VISIBLE_DEVICES="" |
No segfault, but then no GPU is available for computation |
Uninstall UCX packages (ucx-py-cu12, libucx-cu12, etc.) |
No segfault, but import cugraph fails with ImportError: libucp.so.0 because cugraph unconditionally imports its Dask/distributed submodule which requires UCX |
Possible solution
- The most long-term solution would be to replace cuspatial, but I guess it would be hard to replicate its functions
- Replace cugraph so that one could remove UCX, but it doesn't seem to be a very stable solution