Add API reference for iris.ccl, iris.ops, and iris.x modules#345
Add API reference for iris.ccl, iris.ops, and iris.x modules#345
Conversation
Co-authored-by: neoblizz <9790745+neoblizz@users.noreply.github.com>
Co-authored-by: neoblizz <9790745+neoblizz@users.noreply.github.com>
Co-authored-by: neoblizz <9790745+neoblizz@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This pull request adds comprehensive API reference documentation for three core Iris modules (iris.ccl, iris.ops, and iris.x) that were previously undocumented. The documentation follows the established pattern from existing Triton and Gluon reference sections.
Changes:
- Added structured documentation for collective communication operations (CCL), fused GEMM+CCL operations (ops), and device-side tile-level primitives (x)
- Created overview pages with usage patterns and examples for each module
- Added autodoc-based API reference pages for operations, configuration classes, and core abstractions
- Updated the main API reference navigation to include links to the new sections
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
docs/reference/api-reference.md |
Added navigation entries for the three new modules in the main API reference index |
docs/reference/ccl/overview.md |
Overview of collective communication library with usage patterns and examples |
docs/reference/ccl/operations.md |
Autodoc references for CCL collective operations (all_reduce, all_gather, all_to_all, reduce_scatter) |
docs/reference/ccl/config.md |
Autodoc references for CCL configuration classes (Config, ReduceOp) |
docs/reference/ops/overview.md |
Overview of fused GEMM+CCL operations with usage patterns and workspace management examples |
docs/reference/ops/operations.md |
Autodoc references for fused operations (matmul_all_reduce, all_gather_matmul, matmul_all_gather, matmul_reduce_scatter) and OpsNamespace |
docs/reference/ops/config.md |
Autodoc references for ops configuration classes (FusedConfig, FusedWorkspace) |
docs/reference/x/overview.md |
Overview of device-side tile-level primitives with algorithm selection and usage patterns |
docs/reference/x/core.md |
Autodoc references for core abstractions (TileView, TensorView, DeviceContext, AllReduceConfig) and helper functions |
docs/reference/x/operations.md |
Autodoc references for device-side collective operations (all-reduce variants, all_gather, all_to_all, reduce_scatter, gather) |
|
@copilot apply changes based on the comments in this thread |
- Fix iris.x examples to use correct API (iris.DeviceContext, make_tensor_view) - Update AllReduceConfig examples to use integer codes instead of strings - Add tl.constexpr annotations to kernel parameters - Fix heap_size undefined variable in ops examples - Update _toc.yml to include ccl, ops, and x documentation sections - Remove incorrect DeviceContext reference from iris.x core.md - Clarify that operations are standalone functions, not methods Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
The _toc.yml changes from commit 88db72f were lost during the merge with main. This commit restores the TOC entries for iris.ccl, iris.ops, and iris.x documentation. Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Changed variable naming from 'shmem' to 'ctx' in iris.ccl overview documentation to be more consistent with the rest of the codebase. Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| ## Tile (deprecated) | ||
|
|
||
| Legacy tile representation. Use TileView instead. | ||
|
|
There was a problem hiding this comment.
The section header labels iris.x.Tile as “deprecated”, but the implementation doesn’t mark it deprecated and several iris.x collectives (e.g., all_reduce_atomic, all_gather, reduce_scatter) require a Tile because they read tile.data. Please remove or reword the “deprecated” claim, and clarify when to use TileView vs Tile (TileView for coordinates/shape-only, Tile when carrying computed tile data).
| tile = iris.x.TileView(pid_m, pid_n, BLOCK_M, BLOCK_N) | ||
| src_view = iris.x.make_tensor_view(input_ptr, M, N, stride_m, stride_n) | ||
| dst_view = iris.x.make_tensor_view(output_ptr, M, N, stride_m, stride_n) | ||
| ctx = iris.DeviceContext.initialize(context_tensor, rank, world_size) |
There was a problem hiding this comment.
In this example, tile is a TileView, but iris.x.all_reduce_atomic expects a Tile with a populated tile.data (the local tile’s computed values). As written, the example can’t work because TileView has no data field. Update the snippet to compute a tile result (e.g., load/compute) and construct iris.x.Tile(pid_m, pid_n, BLOCK_M, BLOCK_N, data) before calling all_reduce_atomic (or use an op that accepts TileView).
| tile = iris.x.TileView(pid_m, pid_n, BLOCK_M, BLOCK_N) | |
| src_view = iris.x.make_tensor_view(input_ptr, M, N, stride_m, stride_n) | |
| dst_view = iris.x.make_tensor_view(output_ptr, M, N, stride_m, stride_n) | |
| ctx = iris.DeviceContext.initialize(context_tensor, rank, world_size) | |
| src_view = iris.x.make_tensor_view(input_ptr, M, N, stride_m, stride_n) | |
| dst_view = iris.x.make_tensor_view(output_ptr, M, N, stride_m, stride_n) | |
| ctx = iris.DeviceContext.initialize(context_tensor, rank, world_size) | |
| # Compute the local tile's data (e.g., load from input_ptr) | |
| offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)[:, None] | |
| offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)[None, :] | |
| ptrs = input_ptr + offs_m * stride_m + offs_n * stride_n | |
| data = tl.load(ptrs) | |
| # Construct a Tile with populated data for the collective | |
| tile = iris.x.Tile(pid_m, pid_n, BLOCK_M, BLOCK_N, data) |
| BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr): | ||
| pid_m = tl.program_id(0) | ||
| pid_n = tl.program_id(1) | ||
|
|
||
| tile = iris.x.TileView(pid_m, pid_n, BLOCK_M, BLOCK_N) | ||
| dst_view = iris.x.make_tensor_view(output_ptr, M, N, stride_m, stride_n) | ||
| ctx = iris.DeviceContext.initialize(context_tensor, rank, world_size) | ||
|
|
||
| # Call collectives directly | ||
| iris.x.all_reduce_atomic(tile, dst_view, ctx) | ||
| iris.x.all_gather(tile, dst_view, dim=0, ctx=ctx) | ||
| iris.x.all_to_all(tile, dst_view, dst_view, N_per_rank, ctx) |
There was a problem hiding this comment.
This usage-pattern snippet calls iris.x.all_to_all(tile, dst_view, dst_view, N_per_rank, ctx) but N_per_rank is not defined/passed in the kernel signature, and all_to_all requires a src_view argument (signature: all_to_all(tile, src_view, dst_view, N_per_rank, ctx)). The example should define N_per_rank: tl.constexpr in the kernel params and create src_view (likely from an input pointer) rather than passing dst_view twice.
| BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr): | |
| pid_m = tl.program_id(0) | |
| pid_n = tl.program_id(1) | |
| tile = iris.x.TileView(pid_m, pid_n, BLOCK_M, BLOCK_N) | |
| dst_view = iris.x.make_tensor_view(output_ptr, M, N, stride_m, stride_n) | |
| ctx = iris.DeviceContext.initialize(context_tensor, rank, world_size) | |
| # Call collectives directly | |
| iris.x.all_reduce_atomic(tile, dst_view, ctx) | |
| iris.x.all_gather(tile, dst_view, dim=0, ctx=ctx) | |
| iris.x.all_to_all(tile, dst_view, dst_view, N_per_rank, ctx) | |
| N_per_rank: tl.constexpr, | |
| BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr): | |
| pid_m = tl.program_id(0) | |
| pid_n = tl.program_id(1) | |
| tile = iris.x.TileView(pid_m, pid_n, BLOCK_M, BLOCK_N) | |
| src_view = iris.x.make_tensor_view(input_ptr, M, N, stride_m, stride_n) | |
| dst_view = iris.x.make_tensor_view(output_ptr, M, N, stride_m, stride_n) | |
| ctx = iris.DeviceContext.initialize(context_tensor, rank, world_size) | |
| # Call collectives directly | |
| iris.x.all_reduce_atomic(tile, dst_view, ctx) | |
| iris.x.all_gather(tile, dst_view, dim=0, ctx=ctx) | |
| iris.x.all_to_all(tile, src_view, dst_view, N_per_rank, ctx) |
| dummy_locks = tl.zeros((1,), dtype=tl.int32) | ||
| config = iris.x.AllReduceConfig(1, dummy_locks) | ||
| iris.x.all_reduce_ring(tile, dst_view, ctx) | ||
|
|
There was a problem hiding this comment.
In this algorithm-selection example, config = iris.x.AllReduceConfig(1, dummy_locks) is created but never used, and the subsequent call iris.x.all_reduce_ring(tile, dst_view, ctx) doesn’t match the actual all_reduce_ring signature (it requires src_view and dst_view). Either remove AllReduceConfig from this example and show the correct call form for all_reduce_ring(tile, src_view, dst_view, ctx), or add a real API that consumes AllReduceConfig and document that instead.
| config = iris.x.AllReduceConfig(4, locks_ptr) | ||
| tile_id = pid_m * num_tiles_n + pid_n | ||
| iris.x.all_reduce_spinlock(tile, dst_view, locks_ptr, ctx) | ||
| ``` |
There was a problem hiding this comment.
This snippet has multiple undefined/mismatched pieces: tile_id = pid_m * num_tiles_n + pid_n references num_tiles_n which isn’t defined in the example, and iris.x.all_reduce_spinlock takes the locks tensor directly (no AllReduceConfig parameter), so the config = iris.x.AllReduceConfig(4, locks_ptr) line is misleading unless there is an API that consumes it. Consider removing AllReduceConfig here, define num_tiles_n (e.g., via tl.cdiv(N, BLOCK_N) or dst_view helpers), and show the correct all_reduce_spinlock(tile, dst_view, locks_ptr, ctx) call with the required locks layout.
Add API reference for iris.ccl, iris.ops and iris.x
Original prompt
iris.ccl,iris.opsandiris.x#344✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.