Skip to content

fix: preserve None projection semantics across FFI boundary in ForeignTableProvider::scan#20393

Merged
timsaucer merged 1 commit intoapache:mainfrom
Kontinuation:fix/ffi-scan-none-projection
Mar 6, 2026
Merged

fix: preserve None projection semantics across FFI boundary in ForeignTableProvider::scan#20393
timsaucer merged 1 commit intoapache:mainfrom
Kontinuation:fix/ffi-scan-none-projection

Conversation

@Kontinuation
Copy link
Member

@Kontinuation Kontinuation commented Feb 17, 2026

Which issue does this PR close?

N/A (newly discovered bug)

This is originally found in apache/sedona-db when working on a custom plan node: apache/sedona-db#611 (comment)

Rationale for this change

ForeignTableProvider::scan() converts a None projection (meaning "return all columns") into an empty RVec<usize> before passing it across the FFI boundary. On the receiving side, scan_fn_wrapper always wraps the received RVec in Some(...), passing Some(&vec![]) to the inner TableProvider::scan(). This means "project zero columns" — the exact opposite of the intended "project all columns."

The root cause is that the FFI_TableProvider::scan function signature uses RVec<usize> for the projections parameter. Since RVec<usize> cannot represent None, the None vs. empty-vec distinction is lost at the FFI boundary.

What changes are included in this PR?

Three coordinated changes in datafusion/ffi/src/table_provider.rs:

  1. FFI struct definition: Changed scan function pointer signature from RVec<usize> to ROption<RVec<usize>> for the projections parameter, matching how limit already uses ROption<usize> for the same None-vs-value distinction.

  2. Receiver side (scan_fn_wrapper): Converts ROption<RVec<usize>> via .into_option().map(...) and passes projections.as_ref() to the inner provider, preserving None semantics.

  3. Sender side (ForeignTableProvider::scan): Converts Option<&Vec<usize>> to ROption<RVec<usize>> via .into() instead of using unwrap_or_default().

Plus a new unit test test_scan_with_none_projection_returns_all_columns that directly exercises the FFI round-trip with projection=None and verifies all 3 columns are returned.

Also fixed the existing test_aggregation test to set library_marker_id = mock_foreign_marker_id so it actually exercises the FFI path instead of taking the local bypass.

How are these changes tested?

  • New test test_scan_with_none_projection_returns_all_columns: creates a 3-column MemTable, wraps it through FFI with the foreign marker set, calls scan(None), and asserts 3 columns are returned (previously returned 0).

Are these changes safe?

This is a breaking FFI ABI change to the FFI_TableProvider::scan function pointer signature. However:

  • The abi_stable crate's #[derive(StableAbi)] generates layout checks at dylib load time, so mismatched dylibs will be caught at load rather than causing silent corruption.
  • All existing providers construct FFI_TableProvider via ::new() or ::new_with_ffi_codec(), which internally wire up scan_fn_wrapper — nobody constructs the scan function pointer manually.

@github-actions github-actions bot added the ffi Changes to the ffi crate label Feb 17, 2026
…nTableProvider::scan

ForeignTableProvider::scan() converted projection: None (meaning 'all
columns') into an empty RVec<usize> via unwrap_or_default() before
passing it across the FFI boundary. On the receiving side,
scan_fn_wrapper always wrapped the received RVec in Some(...), passing
Some(&vec![]) to the inner TableProvider::scan(). This means 'project
zero columns' -- the opposite of the intended 'all columns'.

Fix by changing the FFI function signature's projections parameter from
RVec<usize> to ROption<RVec<usize>>, matching how limit already uses
ROption<usize> for the same None-vs-value distinction. Update both the
sender (ForeignTableProvider::scan) and receiver (scan_fn_wrapper) to
properly convert between Option and ROption.

Add test that verifies scan(projection=None) returns all columns.
@paleolimbot
Copy link
Member

cc @timsaucer !

Copy link
Member

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for a very clear and concise solution!

One alternative we could do that would not be breaking for the FFI boundary would be on the Foreign side to check for Some([]) and return an empty batch executor and then on the local side (wrapped fn) change an empty projections into None.

@alamb
Copy link
Contributor

alamb commented Mar 4, 2026

What shall we do with this PR? Is it ready to merge? Should we try @timsaucer 's alternate suggestion?

@Kontinuation
Copy link
Member Author

I had a hard time figuring out how to construct an execution plan producing RecordBatches with zero columns while matching the total number of records in the foreign table. This can be done but in a cumbersome way. I don't think we can simply return an empty batch executor for such cases, since we still need to produce batches with no columns.

Another approach without breaking the ABI is to expand None projection into Some([0, 1, .. n - 1]). The relationship between local projection and FFI projection is as follows:

Semantics     Local                 FFI                Local (scan_fn_wrapper)
Project all   None           -->   [0,1,..n-1]   -->  Some([0,1,..n-1])
Project none  Some([])       -->   []            -->  Some([])
Project       Some([a,b,..]) -->   [a,b,..]      -->  Some([a,b,..])

The "select all" projection represented by None will become Some([0,1,...n-1]) after the roundtrip, but the semantics should be the same.

I think we should either stick to the current approach to make the FFI APIs look the same as local APIs for better consistency and less mental burden, or expand None to Some([0, 1, .. n-1]) when transferring the projection from local side to FFI.

Copy link
Member

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I agree that your solution is preferable over the other idea I had.

@timsaucer timsaucer added this pull request to the merge queue Mar 6, 2026
Merged via the queue into apache:main with commit d72b0b8 Mar 6, 2026
30 checks passed
alamb pushed a commit to alamb/datafusion that referenced this pull request Mar 12, 2026
…nTableProvider::scan (apache#20393)

## Which issue does this PR close?

N/A (newly discovered bug)

This is originally found in apache/sedona-db when working on a custom
plan node:
apache/sedona-db#611 (comment)

## Rationale for this change

`ForeignTableProvider::scan()` converts a `None` projection (meaning
"return all columns") into an empty `RVec<usize>` before passing it
across the FFI boundary. On the receiving side, `scan_fn_wrapper` always
wraps the received `RVec` in `Some(...)`, passing `Some(&vec![])` to the
inner `TableProvider::scan()`. This means "project zero columns" — the
exact opposite of the intended "project all columns."

The root cause is that the `FFI_TableProvider::scan` function signature
uses `RVec<usize>` for the projections parameter. Since `RVec<usize>`
cannot represent `None`, the `None` vs. empty-vec distinction is lost at
the FFI boundary.

## What changes are included in this PR?

Three coordinated changes in `datafusion/ffi/src/table_provider.rs`:

1. **FFI struct definition**: Changed `scan` function pointer signature
from `RVec<usize>` to `ROption<RVec<usize>>` for the projections
parameter, matching how `limit` already uses `ROption<usize>` for the
same `None`-vs-value distinction.

2. **Receiver side** (`scan_fn_wrapper`): Converts
`ROption<RVec<usize>>` via `.into_option().map(...)` and passes
`projections.as_ref()` to the inner provider, preserving `None`
semantics.

3. **Sender side** (`ForeignTableProvider::scan`): Converts
`Option<&Vec<usize>>` to `ROption<RVec<usize>>` via `.into()` instead of
using `unwrap_or_default()`.

Plus a new unit test
`test_scan_with_none_projection_returns_all_columns` that directly
exercises the FFI round-trip with `projection=None` and verifies all 3
columns are returned.

Also fixed the existing `test_aggregation` test to set
`library_marker_id = mock_foreign_marker_id` so it actually exercises
the FFI path instead of taking the local bypass.

## How are these changes tested?

- New test `test_scan_with_none_projection_returns_all_columns`: creates
a 3-column MemTable, wraps it through FFI with the foreign marker set,
calls `scan(None)`, and asserts 3 columns are returned (previously
returned 0).

## Are these changes safe?

This is a **breaking FFI ABI change** to the `FFI_TableProvider::scan`
function pointer signature. However:

- The `abi_stable` crate's `#[derive(StableAbi)]` generates layout
checks at dylib load time, so mismatched dylibs will be caught at load
rather than causing silent corruption.
- All existing providers construct `FFI_TableProvider` via `::new()` or
`::new_with_ffi_codec()`, which internally wire up `scan_fn_wrapper` —
nobody constructs the `scan` function pointer manually.
alamb added a commit that referenced this pull request Mar 12, 2026
…ry in ForeignTableProvider::scan (#20393) (#20895)

- Part of #19692

This PR:
- Backports #20393 from
@Kontinuation to the branch-53 line

Co-authored-by: Kristin Cowalcijk <bo@wherobots.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ffi Changes to the ffi crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants