Skip to content

feat: Add substrait support for multiple files#19729

Open
kumarUjjawal wants to merge 6 commits intoapache:mainfrom
kumarUjjawal:feat/multi_file_substrait
Open

feat: Add substrait support for multiple files#19729
kumarUjjawal wants to merge 6 commits intoapache:mainfrom
kumarUjjawal:feat/multi_file_substrait

Conversation

@kumarUjjawal
Copy link
Contributor

Which issue does this PR close?

  • Closes 10864.

Rationale for this change

In #10842 support for ReadType::LocalFiles was added but it didn't support multiple files.

What changes are included in this PR?

  • Added support for multiple files
  • Added new test module to test this new feature

Are these changes tested?

  • All previous tests pass
  • Added new tests

Are there any user-facing changes?

@github-actions github-actions bot added the substrait Changes to the substrait crate label Jan 10, 2026
Copy link
Contributor

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping in from the substrait side of things 👋

Just had a chance to take a quick look, hopefully I can give a more fully review later

Comment on lines +189 to +196
// Normalize file:// to file:/// for proper URL parsing
let normalized = if path_str.starts_with("file://")
&& !path_str.starts_with("file:///")
{
path_str.replacen("file://", "file:///", 1)
} else {
path_str
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question, I realize that this is consistent with the previous code, but why is it the responsibility of this point in the code to "normalize" the url?

I put normalize in quotes because aren't we actually just taking an incorrect url and modifying it to be correct? I would consider that malformed substrait and throw an error personally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore, shouldn't we pass around the Url instead of a string file path? We may need the scheme later. Per the substrait spec, we may use the scheme to determine if we actually need to read the file from e.g. s3.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I had few thoughts:
The Substrait spec says these are "URIs" where "the protocol portion of the URI determines the storage type", but the existing physical plan producer/consumer in DF treats them as raw strings, the producer emits bare object-store paths like "file1.parquet" (not valid URIs), and the consumer passes them straight to ObjectMeta without parsing. If we require Url::parse() to succeed, plans produced by DataFusion's own physical plan producer would be rejected by our logical plan consumer.

So I was thinking we keep &[String] and pass raw URI strings and let the consumer decide how to parse. consistent with the physical plan side. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm to me that sounds like a bug on the consumer side too then. We should only produce plans with valid URIs IMO.

I'm not the datafusion expert, so I'm not sure the right approach here but I feel like it would be better to fix that bug rather than allow the incorrect substrait in the producer require accommodation in all of the consumer parts of the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into the producer and what i can understand, it appears to emit relative object-store paths by design (the scheme lives in ObjectStoreUrl separately). For our logical plan consumer, we'll require valid URIs per spec. I have made the changes here for the URI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened up #20550 to track this

Copy link
Contributor

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the whole I think it looks good, but I have a few remaining questions. The biggest one is about whether we should just let the failure to handle multiple local files just percolate up to an execution error.

I suspect that returning an error there is better than returning a potentially incorrect query result.

.map(|filename| filename.to_string_lossy().to_string())
/// Parses the URI string from a PathType variant.
/// Returns an error if the URI is malformed.
fn parse_uri(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note in case someone comes to this in the future and as the same question.

The Url crate is quite permissive, so while the type is called Url, it handles general URIs without a problem.

// Try to resolve files using the new trait method.
// If not implemented, fall back to the legacy single-file behavior
// for backward compatibility.
let provider = match consumer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a question for the people with more datafusion experience, but is this the right approach? I would think that if the consumer fails to handle multiple files, it is an execution failure. Otherwise, doesn't it risk executing the plan incorrectly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably propagate error in multi-file case.


if lf.items.len() > 1 || filename.is_none() {
return not_impl_err!("Only single file reads are supported");
if uris.is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other small thing. I don't think the spec technically enforces that an empty LocalFiles is invalid. Since the schema will have to be part of the plan, it is also reasonable to just consider it an empty table 🤷

I don't know what is consistent with the normal datafusion approach here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that way, I see no reason to not just let the implementer of resolve_local_files handle this case. The only complexity then is that if you don't have any uris, then you can't make a table name.

@kumarUjjawal kumarUjjawal force-pushed the feat/multi_file_substrait branch from 487c468 to 206e05d Compare March 2, 2026 10:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants