Better document the relationship between `FileFormat::projection` / `FileFormat::filter` and `FileScanConfig::Statistics` by alamb · Pull Request #20188 · apache/datafusion

alamb · 2026-02-06T14:30:37Z

Which issue does this PR close?

Part of Existing file ordering is lost in some cases with projections #20173

Rationale for this change

I am debugging an issue related to the interplay of pre-existing orderings, filtering and projections in FileScanConfig. As part of that I am trying to understand how Statistics were handled by FileScanConfig -- specifically when relatively speaking are the projection and filtering applied

After some study, I have found that the statistics are (supposed?) to be before applying the Filter and Projection from the scan, so let's document that better. Also I found the schemas involved to be a bit confusing.

I also would like to use this PR to validate my understanding of the intended design

What changes are included in this PR?

Update documentation

Are these changes tested?

by CI

Are there any user-facing changes?

Just documentation changes, no functional changes

alamb · 2026-02-06T15:06:45Z

@adriangb @AdamGS and @zhuqi-lucas I wonder if you have some time to help me validate my understanding of how this code is all related and review this PR?

In general I have recently found that we have added so many (cool) features to DataFusion that it is harder and harder to understand how everything works together. I hope to help this situation by adding some more documentation and clarification over the next few days

adriangb

Thanks @alamb !

I honestly don't know what the intention or best design is for the statistics.
I think before the expression projection makes sense, but it doesn't make sense to populate statistics for columns that are not involved in the query.

I do wonder if it would be okay to say the statistics are coupled to the scan plan -> if we know some row groups will not be read and we can use that information to make more accurate statistics we should / can.

One 🎣 for another day: how do struct statistics fit into our stats framework?

adriangb · 2026-02-06T15:28:18Z

datafusion/datasource/src/file.rs

+    /// The output schema of this `FileSource` is this TableSchema
+    /// with [`Self::projection`] applied.


Wonder if we should add a helper method to FileSource that applies the projection?

There is ProjectionExprs::project_schema which does so. Maybe I could add a link to that

adriangb · 2026-02-06T15:28:31Z

datafusion/datasource/src/file.rs

 ///
+/// # Schema information
+/// There are two important schemas for a [`FileSource`]:
+/// 1. [`Self::table_schema`] -- the schema for the overall "table"


Does this include partition columns?

I think it be worth explicitly noting in the table_schema() doc that it includes partition columns? e.g. "the schema for the overall table (file schema + partition columns)". The FileScanConfig docs mention this, but it's easy to miss when reading FileSource alone.

adriangb · 2026-02-06T15:36:04Z

datafusion/datasource/src/file.rs

+    /// `filters` must be in terms of the unprojected table schema (file schema
+    /// plus partition columns).


👍🏻

Worth clarifying what is expected for a FileSource that:

Applies filters independently of projections

Applies filters after reading the data / projecting

?

I tried to clarify in 02e7617

Though I am not quite sure what you are suggesting here as I think the filters are applied (logically) before projection

adriangb · 2026-02-06T15:39:22Z

datafusion/datasource/src/file_scan_config.rs

+/// `FileScanConfig`. Fields in a `FileScanConfig` such as Statistics represent
+/// information about the files **before** any projection or filtering is


I think this is good. I still get confused about stats before vs. after projection / filtering. I do have some caveats: for us we use an external index to prune row groups, so we can actually get more accurate statistics for min/max, num rows, etc. in a file taking into account row groups pruned. It's also useful to project the statistics (or at least only populate them for columns that are in filters or projections); no point in collecting / allocating statistics for columns that will never be used.

I agree what the code currently does is not ideal. However, before I fix it I need to understand what it currently does :)

BTW I am actually looking into the same sort of question for the EquivalenceProperties

adriangb · 2026-02-06T15:40:38Z

datafusion/datasource/src/file_scan_config.rs

-    /// Indexes that are higher than the number of columns of `file_schema` refer to `table_partition_cols`.
+    /// This method attempts to push down the projection to the underlying file
+    /// source if supported. If the file source does not support projection
+    /// pushdown, an error is returned.


I think we return Ok(self) -> not an error, an unchanged plan?

I double checked and I think the comment is correct and this code does return an error

FileSource::try_pushdown_projection returns Option, but then this method returns an internal error if the file source returns None

A few lines below in this diff:

if let Some(new_source) = new_source { self.file_source = new_source; } else { internal_err!( // <------------ This error "FileSource {} does not support projection pushdown", self.file_source.file_type() )?; }

Hmm yeah okay. Don't love that haha but I guess we are documenting the current status quo.

Yeah, I agree -- I struggled with fixing everything I found questionable with just documenting what it was doing

AdamGS · 2026-02-06T16:28:49Z

This is great! I think @adriangb made the whole FileScanConfig <> FileSource interaction and order of operations much better than it used to be, but I'm also a fan of making things clearer.

alamb · 2026-02-06T15:20:13Z

datafusion/datasource/src/file.rs


 /// file format specific behaviors for elements in [`DataSource`]
 ///
+/// # Schema information


I think this is one of the key disitinctions -- there are two relevant schemas and it is important to specify which is being used at any time

alamb · 2026-02-06T16:45:14Z

I do wonder if it would be okay to say the statistics are coupled to the scan plan -> if we know some row groups will not be read and we can use that information to make more accurate statistics we should / can.

One 🎣 for another day: how do struct statistics fit into our stats framework?

One thought I had was to use some sort of delayed statistics thing -- like have a callback to produce statistics and only compute them on demand when they are actually used. Otherwise figuring out what stats will be used is going to be a very tricky business. But maybe on demand would also be tricky

alamb · 2026-02-06T17:41:03Z

Here is another follow on trying to make this code easier to understand

Introduce ProjectionExprs::unproject_exprs/project_exprs and improve docs #20193

zhuqi-lucas

LGTM, nice documentation improvement!

zhuqi-lucas · 2026-02-08T06:57:41Z

datafusion/datasource/src/file.rs

 ///
+/// # Schema information
+/// There are two important schemas for a [`FileSource`]:
+/// 1. [`Self::table_schema`] -- the schema for the overall "table"


I think it be worth explicitly noting in the table_schema() doc that it includes partition columns? e.g. "the schema for the overall table (file schema + partition columns)". The FileScanConfig docs mention this, but it's easy to miss when reading FileSource alone.

alamb · 2026-02-09T13:56:59Z

Thank you @AdamGS @adriangb and @zhuqi-lucas -- I will make the additional documentation improvements you suggest as a follow on PR

alamb · 2026-02-09T14:23:08Z

Follow on PR with some additional clarifications:

More documentation on FileSource::table_schema and FileSource::projection #20242

…::filter and FileScanConfig::output_ordering (#20196) ## Which issue does this PR close? - closes #20173 - Similar to #20188 ## Rationale for this change I spent a long time trying to figure out what was going on in our fork after a DataFusion 52 upgrade, and the root cause was that the output_ordering in DataFusion 52 does *NOT* include the projection (more details here #20173 (comment)) This was not clear to me from the DataFusion code or documentation, and I think it would be helpful to clarify this in the documentation. ## What changes are included in this PR? 1. Document FileScanConfig::output_ordering better ## Are these changes tested?  ## Are there any user-facing changes?

@zhuqi-lucas

…jection` (#20242) ## Which issue does this PR close? - Follow on to #20188 ## Rationale for this change @zhuqi-lucas and @adriangb had some good ideas on how to further improve the documentation on #20188, which I tried to implement in this PR ## What changes are included in this PR? Add more clarity about what TableSource and FileSource::projection are ## Are these changes tested? By CI ## Are there any user-facing changes? Additional documentation

alamb force-pushed the alamb/document_file_scan_statistics_better branch from 6524de7 to 5525203 Compare February 6, 2026 14:30

github-actions bot added the datasource Changes to the datasource crate label Feb 6, 2026

alamb changed the title ~~Better document FileScanConfig::Statistics~~ Better document the relationship between FileFormat::projection / FileFormat::filter and FileScanConfig::Statistics Feb 6, 2026

Better document FileScanConfig::Statistics

8b74a78

alamb force-pushed the alamb/document_file_scan_statistics_better branch from 5525203 to 8b74a78 Compare February 6, 2026 15:02

alamb marked this pull request as ready for review February 6, 2026 15:04

alamb requested a review from adriangb February 6, 2026 15:05

adriangb approved these changes Feb 6, 2026

View reviewed changes

alamb commented Feb 6, 2026

View reviewed changes

Add more clarification

02e7617

github-actions bot added the physical-plan Changes to the physical-plan crate label Feb 6, 2026

alamb mentioned this pull request Feb 6, 2026

Document the relationship between FileFormat::projection / FileFormat::filter and FileScanConfig::output_ordering #20196

Merged

alamb added the documentation Improvements or additions to documentation label Feb 7, 2026

zhuqi-lucas approved these changes Feb 8, 2026

View reviewed changes

alamb added this pull request to the merge queue Feb 9, 2026

Merged via the queue into apache:main with commit cc670e8 Feb 9, 2026
32 checks passed

alamb deleted the alamb/document_file_scan_statistics_better branch February 9, 2026 14:10

alamb mentioned this pull request Feb 9, 2026

More documentation on FileSource::table_schema and FileSource::projection #20242

Merged

		/// The output schema of this `FileSource` is this TableSchema
		/// with [`Self::projection`] applied.

		/// `filters` must be in terms of the unprojected table schema (file schema
		/// plus partition columns).

		/// `FileScanConfig`. Fields in a `FileScanConfig` such as Statistics represent
		/// information about the files before any projection or filtering is

Conversation

alamb commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Feb 6, 2026

Uh oh!

adriangb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AdamGS commented Feb 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 6, 2026

Uh oh!

alamb commented Feb 6, 2026

Uh oh!

zhuqi-lucas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 9, 2026

Uh oh!

Uh oh!

alamb commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alamb commented Feb 6, 2026 •

edited

Loading

adriangb left a comment •

edited

Loading