Skip to content

Better document the relationship between FileFormat::projection / FileFormat::filter and FileScanConfig::Statistics #20188

Merged
alamb merged 2 commits intoapache:mainfrom
alamb:alamb/document_file_scan_statistics_better
Feb 9, 2026
Merged

Better document the relationship between FileFormat::projection / FileFormat::filter and FileScanConfig::Statistics #20188
alamb merged 2 commits intoapache:mainfrom
alamb:alamb/document_file_scan_statistics_better

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Feb 6, 2026

Which issue does this PR close?

Rationale for this change

I am debugging an issue related to the interplay of pre-existing orderings, filtering and projections in FileScanConfig. As part of that I am trying to understand how Statistics were handled by FileScanConfig -- specifically when relatively speaking are the projection and filtering applied

After some study, I have found that the statistics are (supposed?) to be before applying the Filter and Projection from the scan, so let's document that better. Also I found the schemas involved to be a bit confusing.

I also would like to use this PR to validate my understanding of the intended design

What changes are included in this PR?

Update documentation

Are these changes tested?

by CI

Are there any user-facing changes?

Just documentation changes, no functional changes

@alamb alamb force-pushed the alamb/document_file_scan_statistics_better branch from 6524de7 to 5525203 Compare February 6, 2026 14:30
@github-actions github-actions bot added the datasource Changes to the datasource crate label Feb 6, 2026
@alamb alamb changed the title Better document FileScanConfig::Statistics Better document the relationship between FileFormat::projection / FileFormat::filter and FileScanConfig::Statistics Feb 6, 2026
@alamb alamb force-pushed the alamb/document_file_scan_statistics_better branch from 5525203 to 8b74a78 Compare February 6, 2026 15:02
@alamb alamb marked this pull request as ready for review February 6, 2026 15:04
@alamb alamb requested a review from adriangb February 6, 2026 15:05
@alamb
Copy link
Contributor Author

alamb commented Feb 6, 2026

@adriangb @AdamGS and @zhuqi-lucas I wonder if you have some time to help me validate my understanding of how this code is all related and review this PR?

In general I have recently found that we have added so many (cool) features to DataFusion that it is harder and harder to understand how everything works together. I hope to help this situation by adding some more documentation and clarification over the next few days

Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb !

I honestly don't know what the intention or best design is for the statistics.
I think before the expression projection makes sense, but it doesn't make sense to populate statistics for columns that are not involved in the query.

I do wonder if it would be okay to say the statistics are coupled to the scan plan -> if we know some row groups will not be read and we can use that information to make more accurate statistics we should / can.

One 🎣 for another day: how do struct statistics fit into our stats framework?

Comment on lines +79 to +80
/// The output schema of this `FileSource` is this TableSchema
/// with [`Self::projection`] applied.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if we should add a helper method to FileSource that applies the projection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is ProjectionExprs::project_schema which does so. Maybe I could add a link to that

///
/// # Schema information
/// There are two important schemas for a [`FileSource`]:
/// 1. [`Self::table_schema`] -- the schema for the overall "table"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this include partition columns?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it be worth explicitly noting in the table_schema() doc that it includes partition columns? e.g. "the schema for the overall table (file schema + partition columns)". The FileScanConfig docs mention this, but it's easy to miss when reading FileSource alone.

Comment on lines +159 to +160
/// `filters` must be in terms of the unprojected table schema (file schema
/// plus partition columns).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

Worth clarifying what is expected for a FileSource that:

  • Applies filters independently of projections
  • Applies filters after reading the data / projecting

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to clarify in 02e7617

Though I am not quite sure what you are suggesting here as I think the filters are applied (logically) before projection

Comment on lines +65 to +66
/// `FileScanConfig`. Fields in a `FileScanConfig` such as Statistics represent
/// information about the files **before** any projection or filtering is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good. I still get confused about stats before vs. after projection / filtering. I do have some caveats: for us we use an external index to prune row groups, so we can actually get more accurate statistics for min/max, num rows, etc. in a file taking into account row groups pruned. It's also useful to project the statistics (or at least only populate them for columns that are in filters or projections); no point in collecting / allocating statistics for columns that will never be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree what the code currently does is not ideal. However, before I fix it I need to understand what it currently does :)

BTW I am actually looking into the same sort of question for the EquivalenceProperties

/// Indexes that are higher than the number of columns of `file_schema` refer to `table_partition_cols`.
/// This method attempts to push down the projection to the underlying file
/// source if supported. If the file source does not support projection
/// pushdown, an error is returned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we return Ok(self) -> not an error, an unchanged plan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked and I think the comment is correct and this code does return an error

FileSource::try_pushdown_projection returns Option, but then this method returns an internal error if the file source returns None

A few lines below in this diff:

        if let Some(new_source) = new_source {
            self.file_source = new_source;
        } else {
            internal_err!(      // <------------ This error 
                "FileSource {} does not support projection pushdown",
                self.file_source.file_type()
            )?;
        }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah okay. Don't love that haha but I guess we are documenting the current status quo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree -- I struggled with fixing everything I found questionable with just documenting what it was doing

@AdamGS
Copy link
Contributor

AdamGS commented Feb 6, 2026

This is great! I think @adriangb made the whole FileScanConfig <> FileSource interaction and order of operations much better than it used to be, but I'm also a fan of making things clearer.


/// file format specific behaviors for elements in [`DataSource`]
///
/// # Schema information
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is one of the key disitinctions -- there are two relevant schemas and it is important to specify which is being used at any time

@alamb
Copy link
Contributor Author

alamb commented Feb 6, 2026

I do wonder if it would be okay to say the statistics are coupled to the scan plan -> if we know some row groups will not be read and we can use that information to make more accurate statistics we should / can.

One 🎣 for another day: how do struct statistics fit into our stats framework?

One thought I had was to use some sort of delayed statistics thing -- like have a callback to produce statistics and only compute them on demand when they are actually used. Otherwise figuring out what stats will be used is going to be a very tricky business. But maybe on demand would also be tricky

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Feb 6, 2026
@alamb
Copy link
Contributor Author

alamb commented Feb 6, 2026

Here is another follow on trying to make this code easier to understand

Copy link
Contributor

@zhuqi-lucas zhuqi-lucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice documentation improvement!

///
/// # Schema information
/// There are two important schemas for a [`FileSource`]:
/// 1. [`Self::table_schema`] -- the schema for the overall "table"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it be worth explicitly noting in the table_schema() doc that it includes partition columns? e.g. "the schema for the overall table (file schema + partition columns)". The FileScanConfig docs mention this, but it's easy to miss when reading FileSource alone.

@alamb
Copy link
Contributor Author

alamb commented Feb 9, 2026

Thank you @AdamGS @adriangb and @zhuqi-lucas -- I will make the additional documentation improvements you suggest as a follow on PR

@alamb alamb added this pull request to the merge queue Feb 9, 2026
Merged via the queue into apache:main with commit cc670e8 Feb 9, 2026
32 checks passed
@alamb alamb deleted the alamb/document_file_scan_statistics_better branch February 9, 2026 14:10
@alamb
Copy link
Contributor Author

alamb commented Feb 9, 2026

Follow on PR with some additional clarifications:

github-merge-queue bot pushed a commit that referenced this pull request Feb 9, 2026
…::filter and FileScanConfig::output_ordering (#20196)

## Which issue does this PR close?

- closes #20173
- Similar to #20188

## Rationale for this change

I spent a long time trying to figure out what was going on in our fork
after
a DataFusion 52 upgrade, and the root cause was that the output_ordering
in DataFusion
52 does *NOT* include the projection (more details here
#20173 (comment))

This was not clear to me from the DataFusion code or documentation, and
I think
it would be helpful to clarify this in the documentation.


## What changes are included in this PR?

1. Document FileScanConfig::output_ordering better


## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
github-merge-queue bot pushed a commit that referenced this pull request Feb 9, 2026
…jection` (#20242)

## Which issue does this PR close?

- Follow on to #20188

## Rationale for this change

@zhuqi-lucas and @adriangb had some good ideas on how to further improve
the documentation on #20188,
which I tried to implement in this PR

## What changes are included in this PR?

Add more clarity about what TableSource and FileSource::projection are

## Are these changes tested?

By CI
## Are there any user-facing changes?

Additional documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate documentation Improvements or additions to documentation physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants