Better document the relationship between FileFormat::projection / FileFormat::filter and FileScanConfig::Statistics #20188
Conversation
6524de7 to
5525203
Compare
FileFormat::projection / FileFormat::filter and FileScanConfig::Statistics
5525203 to
8b74a78
Compare
|
@adriangb @AdamGS and @zhuqi-lucas I wonder if you have some time to help me validate my understanding of how this code is all related and review this PR? In general I have recently found that we have added so many (cool) features to DataFusion that it is harder and harder to understand how everything works together. I hope to help this situation by adding some more documentation and clarification over the next few days |
There was a problem hiding this comment.
Thanks @alamb !
I honestly don't know what the intention or best design is for the statistics.
I think before the expression projection makes sense, but it doesn't make sense to populate statistics for columns that are not involved in the query.
I do wonder if it would be okay to say the statistics are coupled to the scan plan -> if we know some row groups will not be read and we can use that information to make more accurate statistics we should / can.
One 🎣 for another day: how do struct statistics fit into our stats framework?
| /// The output schema of this `FileSource` is this TableSchema | ||
| /// with [`Self::projection`] applied. |
There was a problem hiding this comment.
Wonder if we should add a helper method to FileSource that applies the projection?
There was a problem hiding this comment.
There is ProjectionExprs::project_schema which does so. Maybe I could add a link to that
| /// | ||
| /// # Schema information | ||
| /// There are two important schemas for a [`FileSource`]: | ||
| /// 1. [`Self::table_schema`] -- the schema for the overall "table" |
There was a problem hiding this comment.
Does this include partition columns?
There was a problem hiding this comment.
I think it be worth explicitly noting in the table_schema() doc that it includes partition columns? e.g. "the schema for the overall table (file schema + partition columns)". The FileScanConfig docs mention this, but it's easy to miss when reading FileSource alone.
datafusion/datasource/src/file.rs
Outdated
| /// `filters` must be in terms of the unprojected table schema (file schema | ||
| /// plus partition columns). |
There was a problem hiding this comment.
👍🏻
Worth clarifying what is expected for a FileSource that:
- Applies filters independently of projections
- Applies filters after reading the data / projecting
?
There was a problem hiding this comment.
I tried to clarify in 02e7617
Though I am not quite sure what you are suggesting here as I think the filters are applied (logically) before projection
| /// `FileScanConfig`. Fields in a `FileScanConfig` such as Statistics represent | ||
| /// information about the files **before** any projection or filtering is |
There was a problem hiding this comment.
I think this is good. I still get confused about stats before vs. after projection / filtering. I do have some caveats: for us we use an external index to prune row groups, so we can actually get more accurate statistics for min/max, num rows, etc. in a file taking into account row groups pruned. It's also useful to project the statistics (or at least only populate them for columns that are in filters or projections); no point in collecting / allocating statistics for columns that will never be used.
There was a problem hiding this comment.
I agree what the code currently does is not ideal. However, before I fix it I need to understand what it currently does :)
BTW I am actually looking into the same sort of question for the EquivalenceProperties
| /// Indexes that are higher than the number of columns of `file_schema` refer to `table_partition_cols`. | ||
| /// This method attempts to push down the projection to the underlying file | ||
| /// source if supported. If the file source does not support projection | ||
| /// pushdown, an error is returned. |
There was a problem hiding this comment.
I think we return Ok(self) -> not an error, an unchanged plan?
There was a problem hiding this comment.
I double checked and I think the comment is correct and this code does return an error
FileSource::try_pushdown_projection returns Option, but then this method returns an internal error if the file source returns None
A few lines below in this diff:
if let Some(new_source) = new_source {
self.file_source = new_source;
} else {
internal_err!( // <------------ This error
"FileSource {} does not support projection pushdown",
self.file_source.file_type()
)?;
}There was a problem hiding this comment.
Hmm yeah okay. Don't love that haha but I guess we are documenting the current status quo.
There was a problem hiding this comment.
Yeah, I agree -- I struggled with fixing everything I found questionable with just documenting what it was doing
|
This is great! I think @adriangb made the whole |
|
|
||
| /// file format specific behaviors for elements in [`DataSource`] | ||
| /// | ||
| /// # Schema information |
There was a problem hiding this comment.
I think this is one of the key disitinctions -- there are two relevant schemas and it is important to specify which is being used at any time
One thought I had was to use some sort of delayed statistics thing -- like have a callback to produce statistics and only compute them on demand when they are actually used. Otherwise figuring out what stats will be used is going to be a very tricky business. But maybe on demand would also be tricky |
|
Here is another follow on trying to make this code easier to understand |
zhuqi-lucas
left a comment
There was a problem hiding this comment.
LGTM, nice documentation improvement!
| /// | ||
| /// # Schema information | ||
| /// There are two important schemas for a [`FileSource`]: | ||
| /// 1. [`Self::table_schema`] -- the schema for the overall "table" |
There was a problem hiding this comment.
I think it be worth explicitly noting in the table_schema() doc that it includes partition columns? e.g. "the schema for the overall table (file schema + partition columns)". The FileScanConfig docs mention this, but it's easy to miss when reading FileSource alone.
|
Thank you @AdamGS @adriangb and @zhuqi-lucas -- I will make the additional documentation improvements you suggest as a follow on PR |
|
Follow on PR with some additional clarifications: |
…::filter and FileScanConfig::output_ordering (#20196) ## Which issue does this PR close? - closes #20173 - Similar to #20188 ## Rationale for this change I spent a long time trying to figure out what was going on in our fork after a DataFusion 52 upgrade, and the root cause was that the output_ordering in DataFusion 52 does *NOT* include the projection (more details here #20173 (comment)) This was not clear to me from the DataFusion code or documentation, and I think it would be helpful to clarify this in the documentation. ## What changes are included in this PR? 1. Document FileScanConfig::output_ordering better ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
…jection` (#20242) ## Which issue does this PR close? - Follow on to #20188 ## Rationale for this change @zhuqi-lucas and @adriangb had some good ideas on how to further improve the documentation on #20188, which I tried to implement in this PR ## What changes are included in this PR? Add more clarity about what TableSource and FileSource::projection are ## Are these changes tested? By CI ## Are there any user-facing changes? Additional documentation
Which issue does this PR close?
Rationale for this change
I am debugging an issue related to the interplay of pre-existing orderings, filtering and projections in FileScanConfig. As part of that I am trying to understand how
Statisticswere handled byFileScanConfig-- specifically when relatively speaking are the projection and filtering appliedAfter some study, I have found that the statistics are (supposed?) to be before applying the Filter and Projection from the scan, so let's document that better. Also I found the schemas involved to be a bit confusing.
I also would like to use this PR to validate my understanding of the intended design
What changes are included in this PR?
Update documentation
Are these changes tested?
by CI
Are there any user-facing changes?
Just documentation changes, no functional changes