feat: Expose Parquet Schema Adapter#10515
Conversation
Allow users of the Datafusion Parquet physical executor to define how to map parquet schema to the table schema. This can be useful as there can be layers on top of parquet like Delta or Iceberg which may also define the schema and how the schema should evolve.
|
Thank you @HawaiianSpork -- I triggered the CI checks on this PR and plan to review it carefully in the next day or two |
alamb
left a comment
There was a problem hiding this comment.
Thank you for this wonderful contribution @HawaiianSpork -- a very impressive first PR 👏
I had some minor comment suggestions, but I also think this PR could be merged as is
I have some suggestions about code organization that maybe we can do as a follow on PR.
|
|
||
| /// The SchemaMapping struct holds a mapping from the file schema to the table schema | ||
| /// and any necessary type conversions that need to be applied. | ||
| #[cfg(feature = "parquet")] |
There was a problem hiding this comment.
It is unfortunate that we have to add the
#[cfg(feature = "parquet")]
all over the place
Maybe as a follow on PR we can pull this code into its own module (e.g. datasource/schema_adaptor.rs for example)
| use std::fmt::Debug; | ||
| use std::sync::Arc; | ||
|
|
||
| /// Factory of schema adapters. |
There was a problem hiding this comment.
I know that currently the only user of SchemaAdapter is parquet, but I don't think there is anything parquet specific about the logic here.
What do you think about moving the code (and default impl) somewhere like
datafusion/core/src/datasource/schema_adapter.rs
?
Perhaps we could do that as a follow on PR as the way you have done this PR makes it easy to see what you have changed / not changed 👍
| // Create several parquet files in same directoty / table with | ||
| // same schema but different metadata |
There was a problem hiding this comment.
Is this comment valid? This seems like it really makes a two files with column id and then uses the schema adapter to add a separate column
| "+----+--------------+", | ||
| "| id | extra_column |", | ||
| "+----+--------------+", | ||
| "| 1 | foo |", |
There was a problem hiding this comment.
It might help future readers if you left some comments explaining that the point of the test was to inject a column that doesn't appear in any of the files
| /// Options for reading Parquet files | ||
| table_parquet_options: TableParquetOptions, | ||
| /// Optional user defined schema adapter | ||
| schema_adapter_factory: Option<Arc<dyn SchemaAdapterFactory>>, |
There was a problem hiding this comment.
I think it is about time (not this PR) to create a ParquetExecBuilder so that new optional arguments can be aded without having to change all callsites.
|
are we waiting feedbacks from @HawaiianSpork or we good to merge the PR? |
I was waiting to see if they wanted to make any of the suggestions. If not, let's merge it in |
|
@HawaiianSpork please address the suggestions, you dont need to change the code, just make a response |
|
Sorry, to keep you hanging and thank you for the quick review. I agree with the suggestions but haven't circled back around to completing. If you want to wait, I will get to the suggestions early next week. Otherwise, I am also ok with this being merged and I can open a follow on PR to address the suggestions. |
This is not a change in behavior except moving the public location of SchemaAdapter. SchemaAdapter was exposed in apache#10515 to allow callers to define their own implementation. This PR then changes the location so that it could be used in other data sources.
This is not a change in behavior except moving the public location of SchemaAdapter. SchemaAdapter was exposed in apache#10515 to allow callers to define their own implementation. This PR then changes the location so that it could be used in other data sources.
This is not a change in behavior except moving the public location of SchemaAdapter. SchemaAdapter was exposed in apache#10515 to allow callers to define their own implementation. This PR then changes the location so that it could be used in other data sources.
* refactor: Move SchemaAdapter from parquet module to data source This is not a change in behavior except moving the public location of SchemaAdapter. SchemaAdapter was exposed in #10515 to allow callers to define their own implementation. This PR then changes the location so that it could be used in other data sources. * fix comments surrounding tests to be accurate.
* feat: Expose Parquet Schema Adapter
…he#10680) * refactor: Move SchemaAdapter from parquet module to data source This is not a change in behavior except moving the public location of SchemaAdapter. SchemaAdapter was exposed in apache#10515 to allow callers to define their own implementation. This PR then changes the location so that it could be used in other data sources. * fix comments surrounding tests to be accurate.
Allow users of the Datafusion Parquet physical executor to define how to map parquet schema to the table schema.
This can be useful as there can be layers on top of parquet like Delta or Iceberg which may also define the schema and how the schema should evolve.
Which issue does this PR close?
Closes #10398
Rationale for this change
By exposing
SchemaAdapterdownstream consumers can reuseParquetExecbut allow for different interpretations of the data from the parquet.For example, delta-rs keeps the schema separate from the parquet so that schema evolution can be well controlled. The external schema can enrich the data inside the parquet files with missing nested columns or timezone information.
What changes are included in this PR?
Changes
SchemaAdapterto a public trait and addsSchemaAdapterFactoryto be passed into the constructor ofParquetExec.Are there any user-facing changes?
This change does expose a new field for
ParquetExecthat can be specified. I was able to reuse the existing documentation for the now public trait.This change adds the optional
schema_adaptor_factoryto theParquetExecstruct. If a client was creating this struct directly they would have to change their code to now specifyNonefor theschema_adaptor_factory. If a consumer was using the builder they would be unaffected.