Skip to content

Conversation

@sungwy
Copy link
Collaborator

@sungwy sungwy commented Mar 18, 2024

As a follow up to #506, this PR introduces the support for adding files as DataFiles to partitioned tables.

Instead of relying on the more inaccurate method of parsing and inferring partition values from the file path relying on a Hive partitioning scheme, this approach requires that the partition values are there in the parquet files, and infers the partition values from the partition metadata footer by taking using the lower and upper bound values.

The optimization to use the lower bound and upper bound values prevents the client from having to read the entire parquet file as it is able to use the aggregated statistics from the parquet metadata footer. As a result, this implementation of add_files does not support tables with partition transforms that are non-linear (not preserves_order).

Among the existing Transforms, the following Transform partitions are supported:

  • IdentityTransform
  • TruncateTransform
  • YearTransform
  • MonthTransform
  • DayTransform
  • HourTransform

The following are not:

  • VoidTransform
  • BucketTransform
  • UnknownTransform

@sungwy sungwy requested review from Fokko and HonahX March 18, 2024 18:46
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@syun64 Thanks for working on this, this looks great!

@sungwy sungwy requested a review from Fokko March 19, 2024 15:08
@sungwy
Copy link
Collaborator Author

sungwy commented Mar 19, 2024

@syun64 Thanks for working on this, this looks great!

Thank you very much for the detailed review @Fokko . I've adopted all of your review comments 👍 - I would appreciate another round of review!

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks again for the work @syun64

@Fokko Fokko merged commit 6989b92 into apache:main Mar 21, 2024
@sungwy
Copy link
Collaborator Author

sungwy commented Mar 21, 2024

This looks good, thanks again for the work @syun64

Thank you! As always! @Fokko

@sungwy sungwy deleted the add-files-partitioned branch March 21, 2024 20:28
@sungwy sungwy added this to the PyIceberg 0.7.0 release milestone Jul 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants