-
Notifications
You must be signed in to change notification settings - Fork 414
fix: Validate SetStatisticsUpdate correctly #2866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. I think we should just deprecate the top-level snapshot_id entirely.
For context, the "before model_validator` was added in 3b53edc#diff-769b43e1d8beaa86141f694679de2bbea3604a65f987a9acd7d9e9efca193b7eR181-R193 to maintain backwards compatibility and prep for deprecation
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops didnt mean to approve
|
@kevinjqliu Ok, do you want me to change the fix so that |
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the java implementation has not deprecated the top-level snapshot_id. lets proceed with this change since it improves the current validation logic.
Thanks for the PR! Lets add a test and it should be good to go!
pyiceberg/table/update/__init__.py
Outdated
| @model_validator(mode="after") | ||
| def validate_snapshot_id(self) -> Self: | ||
| return self.model_copy(update={"snapshot_id": self.statistics.snapshot_id}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @model_validator(mode="after") | |
| def validate_snapshot_id(self) -> Self: | |
| return self.model_copy(update={"snapshot_id": self.statistics.snapshot_id}) | |
| @model_validator(mode="after") | |
| def validate_snapshot_id(self) -> Self: | |
| self.snapshot_id = self.statistics.snapshot_id | |
| return self |
nit: use direct assignment
6f2b0d7 to
aa85657
Compare
@kevinjqliu Thanks for the (quick!) review. I've changed the fix a bit:
Please let me know if you want further changes. |
00cfd92 to
fa456e3
Compare
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks good, i found a small bug. Would be great to add a test for this case
| elif isinstance(stats, dict): | ||
| snapshot_id = stats.get("snapshot_id") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| elif isinstance(stats, dict): | |
| snapshot_id = stats.get("snapshot_id") | |
| elif isinstance(stats, dict): | |
| snapshot_id = stats.get("snapshot_id") | |
| else: | |
| snapshot_id = None | |
nit: i think we can inline the else here
pyiceberg/table/update/__init__.py
Outdated
| if isinstance(stats, StatisticsFile): | ||
| snapshot_id = stats.snapshot_id | ||
| elif isinstance(stats, dict): | ||
| snapshot_id = stats.get("snapshot_id") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| snapshot_id = stats.get("snapshot_id") | |
| snapshot_id = stats.get("snapshot-id") |
i think this should be snapshot-id since before validator takes in json as input
iceberg-python/pyiceberg/table/statistics.py
Lines 32 to 40 in fa03e08
| class StatisticsCommonFields(IcebergBaseModel): | |
| """Common fields between table and partition statistics structs found on metadata.""" | |
| snapshot_id: int = Field(alias="snapshot-id") | |
| statistics_path: str = Field(alias="statistics-path") | |
| file_size_in_bytes: int = Field(alias="file-size-in-bytes") | |
| class StatisticsFile(StatisticsCommonFields): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add a test case for this (and possibly one for the else case too)?
The current test only test the StatisticsFile instance branch
iceberg-python/tests/table/test_init.py
Lines 1370 to 1381 in fa03e08
| def test_set_statistics_update(table_v2_with_statistics: Table) -> None: | |
| snapshot_id = table_v2_with_statistics.metadata.current_snapshot_id | |
| blob_metadata = BlobMetadata( | |
| type="apache-datasketches-theta-v1", | |
| snapshot_id=snapshot_id, | |
| sequence_number=2, | |
| fields=[1], | |
| properties={"prop-key": "prop-value"}, | |
| ) | |
| statistics_file = StatisticsFile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch with the snapshot-id. I've fixed that, and added a test specifically for the snapshot_id handling, creating a model both from a model instance, dict and json. the roundtrip testing also exercises the previous bug.
ba3eb24 to
65e0f64
Compare
Previously the pydantic @model_validator would fail because it assumed statistics was a model instance. In a "before"" validator that is not necessarily the case. Check type explicitly with isinstance instead, and handle `dict` case too.
65e0f64 to
e2517b0
Compare
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
geruh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Closes #2865
Previously the pydantic
@model_validatoronSetStatisticsUpdatewould fail because it assumed statistics was a model instance. In a "before"" validator that is not necessarily case.Check type and handle both models and dicts instead.
Before
After
Rationale for this change
Are these changes tested?
Yes, but only using the two-liners above.
Are there any user-facing changes?
No.