fix: Validate SetStatisticsUpdate correctly #2866

ragnard · 2025-12-26T22:18:20Z

Previously the pydantic @model_validator on SetStatisticsUpdate would fail because it assumed statistics was a model instance. In a "before"" validator that is not necessarily case.

Check type and handle both models and dicts instead.

Before

>>> import pyiceberg.table.update
>>> pyiceberg.table.update.SetStatisticsUpdate.model_validate({'statistics': {'snapshot-id': 1234, 'file-size-in-bytes': 0, 'statistics-path': '', 'file-footer-size-in-bytes': 0, 'blob-metadata': []}})
Traceback (most recent call last):
  File "<python-input-1>", line 1, in <module>
    pyiceberg.table.update.SetStatisticsUpdate.model_validate({'statistics': {'snapshot-id': 1234, 'file-size-in-bytes': 0, 'statistics-path': '', 'file-footer-size-in-bytes': 0, 'blob-metadata': []}})
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ragge/projects/github.com/ragnard/iceberg-python/.venv/lib/python3.14/site-packages/pydantic/main.py", line 716, in model_validate
    return cls.__pydantic_validator__.validate_python(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        obj,
        ^^^^
    ...<5 lines>...
        by_name=by_name,
        ^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/ragge/projects/github.com/ragnard/iceberg-python/pyiceberg/table/update/__init__.py", line 191, in validate_snapshot_id
    data["snapshot_id"] = stats.snapshot_id
                          ^^^^^^^^^^^^^^^^^
AttributeError: 'dict' object has no attribute 'snapshot_id'

After

>>> import pyiceberg.table.update
>>> pyiceberg.table.update.SetStatisticsUpdate.model_validate({'statistics': {'snapshot-id': 1234, 'file-size-in-bytes': 0, 'statistics-path': '', 'file-footer-size-in-bytes': 0, 'blob-metadata': []}})
SetStatisticsUpdate(action='set-statistics', statistics=StatisticsFile(snapshot_id=1234, statistics_path='', file_size_in_bytes=0, file_footer_size_in_bytes=0, key_metadata=None, blob_metadata=[]), snapshot_id=1234)

Rationale for this change

Are these changes tested?

Yes, but only using the two-liners above.

Are there any user-facing changes?

No.

kevinjqliu

Thanks for the PR. I think we should just deprecate the top-level snapshot_id entirely.
For context, the "before model_validator` was added in 3b53edc#diff-769b43e1d8beaa86141f694679de2bbea3604a65f987a9acd7d9e9efca193b7eR181-R193 to maintain backwards compatibility and prep for deprecation

kevinjqliu

oops didnt mean to approve

ragnard · 2025-12-27T07:53:04Z

@kevinjqliu Ok, do you want me to change the fix so that snapshot_id is still there, but just not automatically populated?

kevinjqliu

Looks like the java implementation has not deprecated the top-level snapshot_id. lets proceed with this change since it improves the current validation logic.

Thanks for the PR! Lets add a test and it should be good to go!

kevinjqliu · 2025-12-28T19:14:50Z

pyiceberg/table/update/__init__.py

+    @model_validator(mode="after")
+    def validate_snapshot_id(self) -> Self:
+        return self.model_copy(update={"snapshot_id": self.statistics.snapshot_id})


Suggested change

@model_validator(mode="after")

def validate_snapshot_id(self) -> Self:

return self.model_copy(update={"snapshot_id": self.statistics.snapshot_id})

@model_validator(mode="after")

def validate_snapshot_id(self) -> Self:

self.snapshot_id = self.statistics.snapshot_id

return self

nit: use direct assignment

pyiceberg/table/update/__init__.py

ragnard · 2025-12-29T19:30:43Z

Looks like the java implementation has not deprecated the top-level snapshot_id. lets proceed with this change since it improves the current validation logic.

Thanks for the PR! Lets add a test and it should be good to go!

@kevinjqliu Thanks for the (quick!) review. I've changed the fix a bit:

Since the model is frozen it is not possible to use direct assignment, but it is also not possible to use model_copy like I did because an "after" validator needs to return the same model instance. I've reverted to a "before" validator, but handle the dict case properly.
For testing, the issue is not really about tables with statistics, but that it was not possible to instantiate the SetStatisticsUpdate model from a dict. I added an model_roundtrips helper that is now used to check that any model roundtrips (model -> dict -> model) correctly.

Please let me know if you want further changes.

kevinjqliu

Thanks! Looks good, i found a small bug. Would be great to add a test for this case

kevinjqliu · 2025-12-29T20:36:40Z

pyiceberg/table/update/__init__.py

+        elif isinstance(stats, dict):
+            snapshot_id = stats.get("snapshot_id")
+


Suggested change

elif isinstance(stats, dict):

snapshot_id = stats.get("snapshot_id")

elif isinstance(stats, dict):

snapshot_id = stats.get("snapshot_id")

else:

snapshot_id = None

nit: i think we can inline the else here

kevinjqliu · 2025-12-29T20:39:31Z

pyiceberg/table/update/__init__.py

+        if isinstance(stats, StatisticsFile):
+            snapshot_id = stats.snapshot_id
+        elif isinstance(stats, dict):
+            snapshot_id = stats.get("snapshot_id")


Suggested change

snapshot_id = stats.get("snapshot_id")

snapshot_id = stats.get("snapshot-id")

i think this should be snapshot-id since before validator takes in json as input

iceberg-python/pyiceberg/table/statistics.py

Lines 32 to 40 in fa03e08

class StatisticsCommonFields(IcebergBaseModel):

"""Common fields between table and partition statistics structs found on metadata."""

snapshot_id: int = Field(alias="snapshot-id")

statistics_path: str = Field(alias="statistics-path")

file_size_in_bytes: int = Field(alias="file-size-in-bytes")

class StatisticsFile(StatisticsCommonFields):

could you add a test case for this (and possibly one for the else case too)?

The current test only test the StatisticsFile instance branch

iceberg-python/tests/table/test_init.py

Lines 1370 to 1381 in fa03e08

def test_set_statistics_update(table_v2_with_statistics: Table) -> None:

snapshot_id = table_v2_with_statistics.metadata.current_snapshot_id

blob_metadata = BlobMetadata(

type="apache-datasketches-theta-v1",

snapshot_id=snapshot_id,

sequence_number=2,

fields=[1],

properties={"prop-key": "prop-value"},

)

statistics_file = StatisticsFile(

good catch with the snapshot-id. I've fixed that, and added a test specifically for the snapshot_id handling, creating a model both from a model instance, dict and json. the roundtrip testing also exercises the previous bug.

Previously the pydantic @model_validator would fail because it assumed statistics was a model instance. In a "before"" validator that is not necessarily the case. Check type explicitly with isinstance instead, and handle `dict` case too.

kevinjqliu

LGTM!

geruh

LGTM!

kevinjqliu · 2025-12-30T19:43:58Z

Thank you @ragnard for the PR and thanks @geruh for the review!

kevinjqliu approved these changes Dec 26, 2025

View reviewed changes

kevinjqliu requested changes Dec 26, 2025

View reviewed changes

kevinjqliu reviewed Dec 28, 2025

View reviewed changes

ragnard force-pushed the fix-set-statistics-validation branch from 6f2b0d7 to aa85657 Compare December 29, 2025 19:23

ragnard force-pushed the fix-set-statistics-validation branch 4 times, most recently from 00cfd92 to fa456e3 Compare December 29, 2025 19:48

kevinjqliu reviewed Dec 29, 2025

View reviewed changes

ragnard force-pushed the fix-set-statistics-validation branch 2 times, most recently from ba3eb24 to 65e0f64 Compare December 30, 2025 09:10

ragnard force-pushed the fix-set-statistics-validation branch from 65e0f64 to e2517b0 Compare December 30, 2025 09:22

kevinjqliu approved these changes Dec 30, 2025

View reviewed changes

kevinjqliu changed the title ~~fix: Validate SetStatisticsUpdate correctly (fixes #2865)~~ fix: Validate SetStatisticsUpdate correctly Dec 30, 2025

geruh approved these changes Dec 30, 2025

View reviewed changes

kevinjqliu merged commit 1b69a25 into apache:main Dec 30, 2025
8 checks passed

		elif isinstance(stats, dict):
		snapshot_id = stats.get("snapshot_id")

	snapshot_id = stats.get("snapshot_id")
	snapshot_id = stats.get("snapshot-id")

	class StatisticsCommonFields(IcebergBaseModel):
	"""Common fields between table and partition statistics structs found on metadata."""

	snapshot_id: int = Field(alias="snapshot-id")
	statistics_path: str = Field(alias="statistics-path")
	file_size_in_bytes: int = Field(alias="file-size-in-bytes")


	class StatisticsFile(StatisticsCommonFields):

	def test_set_statistics_update(table_v2_with_statistics: Table) -> None:
	snapshot_id = table_v2_with_statistics.metadata.current_snapshot_id

	blob_metadata = BlobMetadata(
	type="apache-datasketches-theta-v1",
	snapshot_id=snapshot_id,
	sequence_number=2,
	fields=[1],
	properties={"prop-key": "prop-value"},
	)

	statistics_file = StatisticsFile(

fix: Validate SetStatisticsUpdate correctly #2866

fix: Validate SetStatisticsUpdate correctly #2866

Conversation

ragnard commented Dec 26, 2025 • edited by kevinjqliu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before

After

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

ragnard commented Dec 27, 2025

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ragnard commented Dec 29, 2025

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

ragnard Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

geruh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kevinjqliu commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ragnard commented Dec 26, 2025 •

edited by kevinjqliu

Loading

ragnard Dec 30, 2025 •

edited

Loading