-
Notifications
You must be signed in to change notification settings - Fork 421
Fix overwrite when filtering all the data #1023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -38,12 +38,13 @@ | |
| from pyiceberg.catalog.rest import RestCatalog | ||
| from pyiceberg.catalog.sql import SqlCatalog | ||
| from pyiceberg.exceptions import NoSuchTableError | ||
| from pyiceberg.expressions import In | ||
| from pyiceberg.io.pyarrow import _dataframe_to_data_files | ||
| from pyiceberg.partitioning import PartitionField, PartitionSpec | ||
| from pyiceberg.schema import Schema | ||
| from pyiceberg.table import TableProperties | ||
| from pyiceberg.transforms import IdentityTransform | ||
| from pyiceberg.types import IntegerType, LongType, NestedField | ||
| from pyiceberg.types import IntegerType, LongType, NestedField, StringType | ||
| from utils import _create_table | ||
|
|
||
|
|
||
|
|
@@ -1309,3 +1310,27 @@ def test_table_v1_with_null_nested_namespace(session_catalog: Catalog, arrow_tab | |
|
|
||
| # We expect no error here | ||
| session_catalog.drop_table(identifier) | ||
|
|
||
|
|
||
| @pytest.mark.integration | ||
| def test_overwrite_all_data_with_filter(session_catalog: Catalog) -> None: | ||
| schema = Schema( | ||
| NestedField(1, "id", StringType(), required=True), | ||
| NestedField(2, "name", StringType(), required=False), | ||
| identifier_field_ids=[1], | ||
| ) | ||
|
|
||
| data = pa.Table.from_pylist( | ||
| [ | ||
| {"id": "1", "name": "Amsterdam"}, | ||
| {"id": "2", "name": "San Francisco"}, | ||
| {"id": "3", "name": "Drachten"}, | ||
| ], | ||
| schema=schema.as_arrow(), | ||
| ) | ||
|
|
||
| identifier = "default.test_overwrite_all_data_with_filter" | ||
| tbl = _create_table(session_catalog, identifier, data=[data], schema=schema) | ||
| tbl.overwrite(data, In("id", ["1", "2", "3"])) | ||
|
|
||
| assert len(tbl.scan().to_arrow()) == 3 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: since all data match the filter, the
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not a no-op, it's deleting the whole file. The change is in the delete method, not in the overwrite method. I believe that testing the behavior is enough.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah, I see. The change is to make
Previously, we end up trying to write an empty data file.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes exactly. One thing to note is that it would be even more correct to add this to a |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is it more readable if inlined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I don't see much of a difference.