Skip to content

positional delete does not correctly decrement snapshot summarie's total-records #12823

@kevinjqliu

Description

@kevinjqliu

Apache Iceberg version

1.8.1 (latest release)

Query engine

Spark

Please describe the bug 🐞

MoR delete with positional delete file does not properly update the total-records in Snapshot summary.

This can be seen by the pyiceberg example here where a single row is deleted but the total-records remains the same.

CoW delete, where the data file is rewritten, does not have this problem and the total-records is properly decremented, as shown here (Although its decremented using the previously wrongly calculated total-records).

I think this issue has persisted for quite a while. I found both #7463 and #6709.

#7463 shows that the delete (DELETE FROM default.t1 WHERE foo = 'b') produce an OVERWRITE snapshot with the following summary:

{
    "spark.app.id": "local-1682689536619",
    "changed-partition-count": "1",
    "added-position-deletes": "1",
    "total-equality-deletes": "0",
    "total-position-deletes": "1",
    "added-position-delete-files": "1",
    "added-files-size": "1490",
    "total-delete-files": "1",
    "added-delete-files": "1",
    "total-files-size": "2387",
    "total-records": "3",
    "total-data-files": "1"
}

where 'total-records': '3', is the same as the previous Snapshot even though a row has been deleted

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions