Skip to content

Conversation

@HonahX
Copy link
Contributor

@HonahX HonahX commented Apr 6, 2024

Fixes: #583

cc @Fokko

DateType: "date",
TimeType: "string",
TimestampType: "timestamp",
TimestamptzType: "timestamp",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Java, we map TimestamptzType to "timestamp with local time zone" for Hive version >= 3 TimestamptzType.

case TIMESTAMP:
        Types.TimestampType timestampType = (Types.TimestampType) type;
        if (HiveVersion.min(HiveVersion.HIVE_3) && timestampType.shouldAdjustToUTC()) {
          return "timestamp with local time zone";
        }
        return "timestamp";

But in python it seems we do not have a way to check the version. So I put the "timestamp" here to ensure compatibility. Please correct me if I am wrong here. Thanks!

ref: https://github.com/apache/iceberg/blob/main/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java#L143

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just set arbitrary strings in there? If so, I think timestamp with local time zone is more accurate. It would be good to validate using an integration test as well, since we have those tests already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically, the content here should be constants in serdeConstants since java use these to test the Hive schema util
Among these types, timestamp with local time zone is a new one added in Hive3: ref.

Interestingly, when doing integration tests (use pyiceberg to write and spark to read) I found that Hive server will raise error for timestampabc but work normally for timestamp arbitrary string. Seems as long as the first word is within serdeConstants there will be no error loading the table. Not sure if timestamp arbitrary string will cause error in other Hive use cases.

How about we default to "timestamp with local time zone" and add a HiveCatalog property (e.g. hive.hive2-compatible-mode) to use "timestamp" here when it set to true?

(Will add the integration test soon)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context, has been a while since working with the Hive catalog.

How about we default to "timestamp with local time zone" and add a HiveCatalog property (e.g. hive.hive2-compatible-mode) to use "timestamp" here when it set to true?

I think that would be a great solution 👍

@HonahX HonahX mentioned this pull request Apr 6, 2024
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @HonahX Thanks for fixing this 👍

HonahX and others added 2 commits April 7, 2024 23:55
@HonahX
Copy link
Contributor Author

HonahX commented Apr 8, 2024

@Fokko Thanks for reviewing!

@HonahX HonahX merged commit 07442cc into apache:main Apr 8, 2024
HonahX added a commit to HonahX/iceberg-python that referenced this pull request Apr 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hive Catalog cannot create table with TimestamptzType field

2 participants