-
Notifications
You must be signed in to change notification settings - Fork 3k
Fix Parquet with special characters in field names. #601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
rdsr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| return copy; | ||
| } | ||
|
|
||
| public static String makeCompatibleName(String name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: do you want to use this method in org.apache.iceberg.avro.TypeToSchema#struct() api?
Currently the code is
boolean isValidFieldName = AvroSchemaUtil.validAvroName(origFieldName);
String fieldName = isValidFieldName ? origFieldName : AvroSchemaUtil.sanitize(origFieldName);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered it, but the isValidFieldName is reused to add the original name as a property, so I think it's fine as it is.
This also fixes Avro's special character handling.
|
@rdsr, please have another look. I added tests to iceberg-data and ended up needing to fix a couple of things:
I think the second fix also addresses the case introduced by #207 when the Avro names don't match because the shouldn't be projected. Next, we should be able to fix Avro reads by using a similar pattern to iceberg-data, but one that produces Avro generics. |
|
Thanks @rdblue . I'll have a look over the weekend |
rdsr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @rdblue ! I just had a small doubt regarding the new visitor.
| 0, | ||
| Comparators.charSequences().compare("test", (CharSequence) full.getField("data%0"))); | ||
|
|
||
| Record projected = writeAndRead("full_projection", schema, schema.select("data%0"), record); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consider changing the desc.
| } | ||
|
|
||
| private static <T> T visitArray(Type type, Schema array, AvroSchemaWithTypeVisitor<T> visitor) { | ||
| if (array.getLogicalType() instanceof LogicalMap) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth while to also check for AvroSchemaUtil.isKeyValueSchema(array.getElementType())?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but I think it would be good to check whether the Iceberg type is a map.
| return visit(iSchema.asStruct(), schema, visitor); | ||
| } | ||
|
|
||
| public static <T> T visit(Type iType, Schema schema, AvroSchemaWithTypeVisitor<T> visitor) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When could the expected type be null? I see that we are traversing NULL branch of the union, is it because of that? Also, I'm not clear on why do we need to visit the NULL branch of the union
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Iceberg type might be null if the Avro type has no corresponding field. For example, if we drop a column from an Iceberg schema and read an older data file, that column will not be in the read schema, but will be in file schemas.
|
Merged. Thanks for reviewing, @Parth-Brahmbhatt and @rdsr! |
This uses Avro's name sanitization methods to ensure that field names are compatible with Parquet. The names stored in each file don't actually matter because Iceberg uses field IDs.