Skip to content

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Nov 1, 2019

This uses Avro's name sanitization methods to ensure that field names are compatible with Parquet. The names stored in each file don't actually matter because Iceberg uses field IDs.

Copy link
Contributor

@rdsr rdsr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

return copy;
}

public static String makeCompatibleName(String name) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: do you want to use this method in org.apache.iceberg.avro.TypeToSchema#struct() api?
Currently the code is

boolean isValidFieldName = AvroSchemaUtil.validAvroName(origFieldName);
String fieldName =  isValidFieldName ? origFieldName : AvroSchemaUtil.sanitize(origFieldName);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered it, but the isValidFieldName is reused to add the original name as a property, so I think it's fine as it is.

@rdblue
Copy link
Contributor Author

rdblue commented Nov 1, 2019

@rdsr, please have another look. I added tests to iceberg-data and ended up needing to fix a couple of things:

  • BuildAvroProjection couldn't project fields with special characters in the name because Avro would reject the projection schema. I added a couple calls to makeCompatibleName to fix this. It should be safe because special characters would cause failures in these cases before.
  • The generic DataReader created iceberg-data records with a schema converted from Avro. When using sanitized names, the records would not use the right field names for getField. The fix was to pass in the original Iceberg schema. To make this easy, I added a new visitor.

I think the second fix also addresses the case introduced by #207 when the Avro names don't match because the shouldn't be projected. Next, we should be able to fix Avro reads by using a similar pattern to iceberg-data, but one that produces Avro generics.

@rdsr
Copy link
Contributor

rdsr commented Nov 2, 2019

Thanks @rdblue . I'll have a look over the weekend

Copy link
Contributor

@rdsr rdsr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @rdblue ! I just had a small doubt regarding the new visitor.

0,
Comparators.charSequences().compare("test", (CharSequence) full.getField("data%0")));

Record projected = writeAndRead("full_projection", schema, schema.select("data%0"), record);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider changing the desc.

}

private static <T> T visitArray(Type type, Schema array, AvroSchemaWithTypeVisitor<T> visitor) {
if (array.getLogicalType() instanceof LogicalMap) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth while to also check for AvroSchemaUtil.isKeyValueSchema(array.getElementType())?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but I think it would be good to check whether the Iceberg type is a map.

return visit(iSchema.asStruct(), schema, visitor);
}

public static <T> T visit(Type iType, Schema schema, AvroSchemaWithTypeVisitor<T> visitor) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When could the expected type be null? I see that we are traversing NULL branch of the union, is it because of that? Also, I'm not clear on why do we need to visit the NULL branch of the union

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Iceberg type might be null if the Avro type has no corresponding field. For example, if we drop a column from an Iceberg schema and read an older data file, that column will not be in the read schema, but will be in file schemas.

@rdblue rdblue merged commit ad4cd25 into apache:master Nov 9, 2019
@rdblue
Copy link
Contributor Author

rdblue commented Nov 9, 2019

Merged. Thanks for reviewing, @Parth-Brahmbhatt and @rdsr!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants