Fix Parquet with special characters in field names. #601

rdblue · 2019-11-01T19:38:55Z

This uses Avro's name sanitization methods to ensure that field names are compatible with Parquet. The names stored in each file don't actually matter because Iceberg uses field IDs.

rdsr

+1

rdsr · 2019-11-01T20:19:36Z

core/src/main/java/org/apache/iceberg/avro/AvroSchemaUtil.java

    return copy;
  }

+  public static String makeCompatibleName(String name) {


Nit: do you want to use this method in org.apache.iceberg.avro.TypeToSchema#struct() api?
Currently the code is

boolean isValidFieldName = AvroSchemaUtil.validAvroName(origFieldName); String fieldName = isValidFieldName ? origFieldName : AvroSchemaUtil.sanitize(origFieldName);

I considered it, but the isValidFieldName is reused to add the original name as a property, so I think it's fine as it is.

This also fixes Avro's special character handling.

rdblue · 2019-11-01T20:55:47Z

@rdsr, please have another look. I added tests to iceberg-data and ended up needing to fix a couple of things:

BuildAvroProjection couldn't project fields with special characters in the name because Avro would reject the projection schema. I added a couple calls to makeCompatibleName to fix this. It should be safe because special characters would cause failures in these cases before.
The generic DataReader created iceberg-data records with a schema converted from Avro. When using sanitized names, the records would not use the right field names for getField. The fix was to pass in the original Iceberg schema. To make this easy, I added a new visitor.

I think the second fix also addresses the case introduced by #207 when the Avro names don't match because the shouldn't be projected. Next, we should be able to fix Avro reads by using a similar pattern to iceberg-data, but one that produces Avro generics.

rdsr · 2019-11-02T02:42:53Z

Thanks @rdblue . I'll have a look over the weekend

rdsr

LGTM @rdblue ! I just had a small doubt regarding the new visitor.

rdsr · 2019-11-03T19:52:23Z

data/src/test/java/org/apache/iceberg/data/TestReadProjection.java

+        0,
+        Comparators.charSequences().compare("test", (CharSequence) full.getField("data%0")));
+
+    Record projected = writeAndRead("full_projection", schema, schema.select("data%0"), record);


nit: consider changing the desc.

rdsr · 2019-11-03T20:07:11Z

core/src/main/java/org/apache/iceberg/avro/AvroSchemaWithTypeVisitor.java

+  }
+
+  private static <T> T visitArray(Type type, Schema array, AvroSchemaWithTypeVisitor<T> visitor) {
+    if (array.getLogicalType() instanceof LogicalMap) {


Is it worth while to also check for AvroSchemaUtil.isKeyValueSchema(array.getElementType())?

No, but I think it would be good to check whether the Iceberg type is a map.

rdsr · 2019-11-03T20:21:48Z

core/src/main/java/org/apache/iceberg/avro/AvroSchemaWithTypeVisitor.java

+    return visit(iSchema.asStruct(), schema, visitor);
+  }
+
+  public static <T> T visit(Type iType, Schema schema, AvroSchemaWithTypeVisitor<T> visitor) {


When could the expected type be null? I see that we are traversing NULL branch of the union, is it because of that? Also, I'm not clear on why do we need to visit the NULL branch of the union

The Iceberg type might be null if the Avro type has no corresponding field. For example, if we drop a column from an Iceberg schema and read an older data file, that column will not be in the read schema, but will be in file schemas.

rdblue · 2019-11-09T01:25:50Z

Merged. Thanks for reviewing, @Parth-Brahmbhatt and @rdsr!

Fix Parquet with special characters in field names.

7212fa2

Parth-Brahmbhatt approved these changes Nov 1, 2019

View reviewed changes

rdsr approved these changes Nov 1, 2019

View reviewed changes

rdblue added 2 commits November 1, 2019 13:43

More tests to validate special characters in field names.

493158f

This also fixes Avro's special character handling.

Add missing Visitor.

f93823b

rdblue added 4 commits November 1, 2019 14:04

Slight refactor for complexity.

04e78f2

Fix logical maps.

1e92305

Fix complexity.

b128283

Fixup DataReader imports.

eca274a

rdsr reviewed Nov 3, 2019

View reviewed changes

rdblue mentioned this pull request Nov 8, 2019

Fix for 'Error reading Avro files containing array of struct with 2 fields #605' #618

Merged

Update for review comments.

7d76204

rdblue merged commit ad4cd25 into apache:master Nov 9, 2019

rdblue added a commit to rdblue/iceberg that referenced this pull request Jun 1, 2020

Fix Parquet with special characters in field names (apache#601)

02f3a67

This was referenced Apr 7, 2024

[BUG] Valid column characters fail on to_arrow() or to_pandas() ArrowInvalid: No match for FieldRef.Name apache/iceberg-python#584

Closed

[feature request] Allow Java Iceberg library to write parquet files with special character column names #10120

Closed

Fix Parquet with special characters in field names. #601

Fix Parquet with special characters in field names. #601

Uh oh!

Conversation

rdblue commented Nov 1, 2019

Uh oh!

rdsr left a comment

Choose a reason for hiding this comment

Uh oh!

rdsr Nov 1, 2019

Choose a reason for hiding this comment

Uh oh!

rdblue Nov 1, 2019

Choose a reason for hiding this comment

Uh oh!

rdblue commented Nov 1, 2019

Uh oh!

rdsr commented Nov 2, 2019

Uh oh!

rdsr left a comment

Choose a reason for hiding this comment

Uh oh!

rdsr Nov 3, 2019

Choose a reason for hiding this comment

Uh oh!

rdsr Nov 3, 2019

Choose a reason for hiding this comment

Uh oh!

rdblue Nov 8, 2019

Choose a reason for hiding this comment

Uh oh!

rdsr Nov 3, 2019

Choose a reason for hiding this comment

Uh oh!

rdblue Nov 8, 2019

Choose a reason for hiding this comment

Uh oh!

rdblue commented Nov 9, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants