Initial Iceberg Sink by kennknowles · Pull Request #30797 · apache/beam

kennknowles · 2024-03-28T23:29:00Z

This is a basic Iceberg sink. Somewhat in the style of BigQuery file loads:

supports Dynamic Destinations
supports Avro and Parquet file formats
only accepts Beam rows

And how it works, roughly:

First associates each incoming row with some metadata about its destination
Then tries to write all data in a bundle to its destination. If there is just one, or a few, this will complete.
If there are lots of destinations, then to avoid OOM we spill to a GBK by destination metadata
After the GBK we write each group to its destination

I'm a bit of an Iceberg newb. Byron did the first draft and I just refactored and added some stuff to it. This has some small tests but needs integration tests and larger tests. It is a starting point for integrating with @ahmedabu98's work on managed transforms.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

kennknowles · 2024-03-28T23:29:08Z

R: @chamikaramj

kennknowles · 2024-03-28T23:29:40Z

buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy

lol I missed this one

github-actions · 2024-03-28T23:30:25Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

chamikaramj

Thanks!

chamikaramj · 2024-03-29T17:13:35Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/IcebergIO.java

I'm wondering if we can strip dynamic destinations based on UDFs out and think about how to introduce dynamic destinations to this I/O in a portable way based on https://s.apache.org/portable-dynamic-destinations

I left them in a little bit for abstraction, but it can be an implementation detail and IcebergIO.writeToDestinations(...) can just take the string pattern. I haven't done that part yet. I was mostly getting the main body of the transform to only do Rows

chamikaramj · 2024-03-29T17:16:23Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/IcebergIO.java

Is it possible to easily convert "IcebergCatalog" into a portable representation for SchemaTransforms ?

TBD. Leaving all "catalog" questions unresolved for this revision.

chamikaramj · 2024-03-29T17:21:16Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/IcebergIO.java

I would just limit this to PTransform<PCollection<Row>, IcebergWriteResult<Row>> to make this portability first and make it friendly for SchemaTransforms.

Done (and even simpler)

chamikaramj · 2024-03-29T17:31:20Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/WriteToDestinations.java

Any idea how we got to these defaults ? (if so we should document)

I have no idea. This number 20 must be just a guess. Some of the others appear to be BigQuery quota limitations that we can just ignore. One thing that we should do is that I read a lot online about ideal iceberg file size being 512mb (that's what some internal iceberg code does I guess) so perhaps we follow that. I'm still learning the iceberg Java APIs and the best way to use their best practices.

chamikaramj · 2024-03-29T17:38:22Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/WriteToDestinations.java

Can we use the new DLQ framework instead ? (seems like this is following the old DLQ implementation in BQ).

New framework also considers portability aspects for example so it's more advantageous.
https://docs.google.com/document/d/1NGeCk6tOqF-TiGEAV7ixd_vhIiWz9sHPlCa1P_77Ajs/edit?tab=t.0#heading=h.fppublcudjbt

(can be a separate PR but we should remove the DLQ feature from this PR in that case)

I just left it out for now.

chamikaramj · 2024-03-29T18:20:07Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/WriteToDestinations.java

Not sure what we are doing here. Are we trying to write failed records again and flatten with the originally written records (in the subsequent step below) ?
Possibly we should be writing failed records to a DLQ ?

Re-reading the code, seems like failedWrites here are actually due to previous WriteBundlesToFiles exceeding any of the limits provided to the transform (DEFAULT_MAX_WRITERS_PER_BUNDLE, DEFAULT_MAX_BYTES_PER_FILE). We group known set of spilled over records and write in the subsequent transform which makes sense. We should probably change 'failedWrites' to 'spilledOverWrites'.

I have now totally refactored this and renamed everything. Thanks for your description; it helped a lot to understand how to organize it.

chamikaramj · 2024-03-29T18:28:16Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/WriteToDestinations.java

Prob rename to MetadataUpdateDoFn for clarify.

Done, but I still need to refactor this out anyhow.

chamikaramj · 2024-03-29T20:33:21Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/WriteToDestinations.java

Probably this should be followed up by another GBK and a cleanup step that deletes temp files (of this step and any failed work items).

(unresolved)

Oh and btw the files are not tmp. They become part of the table. So it is simpler than the BQ equivalent.

chamikaramj · 2024-03-29T20:41:23Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/DynamicDestinations.java

Seems like this has a lot of copied over logic from BQ dynamic destinations which probably we can simplify/change if we went with the new DLQ framework.

https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/DynamicDestinations.java

Gotcha. I removed actually all the logic and just do something extremely basic for now. I guess DLQ could be update-incompatible change so I better get that done really quick too.

chamikaramj · 2024-03-29T20:45:29Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/IcebergCatalog.java

Seems like org.apache.hadoop.conf.Configuration is a set of string key value pairs.

https://hadoop.apache.org/docs/current/api/org/apache/hadoop/conf/Configuration.html

May be we should just accept a org.apache.hadoop.conf.Configuration and build the Hadoop Configuration to make this more portability friendly.

That makes sense. Leaving this unresolved as I did not get to this yet.

chamikaramj · 2024-03-29T21:38:07Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/WriteToDestinations.java

Re-reading the code, seems like failedWrites here are actually due to previous WriteBundlesToFiles exceeding any of the limits provided to the transform (DEFAULT_MAX_WRITERS_PER_BUNDLE, DEFAULT_MAX_BYTES_PER_FILE). We group known set of spilled over records and write in the subsequent transform which makes sense. We should probably change 'failedWrites' to 'spilledOverWrites'.

codecov · 2024-04-02T02:03:14Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.47%. Comparing base (069c045) to head (a06a187).
Report is 19 commits behind head on master.

❗ Current head a06a187 differs from pull request most recent head 2cffca8. Consider uploading reports for the commit 2cffca8 to get more accurate results

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #30797   +/-   ##
=======================================
  Coverage   71.47%   71.47%           
=======================================
  Files         710      710           
  Lines      104815   104815           
=======================================
  Hits        74915    74915           
  Misses      28268    28268           
  Partials     1632     1632

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kennknowles

OK I did a major revision to clarify things and streamline the main logic around writing rows. Still need another major revision to address the remaining non-portable pieces and DLQ.

kennknowles · 2024-04-01T13:29:40Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/IcebergIO.java

I left them in a little bit for abstraction, but it can be an implementation detail and IcebergIO.writeToDestinations(...) can just take the string pattern. I haven't done that part yet. I was mostly getting the main body of the transform to only do Rows

kennknowles · 2024-04-01T13:30:53Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/DynamicDestinations.java

Gotcha. I removed actually all the logic and just do something extremely basic for now. I guess DLQ could be update-incompatible change so I better get that done really quick too.

kennknowles · 2024-04-02T01:59:39Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/IcebergCatalog.java

That makes sense. Leaving this unresolved as I did not get to this yet.

kennknowles · 2024-04-02T02:02:11Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/IcebergIO.java

TBD. Leaving all "catalog" questions unresolved for this revision.

kennknowles · 2024-04-02T02:02:51Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/IcebergIO.java

Done (and even simpler)

kennknowles · 2024-04-02T02:03:11Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/WriteToDestinations.java

I have no idea. This number 20 must be just a guess. Some of the others appear to be BigQuery quota limitations that we can just ignore. One thing that we should do is that I read a lot online about ideal iceberg file size being 512mb (that's what some internal iceberg code does I guess) so perhaps we follow that. I'm still learning the iceberg Java APIs and the best way to use their best practices.

kennknowles · 2024-04-02T02:04:50Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/WriteToDestinations.java

I just left it out for now.

kennknowles · 2024-04-02T02:05:13Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/WriteToDestinations.java

I have now totally refactored this and renamed everything. Thanks for your description; it helped a lot to understand how to organize it.

kennknowles · 2024-04-02T02:05:26Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/WriteToDestinations.java

(unresolved)

kennknowles · 2024-04-02T02:05:35Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/WriteToDestinations.java

Done, but I still need to refactor this out anyhow.

codecov-commenter · 2024-04-04T17:50:04Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 0.00%. Comparing base (3c9e9c8) to head (5af12aa).
Report is 5 commits behind head on master.

❗ Current head 5af12aa differs from pull request most recent head a7a6515. Consider uploading reports for the commit a7a6515 to get more accurate results

Additional details and impacted files

@@              Coverage Diff              @@
##             master   #30797       +/-   ##
=============================================
- Coverage     70.95%        0   -70.96%     
=============================================
  Files          1257        0     -1257     
  Lines        140939        0   -140939     
  Branches       4307        0     -4307     
=============================================
- Hits         100004        0   -100004     
+ Misses        37456        0    -37456     
+ Partials       3479        0     -3479

Flag	Coverage Δ
go	`?`
java	`?`
python	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kennknowles · 2024-04-04T17:58:12Z

OK I have done a whole massive revision and tested it a little bit more.

The only piece that I have not revised is the IcebergCatalogConfig which gets turned into an org.apache.iceberg.catalog.Catalog on the client and each worker separately. I think your suggestion was to try to use just a big key-value map for all the config values. I am fine with that. I don't really know enough about it yet. All my deep dives into iceberg Java libraries was for other pieces.

kennknowles · 2024-04-04T18:00:29Z

OK I have done a whole massive revision and tested it a little bit more.

The only piece that I have not revised is the IcebergCatalogConfig which gets turned into an org.apache.iceberg.catalog.Catalog on the client and each worker separately. I think your suggestion was to try to use just a big key-value map for all the config values. I am fine with that. I don't really know enough about it yet. All my deep dives into iceberg Java libraries was for other pieces.

It looks like this might work: https://github.com/tabular-io/iceberg-kafka-connect/blob/5ab5c538efab9ccf3cde166f36ba34189eed7187/kafka-connect/src/main/java/io/tabular/iceberg/connect/IcebergSinkConfig.java#L256

chamikaramj

Thanks. Looks great and almost there!

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/IcebergIO.java

chamikaramj · 2024-04-04T20:02:38Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/OneTableDynamicDestinations.java

If this is not configurable, let's document.

It should be configurable. In testing, I have discovered that the ORC codepath doesn't work so I've changed it to throw.

chamikaramj · 2024-04-04T20:40:51Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/AppendFilesToTables.java

Uncomment or delete.

chamikaramj · 2024-04-04T20:43:53Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/AppendFilesToTables.java

Shouldn't this update be atomic for all files ?

In so, we might have to push this to a separate step behind a shuffle.

The key question is what will happen if the step fails after writing some of the elements and gets retried.

All the files per destination are grouped into a single atomic commit. There are two things that could go wrong:

Failure after the commit but before downstream processing, so a new transaction will try to append the same files. I verified that this is idempotent (and I included it as a unit test just to clarify).

Some tables successfully commit but then there are enough failures that the pipeline itself fails. We probably can do a multi-table transaction. We would write the various files all to a manifest and then merge to a single thread and commit all the manifests at once. We don't do this for other sinks, do we?

Yeah, (2) is fine. It's more about making sure that we don't double write if a work item fails. But if writing is idempotent it's simpler.

Sorry to be late on this, I just wondering if we would not need a kind of "commit coordinator" to be sure we have one commit at a time: if we have concurrent commits, it could be problematic in Iceberg.

I am not that familiar with the iceberg libraries. I was under the impression that the optimistic concurrency protocol was handled by them (https://iceberg.apache.org/docs/1.5.2/reliability/#concurrent-write-operations and on filesystem tables described by https://iceberg.apache.org/spec/#file-system-tables).

chamikaramj · 2024-04-04T20:49:08Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/FileWriteResult.java

Let's make sure that this is covered by unit testing.

Done, somewhat. Could use some data generators to thoroughly test.

chamikaramj · 2024-04-04T20:55:27Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/RowHelper.java

Are these types not supported ?
If so we should fail instead of dropping ?

omg yes. haha I didn't notice this. Fixed - added some more support and testing for some types, and throw for the other ones that are not yet supported. We will want to fast-follow with support, but some of the date semantics are unclear to me. (like an iceberg DATE is stored as a Long but I'm not sure exactly what it represents)

chamikaramj · 2024-04-04T20:56:42Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/SchemaHelper.java

UUID is BYTES not STRING ?

Yea it is a Java UUID which contains a byte[].

chamikaramj

Thanks. LGTM.

chamikaramj · 2024-04-08T21:27:01Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/RecordWriter.java

chamikaramj · 2024-04-08T21:30:34Z

settings.gradle.kts

It doesn't look like we add anything under "sdks:java:io:catalog".

- remove Read path (will propose separately) - re-enable checking, fix type errors - some style adjustments

kennknowles

Thanks for all the review!

kennknowles · 2024-04-09T15:01:58Z

settings.gradle.kts

kennknowles · 2024-04-09T15:02:16Z

sdks/java/io/iceberg/src/main/java/org/apache/beam/io/iceberg/RecordWriter.java

sarinasij · 2024-08-07T05:32:29Z

Hello, could you pls kindly update the below docwith the merged implementation?

https://cloud.google.com/dataflow/docs/samples/dataflow-apache-iceberg-write
(It seems the link is stale with Managed.write(Managed.ICEBERG) approach)
https://beam.apache.org/documentation/io/built-in

github-actions bot added java build io labels Mar 28, 2024

kennknowles commented Mar 28, 2024

View reviewed changes

buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy Outdated

Copy link

Member Author

kennknowles Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol I missed this one

kennknowles force-pushed the iceberg-sink branch 6 times, most recently from 201e6cb to a06a187 Compare March 29, 2024 15:55

chamikaramj self-requested a review March 29, 2024 16:54

kennknowles mentioned this pull request Mar 29, 2024

Re-add iceberg bounded source; test splitting #30805

Merged

3 tasks

chamikaramj reviewed Mar 29, 2024

View reviewed changes

kennknowles commented Apr 2, 2024

View reviewed changes

kennknowles force-pushed the iceberg-sink branch 3 times, most recently from 0ccdf45 to 5af12aa Compare April 4, 2024 17:48

kennknowles marked this pull request as ready for review April 4, 2024 17:48

chamikaramj reviewed Apr 4, 2024

View reviewed changes

kennknowles force-pushed the iceberg-sink branch from 8e4c12e to 0a0899a Compare April 8, 2024 18:14

chamikaramj reviewed Apr 8, 2024

View reviewed changes

Byron Ellis and others added 2 commits April 9, 2024 11:05

Initial Iceberg connector

0437a8d

Fix up IcebergIO Write path

fd08eb4

- remove Read path (will propose separately) - re-enable checking, fix type errors - some style adjustments

Add IcebergIO GitHub Action workflow

a7a6515

kennknowles commented Apr 9, 2024

View reviewed changes

kennknowles force-pushed the iceberg-sink branch from 0a0899a to a7a6515 Compare April 9, 2024 15:05

kennknowles merged commit 819e54c into apache:master Apr 9, 2024

kennknowles deleted the iceberg-sink branch April 9, 2024 16:33

ahmedabu98 mentioned this pull request Apr 9, 2024

Managed Transform protos & translation; Iceberg SchemaTransforms & translation #30910

Merged

This was referenced May 17, 2024

Add Apache Iceberg IO connector #20327

Closed

Adding Iceberg Support #29569

Closed

pdames mentioned this pull request Jun 30, 2025

Native Apache Beam IO Connector ray-project/deltacat#561

Open

Conversation

kennknowles commented Mar 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GitHub Actions Tests Status (on master branch)

Uh oh!

kennknowles commented Mar 28, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 28, 2024

Uh oh!

chamikaramj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 2, 2024

Codecov Report

Uh oh!

kennknowles left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kennknowles commented Mar 28, 2024 •

edited

Loading

codecov-commenter commented Apr 4, 2024 •

edited

Loading