Skip to content

Migrate Avro reader to arrow-avro and remove internal conversion code#17861

Merged
alamb merged 81 commits intoapache:mainfrom
getChan:arrow-avro
Mar 26, 2026
Merged

Migrate Avro reader to arrow-avro and remove internal conversion code#17861
alamb merged 81 commits intoapache:mainfrom
getChan:arrow-avro

Conversation

@getChan
Copy link
Contributor

@getChan getChan commented Oct 1, 2025

Which issue does this PR close?

Rationale for this change

DataFusion previously maintained custom Avro-to-Arrow conversion logic.
This PR migrates Avro reading to arrow-avro to align behavior with upstream Arrow and remove duplicated implementation.

What changes are included in this PR?

  • Switched DataFusion Avro reader path to arrow-avro (ReaderBuilder)
  • Removed internal/legacy Avro conversion paths that are no longer needed
  • Updated crate wiring to use arrow-avro and removed prior apache-avro dependency usage in affected paths
  • Updated Avro projection flow to use arrow-avro projection support
  • Added/updated upgrade documentation for Avro API and behavior changes

Are these changes tested?

Yes.

  • Added/updated Avro reader unit tests in datafusion/datasource-avro (including projection and timestamp logical types)
  • Updated SQL logic tests in datafusion/sqllogictest/test_files/avro.slt
  • Integration is covered by existing CI/test suites for affected crates

Are there any user-facing changes?

Yes.

  1. DataFusionError::AvroError is removed.
  2. From<apache_avro::Error> for DataFusionError is removed.
  3. Re-export changed from datafusion::apache_avro to datafusion::arrow_avro.
  4. Avro feature wiring changed:
    • datafusion crate avro feature no longer enables datafusion-common/avro
    • datafusion-proto crate avro feature no longer enables datafusion-common/avro
  5. Avro decoding behavior now follows arrow-avro semantics, including:
    • Avro string values being read as Arrow Binary in this path
    • timestamp-* logical types read as UTC timezone-aware timestamps (Timestamp(..., Some("+00:00")))
    • local-timestamp-* remaining timezone-naive (Timestamp(..., None))

Upgrade notes are documented in:
docs/source/library-user-guide/upgrading/53.0.0.md

@github-actions github-actions bot added common Related to common crate datasource Changes to the datasource crate labels Oct 1, 2025
@alamb
Copy link
Contributor

alamb commented Oct 2, 2025

❤️ amazing! Thank you @getChan
FYI @jecsand838 and @nathaniel-d-ef

@alamb
Copy link
Contributor

alamb commented Oct 19, 2025

Hi @getChan -- I am preparing to make an arrow release -- have you hit any blockers while integrating the new arrow-avro crate into DataFusion?

@getChan
Copy link
Contributor Author

getChan commented Oct 19, 2025

Hi @getChan -- I am preparing to make an arrow release -- have you hit any blockers while integrating the new arrow-avro crate into DataFusion?

No, not yet. Thanks for release.

@nathaniel-d-ef
Copy link

Thanks for jumping on this @getChan; let me know if I can help!

@github-actions github-actions bot removed the common Related to common crate label Oct 27, 2025
@alamb
Copy link
Contributor

alamb commented Oct 29, 2025

FYI I merged the arrow 57 upgrade to DataFusion -- so if you rebase this PR against main you'll have access to the new arrow-avro crate

@github-actions github-actions bot added core Core DataFusion crate common Related to common crate proto Related to proto crate labels Oct 29, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @getChan and @jecsand838 -- this is pretty epic work. The fact that all the tests pass is pretty incredible

I had a few small comments on the upgrade guide (I can help make them too if you want). Otherwise I think this is ready to go.

I also took the liberty of merging up from main to resolve a conflict

@alamb
Copy link
Contributor

alamb commented Mar 25, 2026

Any other thoughts @jecsand838?

FYI @Igosuki -- this is a long time since you contributed the original avro reader. It is fun to see how far things have come

Copy link
Contributor

@jecsand838 jecsand838 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb This LGTM!

@alamb
Copy link
Contributor

alamb commented Mar 26, 2026

Merged up to resolve a conflict

@alamb
Copy link
Contributor

alamb commented Mar 26, 2026

go go go go!

@alamb alamb added this pull request to the merge queue Mar 26, 2026
Merged via the queue into apache:main with commit 627faba Mar 26, 2026
34 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 26, 2026

Epic work @getChan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use arrow-avro for performance and improved type support

5 participants