Skip to content

feat: optional disable column stats#811

Open
parisni wants to merge 10 commits intoapache:mainfrom
leboncoin:pr-feat-skip-stats
Open

feat: optional disable column stats#811
parisni wants to merge 10 commits intoapache:mainfrom
leboncoin:pr-feat-skip-stats

Conversation

@parisni
Copy link
Contributor

@parisni parisni commented Feb 26, 2026

What is the purpose of the pull request

This PR makes sync faster and reduces memory footprint by adding a shared source-side option to skip column stats extraction:

xtable.source.skip_column_stats=true

The flag now works consistently for Hudi, Delta, and Iceberg sources.

In practical terms, this is intended for heavy tables where stats extraction is a bottleneck.
Example from large-sync behavior: a job that previously took around 6 hours and required >64 GB Xmx can be reduced to around 1 hour with about 10 GB Xmx.

Brief change log

  • Added a shared source config (xtable.source.skip_column_stats) instead of per-format keys.
  • Wired skip-column-stats behavior into all three source implementations:
    • Hudi source
    • Delta source
    • Iceberg source
  • Kept required row-count behavior intact so downstream sync logic remains correct.
  • Added handling for zero-row files where needed to avoid incorrect stats behavior.
  • Improved naming consistency from “skip stats” to “skip column stats”.
  • Added integration coverage for source format × sync mode × skip flag combinations.

Verify this pull request

This change added tests and can be verified as follows:

  • Added parameterized integration test in ITConversionController for:
    • source format: Hudi / Delta / Iceberg
    • sync mode: Incremental / Full
    • xtable.source.skip_column_stats: true / false
  • Verified with:
    • mvn -pl xtable-core -Dtest=ITConversionController#testSkipColumnStatsAcrossSources test
  • Additional compile validation:
    • mvn -pl xtable-core -DskipTests compile

Trade-offs

  • When xtable.source.skip_column_stats=true, column stats are not extracted or propagated.
  • This reduces sync cost significantly, but column stats-dependent optimizations may be unavailable.
  • As a result, query performance may be reduced for some workloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant