Upgrade fork to DF 51 by jcsherin · Pull Request #25 · massive-com/arrow-datafusion

jcsherin · 2025-11-24T04:18:24Z

Upgrade Steps

Created branch-51-upstream which is upstream DF 51 at commit fd35a09.
This PR branch for upgrading the fork is checked out from above branch.

❯ git reflog show jacob/branch-51-upgrade | tail -n 1
fd35a0943 jacob/branch-51-upgrade@{22}: branch: Created from HEAD

Merged latest version of our fork which is now in branch-51 (not branch-50) into this PR branch. Last commit in fork is ff301c8 - Add restriction for enabling limit pruning.
Resolved conflicts, follow the DF upgrade guide and make CI green.

Verified DF patches (present in this PR)

Missing patches (not yet applied to this PR)

1. make DefaultSchemaAdapter public Commit 848fd57

Reason: Moved out of core to datafusion-datasource. Needs to be patched once more.

Status: ✅ Fixed.

2. Add partition filters' equivalence classes info to the execution plan if it's DataSourceExec Commit c7628fb.

The branch for adding partition filters' equivalence classes into execution plan of DataSourceExec is missing.
The helper add_partition_filter_equivalence_info method is missing.

Status: ⏳ To be verified later if this is ported upstream. If not, add it back to fork.

Merged upstream

Support csv truncated rows in datafusion #17465

TODO

Add self-review comments
Verify patches which are not yet upstream exists in this branch
Identify and doc patches which are missing in this branch
Resolve review comments (wip)

* Initial commit * Fix formatting * Add across partitions check * Add new test case Add a new test case * Fix buggy test

…#13909) (apache#13934) * Set utf8view as return type when input type is the same * Verify that the returned type from call to scalar function matches the return type specified in the return_type function * Match return type to utf8view Co-authored-by: Tim Saucer <timsaucer@gmail.com>

This reverts commit 5383d30.

* fix: fetch is missed in the EnfoceSorting * fix conflict * resolve comments from alamb * update

…it disabled by default

…e#14415) (apache#14453) * chore: Fixed CI * chore * chore: Fixed clippy * chore Co-authored-by: Alex Huang <huangweijun1001@gmail.com>

* Test for string / numeric coercion * fix tests * Update tests * Add tests to stringview * add numeric coercion

jcsherin · 2025-11-24T04:57:41Z

datafusion/optimizer/src/simplify_expressions/simplify_predicates.rs

-=======
-                    if let Some(comparison) = scalar.partial_cmp(current_best) {
-                        let is_better = if find_greater {
-                            comparison == std::cmp::Ordering::Greater
-                        } else {
-                            comparison == std::cmp::Ordering::Less
-                        };
->>>>>>> origin/branch-51


apache#16624 Upstream fix for using try_cmp instead of partial_cmp for scalar values.

jcsherin · 2025-11-24T04:59:10Z

datafusion/physical-plan/src/analyze.rs

        let mut new_plan = AnalyzeExec::new(
            self.verbose,
            self.show_statistics,
+            self.metric_types.clone(),


The AnalyzeExec::new method now takes metric_types: Vec<MetricType> as the third argument. https://docs.rs/datafusion/51.0.0/datafusion/physical_plan/analyze/struct.AnalyzeExec.html#method.new

jcsherin · 2025-11-24T05:03:09Z

datafusion/physical-plan/src/projection.rs

    ) -> Result<Option<Arc<dyn ExecutionPlan>>> {
        let mut new_plan =
-            ProjectionExec::try_new(self.expr.clone(), Arc::clone(self.input()))?;
+            ProjectionExec::try_new(self.expr().to_vec(), Arc::clone(self.input()))?;


The first argument now changed to self.expr().to_vec().
https://docs.rs/datafusion/51.0.0/datafusion/physical_plan/projection/struct.ProjectionExec.html#method.expr

jcsherin · 2025-11-24T05:04:38Z

datafusion/physical-plan/src/union.rs

        true
    }

+    #[allow(deprecated)]


Deprecated since 44.0.0: Use UnionExec::try_new instead
https://docs.rs/datafusion/51.0.0/datafusion/physical_plan/union/struct.UnionExec.html#method.try_new

Maybe we can switch to try_new in a follow-on PR?

Also okay to change it directly in the PR if not a big change

jcsherin · 2025-11-24T05:06:16Z

datafusion/datasource/src/file_scan_config.rs

-=======
-            match reassign_predicate_columns(filter, &schema, true) {
-                Ok(filter) => {
-                    match Self::add_filter_equivalence_info(
-                        filter,
-                        &mut eq_properties,
-                        &schema,
-                    ) {
-                        Ok(()) => {}
-                        Err(e) => {
-                            warn!("Failed to add filter equivalence info: {e}");
-                            #[cfg(debug_assertions)]
-                            panic!("Failed to add filter equivalence info: {e}");
-                        }
-                    }
-                }
->>>>>>> origin/branch-51


Picking upstream apache#17703

jcsherin · 2025-11-24T05:06:31Z

datafusion/datasource/src/file_scan_config.rs

-=======
-        macro_rules! ignore_dangling_col {
-            ($col:expr) => {
-                if let Some(col) = $col.as_any().downcast_ref::<Column>() {
-                    if schema.index_of(col.name()).is_err() {
-                        continue;
-                    }
-                }
-            };
-        }
-
-        let (equal_pairs, _) = collect_columns_from_predicate(&filter);
-        for (lhs, rhs) in equal_pairs {
-            // Ignore any binary expressions that reference non-existent columns in the current schema
-            // (e.g. due to unnecessary projections being removed)
-            ignore_dangling_col!(lhs);
-            ignore_dangling_col!(rhs);
-            eq_properties.add_equal_conditions(Arc::clone(lhs), Arc::clone(rhs))?
->>>>>>> origin/branch-51


Picking upstream apache#17703

jcsherin · 2025-11-24T05:07:28Z

datafusion/datasource/src/file_scan_config.rs

-<<<<<<< HEAD
        .with_projection_indices(Some(vec![0, 1, 2]))
-=======
-        .with_projection(Some(vec![0, 1, 2]))
->>>>>>> origin/branch-51


The FileScanConfigBuilder::with_projection() method has been deprecated in favor of with_projection_indices()

https://datafusion.apache.org/library-user-guide/upgrading.html#filescanconfig-projection-renamed-to-filescanconfig-projection-exprs

jcsherin · 2025-11-24T05:09:08Z

datafusion/datasource-parquet/src/metrics.rs

-<<<<<<< HEAD
    /// Number of row groups whose bloom filters were checked, tracked with matched/pruned counts
    pub row_groups_pruned_bloom_filter: PruningMetrics,
    /// Number of row groups whose statistics were checked, tracked with matched/pruned counts
    pub row_groups_pruned_statistics: PruningMetrics,
-=======
    /// Number of row groups whose bloom filters were checked and matched (not pruned)
    pub row_groups_matched_bloom_filter: Count,
-    /// Number of row groups pruned by bloom filters
-    pub row_groups_pruned_bloom_filter: Count,
    /// Number of row groups pruned due to limit pruning.
    pub limit_pruned_row_groups: Count,
    /// Number of row groups whose statistics were checked and fully matched
    pub row_groups_fully_matched_statistics: Count,
    /// Number of row groups whose statistics were checked and matched (not pruned)
    pub row_groups_matched_statistics: Count,
-    /// Number of row groups pruned by statistics
-    pub row_groups_pruned_statistics: Count,
->>>>>>> origin/branch-51


metric type changed to PruningMetrics.

jcsherin · 2025-11-24T05:09:29Z

datafusion/datasource-parquet/src/metrics.rs

        Self {
            files_ranges_pruned_statistics,
            predicate_evaluation_errors,
+            row_groups_matched_bloom_filter,


This one row_groups_matched_bloom_filter was missing, so added it back.

jcsherin · 2025-11-24T05:10:14Z

datafusion/datasource-parquet/src/row_group_filter.rs

-<<<<<<< HEAD
                        metrics.row_groups_pruned_statistics.add_matched(1);
-=======
                        fully_contained_candidates_original_idx.push(*idx);
-                        metrics.row_groups_matched_statistics.add(1);
->>>>>>> origin/branch-51


Keeps full_contained_candidates_original_idx + upstream API change to add_matched.

jcsherin · 2025-11-24T05:10:36Z

datafusion/datasource-parquet/src/source.rs

-<<<<<<< HEAD
-=======
-#[cfg(feature = "parquet_encryption")]
-use datafusion_common::encryption::map_config_decryption_to_decryption;
->>>>>>> origin/branch-51


safe to remove?

cc @zhuqi-lucas I forgot the discusssion result about the decryption

This should be safe, we finally remove encryption for default, and we don't use it.

jcsherin · 2025-11-24T05:11:26Z

datafusion/proto-common/src/generated/pbjson.rs

-<<<<<<< HEAD
                            truncated_rows__ = 
-=======
-
-                            truncated_rows__ =
->>>>>>> origin/branch-51


Fixed with proto-common regen.sh script.

jcsherin · 2025-11-24T05:13:17Z

datafusion/core/src/datasource/listing/table.rs

The structs ListingOptions, ListingTable, and ListingTableConfig are now available within the datafusion-catalog-listing crate.

https://datafusion.apache.org/library-user-guide/upgrading.html#reorganization-of-listingtable-into-datafusion-catalog-listing-crate

jcsherin · 2025-11-24T05:15:49Z

datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs

-=======
-          -     DataSourceExec: file_groups={2 groups: [[test1.parquet], [test2.parquet]]}, projection=[a, b, c], file_type=test, pushdown_supported=true, predicate=DynamicFilterPhysicalExpr [ true ]
->>>>>>> origin/branch-51


Rendering format changed to DynamicFilter. Keeping upstream.

jcsherin · 2025-11-24T05:17:09Z

datafusion/proto/src/lib.rs

+//!
+//!  // Workaround for `node_id` not being serializable:
+//!  let mut annotator = NodeIdAnnotator::new();
+//!  let physical_round_trip = annotate_node_id_for_execution_plan(&physical_round_trip, &mut annotator)?;
+//!


This is a new doc test added for round-trip execution plan. The node_id is not currently serialized so had to manually annotate it before assertion.

Is this ok?

It's ok due to node_id is our internal implementation.

datafusion/common/src/config.rs

jcsherin · 2025-11-24T05:21:51Z

datafusion/physical-optimizer/src/enforce_distribution.rs

-fn add_merge_on_top(input: DistributionContext) -> DistributionContext {
+/// Updated node with an execution plan, where desired single
+/// distribution is satisfied by adding [`SortPreservingMergeExec`].
+fn add_merge_on_top(


This was previously named add_spm_on_top per DF patch tracking doc.

jcsherin · 2025-11-24T05:24:59Z

datafusion/sqllogictest/test_files/map.slt

 ints Map("entries": Struct("key": Utf8, "value": Int64), unsorted) NO
 strings Map("entries": Struct("key": Utf8, "value": Utf8), unsorted) NO
-timestamp Utf8View NO
+timestamp Utf8 NO


There are other tests too because of this setting: datafusion.execution.parquet.schema_force_view_types false

jcsherin · 2025-11-24T05:25:19Z

datafusion/sqllogictest/test_files/parquet.slt

-BinaryView 616161 BinaryView 616161 BinaryView 616161
-BinaryView 626262 BinaryView 626262 BinaryView 626262
-BinaryView 636363 BinaryView 636363 BinaryView 636363
-BinaryView 646464 BinaryView 646464 BinaryView 646464
-BinaryView 656565 BinaryView 656565 BinaryView 656565
-BinaryView 666666 BinaryView 666666 BinaryView 666666
-BinaryView 676767 BinaryView 676767 BinaryView 676767
-BinaryView 686868 BinaryView 686868 BinaryView 686868
-BinaryView 696969 BinaryView 696969 BinaryView 696969
+Binary 616161 LargeBinary 616161 BinaryView 616161
+Binary 626262 LargeBinary 626262 BinaryView 626262
+Binary 636363 LargeBinary 636363 BinaryView 636363
+Binary 646464 LargeBinary 646464 BinaryView 646464
+Binary 656565 LargeBinary 656565 BinaryView 656565
+Binary 666666 LargeBinary 666666 BinaryView 666666
+Binary 676767 LargeBinary 676767 BinaryView 676767
+Binary 686868 LargeBinary 686868 BinaryView 686868
+Binary 696969 LargeBinary 696969 BinaryView 696969


datafusion.execution.parquet.schema_force_view_types false

dev/changelog/48.0.1.md

zhuqi-lucas · 2025-11-24T06:39:56Z

datafusion/proto/src/lib.rs

+//!
+//!  // Workaround for `node_id` not being serializable:
+//!  let mut annotator = NodeIdAnnotator::new();
+//!  let physical_round_trip = annotate_node_id_for_execution_plan(&physical_round_trip, &mut annotator)?;
+//!


It's ok due to node_id is our internal implementation.

datafusion/physical-plan/src/values.rs

zhuqi-lucas · 2025-11-24T08:49:40Z

datafusion/datasource-parquet/src/source.rs

-<<<<<<< HEAD
-=======
-#[cfg(feature = "parquet_encryption")]
-use datafusion_common::encryption::map_config_decryption_to_decryption;
->>>>>>> origin/branch-51


This should be safe, we finally remove encryption for default, and we don't use it.

andygrove and others added 30 commits November 4, 2024 19:40

bump version and generate changelog

9707a8a

bump version and generate changelog

88f58bf

Downgrade tonic

2d5364e

[bug]: Fix wrong order by removal from plan (apache#13497)

2c35f17

* Initial commit * Fix formatting * Add across partitions check * Add new test case Add a new test case * Fix buggy test

Update CHANGELOG

3cc3fca

enforce_distribution: fix for limits getting lost

5383d30

set default-features=false for datafusion in proto crate

13f6aca

Adding node_id patch to our fork

d357c7a

Changes to make streaming work

cbd3dbc

only output node_id in display if it exists

deecef1

include projection in FilterExec::with_node_id

57bf8d6

add missing with_fetch calls to with_node_id method

c431f0f

rework SortExec::with_node_id to not drop preserve_partitioning

fa581d0

set schema_force_view_types to false in ParquetOptions

555ef6b

Revert "enforce_distribution: fix for limits getting lost"

0e3c9e0

This reverts commit 5383d30.

update sqllogictests after disabling view types

a4153bf

fix fetch missed in EnforceDistribution

8ae4a95

fix enforcesorting missing fetch

1ae2702

fix more fetch missing in enforcesorting

38f39f5

fix: fetch is missed in the EnforceSorting (apache#14192)

f7740af

* fix: fetch is missed in the EnfoceSorting * fix conflict * resolve comments from alamb * update

fix remaining test issues regarding with_node_id

22473d9

use new_utf8 instead of new_utf8view in page_pruning test as we have …

f0f6e81

…it disabled by default

Expose more components from sqllogictest (apache#14249)

f3e7004

Extract useful methods from sqllogictest bin (apache#14267)

c976a89

expose df sqllogictest error

ffff7a1

update sqllogictest

63bad11

chore: Upgrade to arrow/parquet 54.1.0 and fix clippy/ci (apach…

e3ea7d1

…e#14415) (apache#14453) * chore: Fixed CI * chore * chore: Fixed clippy * chore Co-authored-by: Alex Huang <huangweijun1001@gmail.com>

Fix join type coercion (apache#14387) (apache#14454)

8f10fdf

Support Utf8View to numeric coercion (apache#14377) (apache#14455)

755b26a

* Test for string / numeric coercion * fix tests * Update tests * Add tests to stringview * add numeric coercion

jcsherin commented Nov 24, 2025

View reviewed changes

datafusion/common/src/config.rs Show resolved Hide resolved

jcsherin commented Nov 24, 2025

View reviewed changes

jcsherin marked this pull request as ready for review November 24, 2025 05:28

jcsherin added the do not merge Do not merge until this label is removed label Nov 24, 2025

jcsherin marked this pull request as draft November 24, 2025 05:28

zhuqi-lucas reviewed Nov 24, 2025

View reviewed changes

refactor: remove unused values from physical plan module

319cb70

xudong963 approved these changes Nov 24, 2025

View reviewed changes

zhuqi-lucas approved these changes Nov 24, 2025

View reviewed changes

feat: make DefaultSchemaAdapter public

df445a2

jcsherin marked this pull request as ready for review November 24, 2025 11:48

jcsherin merged commit 9d45f26 into branch-51-upstream Nov 25, 2025
62 checks passed

jcsherin added a commit that referenced this pull request Nov 25, 2025

Upgrade fork to DF 51 (#25)

c958b4e

Conversation

jcsherin commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Upgrade Steps

Verified DF patches (present in this PR)

Missing patches (not yet applied to this PR)

1. make DefaultSchemaAdapter public Commit 848fd57

2. Add partition filters' equivalence classes info to the execution plan if it's DataSourceExec Commit c7628fb.

Merged upstream

TODO

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcsherin Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcsherin Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcsherin Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcsherin Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

jcsherin commented Nov 24, 2025 •

edited

Loading

jcsherin Nov 24, 2025 •

edited

Loading

jcsherin Nov 24, 2025 •

edited

Loading

jcsherin Nov 24, 2025 •

edited

Loading

jcsherin Nov 24, 2025 •

edited

Loading