Refactor distinct aggregate implementations to use common buffer by Jefffrey · Pull Request #18348 · apache/datafusion

Jefffrey · 2025-10-29T08:52:03Z

Which issue does this PR close?

Relates to Support complete distinct usage for aggregate expressions #2406

Rationale for this change

Make it easier to write distinct variations of aggregate functions be refactoring some of the common code together; specifically how they handle maintaining the complete set of distinct primitive values, as this code was duplicated across different functions.

What changes are included in this PR?

Introduce new GenericDistinctBuffer which has methods similar to Accumulator to manage an internal HashSet of values, so implementations like percentile_cont and sum can use it internally and only implement their own evaluate functions.

Are these changes tested?

Existing tests.

Are there any user-facing changes?

No.

Jefffrey · 2025-10-29T08:53:33Z

datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs

It would be nice if I can pull in PrimitiveDistinctCountAccumulator to the deduplication as well, however it is specialized for types which don't need to hash through Hashable (aka non-float types) and I think there might be a performance hit if I try force them to use Hashable 🤔

Yeah, we definitely don't want to be hashing if we can avoid taht

Jefffrey · 2025-10-29T08:56:05Z

datafusion/functions-aggregate-common/src/utils.rs

+/// `merge_batch` and a `Vec` of `ArrayRef` that are converted to scalar values
+/// in the final evaluation step so that we avoid expensive conversions and
+/// allocations during `update_batch`.
+pub struct GenericDistinctBuffer<T: ArrowPrimitiveType> {


Main implementation here; I toyed with the idea of making this implement Accumulator and have the different functions (like median and percentile_cont) provide their evaluate logic as a closure but it got a bit messy; so for now they delegate their state/update_batch/merge_batch to this inner struct, which allows them to grab the final set of distinct values for them to do their own evaluate

alamb

Thank you @Jefffrey -- this is really quite elegant. I am sorry it took so long to review

the only thing I think we need to do is ensure this doesn't have any impact in performance (I don't expect that it will but want to be sure)

Really nice 🏆

alamb · 2025-11-11T18:41:48Z

datafusion/functions-aggregate-common/src/aggregate/count_distinct/native.rs

Yeah, we definitely don't want to be hashing if we can avoid taht

alamb · 2025-11-11T18:45:48Z

datafusion/functions-aggregate-common/src/utils.rs

+            self.values.extend(arr.iter().flatten().map(Hashable));
+        } else {
+            self.values
+                .extend(arr.values().iter().cloned().map(Hashable));


nice -- this is an elegant way to special case nulls/non nulls

alamb · 2025-11-11T19:28:05Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing refactor-agg-distinct (3c389c0) to 6cc73fa diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb · 2025-11-11T20:26:11Z

🤖: Benchmark completed

Details

Comparing HEAD and refactor-agg-distinct
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ refactor-agg-distinct ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2681.40 ms │            2730.16 ms │ no change │
│ QQuery 1     │  1249.36 ms │            1297.64 ms │ no change │
│ QQuery 2     │  2414.27 ms │            2486.04 ms │ no change │
│ QQuery 3     │  1151.81 ms │            1187.91 ms │ no change │
│ QQuery 4     │  2243.87 ms │            2261.10 ms │ no change │
│ QQuery 5     │ 27784.40 ms │           28019.72 ms │ no change │
│ QQuery 6     │  4183.57 ms │            4200.12 ms │ no change │
│ QQuery 7     │  3719.32 ms │            3654.33 ms │ no change │
└──────────────┴─────────────┴───────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 45428.00ms │
│ Total Time (refactor-agg-distinct)   │ 45837.03ms │
│ Average Time (HEAD)                  │  5678.50ms │
│ Average Time (refactor-agg-distinct) │  5729.63ms │
│ Queries Faster                       │          0 │
│ Queries Slower                       │          0 │
│ Queries with No Change               │          8 │
│ Queries with Failure                 │          0 │
└──────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ refactor-agg-distinct ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.19 ms │               2.58 ms │  1.18x slower │
│ QQuery 1     │    49.89 ms │              51.05 ms │     no change │
│ QQuery 2     │   139.29 ms │             140.00 ms │     no change │
│ QQuery 3     │   154.15 ms │             164.65 ms │  1.07x slower │
│ QQuery 4     │  1071.27 ms │            1091.98 ms │     no change │
│ QQuery 5     │  1496.79 ms │            1516.96 ms │     no change │
│ QQuery 6     │     2.17 ms │               2.22 ms │     no change │
│ QQuery 7     │    55.06 ms │              56.95 ms │     no change │
│ QQuery 8     │  1417.71 ms │            1486.53 ms │     no change │
│ QQuery 9     │  1807.14 ms │            1820.02 ms │     no change │
│ QQuery 10    │   389.54 ms │             410.87 ms │  1.05x slower │
│ QQuery 11    │   446.25 ms │             456.15 ms │     no change │
│ QQuery 12    │  1386.50 ms │            1429.53 ms │     no change │
│ QQuery 13    │  2162.95 ms │            2173.12 ms │     no change │
│ QQuery 14    │  1285.62 ms │            1307.88 ms │     no change │
│ QQuery 15    │  1220.96 ms │            1255.23 ms │     no change │
│ QQuery 16    │  2689.84 ms │            2738.54 ms │     no change │
│ QQuery 17    │  2657.93 ms │            2714.66 ms │     no change │
│ QQuery 18    │  5395.06 ms │            4982.75 ms │ +1.08x faster │
│ QQuery 19    │   128.10 ms │             129.16 ms │     no change │
│ QQuery 20    │  2065.75 ms │            1967.07 ms │     no change │
│ QQuery 21    │  2324.17 ms │            2328.12 ms │     no change │
│ QQuery 22    │  4167.13 ms │            3931.84 ms │ +1.06x faster │
│ QQuery 23    │ 13031.85 ms │           12929.07 ms │     no change │
│ QQuery 24    │   219.94 ms │             221.62 ms │     no change │
│ QQuery 25    │   523.46 ms │             522.58 ms │     no change │
│ QQuery 26    │   226.94 ms │             220.89 ms │     no change │
│ QQuery 27    │  2869.38 ms │            2858.88 ms │     no change │
│ QQuery 28    │ 22775.69 ms │           24278.99 ms │  1.07x slower │
│ QQuery 29    │   967.16 ms │             969.06 ms │     no change │
│ QQuery 30    │  1337.50 ms │            1336.90 ms │     no change │
│ QQuery 31    │  1380.43 ms │            1343.02 ms │     no change │
│ QQuery 32    │  4624.40 ms │            4955.25 ms │  1.07x slower │
│ QQuery 33    │  5922.16 ms │            5932.37 ms │     no change │
│ QQuery 34    │  5927.20 ms │            5982.00 ms │     no change │
│ QQuery 35    │  2003.94 ms │            2024.77 ms │     no change │
│ QQuery 36    │   121.02 ms │             119.85 ms │     no change │
│ QQuery 37    │    52.78 ms │              52.00 ms │     no change │
│ QQuery 38    │   121.42 ms │             121.59 ms │     no change │
│ QQuery 39    │   196.08 ms │             200.65 ms │     no change │
│ QQuery 40    │    43.86 ms │              42.67 ms │     no change │
│ QQuery 41    │    40.35 ms │              38.89 ms │     no change │
│ QQuery 42    │    33.88 ms │              33.75 ms │     no change │
└──────────────┴─────────────┴───────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 94934.87ms │
│ Total Time (refactor-agg-distinct)   │ 96342.63ms │
│ Average Time (HEAD)                  │  2207.79ms │
│ Average Time (refactor-agg-distinct) │  2240.53ms │
│ Queries Faster                       │          2 │
│ Queries Slower                       │          5 │
│ Queries with No Change               │         36 │
│ Queries with Failure                 │          0 │
└──────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ refactor-agg-distinct ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 173.41 ms │             170.27 ms │     no change │
│ QQuery 2     │  27.72 ms │              26.96 ms │     no change │
│ QQuery 3     │  37.42 ms │              36.05 ms │     no change │
│ QQuery 4     │  28.96 ms │              28.30 ms │     no change │
│ QQuery 5     │  98.00 ms │              75.98 ms │ +1.29x faster │
│ QQuery 6     │  26.36 ms │              19.60 ms │ +1.34x faster │
│ QQuery 7     │ 259.23 ms │             221.02 ms │ +1.17x faster │
│ QQuery 8     │  32.46 ms │              33.29 ms │     no change │
│ QQuery 9     │ 103.31 ms │             105.18 ms │     no change │
│ QQuery 10    │  61.69 ms │              60.46 ms │     no change │
│ QQuery 11    │  17.21 ms │              16.20 ms │ +1.06x faster │
│ QQuery 12    │  51.71 ms │              51.02 ms │     no change │
│ QQuery 13    │  46.99 ms │              48.51 ms │     no change │
│ QQuery 14    │  13.95 ms │              13.82 ms │     no change │
│ QQuery 15    │  24.39 ms │              24.68 ms │     no change │
│ QQuery 16    │  24.45 ms │              24.85 ms │     no change │
│ QQuery 17    │ 153.72 ms │             149.28 ms │     no change │
│ QQuery 18    │ 329.41 ms │             329.57 ms │     no change │
│ QQuery 19    │  37.16 ms │              36.93 ms │     no change │
│ QQuery 20    │  50.78 ms │              49.59 ms │     no change │
│ QQuery 21    │ 348.92 ms │             333.10 ms │     no change │
│ QQuery 22    │  20.02 ms │              19.94 ms │     no change │
└──────────────┴───────────┴───────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 1967.28ms │
│ Total Time (refactor-agg-distinct)   │ 1874.61ms │
│ Average Time (HEAD)                  │   89.42ms │
│ Average Time (refactor-agg-distinct) │   85.21ms │
│ Queries Faster                       │         4 │
│ Queries Slower                       │         0 │
│ Queries with No Change               │        18 │
│ Queries with Failure                 │         0 │
└──────────────────────────────────────┴───────────┘

Jefffrey · 2025-11-13T10:44:10Z

The clickbench QQuery0 (I believe it's this query?) that is 1.18x slower doesn't use distinct so I don't think it's an actual slowdown.

Refactor distinct aggregate implementations to use common buffer

3c389c0

github-actions bot added the functions Changes to functions implementation label Oct 29, 2025

Jefffrey commented Oct 29, 2025

View reviewed changes

Jefffrey marked this pull request as ready for review October 29, 2025 09:13

alamb mentioned this pull request Nov 4, 2025

Andrew Lamb Weekly-ish Open Source plan - 2025-11-03 #18486

Closed

53 tasks

alamb approved these changes Nov 11, 2025

View reviewed changes

Jefffrey added this pull request to the merge queue Nov 13, 2025

Merged via the queue into apache:main with commit e42a0b6 Nov 13, 2025
28 checks passed

Jefffrey deleted the refactor-agg-distinct branch November 13, 2025 10:46

Jefffrey mentioned this pull request Nov 13, 2025

Make GenericDistinctBuffer generic over both Hashable and native types #18670

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor distinct aggregate implementations to use common buffer#18348

Refactor distinct aggregate implementations to use common buffer#18348
Jefffrey merged 1 commit intoapache:mainfrom
Jefffrey:refactor-agg-distinct

Jefffrey commented Oct 29, 2025

Uh oh!

Jefffrey Oct 29, 2025

Uh oh!

alamb Nov 11, 2025

Uh oh!

Jefffrey Oct 29, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Nov 11, 2025

Uh oh!

alamb Nov 11, 2025

Uh oh!

alamb commented Nov 11, 2025

Uh oh!

alamb commented Nov 11, 2025

Uh oh!

Jefffrey commented Nov 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Jefffrey commented Oct 29, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Jefffrey Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Jefffrey Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Nov 11, 2025

Uh oh!

alamb commented Nov 11, 2025

Uh oh!

Jefffrey commented Nov 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants