Refactor distinct aggregate implementations to use common buffer#18348
Refactor distinct aggregate implementations to use common buffer#18348Jefffrey merged 1 commit intoapache:mainfrom
Conversation
There was a problem hiding this comment.
It would be nice if I can pull in PrimitiveDistinctCountAccumulator to the deduplication as well, however it is specialized for types which don't need to hash through Hashable (aka non-float types) and I think there might be a performance hit if I try force them to use Hashable 🤔
There was a problem hiding this comment.
Yeah, we definitely don't want to be hashing if we can avoid taht
| /// `merge_batch` and a `Vec` of `ArrayRef` that are converted to scalar values | ||
| /// in the final evaluation step so that we avoid expensive conversions and | ||
| /// allocations during `update_batch`. | ||
| pub struct GenericDistinctBuffer<T: ArrowPrimitiveType> { |
There was a problem hiding this comment.
Main implementation here; I toyed with the idea of making this implement Accumulator and have the different functions (like median and percentile_cont) provide their evaluate logic as a closure but it got a bit messy; so for now they delegate their state/update_batch/merge_batch to this inner struct, which allows them to grab the final set of distinct values for them to do their own evaluate
alamb
left a comment
There was a problem hiding this comment.
Thank you @Jefffrey -- this is really quite elegant. I am sorry it took so long to review
the only thing I think we need to do is ensure this doesn't have any impact in performance (I don't expect that it will but want to be sure)
Really nice 🏆
There was a problem hiding this comment.
Yeah, we definitely don't want to be hashing if we can avoid taht
| self.values.extend(arr.iter().flatten().map(Hashable)); | ||
| } else { | ||
| self.values | ||
| .extend(arr.values().iter().cloned().map(Hashable)); |
There was a problem hiding this comment.
nice -- this is an elegant way to special case nulls/non nulls
|
🤖 |
|
🤖: Benchmark completed Details
|
|
The clickbench QQuery0 (I believe it's this query?) that is 1.18x slower doesn't use distinct so I don't think it's an actual slowdown. |
Which issue does this PR close?
distinctusage for aggregate expressions #2406Rationale for this change
Make it easier to write distinct variations of aggregate functions be refactoring some of the common code together; specifically how they handle maintaining the complete set of distinct primitive values, as this code was duplicated across different functions.
What changes are included in this PR?
Introduce new
GenericDistinctBufferwhich has methods similar toAccumulatorto manage an internalHashSetof values, so implementations likepercentile_contandsumcan use it internally and only implement their own evaluate functions.Are these changes tested?
Existing tests.
Are there any user-facing changes?
No.