Optimize Clickbench Query 29 by adding a new Optimizer rule#20180
Optimize Clickbench Query 29 by adding a new Optimizer rule#20180devanshu0987 wants to merge 13 commits intoapache:mainfrom
Conversation
|
run benchmark sql_planner |
|
🤖 |
|
run benchmarks |
|
Thanks @devanshu0987 -- I kicked off some benchmarks |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
I think we can implement datafusion/datafusion/expr/src/udaf.rs Lines 682 to 684 in 35ff4ab |
Hi, is this a blocking PR review comment? I am thinking that we still have the same logical flow
|
neilconway
left a comment
There was a problem hiding this comment.
Can we add a test for queries with a HAVING clause?
I wonder if it would make sense to support this optimization for other aggregates? e.g., AVG(a+k) => AVG(a) + k.
| FROM test_table; | ||
| ---- | ||
| logical_plan | ||
| 01)Projection: sum(test_table.a) AS sum_a, sum(test_table.a + Int64(1)) AS sum_a_plus_1, sum(test_table.b) AS sum_b, sum(test_table.b + Int64(2)) AS sum_b_plus_2, sum(test_table.c + Int64(3)) AS sum_c_plus_3 |
There was a problem hiding this comment.
It looks to me like the rewrite is not actually being applied here? (Contra the comment above: "Test 7: Mixed sum types (rewrites a and b, not c)").
| | ScalarValue::UInt16(_) | ||
| | ScalarValue::UInt32(_) | ||
| | ScalarValue::UInt64(_) | ||
| | ScalarValue::Float32(_) |
| if let Expr::BinaryExpr(BinaryExpr { left, op, right }) = arg | ||
| && matches!(op, Operator::Plus | Operator::Minus) | ||
| { | ||
| // Check if right side is a literal constant |
There was a problem hiding this comment.
Remove duplicate comment
|
|
||
| /// Check if a scalar value is a numeric constant | ||
| /// (guards against non-arithmetic types like strings, booleans, dates, etc.) | ||
| fn is_numeric_constant(value: &ScalarValue) -> bool { |
There was a problem hiding this comment.
How should NULL values be handled by this function? I suppose we want to return false for them?
|
My real concern with this approach is that this optimization is so clickbench specific. I realize that other systems are doing it too (aka the aforementioned ClickHouse and DuckDB PR) but I struggle to find any actual real world usecase On the other hand, the performance results are pretty compelling so let's see if we can make it work |
alamb
left a comment
There was a problem hiding this comment.
Thanks for this @devanshu0987
I think we should proceed with the high level idea of rewriting the expressions
Howevr, I really like @neilconway and @UBarney 's suggestions to use the rewrite API instead of a new optimizer rule
After providing this feedback, however, it occurs to me the "don't use function names" and "try not to add new optimizer passes" guidance isn't written down anywhere so I'll try and make a PR to do so later
|
|
||
| match inner { | ||
| Expr::AggregateFunction(agg_fn) => { | ||
| // Rule only applies to SUM |
There was a problem hiding this comment.
Checking for SUM this was is non ideal because if someone has added a user defined function with a sum that has different behavior / semantics than the built in SUM aggregate this rule will still trigger
Hi, is this a blocking PR review comment? I am not able to understand how it would be simpler?
I am thinking that we still have the same logical flow
- Detect the pattern of SUM(col + literal)
- Modify the plan appropriately.
I think it would be better for several reasons:
- A new optimizer pass adds non trivial overhead (it ends up copying each plan node I think) so we see planning time go up with each new rule we add to the base DataFusion
- It would fix the above issue
There was a problem hiding this comment.
I also think @UBarney has the same suggestion here
There was a problem hiding this comment.
Here is a PR with some suggested documentation improvements:
The solution I proposed avoids the need to locate the However, there is an issue I overlooked: the expression after simplification becomes I am still exploring the best way to handle this. One possible direction is to introduce a new rule or modify Update: I think we can handle the Projection injection directly within the simplify_exprs rule. Regarding the architecture, my current view is that the |
|
Thanks for sharing your thought process on why it should be better to write a
For other comments from @neilconway, I will take care of them post I can convert this to Thanks. |
## Which issue does this PR close? - Related to #20180 ## Rationale for this change I gave feedback to @devanshu0987 https://github.com/apache/datafusion/pull/20180/changes#r2800720037 that it was not a good idea to check for function names in optimizer rules, but then I realized that the rationale for this is not written down anywhere. ## What changes are included in this PR? Document why checking for function names in optimizer rules is not good and offer alternatives ## Are these changes tested? By CI ## Are there any user-facing changes? Just docs, no functional changes
|
I am going to take a shot at cleaning up this pR |
|
I am going to try and help get this PR over the line as it has come up while we are looking at ClickBench results |
## Which issue does this PR close? - Part of #18489 - Related to #20180 ## Rationale for this change This looks like a monster PR but I think it will be quite easy to review (it just adds some new `EXPLAIN` tests). If it would be helpful I can break it into smaller pieces I want to improve the plans for ClickBench Query 29 However, the plans for the ClickBench queries are not in our tests anywhere (so when I make the improvements in #20665 no explain plan tests change) So to start, let's start with adding the explain plans for all the queries in clickbench.slt to so it is clear what our current plans look like as well as to make it clear what the change of plans are ## What changes are included in this PR? Add explain plans to some .slt tests ## Are these changes tested? Only tests ## Are there any user-facing changes? No, this only adds tests
|
I started reviving the PR here |
|
My final rewrite is here: |
…0749) ## Which issue does this PR close? - Part of #18489 - Closes #20180 - Closes #15524 - Replaces #20665 ## Rationale for this change I [want DataFusion to be the fastest parquet engine on ClickBench](#18489). One of the queries where DataFusion is significantly slower is Query 29 which has a very strange pattern of many aggregate functions that are offset by a constant: https://github.com/apache/datafusion/blob/0ca9d6586a43c323525b2e299448e0f1af4d6195/benchmarks/queries/clickbench/queries/q29.sql#L4 This is not a pattern I have ever seen in a real query, but it seems like the engine currently at the top of the ClickBench leaderboard has a special case for this pattern. ClickHouse probably does too. See - duckdb/duckdb#15017 - Discussion on #15524 Thus I reluctantly conclude that we should have one too. ## What changes are included in this PR? This is an alternate to my first attempt. - #20665 In particular, since this is such a ClickBench specific rule, I wanted to 1. Minimize the downstream API / upgrade impact (aka not change existing APIs) 2. Optimize performance for the case where this rewrite will not apply (most times) 1. Add a rewrite `SUM(expr + scalar)` --> `SUM(expr) + scalar*COUNT(expr)` 3. Tests for same Note there are quite a few other ideas to potentially make this more general on #15524 but I am going with the simple thing of making it work for the usecase we have in hand (ClickBench) ## Are these changes tested? Yes, new tests are added ## Are there any user-facing changes? Faster performance 🚀 ``` │ QQuery 29 │ 1012.63 ms │ 139.02 ms │ +7.28x faster │ ```
Which issue does this PR close?
SUM(..)clauses #15524Rationale for this change
SUM(ResolutionWidth), SUM(ResolutionWidth + 1), SUM(ResolutionWidth + 2), ..., SUM(ResolutionWidth + 89)SUM(A + k)intoSUM(A) + k * COUNT(*)CommonSubexprEliminateto allow re use of the common SUM and COUNT expression.What changes are included in this PR?
RewriteAggregateWithConstantAre these changes tested?
Are there any user-facing changes?
No
AI Usage