Fix output schema generated by CommonSubExprEliminate by alex-spies · Pull Request #3726 · apache/datafusion

alex-spies · 2022-10-05T16:35:07Z

Which issue does this PR close?

Closes #3635.

Rationale for this change

The optimization rule CommonSubexprEliminate produces a wrong output schema in many situations. If a logical plan depends on the output schema, it will simply be broken.

What changes are included in this PR?

In the ExprIdentifierVisitor, which is used to determine the datatype of every sub-expression, the datatype is now determined based on the schema of the logical plan for every sub-expression individually instead of determining the datatype of the overall expression and wrongly assigning it to every sub-expression.
The ExprIdentifierVisitor now, accordingly, does not have a datatype attribute but, instead, input_schema and all_schemas.
The latter is necessary for a fall-back logic, where if the datatype of a sub-expression cannot be determined using the input schema, we merge the schemas from all nodes of the overall logical plan and try again.
A test is added to verify the correct datatype of the schema post optimization in the case of optimizing a Filter logical plan.

Are there any user-facing changes?

No.

Fall back to the merged schema from the whole logical plan if the input schema was not sufficient to resolve the datatype of a sub-expression. This re-enables the fallback logic added in 3860cd3 (apache#1925).

alex-spies · 2022-10-05T16:36:19Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

+            // expression type could not be resolved in schema, fall back to all schemas
+            let merged_schema =
+                self.all_schemas
+                    .iter()
+                    .fold(DFSchema::empty(), |mut lhs, rhs| {
+                        lhs.merge(rhs);
+                        lhs
+                    });
+            expr.get_type(&merged_schema)?
+        };


I am honestly unsure if this fall-back logic is necessary at all since the (sub-expression) data types should be resolvable just from the input schema to the respective logical plan node.

However, I did not want to break anything since this fall-back logic was already in place.

All tests pass with and without the fall-back logic.

I think all exprs should be resolvable with the unified schemas, as explained above -- but maybe it is a performance optimization 🤔.

Perhaps you could leave a comment explaining that we are not sure if it is necessary

This code may be a workaround from some issue we have since fixed 🤔

#3730 <-- pr to remove the fallback

Added comments that explain that the fall-back logic can likely be removed.

alex-spies · 2022-10-05T16:40:06Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

+        let expected = r###"[
+    (
+        "CAST(table.a AS Int64)table.a",
+        Int64,


This used to be just Boolean before the fix.

alamb

Looks good to me -- thank you @alex-natzka

cc @waynexia -- do you have time to review this PR as well?

alamb · 2022-10-05T21:41:48Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

+    /// all schemas in the logical plan, as a fall back if we cannot resolve an expression type
+    /// from the input schema alone
+    all_schemas: Vec<DFSchemaRef>,


I don't understand in what cases we wouldn't be able to resolve an expr type from the input schema alone.

The only case I can think of is when the plan node has more than one input (e.g. a Join or a Union) -- but thus I would expect that we always resolve the type of the expressions using the input schema

While Joins and Unions are not (yet?) handled by this optimization rule, I think that even in these cases we should be able to construct one consolidated schema that is used to resolve the expression type - otherwise the expression probably is invalid in the first place.

Right now the fall-back logic just randomly merges all possible schemas into one - there's no guarantee the resulting, merged schema will be any good for resolving the expression at hand. That's especially the case if the logical plan involves lots of aliasing - there may be many fields from different nodes that have the same name, e.g. a, but different data types; the merged schema will have only one a column, though, the first that we encounter while merging schemas.

I am glad I was not the only one confused :)

This random merge used to only exist in the Filter plan and was fixed once in #1925 . Ashamedly, I cannot recall why Filter is unique 😥 I would like to explain it as my mistake, sorry for that >_<

I think it is doable to remove the else branch as it already takes no effect after #1925

I wouldn't worry -- there have been many changes in DataFusion and it was (and still is) a fast changing codebase -- the day I write perfect code without any errors will be the day I hold other people to the same standard 😆

datafusion/optimizer/src/common_subexpr_eliminate.rs

alamb · 2022-10-05T21:44:41Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

+            // expression type could not be resolved in schema, fall back to all schemas
+            let merged_schema =
+                self.all_schemas
+                    .iter()
+                    .fold(DFSchema::empty(), |mut lhs, rhs| {
+                        lhs.merge(rhs);
+                        lhs
+                    });
+            expr.get_type(&merged_schema)?
+        };


I think all exprs should be resolvable with the unified schemas, as explained above -- but maybe it is a performance optimization 🤔.

Perhaps you could leave a comment explaining that we are not sure if it is necessary

alamb · 2022-10-05T21:45:17Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

        let expr = binary_expr(
            binary_expr(
-                sum(binary_expr(col("a"), Operator::Plus, lit("1"))),
+                sum(binary_expr(col("a"), Operator::Plus, lit(1))),


I agree this test doesn't make sense as coercion should have happened before this pass

alamb

I ran all the tests without the fallback of the plan's schemas and they worked. Thus I think it is not necessary.

However, I like your incremental approach to development of keeping the old code there as it is no worse than master. I would be fine with merging this PR as is and I will create a PR that removes a workaround as a follow on.

I will wait until tomorrow to see if @waynexia would like to comment too.

alamb · 2022-10-05T21:49:56Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

+            // expression type could not be resolved in schema, fall back to all schemas
+            let merged_schema =
+                self.all_schemas
+                    .iter()
+                    .fold(DFSchema::empty(), |mut lhs, rhs| {
+                        lhs.merge(rhs);
+                        lhs
+                    });
+            expr.get_type(&merged_schema)?
+        };


This code may be a workaround from some issue we have since fixed 🤔

liukun4515 · 2022-10-06T04:29:32Z

I will review it tomorrow

Point out that it can likely be removed.

alamb · 2022-10-06T11:02:21Z

Github actions is having issues I think -- the CI failures are not related to changes in this PR

andygrove · 2022-10-06T16:47:51Z

I plan on reviewing this today

waynexia · 2022-10-07T04:35:43Z

Sorry for the late reply, I plan to review it later this day.

alamb · 2022-10-07T19:50:08Z

I would like to merge this PR -- I'll plan on doing so tomorrow if @waynexia and/or @andygrove haven't had a chance by then

waynexia

I've reviewed this fix and it looks reasonable to me. Thanks @alex-natzka @alamb

alamb · 2022-10-11T12:14:50Z

Thanks again @alex-natzka and @waynexia

ursabot · 2022-10-11T12:22:17Z

Benchmark runs are scheduled for baseline = e10d647 and contender = 0cf5630. 0cf5630 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Alexander Spies added 3 commits October 5, 2022 12:57

CommonSubexprEliminate: Fix additional col schema

05b0b47

Use correct types in test id_array_visitor

f930e25

Re-enable fall back schema for datatype resolution

f1909cf

Fall back to the merged schema from the whole logical plan if the input schema was not sufficient to resolve the datatype of a sub-expression. This re-enables the fallback logic added in 3860cd3 (apache#1925).

alex-spies commented Oct 5, 2022

View reviewed changes

alex-spies changed the title ~~Fix common subexpr eliminate schema~~ Fix output schema generated by CommonSubExprEliminate Oct 5, 2022

alex-spies commented Oct 5, 2022

View reviewed changes

github-actions bot added the optimizer Optimizer rules label Oct 5, 2022

alamb reviewed Oct 5, 2022

View reviewed changes

alamb approved these changes Oct 5, 2022

View reviewed changes

alamb mentioned this pull request Oct 5, 2022

Remove some uneeded code in CommonSubexprEliminate #3730

Merged

Add comment on fall-back logic using all schemas

ea26e7a

Point out that it can likely be removed.

andygrove self-requested a review October 6, 2022 16:47

waynexia approved these changes Oct 8, 2022

View reviewed changes

alamb merged commit 0cf5630 into apache:master Oct 11, 2022

alex-spies deleted the fix_common_subexpr_eliminate_schema branch October 11, 2022 12:56

Conversation

alex-spies commented Oct 5, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liukun4515 commented Oct 6, 2022

Uh oh!

alamb commented Oct 6, 2022

Uh oh!

andygrove commented Oct 6, 2022

Uh oh!

waynexia commented Oct 7, 2022

Uh oh!

alamb commented Oct 7, 2022

Uh oh!

waynexia left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Oct 11, 2022

Uh oh!

ursabot commented Oct 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants