Skip to content

feat: estimate cardinality for semi and anti-joins using distinct counts#20904

Open
buraksenn wants to merge 5 commits intoapache:mainfrom
buraksenn:use-ndv-for-semi-and-anti-join
Open

feat: estimate cardinality for semi and anti-joins using distinct counts#20904
buraksenn wants to merge 5 commits intoapache:mainfrom
buraksenn:use-ndv-for-semi-and-anti-join

Conversation

@buraksenn
Copy link
Contributor

Which issue does this PR close?

Does not close but part of #20766

Rationale for this change

Details are in #20766. But main idea is to use existing distinct count information to optimize joins similar to how Spark/Trino does

What changes are included in this PR?

This PR extends cardinality estimation for semi/anti joins using distinct counts

Are these changes tested?

I've added cases but not sure if I should've added benchmarks on this.

Are there any user-facing changes?

No

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Mar 12, 2026
Copy link
Member

@asolimando asolimando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a couple of minor points and a few tests to be added. The only change I'd like to see is bailing out when either side has no stats for a column pair.

None
}

/// Estimates the number of outer rows that have at least one matching
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The math looks sound to me, and coherent with that of #20846.

I was wondering if you did check other notable systems using CBO like Trino or Spark.

If so, consider adding a note, this will help reviewers trust the change, as already battle-tested elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This builds up on same assumption in the inner join in the same file estimate_inner_join_cardinality. I saw similar thing in postgres https://github.com/postgres/postgres/blob/02976b0a1718037f73fded250411b013e81fdafa/src/backend/utils/adt/selfuncs.c#L2718. I may need to check Spark and Trino again. In the epic it said about them but not sure about this.

If you have any reservations about I can close or maybe try to be more conservative on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we can look into what trino did, I think they had something for this, but the postgres approach makes sense

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reservations, but I think it's valuable to point to the original implementation for multiple reasons: documentation, revisit the implementation if the original source improves over it, and recognition for the original author/project.

As I said, this looks reasonable and correct to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @asolimando I understand and it totally makes sense. I could not find relevant estimation logic for this in Trino maybe I've missed this. Given Postgres approach and reference in comments I think it makes sense to go forward with this.

@buraksenn buraksenn force-pushed the use-ndv-for-semi-and-anti-join branch from 79dcc2b to ee530c3 Compare March 12, 2026 20:44
if let (Some(&o), Some(&i)) = (outer_ndv.get_value(), inner_ndv.get_value())
&& o > 0
{
selectivity *= (o.min(i) as f64) / (o as f64);
Copy link
Member

@asolimando asolimando Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at the Postgres' implementation (the branch without the "most common values", which we don't track), I noticed a potential improvement over what we have now.

Postgres uses selec = min(1, inner_ndv/outer_ndv) * (1 - nullfrac) as core formula (code is here: I have "condensed" the if/else into a single formula and renamed the variables ndv1 -> outer_ndv and ndv2 -> inner_ndv, so it's clear, but it's equivalent to the code).

We use selec = min(inner_ndv, outer_ndv) / outer_ndv (renamed again for readability here), which is equivalent to Postgres' min(1, inner_ndv/outer_ndv).

We don't account for NULL values, we are missing the * (1 - nullfrac) part of their formula.

Postgres removes them, as NULL values cannot ever match, and we could do that too, when ColumnStatistics::null_count is available.

We could do it this way:

Suggested change
selectivity *= (o.min(i) as f64) / (o as f64);
let null_frac = outer_stat.null_count
.get_value()
.map(|&nc| nc as f64 / outer_rows as f64)
.unwrap_or(0.0);
selectivity *= (o.min(i) as f64) / (o as f64) * (1.0 - null_frac);

WDYT?

If you accept the change, we would need a few additional test cases to exercise this part of the code, I can think of these cases:

  • single column with nulls on outer side
  • anti-join with nulls
  • all outer rows are null
  • multi-column with nulls on one column

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @asolimando I've added your part and will add test cases now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've adjusted comment and added test cases

buraksenn and others added 2 commits March 13, 2026 14:32
Co-authored-by: Alessandro Solimando <alessandro.solimando@gmail.com>
Copy link
Member

@asolimando asolimando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for addressing all my comments fully!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants