Skip to content

fix: Optimize !~ '.*' case to col IS NULL AND Boolean(NULL) instead of Eq ""#20702

Merged
alamb merged 9 commits intoapache:mainfrom
petern48:bug_regexp_optim
Mar 12, 2026
Merged

fix: Optimize !~ '.*' case to col IS NULL AND Boolean(NULL) instead of Eq ""#20702
alamb merged 9 commits intoapache:mainfrom
petern48:bug_regexp_optim

Conversation

@petern48
Copy link
Contributor

@petern48 petern48 commented Mar 4, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

A pre-existing optimization rule for the !~ .* (regexp not match) case rewrote the plan to Eq "", which would return empty strings as part of the result. This is incorrect and doesn't match the output without the optimization rule.

Instead, this PR rewrites the plan to simply col IS NULL AND Boolean(NULL) or, in other words, "NULL if col is NULL else false."

I've confirmed this behavior matches the result of running queries manually with the optimization rule turned off.

Are these changes tested?

Fixed expected output in tests. Added new tests for nulls

Are there any user-facing changes?

Yes, a minor bug fix. When querying s !~ .*, empty strings will no longer be included in the result which is consistent with the behavior without the optimization rule.

@petern48 petern48 changed the title Fix: Optimize ~! '.*' and '.*' cases to False instead of Eq "" fix/perf: Optimize ~! '.*' case to False instead of Eq "" Mar 4, 2026
@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Mar 4, 2026
@petern48 petern48 marked this pull request as ready for review March 4, 2026 18:28
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @petern48

/// - combinations (alternatives) of the above, will be concatenated with `OR` or `AND`
/// - `EQ .*` to NotNull
/// - `NE .*` means IS EMPTY
/// - `NE .*` to false (.* matches non-empty and empty strings, and NULL !~ '.*' results in NULL so this can never be true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is only false in the context of filters (and this code can currently be used for both filters and regular expressions)

I think null !~ '.*' is actually null (not false)

I think a correct rewrite is

CASE 
  WHEN col IS NOT NULL THEN FALSE 
  ELSE NULL
END

Copy link
Contributor Author

@petern48 petern48 Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah thanks, i didn't realize this code applied to cases outside of filters. I've fixed it and updated the PR title and description.

note: the optimizer would further rewrite the CASE expr into IS NOT NULL AND NULL when I ran the tests, so I rewrote it directly to that instead

edit: should've been IS NULL AND NULL 😬

@petern48 petern48 changed the title fix/perf: Optimize ~! '.*' case to False instead of Eq "" fix: Optimize ~! '.*' case to col IS NOT NULL AND Boolean(NULL) instead of Eq "" Mar 5, 2026
@petern48 petern48 changed the title fix: Optimize ~! '.*' case to col IS NOT NULL AND Boolean(NULL) instead of Eq "" fix: Optimize !~ '.*' case to col IS NOT NULL AND Boolean(NULL) instead of Eq "" Mar 5, 2026
@alamb alamb added the bug Something isn't working label Mar 5, 2026
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @petern48

@alamb
Copy link
Contributor

alamb commented Mar 10, 2026

Thanks again for working on this @petern48 -- here is a proposed PR to fix the logic in this PR

What do you think?

@petern48 petern48 changed the title fix: Optimize !~ '.*' case to col IS NOT NULL AND Boolean(NULL) instead of Eq "" fix: Optimize !~ '.*' case to col IS NULL AND Boolean(NULL) instead of Eq "" Mar 11, 2026
@petern48
Copy link
Contributor Author

Appreciate the little bump to get this moving again. So sorry for the delay. I didn't mean to keep you waiting. I definitely agree with your suggestion. I was cramming for a deadline, and wanted to wait until I had more time to investigate something a bit further.

I vaguely remember just copy-pasting the IS NOT NULL part from the optimized output, so I had a small suspicion that the CASE statement was rewritten incorrectly to IS NOT NULL. Though I just manually checked it again, and it seems to optimize properly to IS NULL. Maybe I just wrote the case statement incorrectly, too 🤷. Anyway, lesson learned. I need to slow down a bit nowadays 🤦 , haha. Thanks for bearing with me.

@alamb
Copy link
Contributor

alamb commented Mar 12, 2026

No worries -- thank you so much for helping push this along (crazy corner cases are always tricky)

@alamb alamb added this pull request to the merge queue Mar 12, 2026
Merged via the queue into apache:main with commit 8b412de Mar 12, 2026
34 checks passed
@petern48 petern48 deleted the bug_regexp_optim branch March 12, 2026 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: regexp simplify optimization incorrect simplifies .* pattern to Eq "" operation

2 participants