GH-40968: [C++][Gandiva] add RE2::Options set_dot_nl(true) for Like function#40970
GH-40968: [C++][Gandiva] add RE2::Options set_dot_nl(true) for Like function#40970kou merged 6 commits intoapache:mainfrom
Conversation
|
|
|
@kou could you plz review? |
| Result<std::shared_ptr<LikeHolder>> LikeHolder::Make(const std::string& sql_pattern) { | ||
| std::string pcre_pattern; | ||
| ARROW_RETURN_NOT_OK(RegexUtil::SqlLikePatternToPcre(sql_pattern, pcre_pattern)); | ||
|
|
||
| auto lholder = std::shared_ptr<LikeHolder>(new LikeHolder(pcre_pattern)); | ||
| ARROW_RETURN_IF(!lholder->regex_.ok(), | ||
| Status::Invalid("Building RE2 pattern '", pcre_pattern, | ||
| "' failed with: ", lholder->regex_.error())); | ||
|
|
||
| return lholder; | ||
| } |
There was a problem hiding this comment.
because it's not used anymore
There was a problem hiding this comment.
It seems that this is an exported API. If we remove this, we break backward compatibility. Is it expected?
There was a problem hiding this comment.
Good point. Restored
| "'like' function requires a string literal as the second parameter")); | ||
|
|
||
| RE2::Options regex_op; | ||
| regex_op.set_dot_nl(true); // set dotall mode for the regex. |
There was a problem hiding this comment.
This breaks backward compatibility, right?
Can we keep backward compatibility?
|
Could you rebase on main to fix CI failures? |
Could you please clarify the example? There are no '\n' in the target string. Because the pattern '%Space1.%' does match the string you give. |
|
I do believe it is kind of a "bug" in Gandiva, a minimal reproducing example is |
There was a problem hiding this comment.
Can this set_dot_nl be moved to the constructor, i.e. enable it for all test cases?
|
|
||
| TEST_F(TestLikeHolder, TestPcreSpecialWithNewLine) { | ||
| regex_op.set_dot_nl(true); | ||
| EXPECT_OK_AND_ASSIGN(auto const like_holder, LikeHolder::Make("%Space1.%", regex_op)); |
There was a problem hiding this comment.
Can you also add a simpler test case, e.g. 'abc\nd' LIKE '%abc%' is enough to demonstate this change.
Oh. OK. Let's "fix" this without backward compatibility. |
kou
left a comment
There was a problem hiding this comment.
+1
Gandiva function "LIKE" does not always work correctly when the string contains \n.
String value:
[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]
Pattern '%Space1%' match string, but '%Space1.%' - not.
Is this description correct? I think that %Space1% doesn't match the string because the string includes \n.
|
@kou yes, this is how we found a problem. RE2 documentation said that '.' any character, possibly including newline. If in pattern we don't use any special symbols - it work. But with special symbols and with \n in string - not. But flag handle this case. |
kou means that
is not accurate. |
niyue
left a comment
There was a problem hiding this comment.
+1
Besides the LIKE keyword, I tested postgres 16's regexp_like function, which matches the new line character by default as well
SELECT regexp_like(E'hello\nfoo', 'hello.*'); ==> true
|
@kou Could you merge this, plz? |
|
We use the description for commit message. So I want to use correct description. |
|
@kou it's already fixed |
|
Thanks. |
…Like function (apache#40970) (#68) ### Rationale for this change Gandiva function "LIKE" does not always work correctly when the string contains \n. String value: `[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]` Pattern '%Space1%' nor '%Space1.%' do not match. ### What changes are included in this PR? added flag set_dot_nl(true) to LikeHolder ### Are these changes tested? add unit tests. ### Are there any user-facing changes? Yes **This PR includes breaking changes to public APIs.** * GitHub Issue: apache#40968 Lead-authored-by: Ivan Chesnov <ivan.chesnov@dremio.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 0affccc. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…Like function (apache#40970) ### Rationale for this change Gandiva function "LIKE" does not always work correctly when the string contains \n. String value: `[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]` Pattern '%Space1%' nor '%Space1.%' do not match. ### What changes are included in this PR? added flag set_dot_nl(true) to LikeHolder ### Are these changes tested? add unit tests. ### Are there any user-facing changes? Yes **This PR includes breaking changes to public APIs.** * GitHub Issue: apache#40968 Lead-authored-by: Ivan Chesnov <ivan.chesnov@dremio.com> Co-authored-by: Ivan Chesnov <xxxlaykxxx@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…Like f (#80) * apacheGH-40968: [C++][Gandiva] add RE2::Options set_dot_nl(true) for Like function (apache#40970) (#68) Gandiva function "LIKE" does not always work correctly when the string contains \n. String value: `[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]` Pattern '%Space1%' nor '%Space1.%' do not match. added flag set_dot_nl(true) to LikeHolder add unit tests. Yes **This PR includes breaking changes to public APIs.** * GitHub Issue: apache#40968 Lead-authored-by: Ivan Chesnov <ivan.chesnov@dremio.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com> * apacheGH-43119: [CI][Packaging] Update manylinux 2014 CentOS repos that have been deprecated (apache#43121) Jobs are failing to find mirrorlist.centos.org Updating repos based on solution from: apache#43119 (comment) Via archery No * GitHub Issue: apache#43119 Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> --------- Signed-off-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
…Like function (apache#40970) (dremio#68) ### Rationale for this change Gandiva function "LIKE" does not always work correctly when the string contains \n. String value: `[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]` Pattern '%Space1%' nor '%Space1.%' do not match. ### What changes are included in this PR? added flag set_dot_nl(true) to LikeHolder ### Are these changes tested? add unit tests. ### Are there any user-facing changes? Yes **This PR includes breaking changes to public APIs.** * GitHub Issue: apache#40968 Lead-authored-by: Ivan Chesnov <ivan.chesnov@dremio.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Rationale for this change
Gandiva function "LIKE" does not always work correctly when the string contains \n.
String value:
[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]Pattern '%Space1%' nor '%Space1.%' do not match.
What changes are included in this PR?
added flag set_dot_nl(true) to LikeHolder
Are these changes tested?
add unit tests.
Are there any user-facing changes?
Yes
This PR includes breaking changes to public APIs.