ARROW-9160: [C++] Implement contains for exact matches#7593
ARROW-9160: [C++] Implement contains for exact matches#7593xhochy wants to merge 8 commits intoapache:masterfrom
Conversation
|
This currently fails for chunked arrays. I thought that they should be handled by the kernel framework automatically but it seems, they aren't. |
|
@xhochy chunked arrays should be handled automatically by the function executors. I will take a look. |
|
Implemented Knuth-Morris-Pratt, got roughly a 2x speedup in the benchmark. |
|
@xhochy I'm fixing a couple issues with the implementation:
|
|
I just opened https://issues.apache.org/jira/browse/ARROW-9285 -- it should be easy to check if a kernel has mistakenly replaced a preallocated data buffer (which may be a slice of a larger output that is being populated) |
|
I also propose to rename the function from "contains_exact" to "utf8_contains". Ideas:
@maartenbreddels @pitrou any opinions about this? |
|
I think prefixes make sense. We will have sometime similar kernel names that act quite different depending on the types they work on. I would differentiate in the string/binary case the kernels between the categories:
With that, I would rename the kernel here to |
|
You match a regex, you don't contain it. |
That is one of the name clashes, we already have a match kernel. |
|
|
|
I like the prefixing by
(or s/contains/match 😄 ) There might be kernels that work on binary data, but do not work well with utf8, e.g. a Regarding case insensitive variants, I think we should expose functions to do normalization (eg replace the single codepoint ë by the letter e and the diaeresis combining character(the dots above the ë)), case folding, and removal of combining characters. That allows users to remove e.g. the diaeresis from ë to do pattern matching without diacritics. |
|
While at the difficult topic of naming, is there a conversion (agreed or emerging) for naming the functors/ArrayKernelExec implementations? I see in
|
Could you elaborate? Why is this not a problem with the lower/upper kernels? |
pitrou
left a comment
There was a problem hiding this comment.
This is a nice addition. A couple comments.
cpp/src/arrow/compute/api_scalar.h
Outdated
There was a problem hiding this comment.
I don't think defaulting to the empty string makes sense here.
There was a problem hiding this comment.
This is needed to be able to construct an instance of it in Cython. I can remove this and switch the Cython instantiation to a pointer.
python/pyarrow/compute.py
Outdated
There was a problem hiding this comment.
That's rather vague. Does contains_exact(pa.array([1,2,3]), pa.array([1,2])) return true? I don't think so.
The data preallocation is only for fixed size outputs (eg boolean, integers, floating point, etc). |
There was a problem hiding this comment.
We can move this line into then clause before line 327.
This seems reasonable to me |
|
"binary_contains_exact" still works for utf8 data, though. |
|
OK, do you have an idea for a prefix for a function that works either on BinaryArray or StringArray? This could also be handled with aliases so we could have |
|
String is a subclass of binary right? So I expect binary_contains_exact to work on strings right? |
|
Right, I was just mentioned that "binary" isn't exclusive of "utf8" in this particular instance. |
|
Other types (e.g. list) will need to have a "contains" operation, so I think it's more clear to have |
|
Adressed all comments and CI is happy, too. |
wesm
left a comment
There was a problem hiding this comment.
+1. I'll fix the benchmark function rename and then merge this
Follow function rename in benchmarks
No description provided.