ARROW-9390: [C++][Doc] Review compute function names#7755
ARROW-9390: [C++][Doc] Review compute function names#7755pitrou wants to merge 3 commits intoapache:masterfrom
Conversation
|
Also cc @xhochy @nealrichardson |
|
Valgrind works fine on this PR. |
Modified function names: * minmax -> min_max * binary_isascii -> string_isascii (only works on string types) * ascii_length -> binary_length (also make it work on binary types) * binary_contains_exact -> match_substring (other possibility: has_substring ?) * match -> index_in * isin -> is_in * list_value_lengths -> list_value_length * partition_indices -> partition_nth_indices (other kinds of partitioning would be possible, e.g. using a predicate) Document string predicate functions (ARROW-9444). Also fix the allocation of IsValid output buffer in certain cases.
0f1f905 to
66c0824
Compare
|
Rebased and fixed CI (hopefully). |
On the PR (#7593 (comment)), @xhochy mentioned that the "exact" part was important to him |
|
Well, "substring" clearly implies "exact". As for case-sensitivity, I hope that can be a flag in the options, because I don't think we want a combinatorial explosion of string matching functions. |
+1, that's a much clearer name IMO (although the R people might disagree ;-)) |
wesm
left a comment
There was a problem hiding this comment.
Thanks for doing this. Some minor comments which can be addressed as follow up also
| auto func = std::make_shared<ScalarFunction>("binary_contains_exact", Arity::Unary()); | ||
| auto exec_32 = BinaryContainsExact<StringType>::Exec; | ||
| auto exec_64 = BinaryContainsExact<LargeStringType>::Exec; | ||
| void AddMatchSubstring(FunctionRegistry* registry) { |
There was a problem hiding this comment.
I think this is fine, we can use match_substring_case_insensitive for the case insensitive version
cc @xhochy
|
Does this conflict with 1d7d919 which I just merged? |
|
No, it's rebased already. |
|
AppVeyor build: https://ci.appveyor.com/project/pitrou/arrow/builds/34089592 |
|
I think this can be merged. Any concerns? |
|
Nope. +1 here |
nealrichardson
left a comment
There was a problem hiding this comment.
Sorry I didn't get to this before this merged, but a couple of thoughts.
| MakeUnaryStringBatchKernel<AsciiLower>("ascii_lower", registry); | ||
|
|
||
| AddUnaryStringPredicate<IsAscii>("binary_isascii", registry); | ||
| AddUnaryStringPredicate<IsAscii>("string_isascii", registry); |
There was a problem hiding this comment.
Why isn't this string_is_ascii? It seems we prefer is_x to isx (cf. isin -> is_in in this PR).
There was a problem hiding this comment.
That's a good point. Should we make a followup PR?
(if it's me, it's only tomorrow)
There was a problem hiding this comment.
I can make this change today unless someone else wants to
There was a problem hiding this comment.
Also ascii_is_alnum, etc. ;-)
| UnaryStringBenchmark(state, "binary_contains_exact", &options); | ||
| static void MatchSubstring(benchmark::State& state) { | ||
| MatchSubstringOptions options("abac"); | ||
| UnaryStringBenchmark(state, "match_substring", &options); |
There was a problem hiding this comment.
Are there clear criteria for when the function gets a binary_ or string_ prefix and when it doesn't?
Related, if we haven't documented the naming conventions (haven't seen it yet but I'll keep reading), we should so that future developers/selves know what to call new functions.
There was a problem hiding this comment.
I don't think so. Personally, I find prefixes a bit pointless or even distracting.
(or, here, plain misleading, since "binary_contains_exact" didn't work on binary arrays, only string arrays...)
There was a problem hiding this comment.
Perhaps someone else wants to think about a convention?
Followup from PR apache#7755: a new doc file wasn't checked in, and Sphinx only emits a warning.
Followup from PR #7755: a new doc file wasn't checked in, and Sphinx only emits a warning. Closes #7760 from pitrou/ARROW-9390-missing-file Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Modified function names:
(other possibility: has_substring ?)
(other kinds of partitioning would be possible, e.g. using a predicate)
Document string predicate functions (ARROW-9444).
Also fix the allocation of IsValid output buffer in certain cases.