Skip to content

[branch-53] perf: Optimize to_char to allocate less, fix NULL handling (#20635)#20885

Merged
alamb merged 1 commit intoapache:branch-53from
alamb:alamb/backport_20635_branch53
Mar 12, 2026
Merged

[branch-53] perf: Optimize to_char to allocate less, fix NULL handling (#20635)#20885
alamb merged 1 commit intoapache:branch-53from
alamb:alamb/backport_20635_branch53

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Mar 11, 2026

…20635)

## Which issue does this PR close?

- Closes apache#20634.

## Rationale for this change

The current `to_char` implementation (both scalar and array paths)
allocates a new string for every input row to hold the result of the
`ArrayFormatter`. To produce the results, it uses `StringArray::from`,
which copies.

We can do better by using a reusable buffer to store the result of
`ArrayFormatter`, and then append to `StringBuilder`. We can then
construct the final result values from the `StringBuilder` very
efficiently. This yields a 10-22% improvement on the `to_char`
microbenchmarks.

In the scalar path, we can do a bit better by having the
`ArrayFormatter` write directly into the `StringBuilder`'s buffer, which
saves a copy.

In the array path, we can't easily do this, because `to_char` has some
(dubious) logic to retry errors on `Date32` inputs by casting to
`Date64` and retrying the ArrayFormatter. Since the `StringBuilder` is
shared between all rows and there's no easy way to remove the partial
write that might happen in the case of an error, we use the intermediate
buffer here instead.

This PR also cleans up various code in `to_char`, and fixes a bug in
`NULL` handling: in the array case, if the current data value is NULL
but the format string is non-NULL, we incorrectly returned an empty
string instead of NULL.

## What changes are included in this PR?

* Optimize `to_char` (scalar and array paths) as described above
* Fix bug in NULL handling
* Add SLT test case for NULL handling bug
* Simplify and refactor various parts of `to_char`, particularly around
error handling
* Revise `to_char` benchmarks: the previous version tested `to_char` for
`Date32` inputs where the format string specifies a time value, which is
an odd corner case. Also get rid of benchmarking the scalar-scalar case;
this is very fast and occurs rarely in practice (will usually be
constant-folded before query execution).

## Are these changes tested?

Yes. Benchmarked and added new test.

## Are there any user-facing changes?

No.
@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Mar 11, 2026
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb

@alamb alamb merged commit 519866c into apache:branch-53 Mar 12, 2026
48 of 51 checks passed
@alamb
Copy link
Contributor Author

alamb commented Mar 12, 2026

Thanks for the reviews @comphead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants