Skip to content

perf: Optimize to_char to allocate less, fix NULL handling#20635

Merged
alamb merged 4 commits intoapache:mainfrom
neilconway:neilc/optimize-to-char
Mar 4, 2026
Merged

perf: Optimize to_char to allocate less, fix NULL handling#20635
alamb merged 4 commits intoapache:mainfrom
neilconway:neilc/optimize-to-char

Conversation

@neilconway
Copy link
Contributor

@neilconway neilconway commented Mar 1, 2026

Which issue does this PR close?

Rationale for this change

The current to_char implementation (both scalar and array paths) allocates a new string for every input row to hold the result of the ArrayFormatter. To produce the results, it uses StringArray::from, which copies.

We can do better by using a reusable buffer to store the result of ArrayFormatter, and then append to StringBuilder. We can then construct the final result values from the StringBuilder very efficiently. This yields a 10-22% improvement on the to_char microbenchmarks.

In the scalar path, we can do a bit better by having the ArrayFormatter write directly into the StringBuilder's buffer, which saves a copy.

In the array path, we can't easily do this, because to_char has some (dubious) logic to retry errors on Date32 inputs by casting to Date64 and retrying the ArrayFormatter. Since the StringBuilder is shared between all rows and there's no easy way to remove the partial write that might happen in the case of an error, we use the intermediate buffer here instead.

This PR also cleans up various code in to_char, and fixes a bug in NULL handling: in the array case, if the current data value is NULL but the format string is non-NULL, we incorrectly returned an empty string instead of NULL.

What changes are included in this PR?

  • Optimize to_char (scalar and array paths) as described above
  • Fix bug in NULL handling
  • Add SLT test case for NULL handling bug
  • Simplify and refactor various parts of to_char, particularly around error handling
  • Revise to_char benchmarks: the previous version tested to_char for Date32 inputs where the format string specifies a time value, which is an odd corner case. Also get rid of benchmarking the scalar-scalar case; this is very fast and occurs rarely in practice (will usually be constant-folded before query execution).

Are these changes tested?

Yes. Benchmarked and added new test.

Are there any user-facing changes?

No.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Mar 1, 2026
@neilconway neilconway changed the title perf: Optimize to_char to reduce allocations, fix NULL handling perf: Optimize to_char to allocate less, fix NULL handling Mar 1, 2026
@neilconway neilconway force-pushed the neilc/optimize-to-char branch from f56f972 to 0a1245c Compare March 1, 2026 23:11
@neilconway
Copy link
Contributor Author

Benchmarks:

  ┌───────────────────────────────┬───────────┬───────────────────┬────────┐
  │           Benchmark           │ Data type │      Format       │ Change │
  ├───────────────────────────────┼───────────┼───────────────────┼────────┤
  │ array_date_only_patterns_1000 │ Date32    │ array, date-only  │ -10%   │
  ├───────────────────────────────┼───────────┼───────────────────┼────────┤
  │ array_datetime_patterns_1000  │ Date64    │ array, datetime   │ -10%   │
  ├───────────────────────────────┼───────────┼───────────────────┼────────┤
  │ array_mixed_patterns_1000     │ Date64    │ array, mixed      │ -10%   │
  ├───────────────────────────────┼───────────┼───────────────────┼────────┤
  │ scalar_date_only_pattern_1000 │ Date32    │ scalar, date-only │ -22%   │
  ├───────────────────────────────┼───────────┼───────────────────┼────────┤
  │ scalar_datetime_pattern_1000  │ Date64    │ scalar, datetime  │ -17%   │
  └───────────────────────────────┴───────────┴───────────────────┴────────┘

@alamb alamb added this pull request to the merge queue Mar 4, 2026
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @neilconway and @Jefffrey

return to_char_scalar(&expression.cast_to(&Date64, None)?, format);
// Write directly into the builder's internal buffer, then
// commit the value with append_value("").
match formatter.value(i).write(&mut builder) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Merged via the queue into apache:main with commit 340ef60 Mar 4, 2026
30 checks passed
@neilconway neilconway deleted the neilc/optimize-to-char branch March 5, 2026 13:39
alamb pushed a commit to alamb/datafusion that referenced this pull request Mar 11, 2026
…20635)

## Which issue does this PR close?

- Closes apache#20634.

## Rationale for this change

The current `to_char` implementation (both scalar and array paths)
allocates a new string for every input row to hold the result of the
`ArrayFormatter`. To produce the results, it uses `StringArray::from`,
which copies.

We can do better by using a reusable buffer to store the result of
`ArrayFormatter`, and then append to `StringBuilder`. We can then
construct the final result values from the `StringBuilder` very
efficiently. This yields a 10-22% improvement on the `to_char`
microbenchmarks.

In the scalar path, we can do a bit better by having the
`ArrayFormatter` write directly into the `StringBuilder`'s buffer, which
saves a copy.

In the array path, we can't easily do this, because `to_char` has some
(dubious) logic to retry errors on `Date32` inputs by casting to
`Date64` and retrying the ArrayFormatter. Since the `StringBuilder` is
shared between all rows and there's no easy way to remove the partial
write that might happen in the case of an error, we use the intermediate
buffer here instead.

This PR also cleans up various code in `to_char`, and fixes a bug in
`NULL` handling: in the array case, if the current data value is NULL
but the format string is non-NULL, we incorrectly returned an empty
string instead of NULL.

## What changes are included in this PR?

* Optimize `to_char` (scalar and array paths) as described above
* Fix bug in NULL handling
* Add SLT test case for NULL handling bug
* Simplify and refactor various parts of `to_char`, particularly around
error handling
* Revise `to_char` benchmarks: the previous version tested `to_char` for
`Date32` inputs where the format string specifies a time value, which is
an odd corner case. Also get rid of benchmarking the scalar-scalar case;
this is very fast and occurs rarely in practice (will usually be
constant-folded before query execution).

## Are these changes tested?

Yes. Benchmarked and added new test.

## Are there any user-facing changes?

No.
alamb added a commit that referenced this pull request Mar 12, 2026
…ing (#20635) (#20885)

- Part of #19692
- Closes #20634 on branch-53

This PR:
- Backports #20635 from
@neilconway to the branch-53 line

Co-authored-by: Neil Conway <neil.conway@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid per-row allocations in to_char

3 participants