Skip to content

ScalarValue::to_array_of_size is slow for StringViewArray with many buffers #20458

@neilconway

Description

@neilconway

Describe the bug

  • ScalarValue::to_array_of_size calls list_to_array_of_size for list-like inputs
  • list_to_array_of_size will be called with a one-element list containing an Arrow array type, and it does:
        let arrays = repeat_n(arr, size).collect::<Vec<_>>();
        let ret = match !arrays.is_empty() {
            true => arrow::compute::concat(arrays.as_slice())?,
            false => arr.slice(0, 0),
       };
  • If the input is a StringViewArray, repeat_n will create size copies, each with their own data buffers. concat preserves those data buffers. So if the input is, say, a StringViewArray with 500 buffers and size is 1024, the result will have 500k buffers.

To Reproduce

No response

Expected behavior

No response

Additional context

We probably didn't see this problem before because DF doesn't usually call ScalarValue::to_array_of_size on a list whose underlying StringViewArray has many data buffers. I ran into this when working on #3781. It crops up in a scenario like

SELECT ... FROM t WHERE array_has_any(f.c, (SELECT array_agg(...) FROM parquet_source_file))

The array_agg creates a StringViewArray with many data buffers. In current DF, this gets rewritten into a join, but after fixing #3781, array_has_any is instead invoked with the array_agg output as a scalar, so we go through this code path and run into some pain.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions