Skip to content

Conversation

@EeshanBembi
Copy link
Contributor

Add garbage collection for StringView and BinaryView arrays before spilling to disk. This prevents sliced arrays from carrying their entire original buffers when written to spill files.

Changes:

  • Add gc_view_arrays() function to apply GC on view arrays
  • Integrate GC into InProgressSpillFile::append_batch()
  • Use simple threshold-based heuristic (100+ rows, 10KB+ buffer size)

Fixes #19414 where GROUP BY on StringView columns created 820MB spill files instead of 33MB due to sliced arrays maintaining references to original buffers.

Testing shows 80-98% reduction in spill file sizes for typical GROUP BY workloads.

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Dec 21, 2025
@EeshanBembi EeshanBembi force-pushed the fix-stringview-spill-gc branch from d3e3383 to cc6c180 Compare December 22, 2025 09:55
Add garbage collection for StringView and BinaryView arrays before spilling
to disk. This prevents sliced arrays from carrying their entire original
buffers when written to spill files.

Changes:
- Add gc_view_arrays() function to apply GC on view arrays
- Integrate GC into InProgressSpillFile::append_batch()
- Use simple threshold-based heuristic (100+ rows, 10KB+ buffer size)

Fixes apache#19414 where GROUP BY on StringView columns created 820MB spill files
instead of 33MB due to sliced arrays maintaining references to original buffers.

Testing shows 80-98% reduction in spill file sizes for typical GROUP BY workloads.
"https://produkty%2Fpulove.ru/album/login",
];

let mut urls = Vec::with_capacity(200_000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be quite heavy - maybe we can just keep the minimal reproducible version to verify that the changes are working as expected [ like the test above this ]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be great to make tests faster and use less memory.

if any_gc_performed {
Ok(RecordBatch::try_new(batch.schema(), new_columns)?)
} else {
Ok(batch.clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just return the batch without clone ?

let mut new_columns: Vec<Arc<dyn Array>> = Vec::with_capacity(batch.num_columns());
let mut any_gc_performed = false;

for array in batch.columns() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe lets exit early and return the batch as it is if there are no view arrays in the RecordBatch ?

}

fn should_gc_view_array(len: usize, data_buffers: &[arrow::buffer::Buffer]) -> bool {
if len < 10 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the number of rows a useful heuristic to not GC? Even if there are few rows, the data buffer may still be large.

return false;
}

let total_buffer_size: usize = data_buffers.iter().map(|b| b.capacity()).sum();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use some of the existing size calculation methods like get_buffer_memory_size instead of duplicating calculations?

#[cfg(test)]
const VIEW_SIZE_BYTES: usize = 16;
#[cfg(test)]
const INLINE_THRESHOLD: usize = 12;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a public constant for this MAX_INLINE_VIEW_LEN

Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's a good idea to include compaction inside SpillManager

One follow-on to do is refactoring the external sort, now it's doing some compaction already outside the SpillManager, so with this PR it would be doing redundant compactions, I believe we should just simply remove that

fn organize_stringview_arrays(

}

if any_gc_performed {
Ok(RecordBatch::try_new(batch.schema(), new_columns)?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we can get rid of the any_gc_performed condition, and always go this branch to make it a little bit simpler

"https://produkty%2Fpulove.ru/album/login",
];

let mut urls = Vec::with_capacity(200_000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be great to make tests faster and use less memory.

}

/// Appends a `RecordBatch` to the spill file, initializing the writer if necessary.
/// Performs garbage collection on StringView/BinaryView arrays to reduce spill file size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend to add more comments to explain the rationale for views gc, perhaps just copy and paste from

fn organize_stringview_arrays(

- Replace row count heuristic with 10KB memory threshold
- Improve documentation and add inline comments
- Remove redundant test_exact_clickbench_issue_19414
- Maintains 96% reduction in spill file sizes
@EeshanBembi EeshanBembi marked this pull request as ready for review December 31, 2025 10:38
@EeshanBembi EeshanBembi requested a review from 2010YOUY01 January 2, 2026 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] BinaryView/StringView columns are spilled without GC and results in enormous spill files

4 participants