-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Fix massive spill files for StringView/BinaryView columns #19444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
d3e3383 to
cc6c180
Compare
Add garbage collection for StringView and BinaryView arrays before spilling to disk. This prevents sliced arrays from carrying their entire original buffers when written to spill files. Changes: - Add gc_view_arrays() function to apply GC on view arrays - Integrate GC into InProgressSpillFile::append_batch() - Use simple threshold-based heuristic (100+ rows, 10KB+ buffer size) Fixes apache#19414 where GROUP BY on StringView columns created 820MB spill files instead of 33MB due to sliced arrays maintaining references to original buffers. Testing shows 80-98% reduction in spill file sizes for typical GROUP BY workloads.
cc6c180 to
7dfb1e2
Compare
| "https://produkty%2Fpulove.ru/album/login", | ||
| ]; | ||
|
|
||
| let mut urls = Vec::with_capacity(200_000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be quite heavy - maybe we can just keep the minimal reproducible version to verify that the changes are working as expected [ like the test above this ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would be great to make tests faster and use less memory.
| if any_gc_performed { | ||
| Ok(RecordBatch::try_new(batch.schema(), new_columns)?) | ||
| } else { | ||
| Ok(batch.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just return the batch without clone ?
| let mut new_columns: Vec<Arc<dyn Array>> = Vec::with_capacity(batch.num_columns()); | ||
| let mut any_gc_performed = false; | ||
|
|
||
| for array in batch.columns() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe lets exit early and return the batch as it is if there are no view arrays in the RecordBatch ?
| } | ||
|
|
||
| fn should_gc_view_array(len: usize, data_buffers: &[arrow::buffer::Buffer]) -> bool { | ||
| if len < 10 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the number of rows a useful heuristic to not GC? Even if there are few rows, the data buffer may still be large.
| return false; | ||
| } | ||
|
|
||
| let total_buffer_size: usize = data_buffers.iter().map(|b| b.capacity()).sum(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use some of the existing size calculation methods like get_buffer_memory_size instead of duplicating calculations?
| #[cfg(test)] | ||
| const VIEW_SIZE_BYTES: usize = 16; | ||
| #[cfg(test)] | ||
| const INLINE_THRESHOLD: usize = 12; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a public constant for this MAX_INLINE_VIEW_LEN
2010YOUY01
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it's a good idea to include compaction inside SpillManager
One follow-on to do is refactoring the external sort, now it's doing some compaction already outside the SpillManager, so with this PR it would be doing redundant compactions, I believe we should just simply remove that
| fn organize_stringview_arrays( |
| } | ||
|
|
||
| if any_gc_performed { | ||
| Ok(RecordBatch::try_new(batch.schema(), new_columns)?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think we can get rid of the any_gc_performed condition, and always go this branch to make it a little bit simpler
| "https://produkty%2Fpulove.ru/album/login", | ||
| ]; | ||
|
|
||
| let mut urls = Vec::with_capacity(200_000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would be great to make tests faster and use less memory.
| } | ||
|
|
||
| /// Appends a `RecordBatch` to the spill file, initializing the writer if necessary. | ||
| /// Performs garbage collection on StringView/BinaryView arrays to reduce spill file size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend to add more comments to explain the rationale for views gc, perhaps just copy and paste from
| fn organize_stringview_arrays( |
- Replace row count heuristic with 10KB memory threshold - Improve documentation and add inline comments - Remove redundant test_exact_clickbench_issue_19414 - Maintains 96% reduction in spill file sizes
Add garbage collection for StringView and BinaryView arrays before spilling to disk. This prevents sliced arrays from carrying their entire original buffers when written to spill files.
Changes:
Fixes #19414 where GROUP BY on StringView columns created 820MB spill files instead of 33MB due to sliced arrays maintaining references to original buffers.
Testing shows 80-98% reduction in spill file sizes for typical GROUP BY workloads.