Fix massive spill files for StringView/BinaryView columns #19444

EeshanBembi · 2025-12-21T19:53:33Z

Add garbage collection for StringView and BinaryView arrays before spilling to disk. This prevents sliced arrays from carrying their entire original buffers when written to spill files.

Changes:

Add gc_view_arrays() function to apply GC on view arrays
Integrate GC into InProgressSpillFile::append_batch()
Use simple threshold-based heuristic (100+ rows, 10KB+ buffer size)

Fixes #19414 where GROUP BY on StringView columns created 820MB spill files instead of 33MB due to sliced arrays maintaining references to original buffers.

Testing shows 80-98% reduction in spill file sizes for typical GROUP BY workloads.

Add garbage collection for StringView and BinaryView arrays before spilling to disk. This prevents sliced arrays from carrying their entire original buffers when written to spill files. Changes: - Add gc_view_arrays() function to apply GC on view arrays - Integrate GC into InProgressSpillFile::append_batch() - Use simple threshold-based heuristic (100+ rows, 10KB+ buffer size) Fixes apache#19414 where GROUP BY on StringView columns created 820MB spill files instead of 33MB due to sliced arrays maintaining references to original buffers. Testing shows 80-98% reduction in spill file sizes for typical GROUP BY workloads.

bharath-techie · 2025-12-22T13:03:09Z

datafusion/physical-plan/src/spill/mod.rs

+            "https://produkty%2Fpulove.ru/album/login",
+        ];
+
+        let mut urls = Vec::with_capacity(200_000);


This might be quite heavy - maybe we can just keep the minimal reproducible version to verify that the changes are working as expected [ like the test above this ]

Yes, it would be great to make tests faster and use less memory.

bharath-techie · 2025-12-22T13:06:44Z

datafusion/physical-plan/src/spill/mod.rs

+    if any_gc_performed {
+        Ok(RecordBatch::try_new(batch.schema(), new_columns)?)
+    } else {
+        Ok(batch.clone())


Can we just return the batch without clone ?

bharath-techie · 2025-12-22T13:08:51Z

datafusion/physical-plan/src/spill/mod.rs

+    let mut new_columns: Vec<Arc<dyn Array>> = Vec::with_capacity(batch.num_columns());
+    let mut any_gc_performed = false;
+
+    for array in batch.columns() {


Maybe lets exit early and return the batch as it is if there are no view arrays in the RecordBatch ?

pepijnve · 2025-12-22T13:21:37Z

datafusion/physical-plan/src/spill/mod.rs

+}
+
+fn should_gc_view_array(len: usize, data_buffers: &[arrow::buffer::Buffer]) -> bool {
+    if len < 10 {


Is the number of rows a useful heuristic to not GC? Even if there are few rows, the data buffer may still be large.

pepijnve · 2025-12-22T13:26:23Z

datafusion/physical-plan/src/spill/mod.rs

+        return false;
+    }
+
+    let total_buffer_size: usize = data_buffers.iter().map(|b| b.capacity()).sum();


Can we use some of the existing size calculation methods like get_buffer_memory_size instead of duplicating calculations?

pepijnve · 2025-12-22T13:43:44Z

datafusion/physical-plan/src/spill/mod.rs

+#[cfg(test)]
+const VIEW_SIZE_BYTES: usize = 16;
+#[cfg(test)]
+const INLINE_THRESHOLD: usize = 12;


There's a public constant for this MAX_INLINE_VIEW_LEN

2010YOUY01

Thanks, it's a good idea to include compaction inside SpillManager

One follow-on to do is refactoring the external sort, now it's doing some compaction already outside the SpillManager, so with this PR it would be doing redundant compactions, I believe we should just simply remove that

datafusion/datafusion/physical-plan/src/sorts/sort.rs

Line 483 in bb9a4a7

fn organize_stringview_arrays(

2010YOUY01 · 2025-12-23T03:05:24Z

datafusion/physical-plan/src/spill/mod.rs

+    }
+
+    if any_gc_performed {
+        Ok(RecordBatch::try_new(batch.schema(), new_columns)?)


nit: I think we can get rid of the any_gc_performed condition, and always go this branch to make it a little bit simpler

2010YOUY01 · 2025-12-23T03:07:17Z

datafusion/physical-plan/src/spill/mod.rs

+            "https://produkty%2Fpulove.ru/album/login",
+        ];
+
+        let mut urls = Vec::with_capacity(200_000);


Yes, it would be great to make tests faster and use less memory.

2010YOUY01 · 2025-12-23T03:09:05Z

datafusion/physical-plan/src/spill/in_progress_spill_file.rs

    }

    /// Appends a `RecordBatch` to the spill file, initializing the writer if necessary.
+    /// Performs garbage collection on StringView/BinaryView arrays to reduce spill file size.


I recommend to add more comments to explain the rationale for views gc, perhaps just copy and paste from

datafusion/datafusion/physical-plan/src/sorts/sort.rs

Line 483 in bb9a4a7

fn organize_stringview_arrays(

- Replace row count heuristic with 10KB memory threshold - Improve documentation and add inline comments - Remove redundant test_exact_clickbench_issue_19414 - Maintains 96% reduction in spill file sizes

github-actions bot added the physical-plan Changes to the physical-plan crate label Dec 21, 2025

EeshanBembi force-pushed the fix-stringview-spill-gc branch from d3e3383 to cc6c180 Compare December 22, 2025 09:55

EeshanBembi force-pushed the fix-stringview-spill-gc branch from cc6c180 to 7dfb1e2 Compare December 22, 2025 10:00

EeshanBembi mentioned this pull request Dec 22, 2025

[Bug] BinaryView/StringView columns are spilled without GC and results in enormous spill files #19414

Open

bharath-techie reviewed Dec 22, 2025

View reviewed changes

pepijnve reviewed Dec 22, 2025

View reviewed changes

2010YOUY01 reviewed Dec 23, 2025

View reviewed changes

EeshanBembi added 2 commits December 29, 2025 17:56

Address PR review feedback for StringView/BinaryView GC

1fe162c

- Replace row count heuristic with 10KB memory threshold - Improve documentation and add inline comments - Remove redundant test_exact_clickbench_issue_19414 - Maintains 96% reduction in spill file sizes

Apply cargo fmt

9210e34

EeshanBembi requested review from bharath-techie and pepijnve December 31, 2025 10:38

EeshanBembi marked this pull request as ready for review December 31, 2025 10:38

EeshanBembi requested a review from 2010YOUY01 January 2, 2026 13:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix massive spill files for StringView/BinaryView columns #19444

Fix massive spill files for StringView/BinaryView columns #19444

Uh oh!

EeshanBembi commented Dec 21, 2025

Uh oh!

bharath-techie Dec 22, 2025

Uh oh!

2010YOUY01 Dec 23, 2025

Uh oh!

bharath-techie Dec 22, 2025

Uh oh!

bharath-techie Dec 22, 2025

Uh oh!

pepijnve Dec 22, 2025

Uh oh!

pepijnve Dec 22, 2025

Uh oh!

pepijnve Dec 22, 2025

Uh oh!

2010YOUY01 left a comment

Uh oh!

2010YOUY01 Dec 23, 2025

Uh oh!

2010YOUY01 Dec 23, 2025

Uh oh!

2010YOUY01 Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix massive spill files for StringView/BinaryView columns #19444

Are you sure you want to change the base?

Fix massive spill files for StringView/BinaryView columns #19444

Uh oh!

Conversation

EeshanBembi commented Dec 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants