feat(NET-92) Handle missing parquet columns as nulls#59
Open
define-null wants to merge 5 commits intomasterfrom
Open
feat(NET-92) Handle missing parquet columns as nulls#59define-null wants to merge 5 commits intomasterfrom
define-null wants to merge 5 commits intomasterfrom
Conversation
TableReader` trait** (`reader.rs`) — Added `default_null_columns: Option<&HashSet<Name>>` parameter to `read()`. **`Scan`** (`scan.rs`) — Added `default_null_columns` field and `with_default_null_columns()` builder method. Passes it through to `reader.read()`. **`ParquetFile::read()`** (`parquet/file.rs`) — In Stage 3, columns in `default_null_columns` that are missing from the parquet schema are skipped instead of erroring. After reading (Stage 4), `NullArray` columns are injected for them into every record batch. This handles both projection and predicate columns — predicates will see NullArrays and naturally evaluate to false/null for comparisons. **`SnapshotTableReader::read()`** (`storage/reader.rs`) — Accepts the new parameter (unused for now since storage tables are expected to always have all columns). **`execute_output`** (`plan.rs`) — Simplified to use `scan.with_default_null_columns()` instead of manual missing-column detection and null-array injection.
…upport default-null columns automatically across all phases
kalabukdima
approved these changes
Mar 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Contributes: https://linear.app/sqd-ai/issue/NET-92/correctly-propagate-errors-from-the-query-engine#comment-fb1dc348
What is this PR about?
Graceful handling of missing columns in parquet files. When a query requests a field that doesn't exist in the underlying parquet data, the system now returns null values instead of failing.
How does it work?
Scanbuilder accepts a list of default-null columns viawith_default_null_columns(). When a projected column is missing from the parquet file, it is injected as aNullArrayinto the resultingRecordBatch.Tablegains aset_nullable()method for declaring which columns may be absent. Currently all field columns are marked nullable viacolumns()generated by theitem_field_selection!macro.ChunkWithDefaultswrapper implementsChunkand transparently attaches default-null column info to everyscan_table()call — covering both direct scans and relation lookups.authorization_listfor now from the nullable column list.Limitations
There is no reliable source of schema information today. The schema may vary within a single dataset and is commonly different across datasets of the same kind. As a result, we cannot precisely declare which columns are truly nullable — instead, all field columns are currently marked as such. This is a temporary mitigation as discussed with @kalabukdima: queries will return nulls for missing columns rather than error, but the proper fix requires a well-defined schema source.