-
Notifications
You must be signed in to change notification settings - Fork 4k
Open
Description
Describe the enhancement requested
Arrow's ORC reader already supports column projection (reading only selected columns), but lacks row-level predicate pushdown. Currently, filtering rows from ORC files requires:
- Reading all rows from selected columns (all stripes)
- Applying filters post-read using Arrow compute
This is inefficient for large ORC files where only a small subset of rows match the filter criteria. ORC files store min/max statistics at the stripe level, which can be used to skip entire stripes that cannot contain matching rows, and avoids I/O for data that will be filtered out anyway.
Use Cases
- Efficiently query large ORC datasets with selective predicates
- Enable predicate pushdown for Iceberg tables stored in ORC format
- Match the filtering capabilities already available for Parquet files
Component(s)
Python, C++
Reactions are currently unavailable