Skip to content

ORC Predicate Pushdown #48986

@cbb330

Description

@cbb330

Describe the enhancement requested

Arrow's ORC reader already supports column projection (reading only selected columns), but lacks row-level predicate pushdown. Currently, filtering rows from ORC files requires:

  1. Reading all rows from selected columns (all stripes)
  2. Applying filters post-read using Arrow compute

This is inefficient for large ORC files where only a small subset of rows match the filter criteria. ORC files store min/max statistics at the stripe level, which can be used to skip entire stripes that cannot contain matching rows, and avoids I/O for data that will be filtered out anyway.

Use Cases

  1. Efficiently query large ORC datasets with selective predicates
  2. Enable predicate pushdown for Iceberg tables stored in ORC format
  3. Match the filtering capabilities already available for Parquet files

Component(s)

Python, C++

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions