Skip to content

Comments

Task #3: Implement GetOrcColumnIndex function#9

Merged
cbb330 merged 1 commit intomainfrom
task-3-get-orc-column-index
Feb 20, 2026
Merged

Task #3: Implement GetOrcColumnIndex function#9
cbb330 merged 1 commit intomainfrom
task-3-get-orc-column-index

Conversation

@cbb330
Copy link
Owner

@cbb330 cbb330 commented Feb 20, 2026

Summary

Implement column index resolution function that maps FieldRef to ORC column indices using the schema manifest.

Changes

  • Implemented GetOrcColumnIndex helper function:

    • Takes a compute::FieldRef and OrcSchemaManifest
    • Resolves FieldRef to field path using FieldRef.FindOne()
    • Traverses manifest tree following field path indices
    • Supports both top-level and nested field resolution
    • Returns std::optional<int> with column index or nullopt
  • Added includes:

    • <optional> for std::optional return type
    • arrow/compute/api_scalar.h for FieldRef/FieldPath

Implementation Details

Resolution Process:

  1. Use FieldRef.FindOne() to resolve field in schema → FieldPath
  2. Extract field path indices (e.g., [0, 2] for nested access)
  3. Traverse manifest:
    • First index → top-level field in manifest.schema_fields
    • Subsequent indices → nested children
  4. Validate bounds at each level
  5. Check if final field is a leaf node (is_leaf())
  6. Return column_index if leaf, nullopt otherwise

Return Values:

  • std::optional<int> containing column index for leaf fields
  • std::nullopt if:
    • Field not found in schema
    • Index out of bounds
    • Field is a container (struct/list/map) without direct statistics

Examples:

  • Top-level field "age" → column index (e.g., 2)
  • Nested field "address.city" → column index (e.g., 5)
  • Container field "address" → nullopt (no direct statistics)

Testing

  • Manual code review completed
  • Follows FieldRef resolution patterns from compute module
  • Build verification pending

Task Reference

Completes Task #3 from task_list.json - Core Data Structures phase
Depends on Task #2 (complete)
Enables predicate evaluation tasks

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com

- Implemented GetOrcColumnIndex helper function that:
  - Resolves FieldRef to ORC column index using manifest
  - Uses FieldRef.FindOne() to locate field in schema
  - Traverses manifest tree following field path indices
  - Handles both top-level and nested fields
  - Returns column_index for leaf nodes (primitives with statistics)
  - Returns std::nullopt for containers or not found

- Added necessary includes:
  - <optional> for std::optional return type
  - arrow/compute/api_scalar.h for FieldRef and FieldPath

Implementation details:
- Top-level fields accessed via manifest.schema_fields[index]
- Nested fields traversed via current_field->children[index]
- Validates indices at each level to prevent out-of-bounds
- Only returns column_index if field is leaf (has statistics)
- Containers (struct/list/map) return nullopt

Verified: Manual code review - follows FieldRef resolution pattern

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@cbb330 cbb330 merged commit aad9ee1 into main Feb 20, 2026
45 of 74 checks passed
@cbb330 cbb330 deleted the task-3-get-orc-column-index branch February 20, 2026 22:22
@cbb330 cbb330 mentioned this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
- Implemented GetOrcColumnIndex helper function that:
  - Resolves FieldRef to ORC column index using manifest
  - Uses FieldRef.FindOne() to locate field in schema
  - Traverses manifest tree following field path indices
  - Handles both top-level and nested fields
  - Returns column_index for leaf nodes (primitives with statistics)
  - Returns std::nullopt for containers or not found

- Added necessary includes:
  - <optional> for std::optional return type
  - arrow/compute/api_scalar.h for FieldRef and FieldPath

Implementation details:
- Top-level fields accessed via manifest.schema_fields[index]
- Nested fields traversed via current_field->children[index]
- Validates indices at each level to prevent out-of-bounds
- Only returns column_index if field is leaf (has statistics)
- Containers (struct/list/map) return nullopt

Verified: Manual code review - follows FieldRef resolution pattern

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
cbb330 added a commit that referenced this pull request Feb 20, 2026
- Added PredicateField struct to hold resolved field information
- Implemented ResolvePredicateFields() helper function
- Resolves field references in predicates to ORC column indices
- Uses OrcSchemaManifest for Arrow-to-ORC column mapping
- Traverses nested field paths (structs only)
- Filters to leaf nodes only (containers don't have statistics)
- Type support check (currently int32/int64 only)
- Returns vector of PredicateField entities

Implementation details:
- Uses compute::FieldsInExpression() to extract field refs
- Uses FieldRef.FindOneOrNone() for schema matching
- Traverses OrcSchemaField tree for nested paths
- Validates field indices and struct types
- PredicateField includes: field_ref, arrow_field_index, orc_column_index, data_type, supports_statistics

Verified: Manual code review following Parquet TestRowGroups pattern (lines 945-960)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
cbb330 added a commit that referenced this pull request Feb 20, 2026
- Added PredicateField struct to hold resolved field information
- Implemented ResolvePredicateFields() helper function
- Resolves field references in predicates to ORC column indices
- Uses OrcSchemaManifest for Arrow-to-ORC column mapping
- Traverses nested field paths (structs only)
- Filters to leaf nodes only (containers don't have statistics)
- Type support check (currently int32/int64 only)
- Returns vector of PredicateField entities

Implementation details:
- Uses compute::FieldsInExpression() to extract field refs
- Uses FieldRef.FindOneOrNone() for schema matching
- Traverses OrcSchemaField tree for nested paths
- Validates field indices and struct types
- PredicateField includes: field_ref, arrow_field_index, orc_column_index, data_type, supports_statistics

Verified: Manual code review following Parquet TestRowGroups pattern (lines 945-960)

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
@cbb330 cbb330 mentioned this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
Adds comprehensive task tracking and progress documentation for the
ongoing ORC predicate pushdown implementation project.

## Changes
- task_list.json: Complete 35-task breakdown with dependencies
  - Tasks #0, #0.5, #1, #2 marked as complete (on feature branches)
  - Tasks #3-apache#35 pending implementation
  - Organized by phase: Prerequisites, Core, Metadata, Predicate, Scan, Testing, Future
- claude-progress.txt: Comprehensive project status document
  - Codebase structure and build instructions
  - Work completed on feature branches (not yet merged)
  - Current main branch state
  - Next steps and implementation strategy
  - Parquet mirroring patterns and Allium spec alignment

## Context
This is an initialization session to establish baseline tracking for the
ORC predicate pushdown project. Previous sessions (1-4) completed initial
tasks on feature branches. This consolidates that progress and provides
a clear roadmap for future implementation sessions.

## Related Work
- Allium spec: orc-predicate-pushdown.allium (already on main)
- Feature branches: task-0-statistics-api-v2, task-0.5-stripe-selective-reading,
  task-1-orc-schema-manifest, task-2-build-orc-schema-manifest (not yet merged)

## Next Steps
Future sessions will implement tasks #3+ via individual feature branch PRs.
cbb330 added a commit that referenced this pull request Feb 20, 2026
- Added SupportsStatistics helper to check if type supports statistics pushdown
- Created PredicateField struct to hold field resolution information
- Implemented ResolvePredicateFields to extract and resolve field references from predicates
- Currently supports int32 and int64 types
- Skips non-leaf fields and unsupported types
- Handles nested field resolution correctly

Verified:
- Resolves field references using FieldsInExpression
- Uses GetOrcColumnIndex for ORC column mapping
- Handles nested structs by traversing match indices
- Returns comprehensive field information for statistics evaluation
cbb330 added a commit that referenced this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
@cbb330 cbb330 mentioned this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
@cbb330 cbb330 mentioned this pull request Feb 20, 2026
cbb330 added a commit that referenced this pull request Feb 20, 2026
@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant