Skip to content

Conversation

@feichai0017
Copy link

Which issue does this PR close?

  • Closes #N/A (optimizer maintenance cleanup; no existing issue).

Rationale for this change

  • Streamline RequiredIndices::add_expr to collect column indices in a single traversal, avoiding temporary HashSet builds and duplicate walks. This reduces small per-query overhead in the projection optimizer and removes unused helpers.

What changes are included in this PR?

  • Refactor add_expr to push column indices during TreeNode::apply traversal, including outer-ref subquery expressions.
  • Add local helpers collect_outer_ref_exprs / push_column_index; remove unused outer_columns and redundant imports.
  • No behavioral change; maintenance/perf-only.

Are these changes tested?

  • Ran full cargo test (pass).

Are there any user-facing changes?

  • No user-facing changes.

Copilot AI review requested due to automatic review settings December 22, 2025 04:21
@github-actions github-actions bot added the optimizer Optimizer rules label Dec 22, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the RequiredIndices column collection in the projection optimizer by refactoring from a two-pass approach (using column_refs() + outer_columns()) to a single tree traversal. The goal is to reduce per-query overhead by eliminating temporary HashSet allocations and redundant expression walks.

Key changes:

  • Refactored add_expr() to use a single TreeNode::apply() traversal instead of calling column_refs() followed by outer_columns()
  • Added helper functions collect_outer_ref_exprs() and push_column_index() to support the new traversal approach
  • Removed unused outer_columns() and outer_columns_helper_multi() functions from mod.rs along with the unused HashSet import

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
datafusion/optimizer/src/optimize_projections/required_indices.rs Refactored add_expr() to use single-pass traversal with new helper functions; updated imports to include TreeNode
datafusion/optimizer/src/optimize_projections/mod.rs Removed now-unused outer_columns() and outer_columns_helper_multi() functions; removed unused HashSet import

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@xudong963
Copy link
Member

@feichai0017 Do you have a micro benchmark for the change?

@feichai0017
Copy link
Author

Not yet, let me run micro benchmark to compare it with old implementation @xudong963

@feichai0017
Copy link
Author

Ran the new required_indices micro-bench on an M3 Pro MacBook Pro (36 GB RAM). Results from the latest run:

  • new_wide (200-column projection): ~30 µs (median ~31.3 µs)
  • old_wide: ~42–44 µs (median ~41.2 µs)
  • new_outer_ref: ~26 ns
  • old_outer_ref: ~12–13 ns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants