Skip to content

Feature Request: Parallel Analysis with Ray and Incremental Caching #16

@rahlk

Description

@rahlk

Is your feature request related to a problem? Please describe.
The current implementation processes files sequentially, leading to slow analysis times for large Python codebases. Unchanged files are re-analyzed on every run, and minor code changes trigger full project re-analysis. These issues result in inefficiencies, especially for projects with hundreds or thousands of files.

Describe the solution you'd like
Introduce Ray-based parallelization to leverage multiple CPU cores and potentially multiple machines. Add CLI options for controlling the number of processes:

  • --use-ray: Enable Ray-based parallel analysis.
  • --nproc=%: Specify the percentage of available CPU cores to use.
  • --nproc=all: Use all available CPU cores.

For incremental analysis, add a new argument:

  • --file-name=<file>: Analyze only the specified file, leveraging cached results for unchanged files.

Describe alternatives you've considered

  • Threading/Multiprocessing: Less scalable than Ray and does not support distributed computing.
  • File-level caching only: Would still re-analyze entire files when only one function changes.
  • Simple timestamp-based caching: Less reliable than content-based hashing for detecting changes.

Additional context
Expected performance improvements:

  • Significant speedup with Ray parallelization on multi-core systems.
  • Faster subsequent runs with incremental caching.
  • Efficient handling of minor code changes with SHA-based updates.

Implementation should maintain backward compatibility and integrate with the existing cache directory structure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions