-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Is your feature request related to a problem? Please describe.
The current implementation processes files sequentially, leading to slow analysis times for large Python codebases. Unchanged files are re-analyzed on every run, and minor code changes trigger full project re-analysis. These issues result in inefficiencies, especially for projects with hundreds or thousands of files.
Describe the solution you'd like
Introduce Ray-based parallelization to leverage multiple CPU cores and potentially multiple machines. Add CLI options for controlling the number of processes:
--use-ray: Enable Ray-based parallel analysis.--nproc=%: Specify the percentage of available CPU cores to use.--nproc=all: Use all available CPU cores.
For incremental analysis, add a new argument:
--file-name=<file>: Analyze only the specified file, leveraging cached results for unchanged files.
Describe alternatives you've considered
- Threading/Multiprocessing: Less scalable than Ray and does not support distributed computing.
- File-level caching only: Would still re-analyze entire files when only one function changes.
- Simple timestamp-based caching: Less reliable than content-based hashing for detecting changes.
Additional context
Expected performance improvements:
- Significant speedup with Ray parallelization on multi-core systems.
- Faster subsequent runs with incremental caching.
- Efficient handling of minor code changes with SHA-based updates.
Implementation should maintain backward compatibility and integrate with the existing cache directory structure.