Upstream/rag api merge v0.7.3 by paychex-joser · Pull Request #19 · paychex/rag_api

paychex-joser · 2026-04-15T14:46:30Z

No description provided.

…a#214) * Process documents using async pipeline with concurrent embedding generation and DB insertion. This provides better memory efficiency by immediately inserting embeddings as they're generated. * - mad maxsize of embedding-queue configurable with EMBEDIING_MAX_QUEUE_SIZE env var * implemented fixes for copilot code review * implemented fixes for copilot code review - white space changes and remove unneeded import * implemented fixes for copilot code review - added documentation in function header for executor argument * feat: improve streaming document embeddings with bug fixes and tests This PR implements several critical fixes and improvements for the batch embedding feature that streams document embeddings to reduce memory consumption. ## Critical Fixes - Fix blocking async in _process_documents_batched_sync by wrapping sync calls in run_in_executor() - Unify rollback behavior: add `if all_ids:` guard to async path to only rollback when documents were actually inserted - Fix producer error handling: move `put(None)` to finally block to always signal completion - Add task_done() call for None signal in consumer - Fix consumer error handling: put exception object in results_queue instead of empty list, enabling proper error propagation ## Code Quality Improvements - Remove dead `# temp_vector_store` comment - Remove ineffective `del` statements (don't help with memory in Python) - Fix all f-string logging to use %-style formatting - Update docstrings to accurately describe function behavior - Add type hints (AsyncPgVector, PgVector, ThreadPoolExecutor) - Rename functions for clarity: embedding_producer → batch_producer, database_consumer → embedding_consumer - Add early return for empty document lists ## Tests (31 new tests) - tests/test_batch_processing.py: 23 unit tests covering async pipeline, sync batched processing, edge cases, and producer-consumer pattern - tests/test_batch_processing_integration.py: 8 integration tests including memory optimization verification with tracemalloc ## Documentation - Add EMBEDDING_BATCH_SIZE and EMBEDDING_MAX_QUEUE_SIZE to README env vars - Add "Embedding Batch Processing" section with configuration guide - Add "Running Tests" section with commands and test categories - Improve inline comments in config.py explaining batch processing trade-offs - Add pytest.ini for test configuration Closes danny-avila#213 * feat: add utility function for batch calculation and update tests This commit introduces a new utility function `calculate_num_batches` to streamline batch size calculations across the document processing routes. The function handles various edge cases, including zero items and batch sizes of zero. Additionally, existing tests have been updated to utilize this new function, ensuring accurate batch count calculations and improving test coverage for edge cases. * refactor: update vector store mocking in tests to use AsyncPgVector This commit modifies the test setup in `test_main.py` to replace the mocking of vector store methods with the `AsyncPgVector` class. The changes include updating the signatures of dummy functions to accept `self` and ensuring that all relevant methods are patched correctly. This enhances the accuracy of the tests by aligning them with the asynchronous implementation of the vector store. * test: clear cache for embedding function in vector store tests This commit updates the test setup in `test_main.py` to clear the cache of the `get_cached_query_embedding` function before running tests. This ensures that the mock embedding function is used consistently, improving the reliability of the tests. Additionally, the comment for overriding the embedding function has been clarified to indicate that it uses a dummy implementation that does not call OpenAI. * test: enhance vector store tests with dummy embedding function This commit updates the test setup in `test_main.py` to clear the cache of the `get_cached_query_embedding` function and replaces it with a dummy implementation that returns predefined embeddings. This change improves the reliability of the tests by ensuring consistent behavior during test execution. --------- Co-authored-by: Marc Amick <MarcAmick@jhu.edu> Co-authored-by: Danny Avila <danny@librechat.ai>

* 📚 docs: Update README with clean install instructions and enhance requirements files - Added a section in README for clean installation of dependencies, including steps for local development, lite version, and Docker. - Updated `requirements.txt` and `requirements.lite.txt` to bump `langchain-core` to v0.3.80 and `pypdf` to v6.4.0, and incremented `langchain_text_splitters` to v0.3.9. * 🔧 chore: Add `tenacity` dependency to requirements files - Updated `requirements.txt` and `requirements.lite.txt` to include `tenacity>=9.0.0` for improved retry handling in asynchronous operations.

…la#221) * 🔧 refactor: Document Processing and Health Check Functions * Refactored `_prepare_documents_sync` to handle document preparation synchronously, improving performance and clarity. * Updated `store_data_in_vector_db` to utilize the new synchronous document preparation function, ensuring non-blocking behavior in the async context. * Changed `is_health_ok` to be asynchronous, allowing for proper handling of health checks for different database types. These changes streamline document processing and improve the responsiveness of health checks in the application. * 🔧 refactor: Update Event Loop Retrieval in Document Processing * Replaced `asyncio.get_event_loop()` with `asyncio.get_running_loop()` in `_process_documents_batched_sync` and `store_data_in_vector_db` to ensure compatibility with the current asyncio context. * This change enhances the reliability of asynchronous operations by using the correct method for obtaining the running event loop.

* Added a step to free up disk space in the GitHub Actions workflow, utilizing the `jlumbroso/free-disk-space` action to optimize resource usage during builds. * Updated caching strategy for Docker images to improve build performance by specifying cache-from and cache-to options. These changes aim to streamline the CI/CD process and ensure efficient resource management.

* Fix CVE-2025-68664 Fix "LangGrinch" vulnerability: https://cyata.ai/blog/langgrinch-langchain-core-cve-2025-68664/ sign-of-by: Jeremie Daniel * Fix CVE-2025-68664 Fix "LangGrinch" critical vulnerability: https://cyata.ai/blog/langgrinch-langchain-core-cve-2025-68664/ Sign-off-by: Rahul Golam

@valiumaggelein

…ts (danny-avila#227) * fix(CVE-2025-68413): restrict local file embeds to upload directory * fix(CVE_2025_68414): validate file path for embed and embed-upload endpoints * fix(security): additional endpoint for same issue as CVE_2025_68414 per @valiumaggelein * fix(CVE-2025-68414): enhance file path validation to prevent traversal attacks and add comprehensive tests * fix: improve file path validation in embed and extract endpoints to prevent path traversal vulnerabilities; add tests for entity_id parameter handling --------- Co-authored-by: Danny Avila <danny@librechat.ai>

* chore: Update `unstructured` dependency to version 0.18.32 in requirements files * chore: Update dependencies in requirements files to latest versions - Bumped `pypdf` from 6.4.0 to 6.7.3 - Updated `python-multipart` from 0.0.19 to 0.0.22 - Upgraded `cryptography` from 45.0.5 to 46.0.5

* Introduced to automate the creation of a Python virtual environment. * The script checks for the existence of a specified requirements file, removes any existing virtual environment, and installs dependencies from the requirements file. * Provides instructions for activating the virtual environment after setup.

…idation Crashes (danny-avila#256) * Fix race condition in embed_file_upload temp file path Concurrent uploads with the same filename would write to the same temp path because embed_file_upload used the filename directly under RAG_UPLOAD_DIR, unlike embed_file which namespaces under user_id. Two users uploading "report.pdf" simultaneously would corrupt each other's data. Add user_id subdirectory to match embed_file's behavior. * Fix connection leak and event loop blocking in mongo_health_check Every health check call created a MongoClient, opened a connection pool, and never closed it. Since health checks run frequently, this leaked connections continuously. Also, MongoClient() and admin.command() are blocking I/O calls that were running directly in an async function, blocking the event loop. Fix both: wrap blocking calls in asyncio.to_thread() and always close the client in a finally block. * Fix validation_exception_handler crash and body echo request.body() re-reads the entire upload into memory inside the error handler, and body.decode() raises UnicodeDecodeError on binary/non-ASCII content, turning a 422 into a 500. The response also echoed potentially large or sensitive upload content back to the caller. Remove the body read entirely — the validation errors already contain all useful info. * Remove dead exception branch in generate_digest str.encode("utf-8") on a Python str never raises UnicodeEncodeError for any valid Unicode string. The except branch handled surrogate characters with an encode-decode-encode round-trip that produces the same result as encode("utf-8", "ignore") directly. Replace with a single expression. * Enable EMBEDDING_BATCH_SIZE by default (500) The batch processing pipeline is fully built but disabled by default (EMBEDDING_BATCH_SIZE=0). This means large documents embed all chunks in one blocking call. Set default to 500 to enable streaming embeddings into the DB while the rest are still being computed — the highest- leverage memory improvement with zero new code. * Stream CSV encoding conversion instead of reading entire file Non-UTF-8 CSV files were read entirely into a single Python string before writing to the temp file. For large files (e.g. 100MB UTF-16), this causes unnecessary peak memory allocation. Stream in 64KB chunks instead. * Close asyncpg connection pool on shutdown The pool was created during startup but never closed on shutdown. Rolling deploys or OOM-killed pods leave connections in idle state until TCP timeout, consuming PostgreSQL's finite connection limit. Close the pool cleanly in the lifespan teardown. * Address review findings: unique temp paths, guarded shutdown, tests Finding 1: Wrap close_pool() in try/except so an exception doesn't skip thread_pool.shutdown(). Finding 2+5: Use UUID-suffixed temp filenames via _make_unique_temp_path in all three upload endpoints (embed, embed-upload, text). This prevents same-user concurrent upload collisions (not just cross-user). Also moves save_upload_file_async inside the try/finally block so temp files are cleaned up even when the save itself fails. Finding 3+6: Restore tuning guidance comment for EMBEDDING_BATCH_SIZE explaining the default of 500 and when to increase. Finding 4: Add tests for _make_unique_temp_path (uniqueness, user isolation, extension preservation, traversal rejection) and generate_digest (normal strings, unicode, surrogates, determinism). * Clean up remaining nits from review - Drop redundant second validate_file_path call in _make_unique_temp_path; the UUID suffix is [0-9a-f]{32} with no path separators, so it cannot escape the already-validated directory. Resolve the path directly. - Fold os.makedirs into the try/finally block in all three upload endpoints so an OSError (permissions, disk full) is caught by the existing error handler instead of propagating as an unhandled 500. - Fix hardcoded Unix separator in test assertion; use Path.parent.name instead of string containment check.

* 🦥 feat: Add lazy_load() support to document loaders Add lazy_load() to SafePyPDFLoader and switch call sites from loader.load to lazy_load(), establishing the pattern for future streaming optimizations. Memory benchmarks confirm CSV streaming sees ~67% reduction; PDF sees modest gains; Unstructured loaders show no difference (as expected — they load fully internally). * 🩺 fix: Prevent page duplication in SafePyPDFLoader fallback path Fix critical bug where try/except around yield-from caused already-yielded pages to be duplicated when the image-extraction fallback triggered mid-stream. The extract_images=True path now collects eagerly to ensure atomic fallback; extract_images=False remains fully streaming. Also addresses code review findings: - Add mock-based tests for fallback correctness (no duplication, load() delegation, non-Filter error propagation) - Fix file handle leaks in test helpers (open without with) - Add tracemalloc.is_tracing() guards in benchmarks - Use collections.abc.Iterator instead of types.GeneratorType - Restore explanatory comment on bare raise - Remove vacuous assert (list() never returns None) - Add real assertions to memory benchmark tests

danny-avila#258) * 🚰 fix: Close leaked database connections and temp files on shutdown - MongoDB: Store MongoClient reference in factory module so it can be closed on app shutdown (was created anonymously, never closed) - SQLAlchemy: Dispose PGVector's engine on shutdown (connection pool was never released, separate from the asyncpg pool that was managed) - embed_local_file: Move cleanup_temp_encoding_file to finally block so CSV temp files are cleaned even when store_data_in_vector_db fails - Bound results_queue in async pipeline to EMBEDDING_MAX_QUEUE_SIZE to prevent unbounded memory growth on large documents * 🩺 fix: Revert results_queue deadlock, fix remaining temp leak, correct shutdown order - Revert results_queue to unbounded: bounding it deadlocked any document producing >EMBEDDING_MAX_QUEUE_SIZE batches, since the drain loop runs only after asyncio.gather() returns but the consumer blocks on put() when the queue is full. The queue holds only UUID lists (negligible memory). - Fix same temp file leak in load_file_content that was fixed in embed_local_file — affects /embed, /embed-upload, /text routes. - Swap shutdown order: drain thread pool before disposing connections, so no threads are mid-SQL when the engine pool is torn down. - Close previous MongoClient before overwriting in get_vector_store. - Add docstrings and type annotations to factory functions. - Add 6 tests for close_vector_store_connections (mongo close, engine dispose, idempotency, no-bind, None, re-call closes previous client). * 🧪 test: Add test for load_file_content cleanup on lazy_load failure Verifies that cleanup_temp_encoding_file is called in the finally block even when lazy_load() raises, closing the last test coverage gap from the review.

…ny-avila#259) * feat: Migrate cmetadata from JSON to JSONB and add GIN index for efficient filtering * fix: Correct DO block dollar-quoting, add lock_timeout, schema filter, and tests Addresses code review findings on the JSON-to-JSONB migration: - Fix invalid $..$ dollar-quoting to $$...$$ (was preventing app startup) - Add SET LOCAL lock_timeout = '10s' before ALTER TABLE - Add table_schema = current_schema() filter to information_schema query - Add docstring and inline comments (rollback SQL, index redundancy note) - Replace dummy test with capturing tests that verify actual SQL content - Add use_jsonb=True regression tests for both sync and async factory paths

…er (danny-avila#260) * chore: Upgrade langchain and related dependencies; refactor async execution in vector store * fix: Handle None for RAG_GOOGLE_API_KEY and update default EMBEDDINGS_MODEL for Google Vertex AI * fix: Correct package names and enforce version constraints for pymongo and boto3 * refactor: Simplify thread pool retrieval in AsyncPgVector class * test: Add unit tests for DummyAsyncPgVector methods in test_async_pg_vector.py * feat: Implement async executor wrapper to handle StopIteration in AsyncPgVector methods; add corresponding unit test * chore: unused import * chore: Update VertexAI embedding model and add additional environment variables for Google Cloud integration

…vila#265) The unstructured library calls nltk.pos_tag() when classifying short text fragments during markdown parsing, which requires the averaged_perceptron_tagger_eng resource in NLTK 3.9+. Without it, uploading .md files fails with LookupError in environments without outbound internet (the bug is masked elsewhere because unstructured silently auto-downloads the package at runtime). Fixes danny-avila#141

…#267)

…anny-avila#264) * fix: add missing msoffcrypto-tool package for xlsx file support * chore: add major version upper bound to msoffcrypto-tool constraint Pin msoffcrypto-tool to >=6.0.0,<7 (consistent with other >= deps in this project like pymongo, pydantic, boto3) to prevent accidental breakage from future major version bumps. --------- Co-authored-by: Danny Avila <danny@librechat.ai>

… ID Collisions (danny-avila#266) * Remove hard coded documents field for add_documents function The add_documents function for mongo has a docs parameter where the same function for postgres has a documents parameter. This difference makes this function fail for mongo. * fix: resolve ID collision and LSP violation in AtlasMongoVector.add_documents - Rename `docs` param to `documents` to match VectorStore base class, fixing the root cause of the TypeError for MongoDB users - Replace per-batch sequential ID generation with content-digest-based IDs to prevent cross-batch collisions that caused silent data loss for files exceeding EMBEDDING_BATCH_SIZE in chunk count - Add empty document guard - Update type annotation and docstring in _process_documents_batched_sync to reflect that it handles both PgVector and AtlasMongoVector - Add regression tests for cross-batch ID uniqueness and positional dispatch compatibility - Align DummyVectorStore in conftest with base class signature * fix: address remaining review findings (R1-R3) - R1: Replace vacuous test_empty_documents_returns_empty with one that calls the real AtlasMongoVector.add_documents via bound method - R2: Swap test names/stores so the primary regression test uses a strict legacy-param store (no **kwargs) that would fail with keyword dispatch, and rename the base-class-signature test accordingly - R3: Add Union["PgVector", "AtlasMongoVector"] type annotation to _process_documents_batched_sync; add docstring to add_documents explaining that caller-supplied ids are intentionally ignored in favor of content-digest-based IDs --------- Co-authored-by: Danny Avila <danny@librechat.ai>

Brings in Paychex customizations: - PaychexDockerfile with Paychex root certificate - CI/CD workflows for n1, n2a, and prod environments - Azure ACR integration - paychex-root.pem certificate This merge combines upstream v0.7.3 features with Paychex infrastructure.

- Add missing averaged_perceptron_tagger_eng NLTK package (from upstream danny-avila#265) - Remove duplicate WORKDIR /app line - Maintains Python 3.12-slim and Paychex certificate integration

MarcAmick and others added 21 commits December 10, 2025 13:49

📋 docs: MongoDB file_id index recommendation (danny-avila#219)

fa0abea

📦 chore: Bump pypdf to v6.7.4 (danny-avila#261)

b3a785f

📦 chore: Update pypdf to v6.9.1 and PyJWT to v2.12.1 (danny-avila…

2f6e6e9

…#267)

Update PaychexDockerfile with upstream changes

ba60339

- Add missing averaged_perceptron_tagger_eng NLTK package (from upstream danny-avila#265) - Remove duplicate WORKDIR /app line - Maintains Python 3.12-slim and Paychex certificate integration

paychex-joser requested review from gmarcelino-paychex, paychex-inder and paychex-tmarkovi April 15, 2026 14:46

Adding docker-compose.override to .gitignore

9da7074

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream/rag api merge v0.7.3#19

Upstream/rag api merge v0.7.3#19
paychex-joser wants to merge 22 commits intomainfrom
upstream/rag_api_merge_v0.7.3

paychex-joser commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

paychex-joser commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants