Skip to content

Upstream/rag api merge v0.7.3#19

Open
paychex-joser wants to merge 22 commits intomainfrom
upstream/rag_api_merge_v0.7.3
Open

Upstream/rag api merge v0.7.3#19
paychex-joser wants to merge 22 commits intomainfrom
upstream/rag_api_merge_v0.7.3

Conversation

@paychex-joser
Copy link
Copy Markdown

No description provided.

MarcAmick and others added 21 commits December 10, 2025 13:49
…a#214)

* Process documents using async pipeline with concurrent embedding generation and DB insertion.
This provides better memory efficiency by immediately inserting embeddings as they're generated.

* - mad maxsize of embedding-queue configurable with EMBEDIING_MAX_QUEUE_SIZE env var

* implemented fixes for copilot code review

* implemented fixes for copilot code review
- white space changes and remove unneeded import

* implemented fixes for copilot code review
- added documentation in function header for executor argument

* feat: improve streaming document embeddings with bug fixes and tests

This PR implements several critical fixes and improvements for the batch
embedding feature that streams document embeddings to reduce memory consumption.

## Critical Fixes
- Fix blocking async in _process_documents_batched_sync by wrapping sync
  calls in run_in_executor()
- Unify rollback behavior: add `if all_ids:` guard to async path to only
  rollback when documents were actually inserted
- Fix producer error handling: move `put(None)` to finally block to always
  signal completion
- Add task_done() call for None signal in consumer
- Fix consumer error handling: put exception object in results_queue instead
  of empty list, enabling proper error propagation

## Code Quality Improvements
- Remove dead `# temp_vector_store` comment
- Remove ineffective `del` statements (don't help with memory in Python)
- Fix all f-string logging to use %-style formatting
- Update docstrings to accurately describe function behavior
- Add type hints (AsyncPgVector, PgVector, ThreadPoolExecutor)
- Rename functions for clarity: embedding_producer → batch_producer,
  database_consumer → embedding_consumer
- Add early return for empty document lists

## Tests (31 new tests)
- tests/test_batch_processing.py: 23 unit tests covering async pipeline,
  sync batched processing, edge cases, and producer-consumer pattern
- tests/test_batch_processing_integration.py: 8 integration tests including
  memory optimization verification with tracemalloc

## Documentation
- Add EMBEDDING_BATCH_SIZE and EMBEDDING_MAX_QUEUE_SIZE to README env vars
- Add "Embedding Batch Processing" section with configuration guide
- Add "Running Tests" section with commands and test categories
- Improve inline comments in config.py explaining batch processing trade-offs
- Add pytest.ini for test configuration

Closes danny-avila#213

* feat: add utility function for batch calculation and update tests

This commit introduces a new utility function `calculate_num_batches` to streamline batch size calculations across the document processing routes. The function handles various edge cases, including zero items and batch sizes of zero. Additionally, existing tests have been updated to utilize this new function, ensuring accurate batch count calculations and improving test coverage for edge cases.

* refactor: update vector store mocking in tests to use AsyncPgVector

This commit modifies the test setup in `test_main.py` to replace the mocking of vector store methods with the `AsyncPgVector` class. The changes include updating the signatures of dummy functions to accept `self` and ensuring that all relevant methods are patched correctly. This enhances the accuracy of the tests by aligning them with the asynchronous implementation of the vector store.

* test: clear cache for embedding function in vector store tests

This commit updates the test setup in `test_main.py` to clear the cache of the `get_cached_query_embedding` function before running tests. This ensures that the mock embedding function is used consistently, improving the reliability of the tests. Additionally, the comment for overriding the embedding function has been clarified to indicate that it uses a dummy implementation that does not call OpenAI.

* test: enhance vector store tests with dummy embedding function

This commit updates the test setup in `test_main.py` to clear the cache of the `get_cached_query_embedding` function and replaces it with a dummy implementation that returns predefined embeddings. This change improves the reliability of the tests by ensuring consistent behavior during test execution.

---------

Co-authored-by: Marc Amick <MarcAmick@jhu.edu>
Co-authored-by: Danny Avila <danny@librechat.ai>
* 📚 docs: Update README with clean install instructions and enhance requirements files

- Added a section in README for clean installation of dependencies, including steps for local development, lite version, and Docker.
- Updated `requirements.txt` and `requirements.lite.txt` to bump `langchain-core` to v0.3.80 and `pypdf` to v6.4.0, and incremented `langchain_text_splitters` to v0.3.9.

* 🔧 chore: Add `tenacity` dependency to requirements files

- Updated `requirements.txt` and `requirements.lite.txt` to include `tenacity>=9.0.0` for improved retry handling in asynchronous operations.
…la#221)

* 🔧 refactor: Document Processing and Health Check Functions

* Refactored `_prepare_documents_sync` to handle document preparation synchronously, improving performance and clarity.
* Updated `store_data_in_vector_db` to utilize the new synchronous document preparation function, ensuring non-blocking behavior in the async context.
* Changed `is_health_ok` to be asynchronous, allowing for proper handling of health checks for different database types.

These changes streamline document processing and improve the responsiveness of health checks in the application.

* 🔧 refactor: Update Event Loop Retrieval in Document Processing

* Replaced `asyncio.get_event_loop()` with `asyncio.get_running_loop()` in `_process_documents_batched_sync` and `store_data_in_vector_db` to ensure compatibility with the current asyncio context.
* This change enhances the reliability of asynchronous operations by using the correct method for obtaining the running event loop.
* Added a step to free up disk space in the GitHub Actions workflow, utilizing the `jlumbroso/free-disk-space` action to optimize resource usage during builds.
* Updated caching strategy for Docker images to improve build performance by specifying cache-from and cache-to options.

These changes aim to streamline the CI/CD process and ensure efficient resource management.
* Fix CVE-2025-68664

Fix "LangGrinch" vulnerability:
https://cyata.ai/blog/langgrinch-langchain-core-cve-2025-68664/

sign-of-by: Jeremie Daniel

* Fix CVE-2025-68664

Fix "LangGrinch" critical vulnerability: 
https://cyata.ai/blog/langgrinch-langchain-core-cve-2025-68664/

Sign-off-by: Rahul Golam
…ts (danny-avila#227)

* fix(CVE-2025-68413): restrict local file embeds to upload directory

* fix(CVE_2025_68414): validate file path for embed and embed-upload endpoints

* fix(security): additional endpoint for same issue as CVE_2025_68414 per @valiumaggelein

* fix(CVE-2025-68414): enhance file path validation to prevent traversal attacks and add comprehensive tests

* fix: improve file path validation in embed and extract endpoints to prevent path traversal vulnerabilities; add tests for entity_id parameter handling

---------

Co-authored-by: Danny Avila <danny@librechat.ai>
* chore: Update `unstructured` dependency to version 0.18.32 in requirements files

* chore: Update dependencies in requirements files to latest versions

- Bumped `pypdf` from 6.4.0 to 6.7.3
- Updated `python-multipart` from 0.0.19 to 0.0.22
- Upgraded `cryptography` from 45.0.5 to 46.0.5
* Introduced  to automate the creation of a Python virtual environment.
* The script checks for the existence of a specified requirements file, removes any existing virtual environment, and installs dependencies from the requirements file.
* Provides instructions for activating the virtual environment after setup.
…idation Crashes (danny-avila#256)

* Fix race condition in embed_file_upload temp file path

Concurrent uploads with the same filename would write to the same temp
path because embed_file_upload used the filename directly under
RAG_UPLOAD_DIR, unlike embed_file which namespaces under user_id. Two
users uploading "report.pdf" simultaneously would corrupt each other's
data. Add user_id subdirectory to match embed_file's behavior.

* Fix connection leak and event loop blocking in mongo_health_check

Every health check call created a MongoClient, opened a connection pool,
and never closed it. Since health checks run frequently, this leaked
connections continuously. Also, MongoClient() and admin.command() are
blocking I/O calls that were running directly in an async function,
blocking the event loop.

Fix both: wrap blocking calls in asyncio.to_thread() and always close
the client in a finally block.

* Fix validation_exception_handler crash and body echo

request.body() re-reads the entire upload into memory inside the error
handler, and body.decode() raises UnicodeDecodeError on binary/non-ASCII
content, turning a 422 into a 500. The response also echoed potentially
large or sensitive upload content back to the caller. Remove the body
read entirely — the validation errors already contain all useful info.

* Remove dead exception branch in generate_digest

str.encode("utf-8") on a Python str never raises UnicodeEncodeError for
any valid Unicode string. The except branch handled surrogate characters
with an encode-decode-encode round-trip that produces the same result as
encode("utf-8", "ignore") directly. Replace with a single expression.

* Enable EMBEDDING_BATCH_SIZE by default (500)

The batch processing pipeline is fully built but disabled by default
(EMBEDDING_BATCH_SIZE=0). This means large documents embed all chunks in
one blocking call. Set default to 500 to enable streaming embeddings
into the DB while the rest are still being computed — the highest-
leverage memory improvement with zero new code.

* Stream CSV encoding conversion instead of reading entire file

Non-UTF-8 CSV files were read entirely into a single Python string
before writing to the temp file. For large files (e.g. 100MB UTF-16),
this causes unnecessary peak memory allocation. Stream in 64KB chunks
instead.

* Close asyncpg connection pool on shutdown

The pool was created during startup but never closed on shutdown.
Rolling deploys or OOM-killed pods leave connections in idle state until
TCP timeout, consuming PostgreSQL's finite connection limit. Close the
pool cleanly in the lifespan teardown.

* Address review findings: unique temp paths, guarded shutdown, tests

Finding 1: Wrap close_pool() in try/except so an exception doesn't
skip thread_pool.shutdown().

Finding 2+5: Use UUID-suffixed temp filenames via _make_unique_temp_path
in all three upload endpoints (embed, embed-upload, text). This prevents
same-user concurrent upload collisions (not just cross-user). Also moves
save_upload_file_async inside the try/finally block so temp files are
cleaned up even when the save itself fails.

Finding 3+6: Restore tuning guidance comment for EMBEDDING_BATCH_SIZE
explaining the default of 500 and when to increase.

Finding 4: Add tests for _make_unique_temp_path (uniqueness, user
isolation, extension preservation, traversal rejection) and
generate_digest (normal strings, unicode, surrogates, determinism).

* Clean up remaining nits from review

- Drop redundant second validate_file_path call in _make_unique_temp_path;
  the UUID suffix is [0-9a-f]{32} with no path separators, so it cannot
  escape the already-validated directory. Resolve the path directly.
- Fold os.makedirs into the try/finally block in all three upload
  endpoints so an OSError (permissions, disk full) is caught by the
  existing error handler instead of propagating as an unhandled 500.
- Fix hardcoded Unix separator in test assertion; use Path.parent.name
  instead of string containment check.
* 🦥 feat: Add lazy_load() support to document loaders

Add lazy_load() to SafePyPDFLoader and switch call sites from
loader.load to lazy_load(), establishing the pattern for future
streaming optimizations. Memory benchmarks confirm CSV streaming
sees ~67% reduction; PDF sees modest gains; Unstructured loaders
show no difference (as expected — they load fully internally).

* 🩺 fix: Prevent page duplication in SafePyPDFLoader fallback path

Fix critical bug where try/except around yield-from caused
already-yielded pages to be duplicated when the image-extraction
fallback triggered mid-stream. The extract_images=True path now
collects eagerly to ensure atomic fallback; extract_images=False
remains fully streaming.

Also addresses code review findings:
- Add mock-based tests for fallback correctness (no duplication,
  load() delegation, non-Filter error propagation)
- Fix file handle leaks in test helpers (open without with)
- Add tracemalloc.is_tracing() guards in benchmarks
- Use collections.abc.Iterator instead of types.GeneratorType
- Restore explanatory comment on bare raise
- Remove vacuous assert (list() never returns None)
- Add real assertions to memory benchmark tests
danny-avila#258)

* 🚰 fix: Close leaked database connections and temp files on shutdown

- MongoDB: Store MongoClient reference in factory module so it can be
  closed on app shutdown (was created anonymously, never closed)
- SQLAlchemy: Dispose PGVector's engine on shutdown (connection pool
  was never released, separate from the asyncpg pool that was managed)
- embed_local_file: Move cleanup_temp_encoding_file to finally block
  so CSV temp files are cleaned even when store_data_in_vector_db fails
- Bound results_queue in async pipeline to EMBEDDING_MAX_QUEUE_SIZE
  to prevent unbounded memory growth on large documents

* 🩺 fix: Revert results_queue deadlock, fix remaining temp leak, correct shutdown order

- Revert results_queue to unbounded: bounding it deadlocked any document
  producing >EMBEDDING_MAX_QUEUE_SIZE batches, since the drain loop runs
  only after asyncio.gather() returns but the consumer blocks on put()
  when the queue is full. The queue holds only UUID lists (negligible memory).
- Fix same temp file leak in load_file_content that was fixed in
  embed_local_file — affects /embed, /embed-upload, /text routes.
- Swap shutdown order: drain thread pool before disposing connections,
  so no threads are mid-SQL when the engine pool is torn down.
- Close previous MongoClient before overwriting in get_vector_store.
- Add docstrings and type annotations to factory functions.
- Add 6 tests for close_vector_store_connections (mongo close, engine
  dispose, idempotency, no-bind, None, re-call closes previous client).

* 🧪 test: Add test for load_file_content cleanup on lazy_load failure

Verifies that cleanup_temp_encoding_file is called in the finally
block even when lazy_load() raises, closing the last test coverage
gap from the review.
…ny-avila#259)

* feat: Migrate cmetadata from JSON to JSONB and add GIN index for efficient filtering

* fix: Correct DO block dollar-quoting, add lock_timeout, schema filter, and tests

Addresses code review findings on the JSON-to-JSONB migration:
- Fix invalid $..$ dollar-quoting to $$...$$ (was preventing app startup)
- Add SET LOCAL lock_timeout = '10s' before ALTER TABLE
- Add table_schema = current_schema() filter to information_schema query
- Add docstring and inline comments (rollback SQL, index redundancy note)
- Replace dummy test with capturing tests that verify actual SQL content
- Add use_jsonb=True regression tests for both sync and async factory paths
…er (danny-avila#260)

* chore: Upgrade langchain and related dependencies; refactor async execution in vector store

* fix: Handle None for RAG_GOOGLE_API_KEY and update default EMBEDDINGS_MODEL for Google Vertex AI

* fix: Correct package names and enforce version constraints for pymongo and boto3

* refactor: Simplify thread pool retrieval in AsyncPgVector class

* test: Add unit tests for DummyAsyncPgVector methods in test_async_pg_vector.py

* feat: Implement async executor wrapper to handle StopIteration in AsyncPgVector methods; add corresponding unit test

* chore: unused import

* chore: Update VertexAI embedding model and add additional environment variables for Google Cloud integration
…vila#265)

The unstructured library calls nltk.pos_tag() when classifying short
text fragments during markdown parsing, which requires the
averaged_perceptron_tagger_eng resource in NLTK 3.9+. Without it,
uploading .md files fails with LookupError in environments without
outbound internet (the bug is masked elsewhere because unstructured
silently auto-downloads the package at runtime).

Fixes danny-avila#141
…anny-avila#264)

* fix: add missing msoffcrypto-tool package for xlsx file support

* chore: add major version upper bound to msoffcrypto-tool constraint

Pin msoffcrypto-tool to >=6.0.0,<7 (consistent with other >= deps in
this project like pymongo, pydantic, boto3) to prevent accidental
breakage from future major version bumps.

---------

Co-authored-by: Danny Avila <danny@librechat.ai>
… ID Collisions (danny-avila#266)

* Remove hard coded documents field for add_documents function

The add_documents function for mongo has a docs parameter where the same function for postgres has a documents parameter. This difference makes this function fail for mongo.

* fix: resolve ID collision and LSP violation in AtlasMongoVector.add_documents

- Rename `docs` param to `documents` to match VectorStore base class,
  fixing the root cause of the TypeError for MongoDB users
- Replace per-batch sequential ID generation with content-digest-based
  IDs to prevent cross-batch collisions that caused silent data loss
  for files exceeding EMBEDDING_BATCH_SIZE in chunk count
- Add empty document guard
- Update type annotation and docstring in _process_documents_batched_sync
  to reflect that it handles both PgVector and AtlasMongoVector
- Add regression tests for cross-batch ID uniqueness and positional
  dispatch compatibility
- Align DummyVectorStore in conftest with base class signature

* fix: address remaining review findings (R1-R3)

- R1: Replace vacuous test_empty_documents_returns_empty with one that
  calls the real AtlasMongoVector.add_documents via bound method
- R2: Swap test names/stores so the primary regression test uses a
  strict legacy-param store (no **kwargs) that would fail with keyword
  dispatch, and rename the base-class-signature test accordingly
- R3: Add Union["PgVector", "AtlasMongoVector"] type annotation to
  _process_documents_batched_sync; add docstring to add_documents
  explaining that caller-supplied ids are intentionally ignored in
  favor of content-digest-based IDs

---------

Co-authored-by: Danny Avila <danny@librechat.ai>
Brings in Paychex customizations:
- PaychexDockerfile with Paychex root certificate
- CI/CD workflows for n1, n2a, and prod environments
- Azure ACR integration
- paychex-root.pem certificate

This merge combines upstream v0.7.3 features with Paychex infrastructure.
- Add missing averaged_perceptron_tagger_eng NLTK package (from upstream danny-avila#265)
- Remove duplicate WORKDIR /app line
- Maintains Python 3.12-slim and Paychex certificate integration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants