feat: Add Amazon S3 Vectors document store integration#3149
feat: Add Amazon S3 Vectors document store integration#3149dotKokott wants to merge 7 commits intodeepset-ai:mainfrom
Conversation
Implements issue deepset-ai#2110 - Amazon S3 Vectors document store integration with: - S3VectorsDocumentStore: full DocumentStore protocol (count, write, filter, delete) - S3VectorsEmbeddingRetriever: embedding-based retrieval with metadata filtering - Filter conversion from Haystack format to S3 Vectors filter syntax - Auto-creation of vector buckets and indexes - AWS credential support via Secret (or default credential chain) - 49 unit tests covering store, retriever, filters, and serialization - README with usage examples and known limitations
…rkflow - boto3 lower bound set to 1.42.0 (when s3vectors service was added) - pydoc filename changed to amazon_s3_vectors.md (underscores, matching folder name) - Quote $GITHUB_OUTPUT in workflow to fix shellcheck SC2086
- Flatten test classes into standalone functions (matching pinecone/qdrant pattern) - Assert full serialized dict structure in to_dict/from_dict tests - Use Mock(spec=...) for retriever tests instead of MagicMock+patch - Verify _embedding_retrieval call args match exactly - Add test_from_dict_no_filter_policy (backward compat) - Add test_init_is_lazy
Remove tests that just verify mock plumbing (count, write, delete calling the mock client). Keep tests that verify our actual logic: - Serialization roundtrip (full dict structure) - Score conversion (cosine + euclidean) - Filter conversion (pure function with real logic) - Duplicate policy batch checks (SKIP/NONE) - Document <-> S3 vector conversion - Input validation Before: 49 unit tests (many testing mock behavior) After: 26 unit tests (all testing our code) + 12 integration tests
- Class docstring: top_k cap, dimension limit, metadata limits, float32 only - write_documents: embedding required, 40KB metadata limit - _embedding_retrieval: top_k=100 cap, no embeddings in response - Retriever run: top_k=100, server-side filters, no embeddings returned
…ity, deduplicate retrieval logic - Replace hand-rolled _apply_filters_in_memory/_document_matches/_compare with haystack.utils.filters.document_matches_filter (same utility used by InMemoryDocumentStore). Gains NOT operator, nested dotted field paths, and date comparison support for free. (-65 lines) - Deduplicate blob/content reconstruction in _embedding_retrieval() by reusing _s3_vector_to_document() + dataclasses.replace() (-20 lines) - Make filter_documents() warning conditional on filters actually being provided (no warning when listing all documents)
1df9666 to
90c4977
Compare
CI: Integration tests need AWS credential setupThe integration tests currently run unconditionally in CI with no AWS credentials configured. The tests have a
What needs to happenThe workflow should match the # Do not authenticate on PRs from forks and on PRs created by dependabot
- name: AWS authentication
id: aws-auth
if: github.event_name == 'schedule' || (github.event.pull_request.head.repo.full_name == github.repository && !startsWith(github.event.pull_request.head.ref, 'dependabot/'))
uses: aws-actions/configure-aws-credentials@ec61189d14ec14c8efccab744f656cffd0e33f37
with:
aws-region: us-east-1
role-to-assume: ${{ secrets.AWS_S3_VECTORS_CI_ROLE_ARN }}
- name: Run integration tests
if: success() && steps.aws-auth.outcome == 'success'
run: hatch run test:integration-cov-append-retryPrerequisites (maintainer action required)
|
|
@dotKokott I'll try to take a look in the next few days. Have you tried the integration yourself in a real-world setting with AWS? |
I have tried all integration tests and examples on my AWS account. However I did not try with any large datasets. That might be next thing to validate: does this work as expected with real load. |
Related Issues
Proposed Changes:
Adds an Amazon S3 Vectors document store integration — a serverless vector storage capability native to S3.
Components:
S3VectorsDocumentStore— full DocumentStore protocol (write, count, filter, delete)S3VectorsEmbeddingRetriever— embedding-based retrieval with server-side metadata filteringKey design decisions:
1 - distance) for Haystack conventionfilter_documents()useslist_vectors(returnData=True, returnMetadata=True)with client-side filtering (warning logged) since S3 Vectors has no standalone filter APIDuplicatePolicy.SKIP/NONE(batches of 100)Known limitations (documented in README):
top_kcapped at 100 (service limit)query_vectorsdoes not return embedding datafloat32,cosine/euclidean, eventual consistencyHow did you test it?
pytestmarkcredential guard for CIhatch run test:all,hatch run fmt,hatch run test:typesexamples/example.py) verified against live AWSNotes for the reviewer
This PR was fully generated with an AI assistant. I have reviewed the changes and run the relevant tests.
Structure and test style follow the Pinecone integration pattern.
Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:.