diff --git a/content/patterns/rag-llm-cpu/_index.md b/content/patterns/rag-llm-cpu/_index.md new file mode 100644 index 000000000..cdbed326d --- /dev/null +++ b/content/patterns/rag-llm-cpu/_index.md @@ -0,0 +1,74 @@ +--- +title: RAG LLM Chatbot on CPU +date: 2025-10-24 +tier: sandbox +summary: This patterns deploys a CPU-based LLM, your choice of several RAG DB providers, and a simple chatbot UI which exposes the configuration and results of the RAG queries. +rh_products: + - Red Hat OpenShift Container Platform + - Red Hat OpenShift GitOps + - Red Hat OpenShift AI +partners: + - Microsoft + - IBM Fusion +industries: + - General +aliases: /rag-llm-cpu/ +links: + github: https://github.com/validatedpatterns-sandbox/rag-llm-cpu + install: getting-started + bugs: https://github.com/validatedpatterns-sandbox/rag-llm-cpu/issues + feedback: https://docs.google.com/forms/d/e/1FAIpQLScI76b6tD1WyPu2-d_9CCVDr3Fu5jYERthqLKJDUGwqBg7Vcg/viewform +--- + +# **CPU-based RAG LLM chatbot** + +## **Introduction** + +The CPU-based RAG LLM chatbot Validated Pattern deploys a retrieval-augmented generation (RAG) chatbot on Red Hat OpenShift by using Red Hat OpenShift AI. +The pattern runs entirely on CPU nodes without requiring GPU hardware, which provides a cost-effective and accessible solution for environments where GPU resources are limited or unavailable. +This pattern provides a secure, flexible, and production-ready starting point for building and deploying on-premise generative AI applications. + +## **Target audience** + +This pattern is intended for the following users: + +- **Developers & Data Scientists** who want to build and experiment with RAG-based large language model (LLM) applications. +- **MLOps & DevOps Engineers** who are responsible for deploying and managing AI/ML workloads on OpenShift. +- **Architects** who evaluate cost-effective methods for delivering generative AI capabilities on-premise. + +## **Why Use This Pattern?** + +- **Cost-Effective**: The pattern runs entirely on CPU nodes, which removes the need for expensive and scarce GPU resources. +- **Flexible**: The pattern supports multiple vector database backends, such as Elasticsearch, PGVector, and Microsoft SQL Server, to integrate with existing data infrastructure. +- **Transparent**: The Gradio frontend exposes the internals of the RAG query and LLM prompts, which provides insight into the generation process. +- **Extensible**: The pattern uses open-source standards, such as KServe and OpenAI-compatible APIs, to serve as a foundation for complex applications. + +## **Architecture Overview** + +At a high level, the components work together in the following sequence: + +1. A user enters a query into the **Gradio UI**. +2. The backend application, using **LangChain**, queries a configured **vector database** to retrieve relevant documents. +3. These documents are combined with the original query from the user into a prompt. +4. The prompt is sent to the **KServe-deployed LLM**, which runs via **llama.cpp** on a CPU node. +5. The LLM generates a response, which is streamed back to the **Gradio UI**. +6. **Vault** provides the necessary credentials for the vector database and HuggingFace token at runtime. + +![Overview](/images/rag-llm-cpu/rag-augmented-query.png) + +_Figure 1. Overview of RAG Query from User's perspective._ + +## **Prerequisites** + +Before you begin, ensure that you have access to the following resources: + +- A Red Hat OpenShift cluster version 4.x. (The recommended size is at least two `m5.4xlarge` nodes.) +- A HuggingFace API token. +- The `Podman` command-line tool. + +## **What This Pattern Provides** + +- A [KServe](https://github.com/kserve/kserve)-based LLM deployed to [Red Hat OpenShift AI](https://www.redhat.com/en/products/ai/openshift-ai) that runs entirely on a CPU-node with a [llama.cpp](https://github.com/ggml-org/llama.cpp) runtime. +- A choice of one or more vector database providers to serve as a RAG backend with configurable web-based or Git repository-based sources. Vector embedding and document retrieval are implemented with [LangChain](https://docs.langchain.com/oss/python/langchain/overview). +- [Vault](https://developer.hashiCorp.com/vault)-based secret management for a HuggingFace API token and credentials for supported databases, such as ([Elasticsearch](https://www.elastic.co/docs/solutions/search/vector), [PGVector](https://github.com/pgvector/pgvector), [Microsoft SQL Server](https://learn.microsoft.com/en-us/sql/sql-server/ai/vectors?view=sql-server-ver17)). +- A [gradio](https://www.gradio.app/)-based frontend for connecting to multiple [OpenAI API-compatible](https://github.com/openai/openai-openapi) LLMs. This frontend exposes the internals of the RAG query and LLM prompts so that users have insight into the running processes. diff --git a/content/patterns/rag-llm-cpu/configure.md b/content/patterns/rag-llm-cpu/configure.md new file mode 100644 index 000000000..942988791 --- /dev/null +++ b/content/patterns/rag-llm-cpu/configure.md @@ -0,0 +1,286 @@ +--- +title: Configuring this pattern +weight: 20 +aliases: /rag-llm-cpu/configure/ +--- + +# **Configuring this pattern** + +This guide covers common customizations, such as changing the default large language model (LLM), adding new models, and configuring retrieval-augmented generation (RAG) data sources. This guide assumes that you have already completed the [Getting started](/rag-llm-cpu/getting-started/) guide. + +## **Configuration overview** + +ArgoCD manages this pattern by using GitOps. All application configurations are defined in the `values-prod.yaml` file. To customize a component, complete the following steps: + +1. **Enable an override:** In the `values-prod.yaml` file, locate the application that you want to change, such as `llm-inference-service`, and add an `extraValueFiles:` entry that points to a new override file, such as `$patternref/overrides/llm-inference-service.yaml`. +2. **Create the override file:** Create the new `.yaml` file in the `/overrides` directory. +3. **Add settings:** Add the specific values that you want to change to the new file. +4. **Commit and synchronize:** Commit your changes and allow ArgoCD to synchronize the application. + +## **Task: Changing the default LLM** + +By default, the pattern deploys the `mistral-7b-instruct-v0.2.Q5_0.gguf` model. You can change this to a different model, such as a different quantization, or adjust the resource usage. To change the default LLM, create an override file for the existing `llm-inference-service` application. + +1. **Enable the override:** +In the `values-prod.yaml` file, update the `llm-inference-service` application to use an override file: + ```yaml + clusterGroup: + # ... + applications: + # ... + llm-inference-service: + name: llm-inference-service + namespace: rag-llm-cpu + chart: llm-inference-service + chartVersion: 0.3.* + extraValueFiles: # <-- ADD THIS BLOCK + - $patternref/overrides/llm-inference-service.yaml + ``` + +2. **Create the override file:** +Create a new file named `overrides/llm-inference-service.yaml`. The following example switches to a different model file (Q8_0) and increases the CPU and memory requests: + ```yaml + inferenceService: + resources: # <-- Increaed allocated resources + requests: + cpu: "8" + memory: 12Gi + limits: + cpu: "12" + memory: 24Gi + + servingRuntime: + args: + - --model + - /models/mistral-7b-instruct-v0.2.Q8_0.gguf # <-- Changed model file + + model: + repository: TheBloke/Mistral-7B-Instruct-v0.2-GGUF + files: + - mistral-7b-instruct-v0.2.Q8_0.gguf # <-- Changed file to download + ``` + +## **Task: Adding a second LLM** + +You can deploy an additional LLM and add it to the demonstration user interface (UI). The following example deploys the HuggingFace TGI runtime instead of `llama.cpp`. This process requires two steps: deploying the new LLM and configuring the frontend UI. + +### **Step 1: Deploying the new LLM service** + +1. **Define the new application:** +In the `values-prod.yaml` file, add a new application named `another-llm-inference-service` to the applications list. + + ```yaml + clusterGroup: + # ... + applications: + # ... + another-llm-inference-service: # <-- ADD THIS NEW APPLICATION + name: another-llm-inference-service + namespace: rag-llm-cpu + chart: llm-inference-service + chartVersion: 0.3.* + extraValueFiles: + - $patternref/overrides/another-llm-inference-service.yaml + ``` + +2. **Create the override file:** +Create a new file named `overrides/another-llm-inference-service.yaml`. This file defines the new model and disables the creation of resources, such as secrets, that the first LLM already created. + ```yaml + dsc: + initialize: false + externalSecret: + create: false + + # Define the new InferenceService + inferenceService: + name: hf-inference-service # <-- New service name + minReplicas: 1 + maxReplicas: 1 + resources: + requests: + cpu: "8" + memory: 32Gi + limits: + cpu: "12" + memory: 32Gi + + # Define the new runtime (HuggingFace TGI) + servingRuntime: + name: hf-runtime + port: 8080 + image: docker.io/kserve/huggingfaceserver:latest + modelFormat: huggingface + args: + - --model_dir + - /models + - --model_name + - /models/Mistral-7B-Instruct-v0.3 + - --http_port + - "8080" + + # Define the new model to download + model: + repository: mistralai/Mistral-7B-Instruct-v0.3 + files: + - generation_config.json + - config.json + - model.safetensors.index.json + - model-00001-of-00003.safetensors + - model-00002-of-00003.safetensors + - model-00003-of-00003.safetensors + - tokenizer.model + - tokenizer.json + - tokenizer_config.json + ``` + + > **IMPORTANT:** A known issue in the model-downloading container requires that you explicitly list all files that you want to download from the HuggingFace repository. Ensure that you list every file required for the model to run. + +### **Step 2: Adding the new LLM to the demonstration UI** + +Configure the frontend to recognize the new LLM. + +1. **Edit the frontend overrides**: +Open the `overrides/rag-llm-frontend-values.yaml` file. +2. **Update LLM_URLS:** +Add the URL of the new service to the `LLM_URLS` environment variable. The URL uses the `http://-predictor/v1` format or `http://-predictor/openai/v1` for the HuggingFace runtime. +In the `overrides/rag-llm-frontend-values.yaml` file: + + ```yaml + env: + # ... + - name: LLM_URLS + value: '["http://cpu-inference-service-predictor/v1","http://hf-inference-service-predictor/openai/v1"]' + ``` + +## **Task: Customizing RAG data sources** + +By default, the pattern ingests data from the Validated Patterns documentation. You can change this to point to public Git repositories or web pages. + +1. **Edit the vector database overrides:** +Open the `overrides/vector-db-values.yaml` file. +2. **Update sources:** +Modify the `repoSources` and `webSources` keys. You can add any publicly available Git repository or public web URL. The job also processes PDF files from `webSources`. +In the `overrides/vector-db-values.yaml` file: + + ```yaml + providers: + qdrant: + enabled: true + mssql: + enabled: true + + vectorEmbedJob: + repoSources: + - repo: https://github.com/your-org/your-docs.git # <-- Your repo + globs: + - "**/*.md" + webSources: + - https://your-company.com/product-manual.pdf # <-- Your PDF + chunking: + size: 4096 + ``` + +## **Task: Adding a new RAG database provider** + +By default, the pattern enables `qdrant` and `mssql`. You can also enable `redis`, `pgvector`, or `elastic`. This process requires three steps: adding secrets, enabling the database, and configuring the UI. + +### **Step 1: Updating the secrets file** + +1. If the new database requires credentials, add them to the main secrets file: + + ```sh + vim ~/values-secret-rag-llm-cpu.yaml + ``` +2. Add the necessary credentials. For example: + + ```yaml + secrets: + # ... + - name: pgvector + fields: + - name: user + value: user # <-- Update the user + - name: password + value: password # <-- Update the password + - name: db + value: db # <-- Update the db + ``` + +> **NOTE:** For information about the expected values, see the [`values-secret.yaml.template`](https://github.com/validatedpatterns-sandbox/rag-llm-cpu/blob/main/values-secret.yaml.template) file. + +### **Step 2: Enabling the provider in the vector database chart** + +Edit the `overrides/vector-db-values.yaml` file and set `enabled: true` for the providers that you want to add. + +In the `overrides/vector-db-values.yaml` file: + +```yaml +providers: + qdrant: + enabled: true + mssql: + enabled: true + pgvector: # <-- ADD THIS + enabled: true + elastic: # <-- OR THIS + enabled: true +``` + +### **Step 3: Adding the provider to the demonstration UI** + +Edit the `overrides/rag-llm-frontend-values.yaml` file to configure the UI: + +1. Add the secrets for the new provider to the `dbProvidersSecret.vault` list. +2. Add the connection details for the new provider to the `dbProvidersSecret.providers` list. + +The following example shows the configuration for non-default RAG database providers: + +In the `overrides/rag-llm-frontend-values.yaml` file: + +```yaml +dbProvidersSecret: + vault: + - key: mssql + field: sapassword + - key: pgvector # <-- Add this block + field: user + - key: pgvector + field: password + - key: pgvector + field: db + - key: elastic # <-- Add this block + field: user + - key: elastic + field: password + providers: + - type: qdrant # <-- Example for Qdrant + collection: docs + url: http://qdrant-service:6333 + embedding_model: sentence-transformers/all-mpnet-base-v2 + - type: mssql # <-- Example for MSSQL + table: docs + connection_string: >- + Driver={ODBC Driver 18 for SQL Server}; + Server=mssql-service,1433; + Database=embeddings; + UID=sa; + PWD={{ .mssql_sapassword }}; + TrustServerCertificate=yes; + Encrypt=no; + embedding_model: sentence-transformers/all-mpnet-base-v2 + - type: redis # <-- Example for Redis + index: docs + url: redis://redis-service:6379 + embedding_model: sentence-transformers/all-mpnet-base-v2 + - type: elastic # <-- Example for Elastic + index: docs + url: http://elastic-service:9200 + user: "{{ .elastic_user }}" + password: "{{ .elastic_password }}" + embedding_model: sentence-transformers/all-mpnet-base-v2 + - type: pgvector # <-- Example for PGVector + collection: docs + url: >- + postgresql+psycopg://{{ .pgvector_user }}:{{ .pgvector_password }}@pgvector-service:5432/{{ .pgvector_db }} + embedding_model: sentence-transformers/all-mpnet-base-v2 +``` diff --git a/content/patterns/rag-llm-cpu/getting-started.md b/content/patterns/rag-llm-cpu/getting-started.md new file mode 100644 index 000000000..f727808ff --- /dev/null +++ b/content/patterns/rag-llm-cpu/getting-started.md @@ -0,0 +1,88 @@ +--- +title: Getting Started +weight: 10 +aliases: /rag-llm-cpu/getting-started/ +--- + +## Prerequisites + +* Podman is installed on your system. +* You are logged into a Red Hat OpenShift 4 cluster with administrator permissions. + +## Deploying the pattern + +1. Fork the [rag-llm-cpu](https://github.com/validatedpatterns-sandbox/rag-llm-cpu) Git repository. + +2. Clone the forked repository by running the following command: + + ```sh + $ git clone git@github.com:your-username/rag-llm-cpu.git + ``` + +3. Navigate to the root directory of your Git repository: + + ```sh + $ cd rag-llm-cpu + ``` + +4. Create a local copy of the secret values file by running the following command: + + ```sh + $ cp values-secret.yaml.template ~/values-secret-rag-llm-cpu.yaml + ``` + +5. Create an API token on [HuggingFace](https://huggingface.co/). + +6. Update the secret values file: + + ```sh + vim ~/values-secret-rag-llm-cpu.yaml + ``` + + > **NOTE**: Update the value of the `token` field in the `huggingface` section with the API token from the previous step. By default, this pattern deploys Microsoft SQL Server as a retrieval-augmented generation (RAG) database provider. Update the `sapassword` field in the `mssql` section. If you plan to use other database providers, update those secrets. + +7. To install the pattern without modifications, run the following commands: + + ```sh + $ ./pattern.sh oc whoami --show-console + ``` + + The output displays the cluster where the pattern will be installed. If the correct cluster is not displayed, log into your OpenShift cluster. + + ```sh + $ ./pattern.sh make install + ``` + + ArgoCD deploys the components after you run the install command. To check the status of the components after the installation completes, run the following command: + + ```sh + $ ./pattern.sh make argo-healthcheck + ``` + +8. To make changes to the pattern before installation, such as using different RAG database providers or changing the large language model (LLM), see [Configuring this Pattern](/rag-llm-cpu/configure/). + +## Verifying the installation + +1. Confirm that all applications are successfully installed: + + ```sh + $ ./pattern.sh make argo-healthcheck + ``` + + It might take several minutes for all applications to synchronize and reach a healthy state because the process includes downloading the LLM models and populating the RAG databases. + + ![Healthcheck](/images/rag-llm-cpu/healthcheck.png) + +2. Open the **RAG LLM Demo UI** by clicking the link in the **Red Hat applications** menu. + + ![9Dots](/images/rag-llm-cpu/9dots.png) + +3. Confirm that the configured LLMs and RAG database providers are available. Verify that a query in the chatbot triggers a response from the selected RAG database and LLM. + + > **NOTE**: The CPU-based LLM might take approximately one minute to start streaming a response during the first query because the system must load the data into memory. + + ![App](/images/rag-llm-cpu/app.png) + +## Next Steps + +After the pattern is running, you can customize the configuration. See [Configuring this Pattern](/rag-llm-cpu/configure/)for information about changing the LLM, adding RAG sources, or switching vector databases. diff --git a/content/patterns/rag-llm-cpu/ibm-fusion.md b/content/patterns/rag-llm-cpu/ibm-fusion.md new file mode 100644 index 000000000..73dd4f4d8 --- /dev/null +++ b/content/patterns/rag-llm-cpu/ibm-fusion.md @@ -0,0 +1,9 @@ +--- +title: GPU-free RAG LLM pattern on IBM Fusion +weight: 30 +aliases: /rag-llm-cpu/ibm-fusion/ +--- + +# **GPU-free RAG LLM pattern on IBM Fusion** + +This pattern is deployed with IBM Fusion. For more details, see the [IBM Community Post](https://community.ibm.com/community/user/blogs/saif-adil/2026/01/08/deploying-a-gpu-free-rag-llm-pattern-on-ibm-fusion). diff --git a/static/images/rag-llm-cpu/9dots.png b/static/images/rag-llm-cpu/9dots.png new file mode 100644 index 000000000..fe318b5a6 Binary files /dev/null and b/static/images/rag-llm-cpu/9dots.png differ diff --git a/static/images/rag-llm-cpu/app.png b/static/images/rag-llm-cpu/app.png new file mode 100644 index 000000000..36705b8d7 Binary files /dev/null and b/static/images/rag-llm-cpu/app.png differ diff --git a/static/images/rag-llm-cpu/healthcheck.png b/static/images/rag-llm-cpu/healthcheck.png new file mode 100644 index 000000000..06d30ad45 Binary files /dev/null and b/static/images/rag-llm-cpu/healthcheck.png differ diff --git a/static/images/rag-llm-cpu/rag-augmented-query.png b/static/images/rag-llm-cpu/rag-augmented-query.png new file mode 100644 index 000000000..d166c53ea Binary files /dev/null and b/static/images/rag-llm-cpu/rag-augmented-query.png differ