validatedpatterns · dminnear-rh · Oct 28, 2025
diff --git a/content/patterns/rag-llm-cpu/_index.md b/content/patterns/rag-llm-cpu/_index.md
@@ -0,0 +1,74 @@
+---
+title: RAG LLM Chatbot on CPU
+date: 2025-10-24
+tier: sandbox
+summary: This patterns deploys a CPU-based LLM, your choice of several RAG DB providers, and a simple chatbot UI which exposes the configuration and results of the RAG queries.
+rh_products:
+  - Red Hat OpenShift Container Platform
+  - Red Hat OpenShift GitOps
+  - Red Hat OpenShift AI
+partners:
+  - Microsoft
+  - IBM Fusion
+industries:
+  - General
+aliases: /rag-llm-cpu/
+links:
+  github: https://github.com/validatedpatterns-sandbox/rag-llm-cpu
+  install: getting-started
+  bugs: https://github.com/validatedpatterns-sandbox/rag-llm-cpu/issues
+  feedback: https://docs.google.com/forms/d/e/1FAIpQLScI76b6tD1WyPu2-d_9CCVDr3Fu5jYERthqLKJDUGwqBg7Vcg/viewform
+---
+
+# **CPU-based RAG LLM chatbot**
+
+## **Introduction**
+
+The CPU-based RAG LLM chatbot Validated Pattern deploys a retrieval-augmented generation (RAG) chatbot on Red Hat OpenShift by using Red Hat OpenShift AI.
+The pattern runs entirely on CPU nodes without requiring GPU hardware, which provides a cost-effective and accessible solution for environments where GPU resources are limited or unavailable.
+This pattern provides a secure, flexible, and production-ready starting point for building and deploying on-premise generative AI applications.
+
+## **Target audience**
+
+This pattern is intended for the following users:
+
+- **Developers & Data Scientists** who want to build and experiment with RAG-based large language model (LLM) applications.
+- **MLOps & DevOps Engineers** who are responsible for deploying and managing AI/ML workloads on OpenShift.
+- **Architects** who evaluate cost-effective methods for delivering generative AI capabilities on-premise.
+
+## **Why Use This Pattern?**
+
+- **Cost-Effective**: The pattern runs entirely on CPU nodes, which removes the need for expensive and scarce GPU resources.
+- **Flexible**: The pattern supports multiple vector database backends, such as Elasticsearch, PGVector, and Microsoft SQL Server, to integrate with existing data infrastructure.
+- **Transparent**: The Gradio frontend exposes the internals of the RAG query and LLM prompts, which provides insight into the generation process.
+- **Extensible**: The pattern uses open-source standards, such as KServe and OpenAI-compatible APIs, to serve as a foundation for complex applications.
+
+## **Architecture Overview**
+
+At a high level, the components work together in the following sequence:
+
+1. A user enters a query into the **Gradio UI**.
+2. The backend application, using **LangChain**, queries a configured **vector database** to retrieve relevant documents.
+3. These documents are combined with the original query from the user into a prompt.
+4. The prompt is sent to the **KServe-deployed LLM**, which runs via **llama.cpp** on a CPU node.
+5. The LLM generates a response, which is streamed back to the **Gradio UI**.
+6. **Vault** provides the necessary credentials for the vector database and HuggingFace token at runtime.
+
+![Overview](/images/rag-llm-cpu/rag-augmented-query.png)
+
+_Figure 1. Overview of RAG Query from User's perspective._
+
+## **Prerequisites**
+
+Before you begin, ensure that you have access to the following resources:
+
+- A Red Hat OpenShift cluster version 4.x. (The recommended size is at least two `m5.4xlarge` nodes.)
+- A HuggingFace API token.
+- The `Podman` command-line tool.
+
+## **What This Pattern Provides**
+
+- A [KServe](https://github.com/kserve/kserve)-based LLM deployed to [Red Hat OpenShift AI](https://www.redhat.com/en/products/ai/openshift-ai) that runs entirely on a CPU-node with a [llama.cpp](https://github.com/ggml-org/llama.cpp) runtime.
+- A choice of one or more vector database providers to serve as a RAG backend with configurable web-based or Git repository-based sources. Vector embedding and document retrieval are implemented with [LangChain](https://docs.langchain.com/oss/python/langchain/overview).
+- [Vault](https://developer.hashiCorp.com/vault)-based secret management for a HuggingFace API token and credentials for supported databases, such as ([Elasticsearch](https://www.elastic.co/docs/solutions/search/vector), [PGVector](https://github.com/pgvector/pgvector), [Microsoft SQL Server](https://learn.microsoft.com/en-us/sql/sql-server/ai/vectors?view=sql-server-ver17)).
+- A [gradio](https://www.gradio.app/)-based frontend for connecting to multiple [OpenAI API-compatible](https://github.com/openai/openai-openapi) LLMs. This frontend exposes the internals of the RAG query and LLM prompts so that users have insight into the running processes.
diff --git a/content/patterns/rag-llm-cpu/configure.md b/content/patterns/rag-llm-cpu/configure.md
@@ -0,0 +1,286 @@
+---
+title: Configuring this pattern
+weight: 20
+aliases: /rag-llm-cpu/configure/
+---
+
+# **Configuring this pattern**
+
+This guide covers common customizations, such as changing the default large language model (LLM), adding new models, and configuring retrieval-augmented generation (RAG) data sources. This guide assumes that you have already completed the [Getting started](/rag-llm-cpu/getting-started/) guide.
+
+## **Configuration overview**
+
+ArgoCD manages this pattern by using GitOps. All application configurations are defined in the `values-prod.yaml` file. To customize a component, complete the following steps:
+
+1. **Enable an override:** In the `values-prod.yaml` file, locate the application that you want to change, such as `llm-inference-service`, and add an `extraValueFiles:` entry that points to a new override file, such as `$patternref/overrides/llm-inference-service.yaml`.
+2. **Create the override file:** Create the new `.yaml` file in the `/overrides` directory.
+3. **Add settings:** Add the specific values that you want to change to the new file.
+4. **Commit and synchronize:** Commit your changes and allow ArgoCD to synchronize the application.
+
+## **Task: Changing the default LLM**
+
+By default, the pattern deploys the `mistral-7b-instruct-v0.2.Q5_0.gguf` model. You can change this to a different model, such as a different quantization, or adjust the resource usage. To change the default LLM, create an override file for the existing `llm-inference-service` application.
+
+1. **Enable the override:**
+In the `values-prod.yaml` file, update the `llm-inference-service` application to use an override file:
+   ```yaml
+   clusterGroup:
+     # ...
+     applications:
+       # ...
+       llm-inference-service:
+         name: llm-inference-service
+         namespace: rag-llm-cpu
+         chart: llm-inference-service
+         chartVersion: 0.3.*
+         extraValueFiles: # <-- ADD THIS BLOCK
+           - $patternref/overrides/llm-inference-service.yaml
+   ```
+
+2. **Create the override file:**
+Create a new file named `overrides/llm-inference-service.yaml`. The following example switches to a different model file (Q8_0) and increases the CPU and memory requests:
+   ```yaml
+   inferenceService:
+     resources: # <-- Increaed allocated resources
+       requests:
+         cpu: "8"
+         memory: 12Gi
+       limits:
+         cpu: "12"
+         memory: 24Gi
+
+   servingRuntime:
+     args:
+       - --model
+       - /models/mistral-7b-instruct-v0.2.Q8_0.gguf # <-- Changed model file
+
+   model:
+     repository: TheBloke/Mistral-7B-Instruct-v0.2-GGUF
+     files:
+       - mistral-7b-instruct-v0.2.Q8_0.gguf # <-- Changed file to download
+   ```
+
+## **Task: Adding a second LLM**
+
+You can deploy an additional LLM and add it to the demonstration user interface (UI). The following example deploys the HuggingFace TGI runtime instead of `llama.cpp`. This process requires two steps: deploying the new LLM and configuring the frontend UI.
+
+### **Step 1: Deploying the new LLM service**
+
+1. **Define the new application:**
+In the `values-prod.yaml` file, add a new application named `another-llm-inference-service` to the applications list.
+
+   ```yaml
+   clusterGroup:
+     # ...
+     applications:
+       # ...
+       another-llm-inference-service: # <-- ADD THIS NEW APPLICATION
+         name: another-llm-inference-service
+         namespace: rag-llm-cpu
+         chart: llm-inference-service
+         chartVersion: 0.3.*
+         extraValueFiles:
+           - $patternref/overrides/another-llm-inference-service.yaml
+   ```
+
+2. **Create the override file:**
+Create a new file named `overrides/another-llm-inference-service.yaml`. This file defines the new model and disables the creation of resources, such as secrets, that the first LLM already created.
+   ```yaml
+   dsc:
+     initialize: false
+   externalSecret:
+     create: false
+
+   # Define the new InferenceService
+   inferenceService:
+     name: hf-inference-service # <-- New service name
+     minReplicas: 1
+     maxReplicas: 1
+     resources:
+       requests:
+         cpu: "8"
+         memory: 32Gi
+       limits:
+         cpu: "12"
+         memory: 32Gi
+
+   # Define the new runtime (HuggingFace TGI)
+   servingRuntime:
+     name: hf-runtime
+     port: 8080
+     image: docker.io/kserve/huggingfaceserver:latest
+     modelFormat: huggingface
+     args:
+       - --model_dir
+       - /models
+       - --model_name
+       - /models/Mistral-7B-Instruct-v0.3
+       - --http_port
+       - "8080"
+
+   # Define the new model to download
+   model:
+     repository: mistralai/Mistral-7B-Instruct-v0.3
+     files:
+       - generation_config.json
+       - config.json
+       - model.safetensors.index.json
+       - model-00001-of-00003.safetensors
+       - model-00002-of-00003.safetensors
+       - model-00003-of-00003.safetensors
+       - tokenizer.model
+       - tokenizer.json
+       - tokenizer_config.json
+   ```
+
+   > **IMPORTANT:** A known issue in the model-downloading container requires that you explicitly list all files that you want to download from the HuggingFace repository. Ensure that you list every file required for the model to run.
+
+### **Step 2: Adding the new LLM to the demonstration UI**
+
+Configure the frontend to recognize the new LLM.
+
+1. **Edit the frontend overrides**:
+Open the `overrides/rag-llm-frontend-values.yaml` file.
+2. **Update LLM_URLS:**
+Add the URL of the new service to the `LLM_URLS` environment variable. The URL uses the `http://<service-name>-predictor/v1` format or `http://<service-name>-predictor/openai/v1` for the HuggingFace runtime.
+In the `overrides/rag-llm-frontend-values.yaml` file:
+
+   ```yaml
+   env:
+     # ...
+     - name: LLM_URLS
+       value: '["http://cpu-inference-service-predictor/v1","http://hf-inference-service-predictor/openai/v1"]'
+   ```
+
+## **Task: Customizing RAG data sources**
+
+By default, the pattern ingests data from the Validated Patterns documentation. You can change this to point to public Git repositories or web pages.
+
+1. **Edit the vector database overrides:**
+Open the `overrides/vector-db-values.yaml` file.
+2. **Update sources:**
+Modify the `repoSources` and `webSources` keys. You can add any publicly available Git repository or public web URL. The job also processes PDF files from `webSources`.
+In the `overrides/vector-db-values.yaml` file:
+
+   ```yaml
+   providers:
+     qdrant:
+       enabled: true
+     mssql:
+       enabled: true
+
+   vectorEmbedJob:
+     repoSources:
+       - repo: https://github.com/your-org/your-docs.git # <-- Your repo
+         globs:
+           - "**/*.md"
+     webSources:
+       - https://your-company.com/product-manual.pdf # <-- Your PDF
+     chunking:
+       size: 4096
+   ```
+
+## **Task: Adding a new RAG database provider**
+
+By default, the pattern enables `qdrant` and `mssql`. You can also enable `redis`, `pgvector`, or `elastic`. This process requires three steps: adding secrets, enabling the database, and configuring the UI.
+
+### **Step 1: Updating the secrets file**
+
+1. If the new database requires credentials, add them to the main secrets file:
+
+   ```sh
+   vim ~/values-secret-rag-llm-cpu.yaml
+   ```
+2. Add the necessary credentials. For example:
+
+   ```yaml
+   secrets:
+     # ...
+     - name: pgvector
+       fields:
+         - name: user
+           value: user # <-- Update the user
+         - name: password
+           value: password # <-- Update the password
+         - name: db
+           value: db # <-- Update the db
+   ```
+
+> **NOTE:** For information about the expected values, see the [`values-secret.yaml.template`](https://github.com/validatedpatterns-sandbox/rag-llm-cpu/blob/main/values-secret.yaml.template) file.
+
+### **Step 2: Enabling the provider in the vector database chart**
+
+Edit the `overrides/vector-db-values.yaml` file and set `enabled: true` for the providers that you want to add.
+
+In the `overrides/vector-db-values.yaml` file:
+
+```yaml
+providers:
+  qdrant:
+    enabled: true
+  mssql:
+    enabled: true
+  pgvector: # <-- ADD THIS
+    enabled: true
+  elastic: # <-- OR THIS
+    enabled: true
+```
+
+### **Step 3: Adding the provider to the demonstration UI**
+
+Edit the `overrides/rag-llm-frontend-values.yaml` file to configure the UI:
+
+1. Add the secrets for the new provider to the `dbProvidersSecret.vault` list.
+2. Add the connection details for the new provider to the `dbProvidersSecret.providers` list.
+
+The following example shows the configuration for non-default RAG database providers:
+
+In the `overrides/rag-llm-frontend-values.yaml` file:
+
+```yaml
+dbProvidersSecret:
+  vault:
+    - key: mssql
+      field: sapassword
+    - key: pgvector # <-- Add this block
+      field: user
+    - key: pgvector
+      field: password
+    - key: pgvector
+      field: db
+    - key: elastic # <-- Add this block
+      field: user
+    - key: elastic
+      field: password
+  providers:
+    - type: qdrant # <-- Example for Qdrant
+      collection: docs
+      url: http://qdrant-service:6333
+      embedding_model: sentence-transformers/all-mpnet-base-v2
+    - type: mssql # <-- Example for MSSQL
+      table: docs
+      connection_string: >-
+        Driver={ODBC Driver 18 for SQL Server};
+        Server=mssql-service,1433;
+        Database=embeddings;
+        UID=sa;
+        PWD={{ .mssql_sapassword }};
+        TrustServerCertificate=yes;
+        Encrypt=no;
+      embedding_model: sentence-transformers/all-mpnet-base-v2
+    - type: redis # <-- Example for Redis
+      index: docs
+      url: redis://redis-service:6379
+      embedding_model: sentence-transformers/all-mpnet-base-v2
+    - type: elastic # <-- Example for Elastic
+      index: docs
+      url: http://elastic-service:9200
+      user: "{{ .elastic_user }}"
+      password: "{{ .elastic_password }}"
+      embedding_model: sentence-transformers/all-mpnet-base-v2
+    - type: pgvector # <-- Example for PGVector
+      collection: docs
+      url: >-
+        postgresql+psycopg://{{ .pgvector_user }}:{{ .pgvector_password }}@pgvector-service:5432/{{ .pgvector_db }}
+      embedding_model: sentence-transformers/all-mpnet-base-v2
+```