Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions content/patterns/rag-llm-cpu/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
title: RAG LLM Chatbot on CPU
date: 2025-10-24
tier: sandbox
summary: This patterns deploys a CPU-based LLM, your choice of several RAG DB providers, and a simple chatbot UI which exposes the configuration and results of the RAG queries.
rh_products:
- Red Hat OpenShift Container Platform
- Red Hat OpenShift GitOps
- Red Hat OpenShift AI
partners:
- Microsoft
- IBM Fusion
industries:
- General
aliases: /rag-llm-cpu/
links:
github: https://github.com/validatedpatterns-sandbox/rag-llm-cpu
install: getting-started
bugs: https://github.com/validatedpatterns-sandbox/rag-llm-cpu/issues
feedback: https://docs.google.com/forms/d/e/1FAIpQLScI76b6tD1WyPu2-d_9CCVDr3Fu5jYERthqLKJDUGwqBg7Vcg/viewform
---

# **CPU-based RAG LLM chatbot**

## **Introduction**

The CPU-based RAG LLM chatbot Validated Pattern deploys a retrieval-augmented generation (RAG) chatbot on Red Hat OpenShift by using Red Hat OpenShift AI.
The pattern runs entirely on CPU nodes without requiring GPU hardware, which provides a cost-effective and accessible solution for environments where GPU resources are limited or unavailable.
This pattern provides a secure, flexible, and production-ready starting point for building and deploying on-premise generative AI applications.

## **Target audience**

This pattern is intended for the following users:

- **Developers & Data Scientists** who want to build and experiment with RAG-based large language model (LLM) applications.
- **MLOps & DevOps Engineers** who are responsible for deploying and managing AI/ML workloads on OpenShift.
- **Architects** who evaluate cost-effective methods for delivering generative AI capabilities on-premise.

## **Why Use This Pattern?**

- **Cost-Effective**: The pattern runs entirely on CPU nodes, which removes the need for expensive and scarce GPU resources.
- **Flexible**: The pattern supports multiple vector database backends, such as Elasticsearch, PGVector, and Microsoft SQL Server, to integrate with existing data infrastructure.
- **Transparent**: The Gradio frontend exposes the internals of the RAG query and LLM prompts, which provides insight into the generation process.
- **Extensible**: The pattern uses open-source standards, such as KServe and OpenAI-compatible APIs, to serve as a foundation for complex applications.

## **Architecture Overview**

At a high level, the components work together in the following sequence:

1. A user enters a query into the **Gradio UI**.
2. The backend application, using **LangChain**, queries a configured **vector database** to retrieve relevant documents.
3. These documents are combined with the original query from the user into a prompt.
4. The prompt is sent to the **KServe-deployed LLM**, which runs via **llama.cpp** on a CPU node.
5. The LLM generates a response, which is streamed back to the **Gradio UI**.
6. **Vault** provides the necessary credentials for the vector database and HuggingFace token at runtime.

![Overview](/images/rag-llm-cpu/rag-augmented-query.png)

_Figure 1. Overview of RAG Query from User's perspective._

## **Prerequisites**

Before you begin, ensure that you have access to the following resources:

- A Red Hat OpenShift cluster version 4.x. (The recommended size is at least two `m5.4xlarge` nodes.)
- A HuggingFace API token.
- The `Podman` command-line tool.

## **What This Pattern Provides**

- A [KServe](https://github.com/kserve/kserve)-based LLM deployed to [Red Hat OpenShift AI](https://www.redhat.com/en/products/ai/openshift-ai) that runs entirely on a CPU-node with a [llama.cpp](https://github.com/ggml-org/llama.cpp) runtime.
- A choice of one or more vector database providers to serve as a RAG backend with configurable web-based or Git repository-based sources. Vector embedding and document retrieval are implemented with [LangChain](https://docs.langchain.com/oss/python/langchain/overview).
- [Vault](https://developer.hashiCorp.com/vault)-based secret management for a HuggingFace API token and credentials for supported databases, such as ([Elasticsearch](https://www.elastic.co/docs/solutions/search/vector), [PGVector](https://github.com/pgvector/pgvector), [Microsoft SQL Server](https://learn.microsoft.com/en-us/sql/sql-server/ai/vectors?view=sql-server-ver17)).
- A [gradio](https://www.gradio.app/)-based frontend for connecting to multiple [OpenAI API-compatible](https://github.com/openai/openai-openapi) LLMs. This frontend exposes the internals of the RAG query and LLM prompts so that users have insight into the running processes.
286 changes: 286 additions & 0 deletions content/patterns/rag-llm-cpu/configure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,286 @@
---
title: Configuring this pattern
weight: 20
aliases: /rag-llm-cpu/configure/
---

# **Configuring this pattern**

This guide covers common customizations, such as changing the default large language model (LLM), adding new models, and configuring retrieval-augmented generation (RAG) data sources. This guide assumes that you have already completed the [Getting started](/rag-llm-cpu/getting-started/) guide.

## **Configuration overview**

ArgoCD manages this pattern by using GitOps. All application configurations are defined in the `values-prod.yaml` file. To customize a component, complete the following steps:

1. **Enable an override:** In the `values-prod.yaml` file, locate the application that you want to change, such as `llm-inference-service`, and add an `extraValueFiles:` entry that points to a new override file, such as `$patternref/overrides/llm-inference-service.yaml`.
2. **Create the override file:** Create the new `.yaml` file in the `/overrides` directory.
3. **Add settings:** Add the specific values that you want to change to the new file.
4. **Commit and synchronize:** Commit your changes and allow ArgoCD to synchronize the application.

## **Task: Changing the default LLM**

By default, the pattern deploys the `mistral-7b-instruct-v0.2.Q5_0.gguf` model. You can change this to a different model, such as a different quantization, or adjust the resource usage. To change the default LLM, create an override file for the existing `llm-inference-service` application.

1. **Enable the override:**
In the `values-prod.yaml` file, update the `llm-inference-service` application to use an override file:
```yaml
clusterGroup:
# ...
applications:
# ...
llm-inference-service:
name: llm-inference-service
namespace: rag-llm-cpu
chart: llm-inference-service
chartVersion: 0.3.*
extraValueFiles: # <-- ADD THIS BLOCK
- $patternref/overrides/llm-inference-service.yaml
```

2. **Create the override file:**
Create a new file named `overrides/llm-inference-service.yaml`. The following example switches to a different model file (Q8_0) and increases the CPU and memory requests:
```yaml
inferenceService:
resources: # <-- Increaed allocated resources
requests:
cpu: "8"
memory: 12Gi
limits:
cpu: "12"
memory: 24Gi

servingRuntime:
args:
- --model
- /models/mistral-7b-instruct-v0.2.Q8_0.gguf # <-- Changed model file

model:
repository: TheBloke/Mistral-7B-Instruct-v0.2-GGUF
files:
- mistral-7b-instruct-v0.2.Q8_0.gguf # <-- Changed file to download
```

## **Task: Adding a second LLM**

You can deploy an additional LLM and add it to the demonstration user interface (UI). The following example deploys the HuggingFace TGI runtime instead of `llama.cpp`. This process requires two steps: deploying the new LLM and configuring the frontend UI.

### **Step 1: Deploying the new LLM service**

1. **Define the new application:**
In the `values-prod.yaml` file, add a new application named `another-llm-inference-service` to the applications list.

```yaml
clusterGroup:
# ...
applications:
# ...
another-llm-inference-service: # <-- ADD THIS NEW APPLICATION
name: another-llm-inference-service
namespace: rag-llm-cpu
chart: llm-inference-service
chartVersion: 0.3.*
extraValueFiles:
- $patternref/overrides/another-llm-inference-service.yaml
```

2. **Create the override file:**
Create a new file named `overrides/another-llm-inference-service.yaml`. This file defines the new model and disables the creation of resources, such as secrets, that the first LLM already created.
```yaml
dsc:
initialize: false
externalSecret:
create: false

# Define the new InferenceService
inferenceService:
name: hf-inference-service # <-- New service name
minReplicas: 1
maxReplicas: 1
resources:
requests:
cpu: "8"
memory: 32Gi
limits:
cpu: "12"
memory: 32Gi

# Define the new runtime (HuggingFace TGI)
servingRuntime:
name: hf-runtime
port: 8080
image: docker.io/kserve/huggingfaceserver:latest
modelFormat: huggingface
args:
- --model_dir
- /models
- --model_name
- /models/Mistral-7B-Instruct-v0.3
- --http_port
- "8080"

# Define the new model to download
model:
repository: mistralai/Mistral-7B-Instruct-v0.3
files:
- generation_config.json
- config.json
- model.safetensors.index.json
- model-00001-of-00003.safetensors
- model-00002-of-00003.safetensors
- model-00003-of-00003.safetensors
- tokenizer.model
- tokenizer.json
- tokenizer_config.json
```

> **IMPORTANT:** A known issue in the model-downloading container requires that you explicitly list all files that you want to download from the HuggingFace repository. Ensure that you list every file required for the model to run.

### **Step 2: Adding the new LLM to the demonstration UI**

Configure the frontend to recognize the new LLM.

1. **Edit the frontend overrides**:
Open the `overrides/rag-llm-frontend-values.yaml` file.
2. **Update LLM_URLS:**
Add the URL of the new service to the `LLM_URLS` environment variable. The URL uses the `http://<service-name>-predictor/v1` format or `http://<service-name>-predictor/openai/v1` for the HuggingFace runtime.
In the `overrides/rag-llm-frontend-values.yaml` file:

```yaml
env:
# ...
- name: LLM_URLS
value: '["http://cpu-inference-service-predictor/v1","http://hf-inference-service-predictor/openai/v1"]'
```

## **Task: Customizing RAG data sources**

By default, the pattern ingests data from the Validated Patterns documentation. You can change this to point to public Git repositories or web pages.

1. **Edit the vector database overrides:**
Open the `overrides/vector-db-values.yaml` file.
2. **Update sources:**
Modify the `repoSources` and `webSources` keys. You can add any publicly available Git repository or public web URL. The job also processes PDF files from `webSources`.
In the `overrides/vector-db-values.yaml` file:

```yaml
providers:
qdrant:
enabled: true
mssql:
enabled: true

vectorEmbedJob:
repoSources:
- repo: https://github.com/your-org/your-docs.git # <-- Your repo
globs:
- "**/*.md"
webSources:
- https://your-company.com/product-manual.pdf # <-- Your PDF
chunking:
size: 4096
```

## **Task: Adding a new RAG database provider**

By default, the pattern enables `qdrant` and `mssql`. You can also enable `redis`, `pgvector`, or `elastic`. This process requires three steps: adding secrets, enabling the database, and configuring the UI.

### **Step 1: Updating the secrets file**

1. If the new database requires credentials, add them to the main secrets file:

```sh
vim ~/values-secret-rag-llm-cpu.yaml
```
2. Add the necessary credentials. For example:

```yaml
secrets:
# ...
- name: pgvector
fields:
- name: user
value: user # <-- Update the user
- name: password
value: password # <-- Update the password
- name: db
value: db # <-- Update the db
```

> **NOTE:** For information about the expected values, see the [`values-secret.yaml.template`](https://github.com/validatedpatterns-sandbox/rag-llm-cpu/blob/main/values-secret.yaml.template) file.

### **Step 2: Enabling the provider in the vector database chart**

Edit the `overrides/vector-db-values.yaml` file and set `enabled: true` for the providers that you want to add.

In the `overrides/vector-db-values.yaml` file:

```yaml
providers:
qdrant:
enabled: true
mssql:
enabled: true
pgvector: # <-- ADD THIS
enabled: true
elastic: # <-- OR THIS
enabled: true
```

### **Step 3: Adding the provider to the demonstration UI**

Edit the `overrides/rag-llm-frontend-values.yaml` file to configure the UI:

1. Add the secrets for the new provider to the `dbProvidersSecret.vault` list.
2. Add the connection details for the new provider to the `dbProvidersSecret.providers` list.

The following example shows the configuration for non-default RAG database providers:

In the `overrides/rag-llm-frontend-values.yaml` file:

```yaml
dbProvidersSecret:
vault:
- key: mssql
field: sapassword
- key: pgvector # <-- Add this block
field: user
- key: pgvector
field: password
- key: pgvector
field: db
- key: elastic # <-- Add this block
field: user
- key: elastic
field: password
providers:
- type: qdrant # <-- Example for Qdrant
collection: docs
url: http://qdrant-service:6333
embedding_model: sentence-transformers/all-mpnet-base-v2
- type: mssql # <-- Example for MSSQL
table: docs
connection_string: >-
Driver={ODBC Driver 18 for SQL Server};
Server=mssql-service,1433;
Database=embeddings;
UID=sa;
PWD={{ .mssql_sapassword }};
TrustServerCertificate=yes;
Encrypt=no;
embedding_model: sentence-transformers/all-mpnet-base-v2
- type: redis # <-- Example for Redis
index: docs
url: redis://redis-service:6379
embedding_model: sentence-transformers/all-mpnet-base-v2
- type: elastic # <-- Example for Elastic
index: docs
url: http://elastic-service:9200
user: "{{ .elastic_user }}"
password: "{{ .elastic_password }}"
embedding_model: sentence-transformers/all-mpnet-base-v2
- type: pgvector # <-- Example for PGVector
collection: docs
url: >-
postgresql+psycopg://{{ .pgvector_user }}:{{ .pgvector_password }}@pgvector-service:5432/{{ .pgvector_db }}
embedding_model: sentence-transformers/all-mpnet-base-v2
```
Loading