Chunking Strategies

This repo is a guide to on how to structure genAI experiments, with a particular focus on the thought process and decisions to be made when selecting a chunking strategy.

Who is this for

This repo is for software engineers who are starting out building generative AI applications, in particular Retrieval Augmented Generation systems. If you've built the "hello, world!" RAG apps already, and are wondering how to improve the system performance to win over your users.

Like with most data probles, there's no single optimal configuration that will work for all datasets. This guide aims to outline how to experiment with your chunking strategies, and identify the levers you can pull, and how to measure performance.

We restrict the scope to chunking methodologies, but can expand to other aspects of RAG if there is enough demand for it.

Prereqs

Azure Open AI resource
Deployments of:
- an embedding model
- an LLM (suggest using gpt-4 for Q&A generation, and 35-turbo-16k elsewhere)
Python 3.10 onwards (tested on 3.11)
Port 8080 available - MLFlow runs against port 8080 by default. If this is an issue you can follow these steps:
- Change the forwarded port in the devcontainer.json and (re)build
- Update the !mlflow server --host 127.0.0.1 --port 8080 in each notebook to reflect your port of choice

Environment

You can either use the requirements.txt and you're environment management tool of choice (conda, mamba, venv etc.)
Or use the devcontainer :)

Setup

Create a .env file using the sample file
Run and follow along with 00-Chunking Strategies.ipynb
Run and follow along with 01-Baseline Strategy.ipynb
Run and follow along with 02-Recursive Chunking.ipynb
Run and follow along with 03-Semantic Chunking.ipynb
View the results in MLFlow (launched from within the notebooks)

NOTE: Given the nature of RAG, running these from scratch can take some time and can be resource intensive. Some example outputs have been provided throughout in the chunking_Strategies/data folder to allow for quick exploration. Feel free to update the parameters at the top of the experiment notebooks and / or use different models in your .env file to try running your own experiment and compare the results

What you'll find in this repo

An approach to experimentation that can be used for any data / ML problem (see experiments)
An overview of 2 popular chunking strategies and a comparisson (that is only valid for the data used in this experiment!)
A comprehensive overview of the decisions, and influencing factors when deciding on a chunking strategy
Pre generated evaluation data, and experiment results
Python code that is not suitable for production!

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.devcontainer		.devcontainer
.github		.github
chunking_strategies		chunking_strategies
test		test
.env.sample		.env.sample
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chunking Strategies

Who is this for

Prereqs

Environment

Setup

What you'll find in this repo

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Chunking Strategies

Who is this for

Prereqs

Environment

Setup

What you'll find in this repo

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages