Skip to content

Dimantarian/chunky_monkey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chunking Strategies

This repo is a guide to on how to structure genAI experiments, with a particular focus on the thought process and decisions to be made when selecting a chunking strategy.

Who is this for

This repo is for software engineers who are starting out building generative AI applications, in particular Retrieval Augmented Generation systems. If you've built the "hello, world!" RAG apps already, and are wondering how to improve the system performance to win over your users.

Like with most data probles, there's no single optimal configuration that will work for all datasets. This guide aims to outline how to experiment with your chunking strategies, and identify the levers you can pull, and how to measure performance.

We restrict the scope to chunking methodologies, but can expand to other aspects of RAG if there is enough demand for it.

Prereqs

  • Azure Open AI resource
  • Deployments of:
    • an embedding model
    • an LLM (suggest using gpt-4 for Q&A generation, and 35-turbo-16k elsewhere)
  • Python 3.10 onwards (tested on 3.11)
  • Port 8080 available - MLFlow runs against port 8080 by default. If this is an issue you can follow these steps:
    • Change the forwarded port in the devcontainer.json and (re)build
    • Update the !mlflow server --host 127.0.0.1 --port 8080 in each notebook to reflect your port of choice

Environment

  • You can either use the requirements.txt and you're environment management tool of choice (conda, mamba, venv etc.)
  • Or use the devcontainer :)

Setup

NOTE: Given the nature of RAG, running these from scratch can take some time and can be resource intensive. Some example outputs have been provided throughout in the chunking_Strategies/data folder to allow for quick exploration. Feel free to update the parameters at the top of the experiment notebooks and / or use different models in your .env file to try running your own experiment and compare the results

What you'll find in this repo

  • An approach to experimentation that can be used for any data / ML problem (see experiments)
  • An overview of 2 popular chunking strategies and a comparisson (that is only valid for the data used in this experiment!)
  • A comprehensive overview of the decisions, and influencing factors when deciding on a chunking strategy
  • Pre generated evaluation data, and experiment results
  • Python code that is not suitable for production!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages