TabRAG: Improving Tabular Document Question Answering for Retrieval Augmented Generation via Structured Representations

Jacob Si^1* | Mike Qu^2* | Michelle Lee¹ | Marek Rei¹ | Yingzhen Li¹

¹Imperial College London ²Columbia University

Figure 1: The TabRAG Architecture, a parsing-based RAG pipeline designed specifically for tables.

Installation

Enviroment Installation.

Clone this repository and navigate to it in your terminal. Create an environment using a preferred package manager.

Note: can replace conda with uv.

conda create --name tabrag python=3.10
conda activate tabrag

Installing Dependencies

pip install torch
pip install 'git+https://github.com/facebookresearch/detectron2.git' --no-build-isolation
pip install pymupdf
pip install faiss-gpu
pip install timm
pip install shapely
pip install qwen_vl_utils
pip install gdown
pip install opencv-python
pip install pypdf
pip install arxiv
pip install weasyprint
pip install datasets
pip install vllm
pip install sentence_transformers
pip install numpy==1.26.4
pip install accelerate

Optional: Installing Tesseract OCR and PyTesseract

This guide explains how to install a baseline in our paper, Tesseract OCR, and use it in Python via PyTesseract by building from source. Official build guide: https://tesseract-ocr.github.io/tessdoc/Compiling.html

Install Python OCR dependencies inside your project environment

pip install pytesseract Pillow

Create a build directory

mkdir -p $HOME/tesseract_build
cd $HOME/tesseract_build

Download

Tesseract Source

git clone https://github.com/tesseract-ocr/tesseract.git

Leptonica Source

git clone https://github.com/DanBloomberg/leptonica.git

Build & install locally

cd $HOME/tesseract_build/leptonica
./autobuild
./configure --prefix=$HOME/tesseract_build/install
make -j$(nproc)
make install

cd ../tesseract
./autogen.sh
LIBLEPT_HEADERSDIR=$HOME/tesseract_build/install/include ./configure \
  --prefix=$HOME/tesseract_build/install \
  --with-extra-libraries=$HOME/tesseract_build/install/lib
make -j$(nproc)
make install

Verify installation

$HOME/tesseract_build/install/bin/tesseract --version

Set environment variable for running any OCR script

export PATH=$HOME/tesseract_build/install/bin:$PATH
export LD_LIBRARY_PATH=$HOME/tesseract_build/install/lib:$LD_LIBRARY_PATH
export TESSDATA_PREFIX=$HOME/tesseract_build/install/share/tessdata

which tesseract
tesseract --version

Download language data for tesseract to perform OCR

mkdir -p $HOME/tesseract_build/install/share/tessdata
cd $HOME/tesseract_build/install/share/tessdata
wget https://github.com/tesseract-ocr/tessdata_best/raw/main/eng.traineddata

Layout Model

Microsoft's DiT model (Document Image Transformer) is used for layout extraction: https://github.com/microsoft/unilm/tree/master/dit

Download the DiT-Large checkpoint pretrained on the Publaynet Dataset: https://huggingface.co/HYPJUDY/dit/resolve/main/dit-fts/icdar19modern_dit-l_cascade.pth

Move it to the project directory.

Datasets

Enter the datasets/ folder

cd datasets

TAT-DQA:

mkdir tatdqa
cd tatdqa
gdown https://drive.google.com/uc?id=1iqe5r-qgQZLhGtM4G6LkNp9S6OCwOF2L
# unzip the folder
gdown https://drive.google.com/uc?id=1ZQjjIC0BB14l6t9b1Ryq0t-CNAP6iC2J

python process_tatdqa.py

MP-DocVQA:

mkdir mpdocvqa
cd mpdocvqa
wget https://datasets.cvc.uab.es/rrc/DocVQA/Task4/images.tar.gz --no-check-certificate
tar -xvf images.tar.gz
python process_mpdocvqa.py # get documents with tables # EDIT run this will do
# python filter_mpdocvqa.py # select 500 pages based on qa:pages ratio
# python indent_mpdocvqa.py # visibility of val.json

WikiTableQuestions:

mkdir wikitablequestions
cd wikitablequestions
wget https://github.com/ppasupat/WikiTableQuestions/archive/refs/tags/v1.0.2.tar.gz
tar -xvf v1.0.2.tar.gz
python process_wikitq.py

TableVQA:

mkdir tablevqa
cd tablevqa
python process_tablevqa.py

ComTQA:

mkdir comtqa
cd comtqa
download the dataset: https://huggingface.co/datasets/ByteDance/ComTQA/blob/main/README.md
process the dataset:
python process_comtqa.py
python process_qa.py

Note that for new datasets, please convert all PDFs into single-page images then organize them as follows:

datasets/
├── {dataset_name}/
│   ├── retrieval
|   |   ├── {folder1_name} # all page images in this folder are treated as the same "document"
|   |   |   ├── page0.jpg
|   |   |   ├── page1.jpg
|   |   |   ├── ......

See TAT-DQA as reference.

Build Ragstore

To build a Ragstore, we can choose between using an externally-served VLM using VLLM, or using a provider like HuggingFace directly.

To serve a VLM externally:

vllm serve "Qwen/Qwen3-VL-8B-Instruct" --dtype auto --tensor-parallel-size 1 --max_model_len 96000 --gpu-memory-utilization 0.95 --port 1707

The first step involves construct the Self-Generated ICL examples. Depending on whether you choose to use HuggingFace or VLLM:

python generate_icl.py --model Qwen/Qwen3-VL-2B-Instruct --dataset tatdqa --use_hf

python generate_icl.py --model Qwen/Qwen3-VL-2B-Instruct --dataset tatdqa --use_vllm --vllm_ip localhost --vllm_port 2222

Afterwards, we can generate the TabRAG rationales and build the vector databases for downstream retrieval:

python main.py --model tabrag --mode generation --dataset tatdqa --vlm Qwen/Qwen3-VL-2B-Instruct --embedder Qwen/Qwen3-Embedding-0.6B --use_hf

python main.py --model tabrag --mode generation --dataset tatdqa --vlm Qwen/Qwen3-VL-2B-Instruct --embedder Qwen/Qwen3-Embedding-0.6B --use_vllm --vllm_ip localhost --vllm_port 2222

Query Engine

After constructing the vector database, we can perform queries to receive responses grounded in retrieved documents.

For inference, we can choose between using an externally-served VLM using VLLM, OpenAI's models, or a provider like HuggingFace directly.

To serve a LLM externally:

vllm serve "Qwen/Qwen3-8B" --dtype auto --tensor-parallel-size 1 --max_model_len 96000 --gpu-memory-utilization 0.95 --port 1707

Now, we can perform queries using the below commands, depending on the LLM inference provider

python inference.py --data_path "storages/[path_to_ragstore]" --embedder Qwen/Qwen3-Embedding-4B --model gpt-5-mini --use_openai

python inference.py --data_path "storages/[path_to_ragstore]" --embedder Qwen/Qwen3-Embedding-4B --model Qwen/Qwen3-8B --use_hf

python inference.py --data_path "storages/[path_to_ragstore]" --embedder Qwen/Qwen3-Embedding-4B --model Qwen/Qwen3-8B --use_hf --vllm_ip localhost --vllm_port 2222

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
datasets		datasets
figs		figs
icl		icl
object_detection		object_detection
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_icl.py		generate_icl.py
inference.py		inference.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TabRAG: Improving Tabular Document Question Answering for Retrieval Augmented Generation via Structured Representations

Installation

Enviroment Installation.

Layout Model

Datasets

Build Ragstore

Query Engine

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

jacobyhsi/TabRAG

Folders and files

Latest commit

History

Repository files navigation

TabRAG: Improving Tabular Document Question Answering for Retrieval Augmented Generation via Structured Representations

Installation

Enviroment Installation.

Layout Model

Datasets

Build Ragstore

Query Engine

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages