Skip to content

(WIP)Feat(benchmark): Add benchmark/RAG : RAG system evaluation framework#825

Draft
sponge225 wants to merge 23 commits intovolcengine:mainfrom
sponge225:feat/rag
Draft

(WIP)Feat(benchmark): Add benchmark/RAG : RAG system evaluation framework#825
sponge225 wants to merge 23 commits intovolcengine:mainfrom
sponge225:feat/rag

Conversation

@sponge225
Copy link
Contributor

@sponge225 sponge225 commented Mar 20, 2026

Description

RAG benchmark

RAG benchmark 是一个用于评测 Openviking 的 RAG (检索增强生成) 系统性能的框架,支持多个数据集和多种评测指标。

RAG benchmark is a framework for evaluating Openviking‘s RAG (Retrieval-Augmented Generation) system performance, supporting multiple datasets and metrics.

Features

  • 支持 Locomo、FinanceBench、Qasper、SyllabusQA 数据集 / Supports Locomo, FinanceBench, Qasper, SyllabusQA datasets
  • 完整的评测流程:数据准备 → 向量检索 → LLM 生成 → 自动评分 / Complete evaluation pipeline: data preparation → vector retrieval → LLM generation → auto-grading
  • 评测指标:Recall、F1 Score、Accuracy / Recall, F1 Score, Accuracy
  • 灵活的 YAML 配置 / Flexible YAML configuration
  • 可扩展设计 / Extensible design

详细文档请查看 benchmark/RAG/README.md。
See benchmark/RAG/README.md for detailed documentation.

Related Issue

Summary

This PR adds the RAG benchmark framework.

Closes #885

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Screenshots (if applicable)

Additional Notes

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@github-actions
Copy link

Failed to generate code suggestions for PR

@sponge225 sponge225 changed the title Feat(benchmark): Add benchmark/RAG : RAG system evaluation framework (WIP)Feat(benchmark): Add benchmark/RAG : RAG system evaluation framework Mar 20, 2026
@sponge225 sponge225 marked this pull request as draft March 20, 2026 12:55
…pdates

- Add complete dataset sampling scripts with document-level sampling
- Implement filtering logic consistent with adapters (exclude category 5 for Locomo, no answer for SyllabusQA, unanswerable for Qasper)
- Update configuration from raw_data/dataset_dir to dataset_path for clarity
- Enhance adapters with improved path handling and data loading
- Add gitignore for data and output directories
- Add dependencies (datasets, pandas, tavily-python)
- Add test files and documentation
- Implement stratified sampling for Locomo (by category 1-4)
- Implement stratified sampling for SyllabusQA (by question_type)
- Implement stratified sampling for Qasper (by answer type: extractive/free_form/yes_no)
- Implement stratified sampling for FinanceBench (by question_type)
- Add proper handling when sample size cannot be evenly split:
  - Display warning message
  - Distribute remaining QAs to first N categories
  - Fall back to random sampling if sample size too small
- Update prepare_dataset.py to support both 'random' and 'stratified' modes
- Set default sampling mode to 'random'
Copy link
Collaborator

@qin-ctx qin-ctx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

中文 review 摘要:这个 PR 引入了一个独立的 RAG benchmark 框架,模块拆分整体合理,README 和数据集支持范围也比较完整。但当前有 3 个阻断性的运行问题:1) 默认配置文件 config/config.yaml 的结构和运行时代码不一致,默认入口会直接因为缺少 execution 节点而失败;2) sample_locomo()num_docs + sample_size + random 路径里把 dict 放进 set,会直接抛出 TypeError: unhashable type: 'dict';3) sample_locomo() 在纯 stratified 路径里把 intlist 相加,也会直接报错。因为这些问题都落在 README 主流程覆盖到的入口上,我这边先请求修改。

@qin-ctx qin-ctx self-assigned this Mar 25, 2026
- Fix two bugs:
  1. num_docs + sample_size + random path: use int indices instead of dict tuples
  2. pure stratified path: use len() for list length calculation

- Extract common sampling utilities:
  - calculate_category_targets()
  - stratified_sample_with_reallocation()
  - random_sample_qas()
  - sample_docs_stratified()
  - sample_docs_random()

- Reduce code duplication by ~60-70%
- Improve maintainability and readability
- Keep full backward compatibility
- Add FinanceBench to supported datasets list
- Change to template configuration format
- Add execution: section for better organization
- Remove duplicate monitor.worker_end(success=False) call in run_generation()
- The _process_generation_task() already calls worker_end() in its exception handler
- This prevents double-counting of failed tasks and distorted statistics
- Add JSON file support to _get_required_syllabi()
- Extract syllabus names from JSON keys (same format as _load_from_json())
- This ensures data_prepare() processes correct docx files when using JSON input
- Replace 'raise e' with bare 'raise' to preserve original traceback
- Also remove unused 'e' variable since we don't need it
- This makes debugging easier by showing where the exception actually occurred
- In Locomo prompt, use gold_answer_str instead of gold_answer
- This ensures consistent formatting when gold_answer is a list
- Both Locomo and Generic prompts now use the same ' | ' separated format
- Replace manual common ancestor calculation with os.path.commonpath()
- os.path.commonpath() handles all OS path separators correctly
- Add try-except to handle ValueError when no common path exists
- More robust than manual split(os.sep) approach
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

[Feature]: Add RAG (Retrieval-Augmented Generation) Benchmark System for OpenViking

3 participants