(WIP)Feat(benchmark): Add benchmark/RAG : RAG system evaluation framework#825
Draft
sponge225 wants to merge 23 commits intovolcengine:mainfrom
Draft
(WIP)Feat(benchmark): Add benchmark/RAG : RAG system evaluation framework#825sponge225 wants to merge 23 commits intovolcengine:mainfrom
sponge225 wants to merge 23 commits intovolcengine:mainfrom
Conversation
|
|
|
Failed to generate code suggestions for PR |
…pdates - Add complete dataset sampling scripts with document-level sampling - Implement filtering logic consistent with adapters (exclude category 5 for Locomo, no answer for SyllabusQA, unanswerable for Qasper) - Update configuration from raw_data/dataset_dir to dataset_path for clarity - Enhance adapters with improved path handling and data loading - Add gitignore for data and output directories - Add dependencies (datasets, pandas, tavily-python) - Add test files and documentation
- Implement stratified sampling for Locomo (by category 1-4) - Implement stratified sampling for SyllabusQA (by question_type) - Implement stratified sampling for Qasper (by answer type: extractive/free_form/yes_no) - Implement stratified sampling for FinanceBench (by question_type) - Add proper handling when sample size cannot be evenly split: - Display warning message - Distribute remaining QAs to first N categories - Fall back to random sampling if sample size too small - Update prepare_dataset.py to support both 'random' and 'stratified' modes - Set default sampling mode to 'random'
# Conflicts: # uv.lock
qin-ctx
requested changes
Mar 25, 2026
Collaborator
qin-ctx
left a comment
There was a problem hiding this comment.
中文 review 摘要:这个 PR 引入了一个独立的 RAG benchmark 框架,模块拆分整体合理,README 和数据集支持范围也比较完整。但当前有 3 个阻断性的运行问题:1) 默认配置文件 config/config.yaml 的结构和运行时代码不一致,默认入口会直接因为缺少 execution 节点而失败;2) sample_locomo() 在 num_docs + sample_size + random 路径里把 dict 放进 set,会直接抛出 TypeError: unhashable type: 'dict';3) sample_locomo() 在纯 stratified 路径里把 int 和 list 相加,也会直接报错。因为这些问题都落在 README 主流程覆盖到的入口上,我这边先请求修改。
- Fix two bugs: 1. num_docs + sample_size + random path: use int indices instead of dict tuples 2. pure stratified path: use len() for list length calculation - Extract common sampling utilities: - calculate_category_targets() - stratified_sample_with_reallocation() - random_sample_qas() - sample_docs_stratified() - sample_docs_random() - Reduce code duplication by ~60-70% - Improve maintainability and readability - Keep full backward compatibility
- Add FinanceBench to supported datasets list - Change to template configuration format - Add execution: section for better organization
- Remove duplicate monitor.worker_end(success=False) call in run_generation() - The _process_generation_task() already calls worker_end() in its exception handler - This prevents double-counting of failed tasks and distorted statistics
- Add JSON file support to _get_required_syllabi() - Extract syllabus names from JSON keys (same format as _load_from_json()) - This ensures data_prepare() processes correct docx files when using JSON input
- Replace 'raise e' with bare 'raise' to preserve original traceback - Also remove unused 'e' variable since we don't need it - This makes debugging easier by showing where the exception actually occurred
- In Locomo prompt, use gold_answer_str instead of gold_answer - This ensures consistent formatting when gold_answer is a list - Both Locomo and Generic prompts now use the same ' | ' separated format
- Replace manual common ancestor calculation with os.path.commonpath() - os.path.commonpath() handles all OS path separators correctly - Add try-except to handle ValueError when no common path exists - More robust than manual split(os.sep) approach
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
RAG benchmark
RAG benchmark 是一个用于评测 Openviking 的 RAG (检索增强生成) 系统性能的框架,支持多个数据集和多种评测指标。
RAG benchmark is a framework for evaluating Openviking‘s RAG (Retrieval-Augmented Generation) system performance, supporting multiple datasets and metrics.
Features
详细文档请查看 benchmark/RAG/README.md。
See benchmark/RAG/README.md for detailed documentation.
Related Issue
Summary
This PR adds the RAG benchmark framework.
Closes #885
Type of Change
Changes Made
Testing
Checklist
Screenshots (if applicable)
Additional Notes