Claude Code plugins for Braintrust - LLM evaluation, logging, observability, and tracing.
Enables AI agents to use Braintrust for LLM evaluation, logging, and observability.
claude plugin marketplace add braintrustdata/braintrust-claude-plugin
claude plugin install braintrust@braintrust-claude-pluginAutomatically trace Claude Code conversations to Braintrust.
claude plugin install trace-claude-code@braintrust-claude-pluginSee trace-claude-code/SKILL.md for setup instructions.
This repo includes skills built on the open Agent Skills format, compatible with Claude Code, Cursor, Amp, and other agents.
Install all skills:
curl -sL https://github.com/braintrustdata/braintrust-claude-plugin/archive/main.tar.gz | tar -xz -C ~/.claude/skills --strip-components=2 braintrust-claude-plugin-main/skillsAvailable skills:
- using-braintrust - Evaluation, logging, and SQL queries
- trace-claude-code - Automatic conversation tracing
Create a .env file in your project directory:
BRAINTRUST_API_KEY=your-api-key-here
The plugin scripts automatically load .env files from the current directory or parent directories.
The plugin includes ready-to-use scripts for common operations:
Query logs with SQL:
uv run query_logs.py --project "My Project" --query "SELECT count(*) as count FROM logs WHERE created > now() - interval 1 day"Log data:
uv run log_data.py --project "My Project" --input "hello" --output "world"Run evaluations:
uv run run_eval.py --project "My Project" --data '[{"input": "test", "expected": "test"}]'The skill teaches Claude how to use the Braintrust SDK correctly:
# Correct Eval() usage - project name is FIRST POSITIONAL arg
braintrust.Eval(
"My Project", # NOT project_name="My Project"
data=lambda: [...],
task=lambda input: ...,
scores=[Factuality],
)
# Logging with flush
logger = braintrust.init_logger(project="My Project")
logger.log(input="hello", output="world")
logger.flush() # Important!The skill teaches Claude to write SQL queries for Braintrust logs:
SELECT input, output, created FROM logs WHERE created > now() - interval 1 day LIMIT 10SQL quirks in Braintrust:
- Use
hour(),day(),month(),year()instead ofdate_trunc() - Intervals use format
interval 1 day(no quotes, singular unit)
braintrust-claude-plugin/
├── .claude-plugin/
│ ├── plugin.json # Plugin manifest
│ └── marketplace.json # Marketplace index
├── skills/
│ ├── using-braintrust/
│ │ ├── SKILL.md # Evaluation & logging skill
│ │ └── scripts/ # Helper scripts
│ │ ├── query_logs.py
│ │ ├── log_data.py
│ │ └── run_eval.py
│ └── trace-claude-code/
│ ├── SKILL.md # Claude Code tracing skill
│ └── hooks/
│ └── stop_hook.sh # Hook script
├── evals/ # Evaluation suite
│ ├── eval_e2e_*.py # End-to-end tests
│ └── eval_*.py # Baseline tests
└── README.md
- Python 3.12+
- uv package manager
Test the plugin without installing from marketplace:
claude --plugin-dir /path/to/braintrust-claude-pluginThe evals/ directory contains tests that verify the skill works correctly (e.g., Claude generates valid SQL queries, logs data properly).
cd evals
export BRAINTRUST_API_KEY="your-key"
# Run all evals
uv run braintrust eval .
# Run specific eval
uv run braintrust eval eval_e2e_log_fetch.py# Install hooks
uv run pre-commit install
# Run all hooks
uv run pre-commit run --all-filesAfter making changes:
- Bump version in
.claude-plugin/plugin.jsonand.claude-plugin/marketplace.json - Commit and push
- Users update with:
claude plugin marketplace update braintrust-claude-plugin