Conversation
Add domain-agnostic framing (legal, medical, documentation, support) and rename "Code-aware" to "Structure-aware" in feature list.
Separate quality benchmark system (bench/quality.ts) that measures compression fidelity independently from the existing perf/regression suite. Includes: - quality-analysis.ts: compressed-only retention metrics, semantic fidelity scoring (fact extraction + negation detection), per-message quality breakdown, and recencyWindow tradeoff sweep - quality-scenarios.ts: 6 edge case scenarios (single-char, giant message, code-only, entity-dense, prose-only, mixed languages) - quality.ts: standalone runner with --save/--check against its own baseline namespace (bench/baselines/quality/) - backfill.ts: retroactively generates quality baselines for older git refs via temporary worktrees Key design decisions: - Retention measured only on compressed messages (fixes the all-1.0 masking problem in the existing analyzeRetention) - Code block integrity is byte-identical verification, not just fence count - Zero-tolerance regression on code block integrity, 5% on entity retention, 10% on fact retention - Completely isolated from existing --check (separate baseline files) - Backfilled v1.0.0 baseline for historical comparison
… LLM judge Replace broken quality metrics (keywordRetention, factRetention, negationErrors) with five meaningful ones: task-based probes (~70 across 13 scenarios), information density, compressed-only quality score, negative compression detection, and summary coherence checks. - Add ProbeDefinition type and getProbesForScenario() with curated probes - Add computeInformationDensity(), computeCompressedQualityScore(), detectNegativeCompressions(), checkCoherence() analysis functions - Add min-output-chars probes to catch over-aggressive compression - Add lang aliases to countCodeBlocks (typescript/ts, python/py, yaml/yml) - Fix regression thresholds: coherence/negativeCompressions track increases from baseline, not just zero-to-nonzero transitions - Add --llm-judge flag with multi-provider support (OpenAI, Anthropic, Gemini, Ollama) for LLM-as-judge scoring (display-only, not in baseline) - Add Gemini provider to bench/llm.ts (@google/genai SDK) - Add bench:quality:judge npm script - Update docs/benchmarks.md with quality metrics, probes, LLM judge, and regression threshold documentation - Update CLAUDE.md with quality benchmark commands - Re-save quality baseline with new format
…ments # Conflicts: # package-lock.json
- Bump version to 1.3.0 - Add quality history documentation with version comparison - Add --features flag for opt-in feature benchmarking - Update CHANGELOG with all 1.3.0 changes - Save baselines for v1.3.0 - Regenerate benchmark-results.md - Link quality-history.md from README and docs index
# Conflicts: # CHANGELOG.md # CLAUDE.md # README.md
Re-apply: version bump to 1.3.0, CHANGELOG 1.3.0 section, quality benchmark npm scripts, CLAUDE.md commands, Gemini provider in llm.ts, quality-history link in README and docs index, @google/genai devDep.
v1.3.0 — Quality benchmark overhaul + LLM judge
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
node:util.styleText(Node 20+)enginesto>=20in package.jsonThe library code itself is pure ESM and technically runs on Node 18, but the test runner can't. Since
.nvmrctargets 22 and coverage already required 20+, this aligns everything.Test plan