Skip to content

v1.3.0 — Quality benchmark overhaul + LLM judge#20

Merged
SimplyLiz merged 85 commits intomainfrom
feature/v2-improvements
Mar 21, 2026
Merged

v1.3.0 — Quality benchmark overhaul + LLM judge#20
SimplyLiz merged 85 commits intomainfrom
feature/v2-improvements

Conversation

@SimplyLiz
Copy link
Owner

Summary

  • Quality benchmark overhaul: replaced broken metrics (keywordRetention, factRetention, negationErrors) with five meaningful ones: task-based probes (~70 across 13 scenarios), information density, compressed-only quality score, negative compression detection, and summary coherence checks
  • LLM-as-judge scoring (--llm-judge): optional multi-provider evaluation (OpenAI, Anthropic, Gemini, Ollama) — display-only, not in baselines
  • Opt-in feature comparison (--features): benchmarks each v2 opt-in feature against baseline to measure impact
  • Quality history (docs/quality-history.md): version-over-version quality tracking across v1.0.0→v1.3.0 with opt-in feature impact analysis
  • Gemini provider support for LLM benchmarks via @google/genai SDK
  • Merged dependabot PRs: dev deps bump + codeql-action v3→v4

Key quality findings

Finding Detail
Code preservation 100% across all versions, all scenarios
v1.1.0 entity regression Structured content 100%→68%, Entity-dense 68%→53%, Mixed languages 100%→67%
Conversation flow Fixes Deep conversation probes (33%→100%) but destroys Long Q&A (entity retention→7%)
Semantic clustering Compresses code-only messages it shouldn't (100%→75% probe pass)
Importance/contradiction Zero measurable impact on current scenarios

Test plan

  • npm run build — compiles
  • npm test — 663 tests pass
  • npm run lint && npm run format:check — clean
  • npm run bench:quality — all 13 scenarios run
  • npm run bench:quality:check — passes against baseline
  • npm run bench:quality:judge — LLM judge runs with Gemini/OpenAI/Ollama
  • npm run bench:quality:features — feature comparison runs
  • npm run bench:save — main baseline saved

SimplyLiz and others added 30 commits February 25, 2026 06:14
…ode to v6

- CLAUDE.md with architecture docs and branching strategy
- SECURITY.md with vulnerability reporting policy
- CHANGELOG.md reformatted to Keep a Changelog spec
- .nvmrc pinning Node 22
- Bump actions/setup-node v4 → v6
Exercises every public export as a real npm consumer would — catches
broken exports maps, missing tarball files, and ESM resolution failures
that unit tests cannot detect. Covers 26 scenarios including compress,
uncompress round-trips, dedup, token budgets, async paths, tool_calls,
re-compression, recursive uncompress, and large conversations.
Add package structure validation (publint --strict) and TypeScript type
resolution checks (attw) to the e2e pipeline. Artifacts (.tgz,
e2e/node_modules, e2e/package-lock.json) are now cleaned up after every
run. E2e job added to CI in parallel with existing jobs, gating publish.
…d error paths

Replaces custom pass/fail harness with node:test + node:assert/strict.
Strengthens fuzzy dedup (asserts messages_fuzzy_deduped > 0) and
tool_calls (verifies non-tool messages are compressed). Adds 7 error
handling tests covering TypeError contracts and graceful null/empty
content. Merges develop to resolve conflicts.
Add domain-agnostic framing (legal, medical, documentation, support)
and rename "Code-aware" to "Structure-aware" in feature list.
- Add inline .env parser in bench/run.ts (no dependency, won't override existing vars)
- Probe localhost:11434/api/tags to auto-detect Ollama without env vars
- Add LLM result types and save/load in bench/baseline.ts
- Auto-save LLM results to bench/baselines/llm/<provider>-<model>.json
- Extend doc generator with LLM comparison tables when result files exist
- Add .env.example template with commented-out provider keys
- Update skip message to mention Ollama auto-detection
… metrics

LLM benchmarks previously ran automatically when API keys were
detected, silently burning money on every `npm run bench`. Now
requires explicit `--llm` flag (`npm run bench:llm`).

Additions:
- Technical explanation scenario (pure prose, no code fences)
- vsDet expansion metric (LLM ratio / deterministic ratio)
- Token budget + LLM section (deterministic vs llm-escalate)
- bench:llm npm script

Fixes:
- .env parser: strip quotes, handle `export` prefix
- loadAllLlmResults: try/catch per file for malformed JSON
- Ollama: verify model availability via /api/tags response
- Anthropic: guard against empty content array
- LLM benchmark loop: per-scenario try/catch
- Doc generation: scenario count 7→8, add Technical explanation
…ture

- --save: writes current.json + history/v{version}.json, regenerates docs
- --check: compares against current.json, exits non-zero on regression
- --tolerance N: allows N% deviation (0% default, deterministic)
- Baselines reorganized: current.json at root, history/ for versioned
  snapshots, llm/ for non-deterministic reference data
- bench:llm added to package.json for explicit LLM benchmark runs
- Doc generation references correct baseline paths
Split docs/benchmarks.md into two files:
- docs/benchmarks.md: hand-written handbook (how to run, scenarios,
  interpreting results, regression testing)
- docs/benchmark-results.md: auto-generated by bench:save with Mermaid
  xychart-beta charts, summary table, and polished data presentation

Rewrite generateBenchmarkDocs() with compression ratio chart, dedup
impact chart, LLM comparison chart, key findings callout, and
conditional sections for LLM data and version history.
…pie chart

Add shields.io badges, unicode progress bars, reduction % and message
count columns to the compression table, a Mermaid pie chart for message
outcomes, and collapsible details sections for LLM provider tables.
Drop progress bar column from compression table — unicode blocks render
with variable width in GitHub's proportional-font tables. Switch LLM
comparison chart from double bar (stacked) to bar+line so both series
are visible side by side.
Interleave "Scenario (Det)" and "Scenario (LLM)" labels on the x-axis
so each scenario gets two side-by-side bars in a single series, avoiding
Mermaid's stacked-bar behavior.
Mermaid xychart can't do grouped bars — stacks or overlaps labels.
Replace with a clean comparison table showing Det vs Best LLM ratio,
delta percentage, and winner per scenario.
…arison

Render comparison as paired horizontal bars inside a fenced code block
(monospace), replacing the broken Mermaid chart. Each scenario shows
Det and LLM bars side by side with ratios and a star for LLM wins.
compressSync and compressAsync were identical (~180 lines each) except
for 2 summarize call sites. Replace both with a single compressGen
generator that yields summarize requests, driven by thin sync/async
runners. Removes 149 lines of duplication, no public API changes.
compressSync and compressAsync were identical (~180 lines each) except
for 2 summarize call sites. Replace both with a single compressGen
generator that yields summarize requests, driven by thin sync/async
runners. Removes 149 lines of duplication, no public API changes.
…CII charts

- Cross-provider summary table with avg ratio, vsDet, budget fits, time
- Fuzzy dedup table gains "vs Base" column highlighting improvements
- ASCII comparison charts now render for all providers, not just best
Single-page demo that lets users paste conversations in plain-text
chat format, adjust compression settings, and see results with an
inline diff view highlighting what changed.

- esbuild bundles src/index.ts → demo/bundle.js (IIFE, global CCE)
- Plain-text input format (role: message, blank line separates)
- All CompressOptions exposed: recencyWindow, tokenBudget, preserve,
  dedup, fuzzyDedup, fuzzyThreshold, forceConverge
- Line-level diff output: red/strikethrough for removed, green for
  added, tags for preserved/compressed/removed messages
- 5 example conversations: coding assistant, technical prose,
  structured + credentials, short chat, deep conversation
- npm scripts: demo:build, demo
feat(demo): browser-based demo app
Measure each dist/*.js file and total after tsc build. Adds
BundleSizeResult type, comparison loop for --check regression
detection, doc section with table, and gzip badge.
Bumps the dev-deps group with 1 update: [publint](https://github.com/publint/publint/tree/HEAD/packages/publint).


Updates `publint` from 0.3.17 to 0.3.18
- [Release notes](https://github.com/publint/publint/releases)
- [Changelog](https://github.com/publint/publint/blob/master/packages/publint/CHANGELOG.md)
- [Commits](https://github.com/publint/publint/commits/publint@0.3.18/packages/publint)

---
updated-dependencies:
- dependency-name: publint
  dependency-version: 0.3.18
  dependency-type: direct:development
  update-type: version-update:semver-patch
  dependency-group: dev-deps
...

Signed-off-by: dependabot[bot] <support@github.com>
gzip output varies across zlib versions (macOS vs Ubuntu CI), so
only raw bytes are regression-checked. gzipBytes remains tracked
in baselines and docs as informational.
SimplyLiz and others added 27 commits March 20, 2026 20:11
- New entropyScorer option: plug in a small LM for self-information
  based sentence importance scoring (Selective Context paper)
- entropyScorerMode: 'replace' (entropy only) or 'augment' (weighted
  average with heuristic, default)
- src/entropy.ts: splitSentences, normalizeScores, combineScores utils
- Sync and async paths supported; async scorer throws in sync mode
- Zero new dependencies: scorer is user-provided function
- Detects Q&A pairs, request→action→confirmation chains, corrections,
  and acknowledgment patterns in message history
- Groups flow chains into single compression units producing more
  coherent summaries (e.g., "Q: how does X work? → A: it uses Y")
- conversationFlow option: opt-in, default false
- Flow chains override soft preservation (recency, short content)
  but not hard blocks (system role, dedup, tool_calls)
…uto)

- compressionDepth option controls summarization aggressiveness
- gentle: standard sentence selection (default, backward compatible)
- moderate: 50% tighter budgets for more aggressive compression
- aggressive: entity-only stubs for maximum ratio
- auto: progressively tries gentle → moderate → aggressive until
  tokenBudget fits, with quality gate (stops if quality < 0.60)
- Both sync and async paths supported
- Coreference tracking (coreference option): when a compressed message
  defines an entity referenced by a preserved message, the definition
  is inlined into the summary to prevent orphaned references
- Semantic clustering (semanticClustering option): groups messages by
  topic using TF-IDF cosine similarity + entity overlap Jaccard, then
  compresses each cluster as a unit for better topic coherence
- Both features are opt-in, zero new dependencies
- Segments text into Elementary Discourse Units with dependency graph
- Clause boundary detection via discourse markers (then, because, which...)
- Pronoun/demonstrative, temporal, and causal dependency edges
- When selecting EDUs for summary, dependency parents are included
  (up to 2 levels) to prevent incoherent output
- discourseAware option: opt-in, default false
- 8 adversarial test cases: pronoun-heavy, scattered entities,
  correction chains, code-interleaved prose, near-duplicates with
  critical differences, 10k+ char messages, mixed SQL/JSON/bash,
  and full round-trip integrity with all features enabled
- Update roadmap: 14 of 16 items complete
- ML token classifier (mlTokenClassifier option): per-token keep/remove
  classification via user-provided model (LLMLingua-2 style). Includes
  sync/async support, whitespace tokenizer, mock classifier for testing
- A/B comparison tool (npm run bench:compare): side-by-side comparison
  of default vs v2 features across coding, deep conversation, and
  agentic scenarios. Reports ratio, quality, entity retention, tokens
- All 16/16 roadmap items now complete
…tion

- bench/run.ts: new Quality Metrics (v2) table showing entity retention,
  structural integrity, reference coherence, and quality score per scenario
- bench/baseline.ts: QualityResult type, quality section in generated docs,
  average quality score in summary table
- bench/compare.ts: add Long Q&A and Technical explanation scenarios,
  rename V2 option set to "V2 balanced" (no relevanceThreshold)
- flow.ts: exclude messages with code fences from flow chain detection
  to prevent Q&A chains from dropping code content
- package.json: add bench:compare script
- New docs/v2-features.md: full documentation for all 11 new features
  with usage examples, how-it-works sections, and explicit tradeoff
  analysis for each feature
- docs/api-reference.md: updated exports listing, 13 new options in
  CompressOptions table, 5 new result fields, new types
  (MLTokenClassifier, TokenClassification)
- docs/token-budget.md: added tiered budget strategy and compression
  depth sections with cross-links
- docs/README.md: added V2 Features to index
- Each feature documents: what it does, how to use it, how it works
  internally, and what you give up (the tradeoff)
- Flow chains and clusters no longer skip non-member messages between
  chain endpoints. Previously, a chain spanning indices [1,4] would
  skip indices 2,3 even if they weren't chain members (dropping code)
- Importance threshold raised from 0.35 to 0.65. The old threshold
  preserved nearly all messages in entity-rich conversations, reducing
  compression ratio by up to 30% with no quality benefit
- EDU scorer replaced length-based heuristic with information-density
  scoring (identifiers, numbers, emphasis) to avoid keeping long filler
  clauses over short technical ones
- Quick reference table, feature section, and TSDoc all flag the 8-28%
  ratio regression without a custom ML scorer
- Explain why: dependency tracking inherently fights compression by
  pulling in parent EDUs, and the rule-based scorer can't distinguish
  load-bearing dependencies from decorative ones
- Recommend using exported segmentEDUs/scoreEDUs/selectEDUs directly
  with a custom scorer instead of the discourseAware option
- Remove discourseAware from recommended feature combinations
Adaptive entity-aware budgets were changing default compression output
(6% regression on coding scenario) because extractEntities was called
unconditionally. Now entity-adaptive budgets only activate when
compressionDepth is explicitly set to moderate/aggressive/auto.

Default path (no v2 options) now produces identical output to develop.
- Flow chains and clusters only mark themselves as processed AFTER
  successful compression. Previously they were marked on entry,
  causing non-compressed chain members to be silently dropped
- Semantic clusters restricted to consecutive indices only —
  non-consecutive merges broke round-trip because uncompress can't
  restore interleaved message ordering
- Added V2 Features Comparison section to bench reporter showing
  each feature individually and recommended combo vs default, with
  per-scenario ratio/quality and delta row
- All 8 scenarios × 8 configs pass round-trip verification
feat: v2 compression features — quality metrics, flow detection, tiered budget, depth control
Separate quality benchmark system (bench/quality.ts) that measures
compression fidelity independently from the existing perf/regression
suite. Includes:

- quality-analysis.ts: compressed-only retention metrics, semantic
  fidelity scoring (fact extraction + negation detection), per-message
  quality breakdown, and recencyWindow tradeoff sweep
- quality-scenarios.ts: 6 edge case scenarios (single-char, giant
  message, code-only, entity-dense, prose-only, mixed languages)
- quality.ts: standalone runner with --save/--check against its own
  baseline namespace (bench/baselines/quality/)
- backfill.ts: retroactively generates quality baselines for older
  git refs via temporary worktrees

Key design decisions:
- Retention measured only on compressed messages (fixes the all-1.0
  masking problem in the existing analyzeRetention)
- Code block integrity is byte-identical verification, not just fence
  count
- Zero-tolerance regression on code block integrity, 5% on entity
  retention, 10% on fact retention
- Completely isolated from existing --check (separate baseline files)
- Backfilled v1.0.0 baseline for historical comparison
… LLM judge

Replace broken quality metrics (keywordRetention, factRetention, negationErrors)
with five meaningful ones: task-based probes (~70 across 13 scenarios),
information density, compressed-only quality score, negative compression
detection, and summary coherence checks.

- Add ProbeDefinition type and getProbesForScenario() with curated probes
- Add computeInformationDensity(), computeCompressedQualityScore(),
  detectNegativeCompressions(), checkCoherence() analysis functions
- Add min-output-chars probes to catch over-aggressive compression
- Add lang aliases to countCodeBlocks (typescript/ts, python/py, yaml/yml)
- Fix regression thresholds: coherence/negativeCompressions track increases
  from baseline, not just zero-to-nonzero transitions
- Add --llm-judge flag with multi-provider support (OpenAI, Anthropic,
  Gemini, Ollama) for LLM-as-judge scoring (display-only, not in baseline)
- Add Gemini provider to bench/llm.ts (@google/genai SDK)
- Add bench:quality:judge npm script
- Update docs/benchmarks.md with quality metrics, probes, LLM judge, and
  regression threshold documentation
- Update CLAUDE.md with quality benchmark commands
- Re-save quality baseline with new format
…op/github/codeql-action-4

chore(deps): bump github/codeql-action from 3 to 4
…/dev-deps-10041a4c1d

chore(deps-dev): bump the dev-deps group across 1 directory with 6 updates
- Bump version to 1.3.0
- Add quality history documentation with version comparison
- Add --features flag for opt-in feature benchmarking
- Update CHANGELOG with all 1.3.0 changes
- Save baselines for v1.3.0
- Regenerate benchmark-results.md
- Link quality-history.md from README and docs index
# Conflicts:
#	CHANGELOG.md
#	CLAUDE.md
#	README.md
Re-apply: version bump to 1.3.0, CHANGELOG 1.3.0 section, quality
benchmark npm scripts, CLAUDE.md commands, Gemini provider in llm.ts,
quality-history link in README and docs index, @google/genai devDep.
@SimplyLiz SimplyLiz merged commit d396e7b into main Mar 21, 2026
5 of 8 checks passed
@SimplyLiz SimplyLiz deleted the feature/v2-improvements branch March 21, 2026 18:03
: 0;

// 3+ numbered steps AND 1+ weak anchor → reasoning chain
const stepMatches = text.match(NUMBERED_STEP_RE);

Check failure

Code scanning / CodeQL

Polynomial regular expression used on uncontrolled data High

This
regular expression
that depends on
library input
may run slow on strings with many repetitions of '\n'.
* the content is too low-value to produce a useful summary.
*/
export function bestSentenceScore(text: string): number {
const sentences = text.match(/[^.!?\n]+[.!?]+/g);

Check failure

Code scanning / CodeQL

Polynomial regular expression used on uncontrolled data High

This
regular expression
that depends on
library input
may run slow on strings with many repetitions of ' '.
rawScores: number[],
mode: 'replace' | 'augment',
): Map<number, number> {
const sentences = text.match(/[^.!?\n]+[.!?]+/g) ?? [text.trim()];

Check failure

Code scanning / CodeQL

Polynomial regular expression used on uncontrolled data High

This
regular expression
that depends on
library input
may run slow on strings with many repetitions of ' '.
} else if (discourseAware && !userSummarizer) {
next = gen.next(summarizeWithEDUs(text, budget));
} else if (entropyScorer) {
const sentences = text.match(/[^.!?\n]+[.!?]+/g) ?? [text.trim()];

Check failure

Code scanning / CodeQL

Polynomial regular expression used on uncontrolled data High

This
regular expression
that depends on
library input
may run slow on strings with many repetitions of ' '.
if (entities.length === 0) return '';

// For each entity, find the sentence where it first appears
const sentences = sourceContent.match(/[^.!?\n]+[.!?]+/g) ?? [sourceContent];

Check failure

Code scanning / CodeQL

Polynomial regular expression used on uncontrolled data High

This
regular expression
that depends on
library input
may run slow on strings with many repetitions of ' '.
This
regular expression
that depends on
library input
may run slow on strings with many repetitions of ' '.
*/
export function segmentEDUs(text: string): EDU[] {
// First split into sentences
const sentences = text.match(/[^.!?\n]+[.!?]+/g) ?? [text.trim()];

Check failure

Code scanning / CodeQL

Polynomial regular expression used on uncontrolled data High

This
regular expression
that depends on
library input
may run slow on strings with many repetitions of ' '.
This
regular expression
that depends on
library input
may run slow on strings with many repetitions of ' '.
This
regular expression
that depends on
library input
may run slow on strings with many repetitions of ' '.
* Returns the sentences and their original indices for reassembly.
*/
export function splitSentences(text: string): string[] {
const sentences = text.match(/[^.!?\n]+[.!?]+/g);

Check failure

Code scanning / CodeQL

Polynomial regular expression used on uncontrolled data High

This
regular expression
that depends on
library input
may run slow on strings with many repetitions of ' '.
function extractMessageEntities(content: string): Set<string> {
const entities = new Set<string>();
for (const re of [CAMEL_RE, PASCAL_RE, SNAKE_RE, VOWELLESS_RE, FILE_REF_RE]) {
const matches = content.match(re);

Check failure

Code scanning / CodeQL

Polynomial regular expression used on uncontrolled data High

This
regular expression
that depends on
library input
may run slow on strings with many repetitions of '!'.
This
regular expression
that depends on
library input
may run slow on strings with many repetitions of '!'.

it('catches URLs', () => {
const entities = extractEntities('See https://example.com/docs for details.');
expect(entities.some((e) => e.includes('https://example.com/docs'))).toBe(true);

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High test

'
https://example.com/docs
' can be anywhere in the URL, and arbitrary hosts may come before or after it.

Copilot Autofix

AI about 22 hours ago

Copilot could not generate an autofix suggestion

Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant