v1.3.0 — Quality benchmark overhaul + LLM judge by SimplyLiz · Pull Request #20 · SimplyLiz/ContextCompressionEngine

SimplyLiz · 2026-03-21T14:27:05Z

Summary

Quality benchmark overhaul: replaced broken metrics (keywordRetention, factRetention, negationErrors) with five meaningful ones: task-based probes (~70 across 13 scenarios), information density, compressed-only quality score, negative compression detection, and summary coherence checks
LLM-as-judge scoring (--llm-judge): optional multi-provider evaluation (OpenAI, Anthropic, Gemini, Ollama) — display-only, not in baselines
Opt-in feature comparison (--features): benchmarks each v2 opt-in feature against baseline to measure impact
Quality history (docs/quality-history.md): version-over-version quality tracking across v1.0.0→v1.3.0 with opt-in feature impact analysis
Gemini provider support for LLM benchmarks via @google/genai SDK
Merged dependabot PRs: dev deps bump + codeql-action v3→v4

Key quality findings

Finding	Detail
Code preservation	100% across all versions, all scenarios
v1.1.0 entity regression	Structured content 100%→68%, Entity-dense 68%→53%, Mixed languages 100%→67%
Conversation flow	Fixes Deep conversation probes (33%→100%) but destroys Long Q&A (entity retention→7%)
Semantic clustering	Compresses code-only messages it shouldn't (100%→75% probe pass)
Importance/contradiction	Zero measurable impact on current scenarios

Test plan

npm run build — compiles
npm test — 663 tests pass
npm run lint && npm run format:check — clean
npm run bench:quality — all 13 scenarios run
npm run bench:quality:check — passes against baseline
npm run bench:quality:judge — LLM judge runs with Gemini/OpenAI/Ollama
npm run bench:quality:features — feature comparison runs
npm run bench:save — main baseline saved

…ode to v6 - CLAUDE.md with architecture docs and branching strategy - SECURITY.md with vulnerability reporting policy - CHANGELOG.md reformatted to Keep a Changelog spec - .nvmrc pinning Node 22 - Bump actions/setup-node v4 → v6

Exercises every public export as a real npm consumer would — catches broken exports maps, missing tarball files, and ESM resolution failures that unit tests cannot detect. Covers 26 scenarios including compress, uncompress round-trips, dedup, token budgets, async paths, tool_calls, re-compression, recursive uncompress, and large conversations.

Add package structure validation (publint --strict) and TypeScript type resolution checks (attw) to the e2e pipeline. Artifacts (.tgz, e2e/node_modules, e2e/package-lock.json) are now cleaned up after every run. E2e job added to CI in parallel with existing jobs, gating publish.

…d error paths Replaces custom pass/fail harness with node:test + node:assert/strict. Strengthens fuzzy dedup (asserts messages_fuzzy_deduped > 0) and tool_calls (verifies non-tool messages are compressed). Adds 7 error handling tests covering TypeError contracts and graceful null/empty content. Merges develop to resolve conflicts.

Add e2e smoke test suite

Add domain-agnostic framing (legal, medical, documentation, support) and rename "Code-aware" to "Structure-aware" in feature list.

- Add inline .env parser in bench/run.ts (no dependency, won't override existing vars) - Probe localhost:11434/api/tags to auto-detect Ollama without env vars - Add LLM result types and save/load in bench/baseline.ts - Auto-save LLM results to bench/baselines/llm/<provider>-<model>.json - Extend doc generator with LLM comparison tables when result files exist - Add .env.example template with commented-out provider keys - Update skip message to mention Ollama auto-detection

… metrics LLM benchmarks previously ran automatically when API keys were detected, silently burning money on every `npm run bench`. Now requires explicit `--llm` flag (`npm run bench:llm`). Additions: - Technical explanation scenario (pure prose, no code fences) - vsDet expansion metric (LLM ratio / deterministic ratio) - Token budget + LLM section (deterministic vs llm-escalate) - bench:llm npm script Fixes: - .env parser: strip quotes, handle `export` prefix - loadAllLlmResults: try/catch per file for malformed JSON - Ollama: verify model availability via /api/tags response - Anthropic: guard against empty content array - LLM benchmark loop: per-scenario try/catch - Doc generation: scenario count 7→8, add Technical explanation

…ture - --save: writes current.json + history/v{version}.json, regenerates docs - --check: compares against current.json, exits non-zero on regression - --tolerance N: allows N% deviation (0% default, deterministic) - Baselines reorganized: current.json at root, history/ for versioned snapshots, llm/ for non-deterministic reference data - bench:llm added to package.json for explicit LLM benchmark runs - Doc generation references correct baseline paths

Split docs/benchmarks.md into two files: - docs/benchmarks.md: hand-written handbook (how to run, scenarios, interpreting results, regression testing) - docs/benchmark-results.md: auto-generated by bench:save with Mermaid xychart-beta charts, summary table, and polished data presentation Rewrite generateBenchmarkDocs() with compression ratio chart, dedup impact chart, LLM comparison chart, key findings callout, and conditional sections for LLM data and version history.

…pie chart Add shields.io badges, unicode progress bars, reduction % and message count columns to the compression table, a Mermaid pie chart for message outcomes, and collapsible details sections for LLM provider tables.

Drop progress bar column from compression table — unicode blocks render with variable width in GitHub's proportional-font tables. Switch LLM comparison chart from double bar (stacked) to bar+line so both series are visible side by side.

Interleave "Scenario (Det)" and "Scenario (LLM)" labels on the x-axis so each scenario gets two side-by-side bars in a single series, avoiding Mermaid's stacked-bar behavior.

Mermaid xychart can't do grouped bars — stacks or overlaps labels. Replace with a clean comparison table showing Det vs Best LLM ratio, delta percentage, and winner per scenario.

…arison Render comparison as paired horizontal bars inside a fenced code block (monospace), replacing the broken Mermaid chart. Each scenario shows Det and LLM bars side by side with ratios and a star for LLM wins.

compressSync and compressAsync were identical (~180 lines each) except for 2 summarize call sites. Replace both with a single compressGen generator that yields summarize requests, driven by thin sync/async runners. Removes 149 lines of duplication, no public API changes.

…CII charts - Cross-provider summary table with avg ratio, vsDet, budget fits, time - Fuzzy dedup table gains "vs Base" column highlighting improvements - ASCII comparison charts now render for all providers, not just best

Single-page demo that lets users paste conversations in plain-text chat format, adjust compression settings, and see results with an inline diff view highlighting what changed. - esbuild bundles src/index.ts → demo/bundle.js (IIFE, global CCE) - Plain-text input format (role: message, blank line separates) - All CompressOptions exposed: recencyWindow, tokenBudget, preserve, dedup, fuzzyDedup, fuzzyThreshold, forceConverge - Line-level diff output: red/strikethrough for removed, green for added, tags for preserved/compressed/removed messages - 5 example conversations: coding assistant, technical prose, structured + credentials, short chat, deep conversation - npm scripts: demo:build, demo

feat(demo): browser-based demo app

Measure each dist/*.js file and total after tsc build. Adds BundleSizeResult type, comparison loop for --check regression detection, doc section with table, and gzip badge.

Bumps the dev-deps group with 1 update: [publint](https://github.com/publint/publint/tree/HEAD/packages/publint). Updates `publint` from 0.3.17 to 0.3.18 - [Release notes](https://github.com/publint/publint/releases) - [Changelog](https://github.com/publint/publint/blob/master/packages/publint/CHANGELOG.md) - [Commits](https://github.com/publint/publint/commits/publint@0.3.18/packages/publint) --- updated-dependencies: - dependency-name: publint dependency-version: 0.3.18 dependency-type: direct:development update-type: version-update:semver-patch dependency-group: dev-deps ... Signed-off-by: dependabot[bot] <support@github.com>

gzip output varies across zlib versions (macOS vs Ubuntu CI), so only raw bytes are regression-checked. gzipBytes remains tracked in baselines and docs as informational.

- New entropyScorer option: plug in a small LM for self-information based sentence importance scoring (Selective Context paper) - entropyScorerMode: 'replace' (entropy only) or 'augment' (weighted average with heuristic, default) - src/entropy.ts: splitSentences, normalizeScores, combineScores utils - Sync and async paths supported; async scorer throws in sync mode - Zero new dependencies: scorer is user-provided function

- Detects Q&A pairs, request→action→confirmation chains, corrections, and acknowledgment patterns in message history - Groups flow chains into single compression units producing more coherent summaries (e.g., "Q: how does X work? → A: it uses Y") - conversationFlow option: opt-in, default false - Flow chains override soft preservation (recency, short content) but not hard blocks (system role, dedup, tool_calls)

…uto) - compressionDepth option controls summarization aggressiveness - gentle: standard sentence selection (default, backward compatible) - moderate: 50% tighter budgets for more aggressive compression - aggressive: entity-only stubs for maximum ratio - auto: progressively tries gentle → moderate → aggressive until tokenBudget fits, with quality gate (stops if quality < 0.60) - Both sync and async paths supported

- Coreference tracking (coreference option): when a compressed message defines an entity referenced by a preserved message, the definition is inlined into the summary to prevent orphaned references - Semantic clustering (semanticClustering option): groups messages by topic using TF-IDF cosine similarity + entity overlap Jaccard, then compresses each cluster as a unit for better topic coherence - Both features are opt-in, zero new dependencies

- Segments text into Elementary Discourse Units with dependency graph - Clause boundary detection via discourse markers (then, because, which...) - Pronoun/demonstrative, temporal, and causal dependency edges - When selecting EDUs for summary, dependency parents are included (up to 2 levels) to prevent incoherent output - discourseAware option: opt-in, default false

- 8 adversarial test cases: pronoun-heavy, scattered entities, correction chains, code-interleaved prose, near-duplicates with critical differences, 10k+ char messages, mixed SQL/JSON/bash, and full round-trip integrity with all features enabled - Update roadmap: 14 of 16 items complete

- ML token classifier (mlTokenClassifier option): per-token keep/remove classification via user-provided model (LLMLingua-2 style). Includes sync/async support, whitespace tokenizer, mock classifier for testing - A/B comparison tool (npm run bench:compare): side-by-side comparison of default vs v2 features across coding, deep conversation, and agentic scenarios. Reports ratio, quality, entity retention, tokens - All 16/16 roadmap items now complete

…tion - bench/run.ts: new Quality Metrics (v2) table showing entity retention, structural integrity, reference coherence, and quality score per scenario - bench/baseline.ts: QualityResult type, quality section in generated docs, average quality score in summary table - bench/compare.ts: add Long Q&A and Technical explanation scenarios, rename V2 option set to "V2 balanced" (no relevanceThreshold) - flow.ts: exclude messages with code fences from flow chain detection to prevent Q&A chains from dropping code content - package.json: add bench:compare script

- New docs/v2-features.md: full documentation for all 11 new features with usage examples, how-it-works sections, and explicit tradeoff analysis for each feature - docs/api-reference.md: updated exports listing, 13 new options in CompressOptions table, 5 new result fields, new types (MLTokenClassifier, TokenClassification) - docs/token-budget.md: added tiered budget strategy and compression depth sections with cross-links - docs/README.md: added V2 Features to index - Each feature documents: what it does, how to use it, how it works internally, and what you give up (the tradeoff)

- Flow chains and clusters no longer skip non-member messages between chain endpoints. Previously, a chain spanning indices [1,4] would skip indices 2,3 even if they weren't chain members (dropping code) - Importance threshold raised from 0.35 to 0.65. The old threshold preserved nearly all messages in entity-rich conversations, reducing compression ratio by up to 30% with no quality benefit - EDU scorer replaced length-based heuristic with information-density scoring (identifiers, numbers, emphasis) to avoid keeping long filler clauses over short technical ones

- Quick reference table, feature section, and TSDoc all flag the 8-28% ratio regression without a custom ML scorer - Explain why: dependency tracking inherently fights compression by pulling in parent EDUs, and the rule-based scorer can't distinguish load-bearing dependencies from decorative ones - Recommend using exported segmentEDUs/scoreEDUs/selectEDUs directly with a custom scorer instead of the discourseAware option - Remove discourseAware from recommended feature combinations

Adaptive entity-aware budgets were changing default compression output (6% regression on coding scenario) because extractEntities was called unconditionally. Now entity-adaptive budgets only activate when compressionDepth is explicitly set to moderate/aggressive/auto. Default path (no v2 options) now produces identical output to develop.

- Flow chains and clusters only mark themselves as processed AFTER successful compression. Previously they were marked on entry, causing non-compressed chain members to be silently dropped - Semantic clusters restricted to consecutive indices only — non-consecutive merges broke round-trip because uncompress can't restore interleaved message ordering - Added V2 Features Comparison section to bench reporter showing each feature individually and recommended combo vs default, with per-scenario ratio/quality and delta row - All 8 scenarios × 8 configs pass round-trip verification

feat: v2 compression features — quality metrics, flow detection, tiered budget, depth control

Separate quality benchmark system (bench/quality.ts) that measures compression fidelity independently from the existing perf/regression suite. Includes: - quality-analysis.ts: compressed-only retention metrics, semantic fidelity scoring (fact extraction + negation detection), per-message quality breakdown, and recencyWindow tradeoff sweep - quality-scenarios.ts: 6 edge case scenarios (single-char, giant message, code-only, entity-dense, prose-only, mixed languages) - quality.ts: standalone runner with --save/--check against its own baseline namespace (bench/baselines/quality/) - backfill.ts: retroactively generates quality baselines for older git refs via temporary worktrees Key design decisions: - Retention measured only on compressed messages (fixes the all-1.0 masking problem in the existing analyzeRetention) - Code block integrity is byte-identical verification, not just fence count - Zero-tolerance regression on code block integrity, 5% on entity retention, 10% on fact retention - Completely isolated from existing --check (separate baseline files) - Backfilled v1.0.0 baseline for historical comparison

… LLM judge Replace broken quality metrics (keywordRetention, factRetention, negationErrors) with five meaningful ones: task-based probes (~70 across 13 scenarios), information density, compressed-only quality score, negative compression detection, and summary coherence checks. - Add ProbeDefinition type and getProbesForScenario() with curated probes - Add computeInformationDensity(), computeCompressedQualityScore(), detectNegativeCompressions(), checkCoherence() analysis functions - Add min-output-chars probes to catch over-aggressive compression - Add lang aliases to countCodeBlocks (typescript/ts, python/py, yaml/yml) - Fix regression thresholds: coherence/negativeCompressions track increases from baseline, not just zero-to-nonzero transitions - Add --llm-judge flag with multi-provider support (OpenAI, Anthropic, Gemini, Ollama) for LLM-as-judge scoring (display-only, not in baseline) - Add Gemini provider to bench/llm.ts (@google/genai SDK) - Add bench:quality:judge npm script - Update docs/benchmarks.md with quality metrics, probes, LLM judge, and regression threshold documentation - Update CLAUDE.md with quality benchmark commands - Re-save quality baseline with new format

…op/github/codeql-action-4 chore(deps): bump github/codeql-action from 3 to 4

…/dev-deps-10041a4c1d chore(deps-dev): bump the dev-deps group across 1 directory with 6 updates

…ments # Conflicts: # package-lock.json

- Bump version to 1.3.0 - Add quality history documentation with version comparison - Add --features flag for opt-in feature benchmarking - Update CHANGELOG with all 1.3.0 changes - Save baselines for v1.3.0 - Regenerate benchmark-results.md - Link quality-history.md from README and docs index

# Conflicts: # CHANGELOG.md # CLAUDE.md # README.md

Re-apply: version bump to 1.3.0, CHANGELOG 1.3.0 section, quality benchmark npm scripts, CLAUDE.md commands, Gemini provider in llm.ts, quality-history link in README and docs index, @google/genai devDep.

src/classify.ts

+    : 0;
+
+  // 3+ numbered steps AND 1+ weak anchor → reasoning chain
+  const stepMatches = text.match(NUMBERED_STEP_RE);


src/compress.ts

+ * the content is too low-value to produce a useful summary.
+ */
+export function bestSentenceScore(text: string): number {
+  const sentences = text.match(/[^.!?\n]+[.!?]+/g);


src/compress.ts

+  rawScores: number[],
+  mode: 'replace' | 'augment',
+): Map<number, number> {
+  const sentences = text.match(/[^.!?\n]+[.!?]+/g) ?? [text.trim()];


src/compress.ts

+    } else if (discourseAware && !userSummarizer) {
+      next = gen.next(summarizeWithEDUs(text, budget));
+    } else if (entropyScorer) {
+      const sentences = text.match(/[^.!?\n]+[.!?]+/g) ?? [text.trim()];


src/coreference.ts

+  if (entities.length === 0) return '';
+
+  // For each entity, find the sentence where it first appears
+  const sentences = sourceContent.match(/[^.!?\n]+[.!?]+/g) ?? [sourceContent];


src/discourse.ts

+ */
+export function segmentEDUs(text: string): EDU[] {
+  // First split into sentences
+  const sentences = text.match(/[^.!?\n]+[.!?]+/g) ?? [text.trim()];


src/entropy.ts

+ * Returns the sentences and their original indices for reassembly.
+ */
+export function splitSentences(text: string): string[] {
+  const sentences = text.match(/[^.!?\n]+[.!?]+/g);


src/importance.ts

+function extractMessageEntities(content: string): Set<string> {
+  const entities = new Set<string>();
+  for (const re of [CAMEL_RE, PASCAL_RE, SNAKE_RE, VOWELLESS_RE, FILE_REF_RE]) {
+    const matches = content.match(re);


tests/retention.test.ts

+
+    it('catches URLs', () => {
+      const entities = extractEntities('See https://example.com/docs for details.');
+      expect(entities.some((e) => e.includes('https://example.com/docs'))).toBe(true);


SimplyLiz and others added 30 commits February 25, 2026 06:14

chore: add develop branch to CI triggers, target dependabot to develop

1201db2

Merge remote-tracking branch 'origin/main' into develop

fec768b

Merge pull request #3 from SimplyLiz/feature/e2e-smoke-test

c3f1705

Add e2e smoke test suite

fix: point CI badge to main branch

e5387fa

docs: broaden README positioning beyond code-only use cases

a823b1d

Add domain-agnostic framing (legal, medical, documentation, support) and rename "Code-aware" to "Structure-aware" in feature list.

fix(bench): use paired bars for LLM comparison chart

4b92c41

Interleave "Scenario (Det)" and "Scenario (LLM)" labels on the x-axis so each scenario gets two side-by-side bars in a single series, avoiding Mermaid's stacked-bar behavior.

fix(bench): replace broken LLM comparison chart with summary table

67b6ef8

Mermaid xychart can't do grouped bars — stacks or overlaps labels. Replace with a clean comparison table showing Det vs Best LLM ratio, delta percentage, and winner per scenario.

docs: clarify defaultTokenCounter rationale across docs and source

0202302

docs: clarify defaultTokenCounter rationale across docs and source

1670993

feat(demo): add help button with settings documentation

a159edd

fix(lint): exclude demo/ from ESLint

e129bf2

Merge pull request #7 from SimplyLiz/feature/demo-app

35917b5

feat(demo): browser-based demo app

feat(bench): track bundle size per-file with gzip in benchmark suite

6695597

Measure each dist/*.js file and total after tsc build. Adds BundleSizeResult type, comparison loop for --check regression detection, doc section with table, and gzip badge.

fix(bench): skip gzipBytes in regression check, run prettier

92124ac

gzip output varies across zlib versions (macOS vs Ubuntu CI), so only raw bytes are regression-checked. gzipBytes remains tracked in baselines and docs as informational.

SimplyLiz and others added 27 commits March 20, 2026 20:11

docs: update roadmap progress tracker (7/16 items complete)

db9d914

docs: update roadmap progress (8/16 items complete)

112cbb7

chore: bump version to 1.2.0, save baseline, update changelog

bcb97c1

chore: re-save baseline after formatting (2-byte bundle delta)

26273df

chore: format benchmark-results.md

a75f1d4

Merge pull request #19 from SimplyLiz/feature/v2-improvements

ac04bef

feat: v2 compression features — quality metrics, flow detection, tiered budget, depth control

Merge pull request #14 from SimplyLiz/dependabot/github_actions/devel…

02f1aed

…op/github/codeql-action-4 chore(deps): bump github/codeql-action from 3 to 4

Merge pull request #17 from SimplyLiz/dependabot/npm_and_yarn/develop…

b56b835

…/dev-deps-10041a4c1d chore(deps-dev): bump the dev-deps group across 1 directory with 6 updates

Merge remote-tracking branch 'origin/develop' into feature/v2-improve…

0e7aab2

…ments # Conflicts: # package-lock.json

Merge remote-tracking branch 'origin/main' into feature/v2-improvements

4abb148

# Conflicts: # CHANGELOG.md # CLAUDE.md # README.md

fix: restore v1.3.0 changes lost during merge conflict resolution

7cf6d79

Re-apply: version bump to 1.3.0, CHANGELOG 1.3.0 section, quality benchmark npm scripts, CLAUDE.md commands, Gemini provider in llm.ts, quality-history link in README and docs index, @google/genai devDep.

SimplyLiz merged commit d396e7b into main Mar 21, 2026
5 of 8 checks passed

SimplyLiz deleted the feature/v2-improvements branch March 21, 2026 18:03

github-advanced-security bot found potential problems Mar 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.3.0 — Quality benchmark overhaul + LLM judge#20

v1.3.0 — Quality benchmark overhaul + LLM judge#20
SimplyLiz merged 85 commits intomainfrom
feature/v2-improvements

SimplyLiz commented Mar 21, 2026

Uh oh!

Uh oh!

Check failure

Check failure

Check failure

Check failure

Check failure

Check failure

Check failure

Check failure

Check failure

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SimplyLiz commented Mar 21, 2026

Summary

Key quality findings

Test plan

Uh oh!

Uh oh!

Check failure

Uh oh!

Uh oh!

Check failure

Uh oh!

Uh oh!

Check failure

Uh oh!

Uh oh!

Check failure

Uh oh!

Uh oh!

Check failure

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check failure

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check failure

Uh oh!

Uh oh!

Check failure

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check failure

Uh oh!

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant