v1.3.0 — Quality benchmark overhaul + LLM judge#20
Conversation
…ode to v6 - CLAUDE.md with architecture docs and branching strategy - SECURITY.md with vulnerability reporting policy - CHANGELOG.md reformatted to Keep a Changelog spec - .nvmrc pinning Node 22 - Bump actions/setup-node v4 → v6
Exercises every public export as a real npm consumer would — catches broken exports maps, missing tarball files, and ESM resolution failures that unit tests cannot detect. Covers 26 scenarios including compress, uncompress round-trips, dedup, token budgets, async paths, tool_calls, re-compression, recursive uncompress, and large conversations.
Add package structure validation (publint --strict) and TypeScript type resolution checks (attw) to the e2e pipeline. Artifacts (.tgz, e2e/node_modules, e2e/package-lock.json) are now cleaned up after every run. E2e job added to CI in parallel with existing jobs, gating publish.
…d error paths Replaces custom pass/fail harness with node:test + node:assert/strict. Strengthens fuzzy dedup (asserts messages_fuzzy_deduped > 0) and tool_calls (verifies non-tool messages are compressed). Adds 7 error handling tests covering TypeError contracts and graceful null/empty content. Merges develop to resolve conflicts.
Add e2e smoke test suite
Add domain-agnostic framing (legal, medical, documentation, support) and rename "Code-aware" to "Structure-aware" in feature list.
- Add inline .env parser in bench/run.ts (no dependency, won't override existing vars) - Probe localhost:11434/api/tags to auto-detect Ollama without env vars - Add LLM result types and save/load in bench/baseline.ts - Auto-save LLM results to bench/baselines/llm/<provider>-<model>.json - Extend doc generator with LLM comparison tables when result files exist - Add .env.example template with commented-out provider keys - Update skip message to mention Ollama auto-detection
… metrics LLM benchmarks previously ran automatically when API keys were detected, silently burning money on every `npm run bench`. Now requires explicit `--llm` flag (`npm run bench:llm`). Additions: - Technical explanation scenario (pure prose, no code fences) - vsDet expansion metric (LLM ratio / deterministic ratio) - Token budget + LLM section (deterministic vs llm-escalate) - bench:llm npm script Fixes: - .env parser: strip quotes, handle `export` prefix - loadAllLlmResults: try/catch per file for malformed JSON - Ollama: verify model availability via /api/tags response - Anthropic: guard against empty content array - LLM benchmark loop: per-scenario try/catch - Doc generation: scenario count 7→8, add Technical explanation
…ture
- --save: writes current.json + history/v{version}.json, regenerates docs
- --check: compares against current.json, exits non-zero on regression
- --tolerance N: allows N% deviation (0% default, deterministic)
- Baselines reorganized: current.json at root, history/ for versioned
snapshots, llm/ for non-deterministic reference data
- bench:llm added to package.json for explicit LLM benchmark runs
- Doc generation references correct baseline paths
Split docs/benchmarks.md into two files: - docs/benchmarks.md: hand-written handbook (how to run, scenarios, interpreting results, regression testing) - docs/benchmark-results.md: auto-generated by bench:save with Mermaid xychart-beta charts, summary table, and polished data presentation Rewrite generateBenchmarkDocs() with compression ratio chart, dedup impact chart, LLM comparison chart, key findings callout, and conditional sections for LLM data and version history.
…pie chart Add shields.io badges, unicode progress bars, reduction % and message count columns to the compression table, a Mermaid pie chart for message outcomes, and collapsible details sections for LLM provider tables.
Drop progress bar column from compression table — unicode blocks render with variable width in GitHub's proportional-font tables. Switch LLM comparison chart from double bar (stacked) to bar+line so both series are visible side by side.
Interleave "Scenario (Det)" and "Scenario (LLM)" labels on the x-axis so each scenario gets two side-by-side bars in a single series, avoiding Mermaid's stacked-bar behavior.
Mermaid xychart can't do grouped bars — stacks or overlaps labels. Replace with a clean comparison table showing Det vs Best LLM ratio, delta percentage, and winner per scenario.
…arison Render comparison as paired horizontal bars inside a fenced code block (monospace), replacing the broken Mermaid chart. Each scenario shows Det and LLM bars side by side with ratios and a star for LLM wins.
compressSync and compressAsync were identical (~180 lines each) except for 2 summarize call sites. Replace both with a single compressGen generator that yields summarize requests, driven by thin sync/async runners. Removes 149 lines of duplication, no public API changes.
compressSync and compressAsync were identical (~180 lines each) except for 2 summarize call sites. Replace both with a single compressGen generator that yields summarize requests, driven by thin sync/async runners. Removes 149 lines of duplication, no public API changes.
…CII charts - Cross-provider summary table with avg ratio, vsDet, budget fits, time - Fuzzy dedup table gains "vs Base" column highlighting improvements - ASCII comparison charts now render for all providers, not just best
Single-page demo that lets users paste conversations in plain-text chat format, adjust compression settings, and see results with an inline diff view highlighting what changed. - esbuild bundles src/index.ts → demo/bundle.js (IIFE, global CCE) - Plain-text input format (role: message, blank line separates) - All CompressOptions exposed: recencyWindow, tokenBudget, preserve, dedup, fuzzyDedup, fuzzyThreshold, forceConverge - Line-level diff output: red/strikethrough for removed, green for added, tags for preserved/compressed/removed messages - 5 example conversations: coding assistant, technical prose, structured + credentials, short chat, deep conversation - npm scripts: demo:build, demo
feat(demo): browser-based demo app
Measure each dist/*.js file and total after tsc build. Adds BundleSizeResult type, comparison loop for --check regression detection, doc section with table, and gzip badge.
Bumps the dev-deps group with 1 update: [publint](https://github.com/publint/publint/tree/HEAD/packages/publint). Updates `publint` from 0.3.17 to 0.3.18 - [Release notes](https://github.com/publint/publint/releases) - [Changelog](https://github.com/publint/publint/blob/master/packages/publint/CHANGELOG.md) - [Commits](https://github.com/publint/publint/commits/publint@0.3.18/packages/publint) --- updated-dependencies: - dependency-name: publint dependency-version: 0.3.18 dependency-type: direct:development update-type: version-update:semver-patch dependency-group: dev-deps ... Signed-off-by: dependabot[bot] <support@github.com>
gzip output varies across zlib versions (macOS vs Ubuntu CI), so only raw bytes are regression-checked. gzipBytes remains tracked in baselines and docs as informational.
- New entropyScorer option: plug in a small LM for self-information based sentence importance scoring (Selective Context paper) - entropyScorerMode: 'replace' (entropy only) or 'augment' (weighted average with heuristic, default) - src/entropy.ts: splitSentences, normalizeScores, combineScores utils - Sync and async paths supported; async scorer throws in sync mode - Zero new dependencies: scorer is user-provided function
- Detects Q&A pairs, request→action→confirmation chains, corrections, and acknowledgment patterns in message history - Groups flow chains into single compression units producing more coherent summaries (e.g., "Q: how does X work? → A: it uses Y") - conversationFlow option: opt-in, default false - Flow chains override soft preservation (recency, short content) but not hard blocks (system role, dedup, tool_calls)
…uto) - compressionDepth option controls summarization aggressiveness - gentle: standard sentence selection (default, backward compatible) - moderate: 50% tighter budgets for more aggressive compression - aggressive: entity-only stubs for maximum ratio - auto: progressively tries gentle → moderate → aggressive until tokenBudget fits, with quality gate (stops if quality < 0.60) - Both sync and async paths supported
- Coreference tracking (coreference option): when a compressed message defines an entity referenced by a preserved message, the definition is inlined into the summary to prevent orphaned references - Semantic clustering (semanticClustering option): groups messages by topic using TF-IDF cosine similarity + entity overlap Jaccard, then compresses each cluster as a unit for better topic coherence - Both features are opt-in, zero new dependencies
- Segments text into Elementary Discourse Units with dependency graph - Clause boundary detection via discourse markers (then, because, which...) - Pronoun/demonstrative, temporal, and causal dependency edges - When selecting EDUs for summary, dependency parents are included (up to 2 levels) to prevent incoherent output - discourseAware option: opt-in, default false
- 8 adversarial test cases: pronoun-heavy, scattered entities, correction chains, code-interleaved prose, near-duplicates with critical differences, 10k+ char messages, mixed SQL/JSON/bash, and full round-trip integrity with all features enabled - Update roadmap: 14 of 16 items complete
- ML token classifier (mlTokenClassifier option): per-token keep/remove classification via user-provided model (LLMLingua-2 style). Includes sync/async support, whitespace tokenizer, mock classifier for testing - A/B comparison tool (npm run bench:compare): side-by-side comparison of default vs v2 features across coding, deep conversation, and agentic scenarios. Reports ratio, quality, entity retention, tokens - All 16/16 roadmap items now complete
…tion - bench/run.ts: new Quality Metrics (v2) table showing entity retention, structural integrity, reference coherence, and quality score per scenario - bench/baseline.ts: QualityResult type, quality section in generated docs, average quality score in summary table - bench/compare.ts: add Long Q&A and Technical explanation scenarios, rename V2 option set to "V2 balanced" (no relevanceThreshold) - flow.ts: exclude messages with code fences from flow chain detection to prevent Q&A chains from dropping code content - package.json: add bench:compare script
- New docs/v2-features.md: full documentation for all 11 new features with usage examples, how-it-works sections, and explicit tradeoff analysis for each feature - docs/api-reference.md: updated exports listing, 13 new options in CompressOptions table, 5 new result fields, new types (MLTokenClassifier, TokenClassification) - docs/token-budget.md: added tiered budget strategy and compression depth sections with cross-links - docs/README.md: added V2 Features to index - Each feature documents: what it does, how to use it, how it works internally, and what you give up (the tradeoff)
- Flow chains and clusters no longer skip non-member messages between chain endpoints. Previously, a chain spanning indices [1,4] would skip indices 2,3 even if they weren't chain members (dropping code) - Importance threshold raised from 0.35 to 0.65. The old threshold preserved nearly all messages in entity-rich conversations, reducing compression ratio by up to 30% with no quality benefit - EDU scorer replaced length-based heuristic with information-density scoring (identifiers, numbers, emphasis) to avoid keeping long filler clauses over short technical ones
- Quick reference table, feature section, and TSDoc all flag the 8-28% ratio regression without a custom ML scorer - Explain why: dependency tracking inherently fights compression by pulling in parent EDUs, and the rule-based scorer can't distinguish load-bearing dependencies from decorative ones - Recommend using exported segmentEDUs/scoreEDUs/selectEDUs directly with a custom scorer instead of the discourseAware option - Remove discourseAware from recommended feature combinations
Adaptive entity-aware budgets were changing default compression output (6% regression on coding scenario) because extractEntities was called unconditionally. Now entity-adaptive budgets only activate when compressionDepth is explicitly set to moderate/aggressive/auto. Default path (no v2 options) now produces identical output to develop.
- Flow chains and clusters only mark themselves as processed AFTER successful compression. Previously they were marked on entry, causing non-compressed chain members to be silently dropped - Semantic clusters restricted to consecutive indices only — non-consecutive merges broke round-trip because uncompress can't restore interleaved message ordering - Added V2 Features Comparison section to bench reporter showing each feature individually and recommended combo vs default, with per-scenario ratio/quality and delta row - All 8 scenarios × 8 configs pass round-trip verification
feat: v2 compression features — quality metrics, flow detection, tiered budget, depth control
Separate quality benchmark system (bench/quality.ts) that measures compression fidelity independently from the existing perf/regression suite. Includes: - quality-analysis.ts: compressed-only retention metrics, semantic fidelity scoring (fact extraction + negation detection), per-message quality breakdown, and recencyWindow tradeoff sweep - quality-scenarios.ts: 6 edge case scenarios (single-char, giant message, code-only, entity-dense, prose-only, mixed languages) - quality.ts: standalone runner with --save/--check against its own baseline namespace (bench/baselines/quality/) - backfill.ts: retroactively generates quality baselines for older git refs via temporary worktrees Key design decisions: - Retention measured only on compressed messages (fixes the all-1.0 masking problem in the existing analyzeRetention) - Code block integrity is byte-identical verification, not just fence count - Zero-tolerance regression on code block integrity, 5% on entity retention, 10% on fact retention - Completely isolated from existing --check (separate baseline files) - Backfilled v1.0.0 baseline for historical comparison
… LLM judge Replace broken quality metrics (keywordRetention, factRetention, negationErrors) with five meaningful ones: task-based probes (~70 across 13 scenarios), information density, compressed-only quality score, negative compression detection, and summary coherence checks. - Add ProbeDefinition type and getProbesForScenario() with curated probes - Add computeInformationDensity(), computeCompressedQualityScore(), detectNegativeCompressions(), checkCoherence() analysis functions - Add min-output-chars probes to catch over-aggressive compression - Add lang aliases to countCodeBlocks (typescript/ts, python/py, yaml/yml) - Fix regression thresholds: coherence/negativeCompressions track increases from baseline, not just zero-to-nonzero transitions - Add --llm-judge flag with multi-provider support (OpenAI, Anthropic, Gemini, Ollama) for LLM-as-judge scoring (display-only, not in baseline) - Add Gemini provider to bench/llm.ts (@google/genai SDK) - Add bench:quality:judge npm script - Update docs/benchmarks.md with quality metrics, probes, LLM judge, and regression threshold documentation - Update CLAUDE.md with quality benchmark commands - Re-save quality baseline with new format
…op/github/codeql-action-4 chore(deps): bump github/codeql-action from 3 to 4
…/dev-deps-10041a4c1d chore(deps-dev): bump the dev-deps group across 1 directory with 6 updates
…ments # Conflicts: # package-lock.json
- Bump version to 1.3.0 - Add quality history documentation with version comparison - Add --features flag for opt-in feature benchmarking - Update CHANGELOG with all 1.3.0 changes - Save baselines for v1.3.0 - Regenerate benchmark-results.md - Link quality-history.md from README and docs index
# Conflicts: # CHANGELOG.md # CLAUDE.md # README.md
Re-apply: version bump to 1.3.0, CHANGELOG 1.3.0 section, quality benchmark npm scripts, CLAUDE.md commands, Gemini provider in llm.ts, quality-history link in README and docs index, @google/genai devDep.
| : 0; | ||
|
|
||
| // 3+ numbered steps AND 1+ weak anchor → reasoning chain | ||
| const stepMatches = text.match(NUMBERED_STEP_RE); |
Check failure
Code scanning / CodeQL
Polynomial regular expression used on uncontrolled data High
| * the content is too low-value to produce a useful summary. | ||
| */ | ||
| export function bestSentenceScore(text: string): number { | ||
| const sentences = text.match(/[^.!?\n]+[.!?]+/g); |
Check failure
Code scanning / CodeQL
Polynomial regular expression used on uncontrolled data High
| rawScores: number[], | ||
| mode: 'replace' | 'augment', | ||
| ): Map<number, number> { | ||
| const sentences = text.match(/[^.!?\n]+[.!?]+/g) ?? [text.trim()]; |
Check failure
Code scanning / CodeQL
Polynomial regular expression used on uncontrolled data High
| } else if (discourseAware && !userSummarizer) { | ||
| next = gen.next(summarizeWithEDUs(text, budget)); | ||
| } else if (entropyScorer) { | ||
| const sentences = text.match(/[^.!?\n]+[.!?]+/g) ?? [text.trim()]; |
Check failure
Code scanning / CodeQL
Polynomial regular expression used on uncontrolled data High
| if (entities.length === 0) return ''; | ||
|
|
||
| // For each entity, find the sentence where it first appears | ||
| const sentences = sourceContent.match(/[^.!?\n]+[.!?]+/g) ?? [sourceContent]; |
Check failure
Code scanning / CodeQL
Polynomial regular expression used on uncontrolled data High
| */ | ||
| export function segmentEDUs(text: string): EDU[] { | ||
| // First split into sentences | ||
| const sentences = text.match(/[^.!?\n]+[.!?]+/g) ?? [text.trim()]; |
Check failure
Code scanning / CodeQL
Polynomial regular expression used on uncontrolled data High
| * Returns the sentences and their original indices for reassembly. | ||
| */ | ||
| export function splitSentences(text: string): string[] { | ||
| const sentences = text.match(/[^.!?\n]+[.!?]+/g); |
Check failure
Code scanning / CodeQL
Polynomial regular expression used on uncontrolled data High
| function extractMessageEntities(content: string): Set<string> { | ||
| const entities = new Set<string>(); | ||
| for (const re of [CAMEL_RE, PASCAL_RE, SNAKE_RE, VOWELLESS_RE, FILE_REF_RE]) { | ||
| const matches = content.match(re); |
Check failure
Code scanning / CodeQL
Polynomial regular expression used on uncontrolled data High
|
|
||
| it('catches URLs', () => { | ||
| const entities = extractEntities('See https://example.com/docs for details.'); | ||
| expect(entities.some((e) => e.includes('https://example.com/docs'))).toBe(true); |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High test
Copilot Autofix
AI about 22 hours ago
Copilot could not generate an autofix suggestion
Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.
Summary
--llm-judge): optional multi-provider evaluation (OpenAI, Anthropic, Gemini, Ollama) — display-only, not in baselines--features): benchmarks each v2 opt-in feature against baseline to measure impactdocs/quality-history.md): version-over-version quality tracking across v1.0.0→v1.3.0 with opt-in feature impact analysis@google/genaiSDKKey quality findings
Test plan
npm run build— compilesnpm test— 663 tests passnpm run lint && npm run format:check— cleannpm run bench:quality— all 13 scenarios runnpm run bench:quality:check— passes against baselinenpm run bench:quality:judge— LLM judge runs with Gemini/OpenAI/Ollamanpm run bench:quality:features— feature comparison runsnpm run bench:save— main baseline saved