Skip to content

feat: token-based client anonymization layer (MRPC 1.6 / SBM AI compliance)#475

Draft
Copilot wants to merge 4 commits intomainfrom
copilot/implement-client-anonymization-strategy
Draft

feat: token-based client anonymization layer (MRPC 1.6 / SBM AI compliance)#475
Copilot wants to merge 4 commits intomainfrom
copilot/implement-client-anonymization-strategy

Conversation

Copy link
Contributor

Copilot AI commented Mar 8, 2026

Client PII, PHI, and legal strategy currently reach cloud AI providers in the clear. This adds a local intercept layer that tokenizes sensitive entities before transmission and reverses the mapping after the response—so providers never see underlying identities.

ClientAnonymizationService (Cyrano/src/services/client-anonymization.ts)

  • anonymize(text, sessionId?) — extracts entities via regex (person, org, location, date, money, email, phone, SSN, account numbers, statutes, case citations) and replaces them with deterministic session-scoped tokens (PERSON_1, COMPANY_2, DATE_3, etc.)
  • deanonymize(text, sessionId) — reverses substitution locally using the in-memory token map; the mapping never leaves the process
  • assessRiskCategory(text) — classifies content per MRPC 1.6 / SBM AI guidance:
    • Cat 1 – generic/safe
    • Cat 2 – strategy-rich (settlement, litigation, privilege, negotiation keywords)
    • Cat 3 – hard PII present (SSN, account number, DOB phrases)
  • Session management — in-memory, 4-hour TTL with automatic eviction; destroySession() for explicit cleanup; multi-turn support via shared session ID

AIService.call() integration (Cyrano/src/services/ai-service.ts)

New fields on AICallOptions:

// opt-in per call
await aiService.call('anthropic', prompt, {
  anonymize: true,
  anonymizationSessionId: existingSessionId, // optional, for multi-turn
});
  • Category 3 prompts throw before any network I/O — identifiable PII is hard-blocked from all cloud providers
  • Anonymization/de-anonymization is transparent to callers; the returned string always contains original values

Tests (Cyrano/tests/services/client-anonymization.test.ts)

26 unit tests covering entity replacement, within-session determinism, round-trip fidelity, risk category classification, session lifecycle, and TTL eviction.

Original prompt

This section details on the original issue you should resolve

<issue_title>Client Confidentiality, Strategy, and Anonymization</issue_title>
<issue_description>Are we taking sufficient steps to safeguard client information, PII, PHI, and legal strategy? Consider the following questions and recommendations:

What Counts as Defensible Anonymization for the State Bar of Michigan

From a cautious, SBM‑compliant angle, you want something closer to HIPAA‑style de‑identification than informal anonymization:

HIPAA’s expert‑determination standard requires that an expert conclude there is a “very small risk” that the information could be used, alone or with reasonably available data, to identify an individual.

Applied to legal facts, that means systematically stripping or tokenizing: names, identifiers, unique dates, specific dollar figures, highly specific locations, rare job titles, and any combination that makes the fact pattern obviously about a particular client/matter.

Best practice that’s emerging: a technical anonymization layer that intercepts text before it hits the AI, replaces all sensitive entities with deterministic tokens (e.g., CLIENT_A, COMPANY_X, DATE_1), and then reverses the mapping locally after the AI responds. In that workflow, the provider never sees the underlying identities at all.

Even with strong anonymization, what you still must check:

Even if you’re comfortable that the text is non‑identifiable:

Are you still exposing confidential legal strategy or work product?

Privilege and MRPC 1.6 cover more than just “who the client is.” Sharing unique litigation strategy, settlement posture, or internal risk assessments with a third‑party AI can still be a confidentiality issue, even if the person is anonymized.

What do the provider’s terms say?

Michigan’s AI FAQ stresses that many AI tools “utilize the information entered to learn,” so inputs may be stored and regurgitated. If the provider keeps and reuses your “anonymized” data, that can still prejudice clients or expose work product, even if they can’t easily attach a name.

Could a subpoena or breach hurt your client anyway?

Ethics and practice‑management guidance points out that if opposing counsel could subpoena the AI provider for “all prompts related to industry X mergers in 2025,” you might still be uncomfortable, even without explicit names, because your strategies or deal structures could be reconstructed.

So anonymization is necessary but not sufficient; you still need a risk assessment of what you’re actually revealing and under what contractual/security regime.

Practical, conservative rule set

If you want a clean, defensible line for your own practice:

Category 1 – Truly anonymized & generic:

• Matter is converted via a robust token‑based system or equivalent, operated locally.
• No combination of facts would reasonably allow identification of the client or matter by an outsider.
• You are not transmitting unique strategy or internal mental impressions, just generic drafting/structuring questions.
→ Using mainstream AI tools here is relatively low risk, though you should still prefer providers with no‑training and strong security terms.

Category 2 – Anonymized but fact‑specific or strategy‑rich:

• You’ve stripped names, but the fact pattern is unusual, high‑profile, or includes distinctive dollar amounts/timing; or you’re discussing concrete litigation/negotiation strategy.
→ Treat as still confidential; only use if (a) provider is on an enterprise, no‑training, contractually locked‑down tier, and (b) you’d be comfortable explaining the use to the client and to a judge.

Category 3 – Identifiable or sensitive by context:

• The story is obviously about a particular person or organization (local public figure, single major employer in a small town, etc.), or involves particularly sensitive categories (health, crime, immigration, harassment with rare facts).
→ Do not send to general‑purpose cloud AI at all; use local/self‑hosted models or specialized legal AI tools with SOC2‑grade assurances and explicit no‑retention/no‑training language.

How are we currently “anonymizing” and "de-anonymizing" client information for legal research and writing, client communications, etc.(manual edits, search/replace, a script, or something like a proxy layer)? Whatever it is, it should be able to pass a HIPAA‑style “very small risk” test and align with the recent privilege cases even though users are unlikely to be covered entities under HIPAA.</issue_description>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 2 commits March 8, 2026 23:03
Co-authored-by: MightyPrytanis <219587333+MightyPrytanis@users.noreply.github.com>
Co-authored-by: MightyPrytanis <219587333+MightyPrytanis@users.noreply.github.com>
Copilot AI changed the title [WIP] Update client information anonymization procedures feat: token-based client anonymization layer (MRPC 1.6 / SBM AI compliance) Mar 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Client Confidentiality, Strategy, and Anonymization

2 participants