Update README with chunking strategy

## Chunking Strategy

Simple Chat processes a wide range of document types using a consistent chunking strategy to enable optimal context handling and LLM performance. Below is an overview of how we chunk content depending on the file type:

### General Principles

- **Chunk Size**: Targeting \~400 words of meaningful content per chunk. Some formats (HTML, Markdown) require larger text windows (up to 1200 words) due to formatting overhead that inflates token count.
- **Minimum Chunk Size**: Chunks are merged if they contain fewer than 600 words, ensuring minimal context fragmentation.
- **Table Handling**: When chunking might split tables, we replicate the table header in each chunk to preserve readability.
- **Code Blocks** (Markdown): Ensure any split code blocks retain full formatting (\`\`\`\`\`) in each chunk to maintain integrity.

---

### File-Type Specific Chunking

#### **PDF**

- Sent to Document Intelligence for OCR + layout parsing.
- If PDF is more than 500 MB or 2000 pages, it is broken into 500‐page parts.
- Each part is sent separately to Document Intelligence.
- All chunks from each part are saved under the original document in AI Search; parts exist only to work around service limits.
- **Chunked by page**.

#### **DOCX**

- Sent to Document Intelligence.
- **Chunked by \~400 words**, approximating an A4 page.

#### **DOC / DOCM**

- Processed with Python package docx2txt
- **Chunked by \~400 words**, approximating an A4 page.

#### **PPTX**

- Sent to Document Intelligence.
- **Chunked by slide (page)**.

#### **Images** (`.jpg`, `.jpeg`, `.png`, `.bmp`, `.tiff`, `.tif`, `.heif`)

- Sent to Document Intelligence for OCR.
- **One chunk per image**.

#### **TXT**

- Processed using regex word splitting.
- **Chunked by 400 words**.
- See `process_txt_file`.

#### **HTML**

- Uses `RecursiveCharacterTextSplitter`.
- **Header-based chunking**:
  - Initially split by `<h1>` tags.
  - Chunks >1200 words are recursively split using `<h2>` → `<h3>` → ... `<h5>`.
- **Tables**: If a table spans chunks, ensure headers are repeated per chunk.
- **Minimum chunk size**: Merge chunks <600 words into preceding ones.
- **Goal**: Maintain 400 words of *informational* content per chunk, accounting for token inflation from HTML tags.
- See `process_html_file`.

#### **Markdown (.md)**

- Uses `MarkdownHeaderTextSplitter`.
- Initial split by `#` headers (h1 → h5).
- Chunks >1200 words undergo recursive splitting.
- **Table & Code Block Handling**:
  - Tables: Re-add headers if split.
  - Code: Wrap with code block syntax (\`\`\`\`\`) if a split occurs.
- Minimum 600-word chunks enforced.
- See `process_md_file`.

#### **JSON**

- Uses `RecursiveJsonSplitter` - a specialized splitter designed for JSON data structures.
- **Structural splitting**:
  - Understands JSON objects, arrays, and nesting
  - `max_chunk_size=4000` characters
  - `convert_lists=True` - intelligently handles JSON arrays
- **Maintains validity**: Each chunk is valid, parseable JSON.
- **Empty chunk filtering**: Skips trivial chunks like `{}`, `[]`, or empty strings.
- **Goal**: Preserve JSON structure while creating semantically meaningful chunks that respect object/array boundaries.
- See `process_json`.

#### **XML**

- Uses `RecursiveCharacterTextSplitter` with XML-aware separators.
- **Structure-preserving chunking**:
  - Separators prioritized: `\n\n` → `\n` → `>` (end of XML tags) → space → character
  - Splits at logical boundaries to maintain tag integrity
- **Chunked by 4000 characters**.
- **Goal**: Preserve XML structure by splitting at tag boundaries rather than mid-element, ensuring chunks are more semantically meaningful for LLM processing.
- See `process_xml`.

#### **YAML / YML**

- Uses `RecursiveCharacterTextSplitter` with YAML-aware separators.
- **Structure-preserving chunking**:
  - Separators prioritized: `\n\n` → `\n` → `- ` (YAML list items) → space → character
  - Splits at logical boundaries to maintain YAML structure
- **Chunked by 4000 characters**.
- **Goal**: Preserve YAML hierarchy and list structures by splitting at section boundaries and list items rather than mid-key or mid-value.
- See `process_yaml`.

#### **LOG**

- Processed using line-based chunking to maintain log record integrity.
- **Never splits mid-line** to preserve complete log entries.
- **Line-Level Chunking**:
  1. Split file by lines using `splitlines(keepends=True)` to preserve line endings.
  2. Accumulate complete lines until reaching target word count ≈1000 words.
  3. When adding next line would exceed target AND chunk already has content:
     - Finalize current chunk
     - Start new chunk with current line
  4. If single line exceeds target, it gets its own chunk to prevent infinite loops.
  5. Emit chunks with complete log records.
- **Goal**: Provide substantial log context (1000 words) while ensuring no log entry is split across chunks.
- See `process_log`.

#### **Tabular (CSV, XLSX, XLS, XLSM)**

- **CSV**:
  - Parsed with pandas.
  - **Chunked by rows** into \~800-character chunks (1+ full rows per chunk).
  - Header row prepended to each chunk.
- **XLSX / XLS**:
  - Each worksheet is treated as a separate file:
    - Filename format: `filename-tabname.ext`
  - **Chunked by rows** (\~800 characters).
  - Header rows are preserved.
  - Excel formulas are evaluated to return only the computed **value**.
- See `process_tabular_file`.

#### **Video Files**

- **Transcript Extraction**:
  - Use Azure Video Indexer  to generate a full transcript with timestamps and, if needed, confidence scores.
  - Retrieve the index JSON via the **Get Video Index** API; inspect the `insights.transcript` array for line segments with `instances[start/end]`.

* **Line-Level Chunking** (preferred for simplicity):
  1. Iterate `insights.transcript`, splitting each segment’s `text` into words.

  2. Accumulate segments until the aggregate is approx 30 seconds long.

  3. Emit a chunk with:

     - `startTime` = first segment’s `instances[0].start`.
     - `text` = concatenation of segment texts.

  4. Reset word count and continue.

  5. See `process_video_file`.

**Audio Files**

- **Transcript Extraction (Azure Speech Services)**:

  - Use Azure Speech Services Speech-to-Text REST API to generate a full transcript for audio files, including timestamps and confidence scores for each word or phrase.
  - Submit your audio file to the Speech Services endpoint and retrieve the resulting transcription JSON, which provides either:
    - A line/phrase-level transcript ("recognizedPhrases" with "offset"/"duration"), or

- **Line-Level Chunking**:
  1. Parse the transcription JSON to obtain an array of phrase segments, each containing `text`, `offset` (start time), and `duration`.
  2. Split each segment’s `text` into words.
  3. Accumulate segments until the total word count ≈400.
  4. Emit a chunk with:
     - `startTime` = the first segment’s `offset` formatted as `HH:MM:SS.sss`.
     - `text` = concatenation of segment texts.
  5. Reset count and continue until all segments are processed.
  6. See `process_audio_file`.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update README with chunking strategy #98

Chunking Strategy

General Principles

File-Type Specific Chunking

PDF

DOCX

DOC / DOCM

PPTX

Images (`.jpg`, `.jpeg`, `.png`, `.bmp`, `.tiff`, `.tif`, `.heif`)

TXT

HTML

Markdown (.md)

JSON

XML

YAML / YML

LOG

Tabular (CSV, XLSX, XLS, XLSM)

Video Files

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Update README with chunking strategy #98

Description

Chunking Strategy

General Principles

File-Type Specific Chunking

PDF

DOCX

DOC / DOCM

PPTX

Images (.jpg, .jpeg, .png, .bmp, .tiff, .tif, .heif)

TXT

HTML

Markdown (.md)

JSON

XML

YAML / YML

LOG

Tabular (CSV, XLSX, XLS, XLSM)

Video Files

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Images (`.jpg`, `.jpeg`, `.png`, `.bmp`, `.tiff`, `.tif`, `.heif`)