Skip to content

Update README with chunking strategy #98

@paullizer

Description

@paullizer

Chunking Strategy

Simple Chat processes a wide range of document types using a consistent chunking strategy to enable optimal context handling and LLM performance. Below is an overview of how we chunk content depending on the file type:

General Principles

  • Chunk Size: Targeting ~400 words of meaningful content per chunk. Some formats (HTML, Markdown) require larger text windows (up to 1200 words) due to formatting overhead that inflates token count.
  • Minimum Chunk Size: Chunks are merged if they contain fewer than 600 words, ensuring minimal context fragmentation.
  • Table Handling: When chunking might split tables, we replicate the table header in each chunk to preserve readability.
  • Code Blocks (Markdown): Ensure any split code blocks retain full formatting (`````) in each chunk to maintain integrity.

File-Type Specific Chunking

PDF

  • Sent to Document Intelligence for OCR + layout parsing.
  • If PDF is more than 500 MB or 2000 pages, it is broken into 500‐page parts.
  • Each part is sent separately to Document Intelligence.
  • All chunks from each part are saved under the original document in AI Search; parts exist only to work around service limits.
  • Chunked by page.

DOCX

  • Sent to Document Intelligence.
  • Chunked by ~400 words, approximating an A4 page.

DOC / DOCM

  • Processed with Python package docx2txt
  • Chunked by ~400 words, approximating an A4 page.

PPTX

  • Sent to Document Intelligence.
  • Chunked by slide (page).

Images (.jpg, .jpeg, .png, .bmp, .tiff, .tif, .heif)

  • Sent to Document Intelligence for OCR.
  • One chunk per image.

TXT

  • Processed using regex word splitting.
  • Chunked by 400 words.
  • See process_txt_file.

HTML

  • Uses RecursiveCharacterTextSplitter.
  • Header-based chunking:
    • Initially split by <h1> tags.
    • Chunks >1200 words are recursively split using <h2><h3> → ... <h5>.
  • Tables: If a table spans chunks, ensure headers are repeated per chunk.
  • Minimum chunk size: Merge chunks <600 words into preceding ones.
  • Goal: Maintain 400 words of informational content per chunk, accounting for token inflation from HTML tags.
  • See process_html_file.

Markdown (.md)

  • Uses MarkdownHeaderTextSplitter.
  • Initial split by # headers (h1 → h5).
  • Chunks >1200 words undergo recursive splitting.
  • Table & Code Block Handling:
    • Tables: Re-add headers if split.
    • Code: Wrap with code block syntax (`````) if a split occurs.
  • Minimum 600-word chunks enforced.
  • See process_md_file.

JSON

  • Uses RecursiveJsonSplitter - a specialized splitter designed for JSON data structures.
  • Structural splitting:
    • Understands JSON objects, arrays, and nesting
    • max_chunk_size=4000 characters
    • convert_lists=True - intelligently handles JSON arrays
  • Maintains validity: Each chunk is valid, parseable JSON.
  • Empty chunk filtering: Skips trivial chunks like {}, [], or empty strings.
  • Goal: Preserve JSON structure while creating semantically meaningful chunks that respect object/array boundaries.
  • See process_json.

XML

  • Uses RecursiveCharacterTextSplitter with XML-aware separators.
  • Structure-preserving chunking:
    • Separators prioritized: \n\n\n> (end of XML tags) → space → character
    • Splits at logical boundaries to maintain tag integrity
  • Chunked by 4000 characters.
  • Goal: Preserve XML structure by splitting at tag boundaries rather than mid-element, ensuring chunks are more semantically meaningful for LLM processing.
  • See process_xml.

YAML / YML

  • Uses RecursiveCharacterTextSplitter with YAML-aware separators.
  • Structure-preserving chunking:
    • Separators prioritized: \n\n\n- (YAML list items) → space → character
    • Splits at logical boundaries to maintain YAML structure
  • Chunked by 4000 characters.
  • Goal: Preserve YAML hierarchy and list structures by splitting at section boundaries and list items rather than mid-key or mid-value.
  • See process_yaml.

LOG

  • Processed using line-based chunking to maintain log record integrity.
  • Never splits mid-line to preserve complete log entries.
  • Line-Level Chunking:
    1. Split file by lines using splitlines(keepends=True) to preserve line endings.
    2. Accumulate complete lines until reaching target word count ≈1000 words.
    3. When adding next line would exceed target AND chunk already has content:
      • Finalize current chunk
      • Start new chunk with current line
    4. If single line exceeds target, it gets its own chunk to prevent infinite loops.
    5. Emit chunks with complete log records.
  • Goal: Provide substantial log context (1000 words) while ensuring no log entry is split across chunks.
  • See process_log.

Tabular (CSV, XLSX, XLS, XLSM)

  • CSV:
    • Parsed with pandas.
    • Chunked by rows into ~800-character chunks (1+ full rows per chunk).
    • Header row prepended to each chunk.
  • XLSX / XLS:
    • Each worksheet is treated as a separate file:
      • Filename format: filename-tabname.ext
    • Chunked by rows (~800 characters).
    • Header rows are preserved.
    • Excel formulas are evaluated to return only the computed value.
  • See process_tabular_file.

Video Files

  • Transcript Extraction:
    • Use Azure Video Indexer to generate a full transcript with timestamps and, if needed, confidence scores.
    • Retrieve the index JSON via the Get Video Index API; inspect the insights.transcript array for line segments with instances[start/end].
  • Line-Level Chunking (preferred for simplicity):
    1. Iterate insights.transcript, splitting each segment’s text into words.

    2. Accumulate segments until the aggregate is approx 30 seconds long.

    3. Emit a chunk with:

      • startTime = first segment’s instances[0].start.
      • text = concatenation of segment texts.
    4. Reset word count and continue.

    5. See process_video_file.

Audio Files

  • Transcript Extraction (Azure Speech Services):

    • Use Azure Speech Services Speech-to-Text REST API to generate a full transcript for audio files, including timestamps and confidence scores for each word or phrase.
    • Submit your audio file to the Speech Services endpoint and retrieve the resulting transcription JSON, which provides either:
      • A line/phrase-level transcript ("recognizedPhrases" with "offset"/"duration"), or
  • Line-Level Chunking:

    1. Parse the transcription JSON to obtain an array of phrase segments, each containing text, offset (start time), and duration.
    2. Split each segment’s text into words.
    3. Accumulate segments until the total word count ≈400.
    4. Emit a chunk with:
      • startTime = the first segment’s offset formatted as HH:MM:SS.sss.
      • text = concatenation of segment texts.
    5. Reset count and continue until all segments are processed.
    6. See process_audio_file.

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentation

Type

Projects

Status

In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions