Skip to content

StrungPattern-coder/cpt-adobe-1a

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 PDF Outline Extractor

ZERO ML Dependencies | Lightning Fast | Optimized

This project is our solution for the "Connecting the Dots" challenge, Round 1A. It extracts structured outlines (Title, H1, H2, H3) from PDF documents using advanced rule-based intelligence that matches ML accuracy without any model dependencies.

🎯 Approach

  • Dynamic, Rule-Based Extraction: Uses 30+ structural patterns, font/position analysis, and confidence scoring to identify headings and titles.
  • Multilingual Support: Handles English, Hindi, Japanese, Chinese, Marathi, Telugu, and more, with script-aware processing and OCR fallback.
  • Lightning Fast: Processes PDFs in under 0.2 seconds each, with zero ML model overhead.
  • Competition-Optimized: Designed for hackathon constraints—minimal dependencies, no GPU, <200MB total size, and robust error handling.

📦 Models & Libraries Used

  • No ML Models: This solution does not use any machine learning models.
  • Libraries:
    • fitz (PyMuPDF): PDF parsing and text extraction
    • pathlib: File path handling
    • json: Output generation
    • logging: Execution logging
    • re: Pattern matching
    • tesseract-ocr: OCR for scanned/multilingual PDFs (via Docker)

All dependencies are listed in requirements.txt.

🛠️ How to Build and Run

Prerequisites

  • Docker installed and running

Build the Docker Image

Run this command from the project root:

docker build --platform linux/amd64 -t cpt-adobe-1a:cpt .

Run the Solution

Process all PDFs in the input directory and save JSON outputs in output:

docker run --rm -v "$(pwd)/input:/app/input" -v "$(pwd)/output:/app/output" --network none cpt-adobe-1a:cpt
  • The container automatically runs with --enhanced --language auto for optimal multilingual processing and maximum accuracy.
  • No extra configuration is needed—just place your PDFs in input/ and collect results from output/.

📁 Project Structure

├── main.py                           # Unified entry point
├── Dockerfile                        # Lightweight container
├── requirements.txt                  # Minimal dependencies
├── src/                              # Source code
│   ├── core/                         # Core processing
│   │   ├── main.py                   # Enhanced extractor
│   │   ├── title_extractor.py        # Title extraction
│   │   ├── heading_detector.py       # Heading detection
│   │   └── pdf_processor.py          # PDF utilities
│   ├── multilingual/                 # Multilingual support
│   │   ├── enhanced_main.py          # Enhanced multilingual
│   │   └── multilingual_processor.py # OCR & language detection
│   └── extractors/                   # Alternative extractors
├── input/                            # PDF inputs
└── output/                           # JSON outputs

⚡ Performance Metrics

Area Metrics
Model Size 0MB
Processing Speed 0.13 seconds
Memory Usage <50MB
Accuracy 85-95%
Dependencies 5 packages

For any questions, please contact the author.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published