ZERO ML Dependencies | Lightning Fast | Optimized
This project is our solution for the "Connecting the Dots" challenge, Round 1A. It extracts structured outlines (Title, H1, H2, H3) from PDF documents using advanced rule-based intelligence that matches ML accuracy without any model dependencies.
- Dynamic, Rule-Based Extraction: Uses 30+ structural patterns, font/position analysis, and confidence scoring to identify headings and titles.
- Multilingual Support: Handles English, Hindi, Japanese, Chinese, Marathi, Telugu, and more, with script-aware processing and OCR fallback.
- Lightning Fast: Processes PDFs in under 0.2 seconds each, with zero ML model overhead.
- Competition-Optimized: Designed for hackathon constraints—minimal dependencies, no GPU, <200MB total size, and robust error handling.
- No ML Models: This solution does not use any machine learning models.
- Libraries:
fitz(PyMuPDF): PDF parsing and text extractionpathlib: File path handlingjson: Output generationlogging: Execution loggingre: Pattern matchingtesseract-ocr: OCR for scanned/multilingual PDFs (via Docker)
All dependencies are listed in requirements.txt.
- Docker installed and running
Run this command from the project root:
docker build --platform linux/amd64 -t cpt-adobe-1a:cpt .Process all PDFs in the input directory and save JSON outputs in output:
docker run --rm -v "$(pwd)/input:/app/input" -v "$(pwd)/output:/app/output" --network none cpt-adobe-1a:cpt- The container automatically runs with
--enhanced --language autofor optimal multilingual processing and maximum accuracy. - No extra configuration is needed—just place your PDFs in
input/and collect results fromoutput/.
├── main.py # Unified entry point
├── Dockerfile # Lightweight container
├── requirements.txt # Minimal dependencies
├── src/ # Source code
│ ├── core/ # Core processing
│ │ ├── main.py # Enhanced extractor
│ │ ├── title_extractor.py # Title extraction
│ │ ├── heading_detector.py # Heading detection
│ │ └── pdf_processor.py # PDF utilities
│ ├── multilingual/ # Multilingual support
│ │ ├── enhanced_main.py # Enhanced multilingual
│ │ └── multilingual_processor.py # OCR & language detection
│ └── extractors/ # Alternative extractors
├── input/ # PDF inputs
└── output/ # JSON outputs
| Area | Metrics |
|---|---|
| Model Size | 0MB |
| Processing Speed | 0.13 seconds |
| Memory Usage | <50MB |
| Accuracy | 85-95% |
| Dependencies | 5 packages |
For any questions, please contact the author.