🚀 PDF Outline Extractor

ZERO ML Dependencies | Lightning Fast | Optimized

This project is our solution for the "Connecting the Dots" challenge, Round 1A. It extracts structured outlines (Title, H1, H2, H3) from PDF documents using advanced rule-based intelligence that matches ML accuracy without any model dependencies.

🎯 Approach

Dynamic, Rule-Based Extraction: Uses 30+ structural patterns, font/position analysis, and confidence scoring to identify headings and titles.
Multilingual Support: Handles English, Hindi, Japanese, Chinese, Marathi, Telugu, and more, with script-aware processing and OCR fallback.
Lightning Fast: Processes PDFs in under 0.2 seconds each, with zero ML model overhead.
Competition-Optimized: Designed for hackathon constraints—minimal dependencies, no GPU, <200MB total size, and robust error handling.

📦 Models & Libraries Used

No ML Models: This solution does not use any machine learning models.
Libraries:
- fitz (PyMuPDF): PDF parsing and text extraction
- pathlib: File path handling
- json: Output generation
- logging: Execution logging
- re: Pattern matching
- tesseract-ocr: OCR for scanned/multilingual PDFs (via Docker)

All dependencies are listed in requirements.txt.

🛠️ How to Build and Run

Prerequisites

Docker installed and running

Build the Docker Image

Run this command from the project root:

docker build --platform linux/amd64 -t cpt-adobe-1a:cpt .

Run the Solution

Process all PDFs in the input directory and save JSON outputs in output:

docker run --rm -v "$(pwd)/input:/app/input" -v "$(pwd)/output:/app/output" --network none cpt-adobe-1a:cpt

The container automatically runs with --enhanced --language auto for optimal multilingual processing and maximum accuracy.
No extra configuration is needed—just place your PDFs in input/ and collect results from output/.

📁 Project Structure

├── main.py                           # Unified entry point
├── Dockerfile                        # Lightweight container
├── requirements.txt                  # Minimal dependencies
├── src/                              # Source code
│   ├── core/                         # Core processing
│   │   ├── main.py                   # Enhanced extractor
│   │   ├── title_extractor.py        # Title extraction
│   │   ├── heading_detector.py       # Heading detection
│   │   └── pdf_processor.py          # PDF utilities
│   ├── multilingual/                 # Multilingual support
│   │   ├── enhanced_main.py          # Enhanced multilingual
│   │   └── multilingual_processor.py # OCR & language detection
│   └── extractors/                   # Alternative extractors
├── input/                            # PDF inputs
└── output/                           # JSON outputs

⚡ Performance Metrics

Area	Metrics
Model Size	0MB
Processing Speed	0.13 seconds
Memory Usage	<50MB
Accuracy	85-95%
Dependencies	5 packages

For any questions, please contact the author.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 PDF Outline Extractor

🎯 Approach

📦 Models & Libraries Used

🛠️ How to Build and Run

Prerequisites

Build the Docker Image

Run the Solution

📁 Project Structure

⚡ Performance Metrics

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
input		input
output		output
src		src
Dockerfile		Dockerfile
README.md		README.md
Team_CPT_1A_presentation.pdf		Team_CPT_1A_presentation.pdf
main.py		main.py
requirements.txt		requirements.txt

StrungPattern-coder/cpt-adobe-1a

Folders and files

Latest commit

History

Repository files navigation

🚀 PDF Outline Extractor

🎯 Approach

📦 Models & Libraries Used

🛠️ How to Build and Run

Prerequisites

Build the Docker Image

Run the Solution

📁 Project Structure

⚡ Performance Metrics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages