A YouTube video transcription tool that integrates video downloading with speech-to-text conversion. This is a personal side project created to fulfill specific needs by combining multiple open-source projects.
This project combines the following technologies:
- yt-dlp: YouTube video downloading
- WhisperX: High-accuracy speech-to-text transcription
- MCP (Model Context Protocol): Server mode support
- Support for multiple output formats (text, srt, vtt)
- Multiple model options (tiny, base, small, medium, large-v2, large-v3, breeze-asr-25)
- Speaker diarization support with configurable speaker count
- Redis/Valkey caching for downloaded audio files
- Containerized deployment with GPU support
- MCP server mode with HTTP transport and OAuth2 authentication
# Build and start MCP server container
docker compose build
docker compose up -d mcp-server
# Execute CLI commands in running container (reuses models for efficiency)
docker compose exec mcp-server uv run transcript-extractor "https://youtube.com/watch?v=VIDEO_ID" --model large-v3
# Use Breeze ASR 25 for Mandarin-English code-switching
docker compose exec mcp-server uv run transcript-extractor "https://youtube.com/watch?v=VIDEO_ID" --model breeze-asr-25
# Output to file
docker compose exec mcp-server uv run transcript-extractor "https://youtube.com/watch?v=VIDEO_ID" \
--format srt \
--model large-v3 \
--output /app/output/transcript.srt--format, -f: Output format (text, srt, vtt)--model, -m: Model name (tiny, base, small, medium, large-v2, large-v3, breeze-asr-25)--language, -l: Language code (zh, en, etc.)--device: Device to run transcription on (cpu, cuda)--compute-type: Compute precision (float16, float32, int8)--diarize: Enable speaker diarization (requires HF_TOKEN)--num-speakers: Number of speakers (if known)--min-speakers: Minimum number of speakers--max-speakers: Maximum number of speakers--verbose, -v: Show processing progress
MCP server uses streamable HTTP protocol and requires its own authorization provider.
# Start MCP server
docker compose up -d mcp-server
# Check server status
docker compose ps
docker compose logs mcp-server
# Stop server
docker compose down mcp-serverMCP Server Configuration:
MCP_TRANSPORT=http: Use HTTP transportMCP_HOST: Server host (default: 0.0.0.0)MCP_PORT: Server port (default: 8080)AUTH_SERVER_URL: OAuth2 authentication server URLMCP_JWT_AUDIENCE: JWT audience (default: transcript-extractor)MAX_WHISPER_MODEL: Maximum allowed model (server hardware limits)
Storage and Caching:
DOWNLOAD_DIR: Directory for downloaded audio files (default: ./downloads)MODEL_STORE_DIR: Directory for caching Whisper models (default: ./models)VALKEY_HOST/PORT/DB: Redis/Valkey connection settingsVALKEY_PASSWORD/USERNAME: Redis/Valkey authentication
Additional Features:
HF_TOKEN: HuggingFace token (required for speaker diarization)DEBUG_ENABLED/PORT: Remote debugging configuration
When using Docker, output files are stored at:
- Container path:
/app/output/ - Local path:
./output/
- Python 3.11-3.12
- Docker & Docker Compose
- NVIDIA GPU (optional, for acceleration)
- FFmpeg (included in container)
This project uses uv for dependency management:
# Install development dependencies
uv sync
# Run tests
uv run python -m pytest
# Local CLI execution
uv run transcript-extractor "https://youtube.com/watch?v=VIDEO_ID"
# Local MCP server
uv run transcript-extractor-mcpThe Breeze ASR-25 model (specialized for Mandarin-English code-switching) produces incorrect timestamps in output formats (SRT/VTT). This is caused by our project's parsing and formatting logic when processing the model's output, not a limitation of the model itself. The timing accuracy issues affect subtitle file generation.
To prevent VRAM memory leaks from accumulating across multiple transcription requests, the system automatically clears GPU memory cache after each transcription. This may cause the following side effects:
- Concurrent Request Risk: If multiple transcription requests are running simultaneously, memory cleanup from one request may interfere with others
- Performance Impact: Models need to be reloaded more frequently, affecting overall throughput
- Recommendation: For production environments with concurrent usage, consider implementing request queuing or using separate instances
This trade-off prioritizes memory stability over concurrent performance. Future versions may implement more sophisticated memory management strategies.
This is a personal side project that integrates multiple excellent open-source tools to meet daily video transcription needs.