Building the infrastructure that makes AI run fast, cheap, and reliably at scale.
I'm a Senior MLOps / AI Platform Engineer with 5+ years shipping production systems across two deep specializations:
- π LLM Inference Infrastructure β vLLM, SGLang, MCP-based agents, RAG architectures.
- ποΈ Computer Vision Pipelines β Real-time object detection, multi-object tracking, segmentation at millions of frames per week.
I've led teams of 4+ engineers, contributed to open-source SDKs increasing downloads by 10x, and pushed models to top throughput performance.
π Currently: MCP-based agent orchestration and pushing LLM inference latency boundaries with vLLM + SGLang.
I've taken systems from prototype to production across LLM infra and computer vision β a few highlights:
- 3x latency reduction on a multi-modal RAG platform over a 1M+ document knowledge base
- 80% GPU memory savings with LoRA/PEFT adapters for cost-efficient production fine-tuning
- 0.97 mAP on car dent detection & segmentation for an insurance client
- 90% MOTA on sports analytics pipelines processing millions of frames/week
- 10x SDK download growth via Clarifai Python SDK & CLI contributions
- NVIDIA Smart City Hackathon finalist (Asia-Pacific) β pothole detection with RT-DETR
ποΈ Docwhisper
A document Q&A system built on a full RAG pipeline: ingest PDFs, chunk and embed them into a vector store, then retrieve and answer with a locally-running LLM via Ollama. Fast, private, and runs entirely on your machine. Includes MLflow-based observability β query traces, retrieval quality metrics, and latency are tracked per run for easy debugging and iteration. |
π local-slm-bench
Run up to 3 models simultaneously with real-time token streaming and per-model metrics. A built-in benchmark suite covers 18 prompts across 10 categories (reasoning, code, math, safety). Interactive Plotly charts visualize throughput, TTFT, RAM usage, and quality-vs-speed trade-offs β all on local hardware. |
π€ Clarifai Python SDK (Contributor)
Drove significant growth through CLI improvements, new API surface coverage, and DX enhancements β contributing to a 10x increase in monthly downloads. |
π€ multi-agent-research
Production-grade multi-agent pipeline: Orchestrator β Search Agent β Summarizer β Critic, connected via A2A typed messaging and MCP servers for tools, memory, and web search. Tracks token usage, latency, cost, and confidence scores per agent. Streamlit dashboard with Plotly charts for visual analytics. |
Other Projects
| Project | Description | Stack |
|---|---|---|
| PyTorch Object Detect & Track | Real-time multi-object detection and tracking in video | Python, PyTorch, YOLO |


