π§ Principal-Grade Site Reliability Engineer | π¦ Polyglot Systems Architect | π§ AI & Deep Learning Engineer
Location: Pune, Maharashtra, India
Email: [email protected]
GitHub: github.com/nkitan
LinkedIn: linkedin.com/in/ankitdas2k
I serve as a critical infrastructure engineer at the intersection of Extreme-Scale Distributed Systems, Low-Level Systems Programming, and Artificial Intelligence.
Currently, I am a Site Reliability Engineer at PhonePe, creating and maintaining the digital backbone for India's largest fintech platform. I possess a holistic "Full Stack" understanding of software reliabilityβfrom tuning kernel parameters on Ubuntu Pro & RHEL bare metal clusters to training deep neural networks for computer vision.
My appetite for responsibility is demonstrated by my role as the Primary On-Call for my organization's Edge Infrastructure and Edge InfoSec, where I actively defend and manage traffic for billions of daily transactions across diverse verticals including Fintech, Stock Trading, and Hyperlocal Logistics.
I operate within a lean, elite team responsible for the reliability and uptime of an ecosystem that powers a significant portion of India's digital economy.
- User Base: Supporting over 600 Million Registered Users.
- Transaction Volume: Handling 33 Crore (330+ Million) daily transactions.
- Financial Impact: Managing critical systems responsible for an Annualized Total Payment Value (TPV) of INR 150 Lakh Crore (as of March 2025).
- Ecosystem Coverage:
- PhonePe (Fintech): UPI, Wallet, and Banking integrations.
- Indus AppStore: A localized Android app store platform.
- Share.market: A real-time stock market trading platform.
- Pincode: A hyperlocal delivery and logistics platform.
Site Reliability Engineer | PhonePe Pvt Ltd
Pune, Maharashtra | February 2023 β Present
As a core member of the SRE team, I architect, deploy, and maintain the edge and backend infrastructure. My role demands a blend of systems engineering, network architecture, and security operations.
- In-House WAF Development: Co-architected and developed proprietary, in-house Web Application Firewalls (WAF) to augment vendor solutions. These custom systems provide granular, low-latency threat mitigation specific to high-velocity fintech workloads.
- Distributed Config Synchronization: Built bespoke internal control plane services that guarantee strict configuration synchronization for diverse proxy clusters across geographically distributed Data Centers, effectively eliminating config drift in active-active environments.
- OS & Virtualization: Deep expertise in managing Ubuntu Pro and Red Hat Enterprise Linux (RHEL) environments. Experience spans Bare Metal deployments, L1 Hypervisors (KVM), and L2 Systems (QEMU).
- Traffic Engineering: Architecting L4/L7 traffic flow using NGINX, HAProxy, and F5 BigIP.
- Stateful Services: Ensuring high availability for mission-critical stateful clusters including Zookeeper, Aerospike, and Redis.
- Primary On-Call: Acting as the first line of defense for the entire organization's Edge Infrastructure and Edge InfoSec.
- Disaster Recovery (DR): Leading comprehensive DR drills to ensure business continuity across multiple active-active Data Centers.
- Observability: Architecting custom monitoring stacks using InfluxDB, OpenTSDB, Prometheus, and Grafana to track the heartbeat of distributed systems.
- Incident Management: Debugging complex production issues involving network latency, kernel panics, and distributed consensus failures.
For over 2.5 years, I have served as a technical bridge for high-stakes integrations.
- Banking Partners: Managing critical infrastructure integrations with major financial institutions like Yes Bank, Axis Bank, and ICICI.
- Telecom Operators: Coordinating with Vodafone and other telecom providers for network reliability and SMS/OTP delivery pipelines.
- Cloud & AI Vendors: Leading technical relations and integrations with Cloudflare, Akamai, AWS, Azure, and OpenAI.
- Hybrid Security Strategy: Collaborating with external WAF providers while simultaneously deploying in-house security layers to mitigate DDoS attacks and secure API endpoints.
Remote | January 2022 β July 2022
- Distributed Systems: Led the design and development of a Distributed Web Scraper using Java, Hibernate, Selenium, and MySQL.
- Cloud Optimization: Engineered the system to run on AWS EC2 Spot Instances, implementing check-pointing and idempotency to handle instance preemptions, significantly reducing operational costs.
- Architecture: Translated high-level business requirements into a performant, fault-tolerant system.
Pune | May 2020 β October 2020
- Initiated and maintained corporate relationships, liaising with external stakeholders to drive partnership opportunities.
I am a polyglot engineer who believes in using the right tool for the job, whether it's low-level memory management in Rust or high-level AI orchestration in Python.
| Domain | Proficiencies |
|---|---|
| Systems Languages | Rust (High-performance tooling), C (Legacy/Kernel), Golang (Microservices), C++ |
| Scripting & AI | Python (Automation, AI/ML, Data), TypeScript/JavaScript (Frontend/Node), Java |
| Operating Systems | Ubuntu Pro, RHEL (Red Hat Enterprise Linux), Bare Metal Linux, Debian |
| Virtualization | KVM (L1 Hypervisor), QEMU (L2 Emulation), Docker, Kubernetes |
| Traffic & Networking | NGINX, HAProxy, F5 BigIP, Cloudflare, Akamai, L4/L7 Load Balancing, WAFs |
| Distributed Databases | Aerospike, PostgreSQL, Redis, MongoDB, MariaDB, ElasticSearch, RabbitMQ, Kafka |
| Observability | InfluxDB, OpenTSDB, Prometheus, Grafana, ELK Stack, Custom Bash/Python Monitoring |
| AI & ML Frameworks | TensorFlow, PyTorch, LangChain, Hugging Face, OpenAI API, Google Gemini, Ollama |
| Web Technologies | ReactJS, NextJS, Node.js, ExpressJS, Axum (Rust), FastAPI, HTML5, CSS3 |
| DevOps & Tools | Ansible, Terraform, Git, GitHub Actions, Selenium, Keycloak |
I bridge the gap between Systems Engineering and Artificial Intelligence, focusing on performance optimization and production-grade deployment of Generative AI.
- Architecture: Multi-Agent System using LangChain and Google Gemini.
- Capabilities: An intelligent financial management platform that orchestrates multiple AI agents to analyze spending habits.
- Computer Vision: Implemented advanced OCR pipelines to process crumpled, handwritten, and multi-language physical receipts.
- Tech Stack: Python, FastAPI, Firebase, React Native.
π FinanceBud
- Performance: High-performance financial analysis tool engineered for real-time UPI reporting.
- Innovation: leveraged MCP (Model Context Protocol) to create persistent context backends.
- Optimization: Achieved 60-80% lower latency through intelligent query caching, connection pooling, and parallel tool call processing.
- Tech Stack: Python, FastAPI, WebSocket, SQLite, Ollama/OpenAI.
π£οΈ Enhanced Lip Reading (Research)
- Domain: Computer Vision & Deep Learning.
- Implementation: Developed an Automated Lip Reading (ALR) model using Temporal Convolutional Neural Networks (TCN) and ResNet18.
- Training: Trained on the massive "Lip Reading in the Wild" (LRW) dataset to parse meaningful text purely from visual lip movement data.
- Tech Stack: Python, PyTorch, Docker, Django.
π¦ Crabfull
- Language: Rust
- Description: A high-performance log viewer designed to handle massive log files with speed and memory safety, demonstrating the capabilities of Rust in system tooling.
πΌοΈ Autowall
- Language: Rust
- Description: A lightweight, resource-efficient daemon for Windows that automates wallpaper management.
π Diffie
- Tech: Web Stack
- Description: A "File Diff Viewer as a Service" allowing users to visualize code changes seamlessly in the browser.
π Muid
- Tech: Frontend
- Description: A Markdown WYSIWYG editor built for developers who need real-time preview and editing capabilities.
Indian Institute of Technology (IIT), Roorkee | June 2024 β December 2024
- Core Modules: Generative AI, Large Language Models (LLMs), Deep Learning, Natural Language Processing (NLP), Computer Vision, Prompt Engineering.
- Capstone: Built end-to-end AI applications using RAG and Agentic workflows.
Symbiosis Institute of Technology, Pune | June 2019 β May 2023
- Relevant Coursework: Deep Learning, Compiler Construction, Distributed Systems, Data Warehousing & Mining, Computer Organization, Networking, Theory of Computation.
"Reliability is not an option; it's the baseline. Whether managing a billion-dollar fintech pipeline or training a neural network, the goal is the same: Build systems that work, scale, and endure."



