5D Parallelism for Transformer Training
complete2.10M tokens/sec on 8-GPU data parallel training runs
Built the parallel training stack for a GPT-style model as a capstone for a hands-on distributed systems workshop, implementing data, tensor, pipeline, context, and expert parallelism from scratch, then benchmarked throughput, memory, and convergence across 1–8 GPU configurations.
PyTorchDistributed TrainingCUDAPythonML SystemsInfrastructureTransformers
Scaling a Multimodal Tutor Model on Modal
complete63.05 req/s peak text throughput; 916 ms p95 TTFT on mixed multimodal traffic
Benchmarked Qwen3-VL-4B-Instruct on Modal with vLLM, comparing H100 replica scaling, tensor parallelism, concurrency limits, and mixed multimodal traffic to find a low-latency, high-throughput serving configuration.
ModalvLLMInferenceH100MultimodalQwenBenchmarking
ICU Deterioration Warning System
complete110M+ hourly ICU records; 0.995 AUROC deterioration prediction
Built an ICU early deterioration warning system on 110M+ hourly records from 50,920 patients (MIMIC-IV), using XGBoost with 12-hour rolling window features to predict vasopressor initiation, intubation, CRRT, or death within 12 hours
MIMICscikitpython