01
Hospital-Level Surgical Site Infection Risk Estimation with Hierarchical Bayesian Models
Analysis of 2024 California hospital colon surgery data estimating facility-level surgical site infection risk. Compared logistic regression, GLMM, and both non-hierarchical and hierarchical Bayesian binomial models. The hierarchical model surfaced significant facility-level variation that logistic regression masked, and partial pooling corrected for the instability of risk estimates at low-volume hospitals. Includes a manuscript-quality report with abstract, methods, results, and conclusions formatted in LaTeX.
R
JAGS
ggplot2
plotly
MyST
LaTeX
02
Deploying a Deep Learning System for Breast Ultrasound Cancer Detection
End-to-end ML engineering project classifying breast ultrasound images as benign or malignant. Covers the full lifecycle: exploratory analysis, neural network architecture experiments (fully connected and convolutional models with dropout, learning rate scheduling, and data augmentation), model serialization, containerization, and a deployed prediction API. CPU-only configurations were used to reduce deployment size without sacrificing reproducibility. The emphasis is on the gap between research notebooks and production-ready services.
Python
PyTorch
FastAPI
Docker
Fly.io
Poetry
03
Recurrent Event Modeling for Bladder Cancer Recurrence Using Survival Analysis
Recurrent-events survival analysis of the bladder cancer dataset implementing and comparing the Andersen-Gill, LWYY, PWP, and Wei-Lin-Weissfeld models alongside a Cox frailty extension. The project demonstrates how modeling choices shift when events can recur within the same subject, and why standard survival methods are insufficient for that structure. Includes dynamic visualizations and integrated bibliography.
R
Quarto
ggplot2
plotly
04
Architecting a Cloud-Native Medicare Data Pipeline and Analytics Platform
A production-grade data engineering project built to demonstrate full-stack pipeline development beyond the modeling layer. Monthly U.S. Medicare enrollment data is ingested from the CMS public API, staged in Amazon S3, loaded into Amazon Redshift Serverless via Airflow-orchestrated COPY jobs, and transformed with dbt into analytics-ready mart tables. Infrastructure is fully provisioned with OpenTofu, authentication uses IAM roles throughout with no hardcoded credentials, and a GitHub Actions CI pipeline validates code quality and dbt model compilation on every push. The resulting marts power an interactive Streamlit dashboard with national enrollment trends and a state-level choropleth map. Additional data sources and dashboards are in development.
Python
Apache Airflow
dbt
Amazon Redshift
Amazon S3
EC2
OpenTofu
Docker
Streamlit
GitHub Actions
05
Design-Based Analysis of a Cluster Randomized Trial for Schistosomiasis Control
Applied analysis of a cluster randomized trial with reproducible data processing, descriptive statistics, and treatment-group comparisons. A permutation test was used rather than parametric alternatives because the trial's randomization mechanism is the basis for inference, not an assumed data-generating distribution. The analysis follows the logic of the design rather than defaulting to standard regression approaches.
R
Quarto
ggplot2
plotly
06
Inference vs Prediction: How Statistics and Machine Learning Approach Data Problems
A Reveal.js and Quarto presentation tracing the distinctions and overlaps between data science, statistics, and machine learning, with particular focus on the inference–prediction divide and what it means for how problems are framed and evaluated.
Reveal.js
Quarto
R
Markdown