MS · MPH · CPH

Rob Daniels

Applied statistician and data scientist. Previously at Harvard University and the CDC.

Selected Projects

01

Hospital-Level Surgical Site Infection Risk Estimation with Hierarchical Bayesian Models

Analysis of 2024 California hospital colon surgery data estimating facility-level surgical site infection risk. Compared logistic regression, GLMM, and both non-hierarchical and hierarchical Bayesian binomial models. The hierarchical model surfaced significant facility-level variation that logistic regression masked, and partial pooling corrected for the instability of risk estimates at low-volume hospitals. Includes a manuscript-quality report with abstract, methods, results, and conclusions formatted in LaTeX.

R JAGS ggplot2 plotly MyST LaTeX

02

Deploying a Deep Learning System for Breast Ultrasound Cancer Detection

End-to-end ML engineering project classifying breast ultrasound images as benign or malignant. Covers the full lifecycle: exploratory analysis, neural network architecture experiments (fully connected and convolutional models with dropout, learning rate scheduling, and data augmentation), model serialization, containerization, and a deployed prediction API. CPU-only configurations were used to reduce deployment size without sacrificing reproducibility. The emphasis is on the gap between research notebooks and production-ready services.

Python PyTorch FastAPI Docker Fly.io Poetry

03

Recurrent Event Modeling for Bladder Cancer Recurrence Using Survival Analysis

Recurrent-events survival analysis of the bladder cancer dataset implementing and comparing the Andersen-Gill, LWYY, PWP, and Wei-Lin-Weissfeld models alongside a Cox frailty extension. The project demonstrates how modeling choices shift when events can recur within the same subject, and why standard survival methods are insufficient for that structure. Includes dynamic visualizations and integrated bibliography.

R Quarto ggplot2 plotly

04

Architecting a Cloud-Native Medicare Data Pipeline and Analytics Platform

A production-grade data engineering project built to demonstrate full-stack pipeline development beyond the modeling layer. Monthly U.S. Medicare enrollment data is ingested from the CMS public API, staged in Amazon S3, loaded into Amazon Redshift Serverless via Airflow-orchestrated COPY jobs, and transformed with dbt into analytics-ready mart tables. Infrastructure is fully provisioned with OpenTofu, authentication uses IAM roles throughout with no hardcoded credentials, and a GitHub Actions CI pipeline validates code quality and dbt model compilation on every push. The resulting marts power an interactive Streamlit dashboard with national enrollment trends and a state-level choropleth map. Additional data sources and dashboards are in development.

Python Apache Airflow dbt Amazon Redshift Amazon S3 EC2 OpenTofu Docker Streamlit GitHub Actions

05

Design-Based Analysis of a Cluster Randomized Trial for Schistosomiasis Control

Applied analysis of a cluster randomized trial with reproducible data processing, descriptive statistics, and treatment-group comparisons. A permutation test was used rather than parametric alternatives because the trial's randomization mechanism is the basis for inference, not an assumed data-generating distribution. The analysis follows the logic of the design rather than defaulting to standard regression approaches.

R Quarto ggplot2 plotly

06

Inference vs Prediction: How Statistics and Machine Learning Approach Data Problems

A Reveal.js and Quarto presentation tracing the distinctions and overlaps between data science, statistics, and machine learning, with particular focus on the inference–prediction divide and what it means for how problems are framed and evaluated.

Reveal.js Quarto R Markdown