ds-stats-aiml

Title slide

Data Science vs. Statistics vs. AI vs. Machine Learning: A Comparative Overview

Robert Daniels

Data Scientist and Statistician

License

This presentation is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0).

Please give appropriate credit if you reuse or adapt this work.

Who I am

Rob Daniels, MS, MPH, CPH

Experience: 13+ years in applied statistics, data science, machine learning

Former roles:

Statistician at Harvard University
Contractor statistician at Centers for Disease Control and Prevention (CDC)

Overview

Distinguish between AI, machine learning, data science, and statistics
Distinguish statistical models (SM) vs. machine learning models (ML)
Guidelines on when and how to apply SM vs. ML

AI vs. machine learning vs. data science vs. statistics

Venn diagram

Morris et al. (2024). Current and future applications of artificial intelligence in surgery: implications for clinical practice and research. Frontiers in Surgery, 11. https://doi.org/10.3389/fsurg.2024.1393898.

Artificial intelligence (AI)

Simulation of human intelligence in machines that are programmed to think, learn, and make decisions

Artificial narrow intelligence (ANI)
• “One-trick-pony” applications
• E.g., smart speaker, self-driving car, web search

Generative AI
• More general-purpose uses
• E.g., ChatGPT, Claude, Gemini, DeepSeek, DALL-E 2

“You won’t lose your job to AI, but you may lose your job to someone who knows how to use AI”

Agentic AI
• More autonomous decisions/actions
• E.g., OpenAI’s Operator

Artificial general intelligence (AGI)
• Anything a human can do

Ng, A. (2021). What is AI? [Video lecture]. AI for Everyone. Coursera. https://www.coursera.org/learn/ai-for-everyone

What is machine learning?

“The science of getting computers to learn without being explicitly programmed”

Supervised learning
• Maps input to output: X → Y labels
• Classification, regression, neural networks

Examples:
– X-ray image → tumor/no tumor
– Words → next word (LLM)

Unsupervised learning
• No labels
• Clustering, anomaly detection, data reduction

Examples:
– Customer segmentation
– Fraud detection

Reinforcement learning
• Rewards and punishments guide learning

Examples:
– Self-driving cars
– Robotics

Deep learning

Subset of machine learning using multi-layered neural networks to model complex patterns
Commonly used types of NN:

Recurrent neural networks (RNNs) (natural language processing, speech recognition)
Generative adversarial networks (GANs) (image generation, style transfer)
Convolutional neural networks (CNNs) (image processing, object detection)

Convolutional neural network diagram created by Claude AI (Anthropic, 2025). https://claude.ai.

Data science vs. statistics

Data Science

Statistics

Scope

Broader, multi-disciplinary
Integrates computer science, data analysis, ML, data visualization, and domain knowledge

Narrower; focused on theory and methods for collecting, analyzing, interpreting, and presenting data
Emphasizes understanding relationships and uncertainty in data

Core principles

Algorithm-based
Prediction-focused

Inference, hypothesis testing, estimation
Rooted in probability and mathematical rigor

Problem-solving approach

Exploratory, iterative
Optimizing predictions

Usually hypothesis-driven
Emphasis on testing and validating assumptions

(1/2)

Data science vs. statistics continued

Data Science

Statistics

Tools and techniques

More programming (e.g., Python, SQL)
ML, deep learning, big data tools (e.g., Spark)

Specialized statistical software (e.g., R, SAS, Stata, Mplus)
Specialized methods (e.g., mixed modeling)

Data

Large-scale
Unstructured, semi-structured, and structured

Primarily structured
Smaller samples, well-defined study designs

Applications

Tech industry, business intelligence
Fraud detection, NLP, computer vision

Clinical trials
Survey analysis, data collection, risk modeling

(2/2)

Statistics vs. machine learning

Inference vs. prediction

Inference involves drawing conclusions about a population based on a sample of data; it is concerned with understanding relationships, estimating parameters, and testing hypotheses
- Focuses on why something happens
- E.g., identifying which factors significantly impact disease risk

Prediction refers to using a model to forecast future outcomes based on current or historical data
- Focuses on what will happen
- E.g., predicting whether a patient will develop a disease given a set of health metrics

Traditional statistical models are typically designed for inference and focus on interpretability but can do both inference and prediction

Machine learning models optimize prediction and offer limited inference

Statistical models vs. machine learning models

Statistical models are probabilistic and explicitly consider uncertainty

Statistical models start with preconceived structure and typically assume additivity of predictor effects when specifying the model

Machine learning models are more empirical and allow for higher-order interactions that are not pre-specified; do not attempt to isolate the effect of any single variable

Harrell, Frank. (2018, April 30). Road map for choosing between statistical modeling and machine learning. Statistical Thinking. https://www.fharrell.com/post/stat-ml/.

When to use SM vs. ML

ML may work best when overall prediction is the goal without being able to describe the impact of any one predictor

ML requires large sample sizes (typically 200 events per candidate predictor)

ML might work better for high signal-to-noise situations (e.g., visual and sound recognition)

ML tends to work well when learning a simple concept

ML tends to work poorly when:
- Learning complex concepts from small amounts of data
- Performs on new types of data (data the model was not trained on)

Harrell, Frank. (2018, April 30). Road map for choosing between statistical modeling and machine learning. Statistical Thinking. https://www.fharrell.com/post/stat-ml/.

Additional AI/ML considerations

AI/ML roles

Data scientist
ML engineer (also new: AI engineer)
Data engineer
- Organize and manage data, data access, security in a cost-effective way
- Needed for project with large data (≥ TB)
AI product manager
Training resources:
- Emphasis on concepts and algorithms not just specific AI implementations

One second rule

What is it?

If a human can complete a task in about one second of thought, AI may be able to automate it

Why it works	• Pattern recognition: AI learns from data like human intuition • Low complexity: quick decisions are ideal for AI • Automation potential: repetitive, structured tasks fit well
Examples	• Object recognition in images • Text prediction • Document classification
When the rule breaks down	• Deep reasoning required • Ambiguity and subjectivity

Considerations for LLMs

Retrieval-augmented generation (RAG)
- Combines an existing LLM with a custom knowledge base to provide up-to-date domain specific answers

Fine-tuning pre-trained models
- Requires less data and compute power than training from scratch
- Uses existing models that are fine-tuned on domain-specific data