Title slide

Data Science vs. Statistics vs. AI vs. Machine Learning: A Comparative Overview



Robert Daniels

Data Scientist and Statistician

License

This presentation is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0).

Please give appropriate credit if you reuse or adapt this work.

Who I am

Rob Daniels, MS, MPH, CPH

Experience: 13+ years in applied statistics, data science, machine learning

Former roles:

  • Statistician at Harvard University
  • Contractor statistician at Centers for Disease Control and Prevention (CDC)

Overview

  • Distinguish between AI, machine learning, data science, and statistics
  • Distinguish statistical models (SM) vs. machine learning models (ML)
  • Guidelines on when and how to apply SM vs. ML

AI vs. machine learning vs. data science vs. statistics

Venn diagram

Morris et al. (2024). Current and future applications of artificial intelligence in surgery: implications for clinical practice and research. Frontiers in Surgery, 11. https://doi.org/10.3389/fsurg.2024.1393898.

Artificial intelligence (AI)

Simulation of human intelligence in machines that are programmed to think, learn, and make decisions

Artificial narrow intelligence (ANI)
• “One-trick-pony” applications
• E.g., smart speaker, self-driving car, web search

Generative AI
• More general-purpose uses
• E.g., ChatGPT, Claude, Gemini, DeepSeek, DALL-E 2

“You won’t lose your job to AI, but you may lose your job to someone who knows how to use AI”

Agentic AI
• More autonomous decisions/actions
• E.g., OpenAI’s Operator

Artificial general intelligence (AGI)
• Anything a human can do

Ng, A. (2021). What is AI? [Video lecture]. AI for Everyone. Coursera. https://www.coursera.org/learn/ai-for-everyone

What is machine learning?

“The science of getting computers to learn without being explicitly programmed”

Supervised learning
• Maps input to output: X → Y labels
• Classification, regression, neural networks

Examples:
– X-ray image → tumor/no tumor
– Words → next word (LLM)

Unsupervised learning
• No labels
• Clustering, anomaly detection, data reduction

Examples:
– Customer segmentation
– Fraud detection

Reinforcement learning
• Rewards and punishments guide learning

Examples:
– Self-driving cars
– Robotics

Deep learning

  • Subset of machine learning using multi-layered neural networks to model complex patterns

  • Commonly used types of NN:

    • Recurrent neural networks (RNNs) (natural language processing, speech recognition)

    • Generative adversarial networks (GANs) (image generation, style transfer)

    • Convolutional neural networks (CNNs) (image processing, object detection)

Convolutional neural network diagram created by Claude AI (Anthropic, 2025). https://claude.ai.

Data science vs. statistics

Data Science

Statistics

Scope

  • Broader, multi-disciplinary
  • Integrates computer science, data analysis, ML, data visualization, and domain knowledge
  • Narrower; focused on theory and methods for collecting, analyzing, interpreting, and presenting data
  • Emphasizes understanding relationships and uncertainty in data

Core principles

  • Algorithm-based
  • Prediction-focused
  • Inference, hypothesis testing, estimation
  • Rooted in probability and mathematical rigor

Problem-solving approach

  • Exploratory, iterative
  • Optimizing predictions
  • Usually hypothesis-driven
  • Emphasis on testing and validating assumptions

(1/2)

Data science vs. statistics continued

Data Science

Statistics

Tools and techniques

  • More programming (e.g., Python, SQL)
  • ML, deep learning, big data tools (e.g., Spark)
  • Specialized statistical software (e.g., R, SAS, Stata, Mplus)
  • Specialized methods (e.g., mixed modeling)

Data

  • Large-scale
  • Unstructured, semi-structured, and structured
  • Primarily structured
  • Smaller samples, well-defined study designs

Applications

  • Tech industry, business intelligence
  • Fraud detection, NLP, computer vision
  • Clinical trials
  • Survey analysis, data collection, risk modeling

(2/2)

Statistics vs. machine learning

Inference vs. prediction

  • Inference involves drawing conclusions about a population based on a sample of data; it is concerned with understanding relationships, estimating parameters, and testing hypotheses

    • Focuses on why something happens
    • E.g., identifying which factors significantly impact disease risk
  • Prediction refers to using a model to forecast future outcomes based on current or historical data

    • Focuses on what will happen
    • E.g., predicting whether a patient will develop a disease given a set of health metrics
  • Traditional statistical models are typically designed for inference and focus on interpretability but can do both inference and prediction
  • Machine learning models optimize prediction and offer limited inference

Statistical models vs. machine learning models

  • Statistical models are probabilistic and explicitly consider uncertainty
  • Statistical models start with preconceived structure and typically assume additivity of predictor effects when specifying the model
  • Machine learning models are more empirical and allow for higher-order interactions that are not pre-specified; do not attempt to isolate the effect of any single variable

Harrell, Frank. (2018, April 30). Road map for choosing between statistical modeling and machine learning. Statistical Thinking. https://www.fharrell.com/post/stat-ml/.

When to use SM vs. ML

  • ML may work best when overall prediction is the goal without being able to describe the impact of any one predictor
  • ML requires large sample sizes (typically 200 events per candidate predictor)
  • ML might work better for high signal-to-noise situations (e.g., visual and sound recognition)
  • ML tends to work well when learning a simple concept
  • ML tends to work poorly when:

    • Learning complex concepts from small amounts of data
    • Performs on new types of data (data the model was not trained on)

Harrell, Frank. (2018, April 30). Road map for choosing between statistical modeling and machine learning. Statistical Thinking. https://www.fharrell.com/post/stat-ml/.

Additional AI/ML considerations

AI/ML roles

  • Data scientist
  • ML engineer (also new: AI engineer)
  • Data engineer
    • Organize and manage data, data access, security in a cost-effective way
    • Needed for project with large data (≥ TB)
  • AI product manager
  • Training resources:
    • Emphasis on concepts and algorithms not just specific AI implementations

One second rule

What is it? If a human can complete a task in about one second of thought, AI may be able to automate it


Why it works • Pattern recognition: AI learns from data like human intuition
• Low complexity: quick decisions are ideal for AI
• Automation potential: repetitive, structured tasks fit well
Examples • Object recognition in images
• Text prediction
• Document classification
When the rule breaks down • Deep reasoning required
• Ambiguity and subjectivity

Considerations for LLMs

  • Retrieval-augmented generation (RAG)

    • Combines an existing LLM with a custom knowledge base to provide up-to-date domain specific answers
  • Fine-tuning pre-trained models

    • Requires less data and compute power than training from scratch
    • Uses existing models that are fine-tuned on domain-specific data