What is tidymodels?

Max Kuhn

tidymodels: Our job is to make modeling data with R suck less better.

It’s actually pretty good

“Modeling” includes everything from classical statistical methods to machine learning.

How do we make it better?

  • Consistent and unsurprising APIs and outputs
  • Tidyverse-like syntax
  • Upgrade data analysis tools available to users

(sort of like systems engineers for modeling)

Who is tidymodels?

A basic example

Two laboratory tests are used to predict whether someone has a specific infectious disease.

(these are real data)


library(tidymodels)
lab_data %>% 
  ggplot(
    aes(lab_test_1, lab_test_2, 
        col = disease, pch = disease)) + 
  geom_point(cex = 2, alpha = 1 / 2) + 
  coord_equal()

Fit a classification tree

cart_fit <- 
  decision_tree() %>% 
  set_mode("classification") %>% 
  fit(disease ~ ., data = lab_data)

Maybe try a neural networks

nnet_fit <- 
  mlp(hidden_units = 5, penalty = 0.01) %>% 
  set_mode("classification") %>% 
  fit(disease ~ ., data = lab_data)

We have a lot of tools to optimize models

set.seed(1)
nnet_tune <- 
  mlp(
    hidden_units = tune(), 
    penalty = 0.01
  ) %>% 
  set_mode("classification") %>% 
  tune_grid(
    disease ~ ., 
    resamples= vfold_cv(lab_data), 
    grid = 10)

Strong points

  • Many important guard rails
  • Very good at feature engineering (RECIPES!!!)
  • Parallel processing and server-based computations
  • Leverage other frameworks: tensorflow, torch, h2o, spark (some)
  • Great documentation

Latest feature: survival analysis

Censored data:

  • “My food was delivered 27m after I ordered it” (complete)
  • “I ordered my food 12m ago but it’s not here yet” (censored)

This requires specialized tools to fit and evaluate the quality of models.

Example of oblique random forest predictions for customer churn: