What is tidymodels?

Max Kuhn

tidymodels: Our job is to make modeling data with R suck less better.

It’s actually pretty good

“Modeling” includes everything from classical statistical methods to machine learning.

How do we make it better?

Consistent and unsurprising APIs and outputs
Tidyverse-like syntax
Upgrade data analysis tools available to users

(sort of like systems engineers for modeling)

Who is tidymodels?

A basic example

Two laboratory tests are used to predict whether someone has a specific infectious disease.

(these are real data)

library(tidymodels)
lab_data %>% 
  ggplot(
    aes(lab_test_1, lab_test_2, 
        col = disease, pch = disease)) + 
  geom_point(cex = 2, alpha = 1 / 2) + 
  coord_equal()

Fit a classification tree

cart_fit <- 
  decision_tree() %>% 
  set_mode("classification") %>% 
  fit(disease ~ ., data = lab_data)

Maybe try a neural networks

nnet_fit <- 
  mlp(hidden_units = 5, penalty = 0.01) %>% 
  set_mode("classification") %>% 
  fit(disease ~ ., data = lab_data)

We have a lot of tools to optimize models

set.seed(1)
nnet_tune <- 
  mlp(
    hidden_units = tune(), 
    penalty = 0.01
  ) %>% 
  set_mode("classification") %>% 
  tune_grid(
    disease ~ ., 
    resamples= vfold_cv(lab_data), 
    grid = 10)

Strong points

Many important guard rails
Very good at feature engineering (RECIPES!!!)
Parallel processing and server-based computations
Leverage other frameworks: tensorflow, torch, h2o, spark (some)
Great documentation
- tidymodels.org
- Tidy Modeling with R (tmwr.org)
- workshops.tidymodels.org

Latest feature: survival analysis

Censored data:

“My food was delivered 27m after I ordered it” (complete)
“I ordered my food 12m ago but it’s not here yet” (censored)

This requires specialized tools to fit and evaluate the quality of models.

Example of oblique random forest predictions for customer churn: