R has always had a rich set of modeling tools that it inherited from S. For example, the formula interface has made it simple to specify potentially complex model structures.
R has cutting-edge models. Many researchers in various domains use R as their primary computing environment and their work often results in R packages.
It is easy to port or link to other applications. R doesn’t try to be everything to everyone.
Modeling in R
However, there is a huge consistency problem. For example:
There are two primary methods for specifying what terms are in a model. Not all models have both.
99% of model functions automatically generate dummy variables.
Many package developers don’t know much about the language and omit OOP and other core R components.
Two examples follow…
Between-Package Inconsistency
The syntax for computing predicted class probabilities:
MASS package: predict(lda_fit)
stats package: predict(glm_fit, type = "response")
mda package: type = "posterior"
rpart package: type = "prob"
RWeka package: type = "probability"
and so on.
Model Interfaces
Which of these packages has both a formula and non-formula (x/y) interface to the model?
glmnet
ranger
rpart
survival
xgboost
Model Interfaces
Which of these packages has both a formula and non-formula (x/y) interface to the model?
glmnet (matrix only)
ranger (both but weirdly)
rpart (formula only)
survival (formula only)
xgboost (special sparse matrix only, classes are zero-based integers)
Is there such a thing as a systems statistician?
tidymodels: Our job is to make modeling data with R less frustrating better.
It’s actually pretty good
“Modeling” includes everything from classical statistical methods to machine learning.
The Tidyverse
All tidyverse packages share an underlying design philosophy, grammar, and data structures.
The principles of the tidyverse:
Reuse existing data structures.
Compose simple functions with the pipe.
Embrace functional programming.
Design for humans.
This results in more specific conventions around interfaces, function naming, etc.
The Tidyverse
For example, we try to use common prefixes for auto-complete: tune_grid(), tune_bayes(), …
We’ll split the data into training (60%), validation (20%), and testing (20%).
Stratification helps ensure the three outcome distributions are about the same.
set.seed(91)delivery_split <-initial_validation_split(deliveries, prop =c(0.6, 0.2), strata = time_to_delivery)delivery_split#> <Training/Validation/Testing/Total>#> <6004/2004/2004/10012>delivery_train <-training(delivery_split)delivery_val <-validation(delivery_split)# To treat it as a single resample:delivery_rs <-validation_set(delivery_split)
We can optionally bundle the recipe and model together into a pipelineworkflow:
glmnet_wflow <-workflow() %>%add_model(linear_mod) %>%add_recipe(delivery_rec) # or add_formula() or add_variables()
Fitting and prediction are very easy:
glmnet_fit <-fit(glmnet_wflow, data = delivery_train)# Very east to use compared to glmnet::predict():predict(glmnet_fit, delivery_val %>%slice(1:5))#> # A tibble: 5 × 1#> .pred#> <dbl>#> 1 23.4#> 2 15.3#> 3 31.5#> 4 20.8#> 5 28.1
A Better Interface
fit_resamples() uses the out-of-sample data to estimate performance:
ctrl <-control_resamples(save_pred =TRUE)glmnet_res <- glmnet_wflow %>%# We can use our validation set!fit_resamples(resamples = delivery_rs, control = ctrl) collect_metrics(glmnet_res)#> # A tibble: 2 × 6#> .metric .estimator mean n std_err .config #> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 rmse standard 2.36 1 NA Preprocessor1_Model1#> 2 rsq standard 0.885 1 NA Preprocessor1_Model1
Plot the Data!
The only way to be comfortable with your data is to never look at them.
conformal inference tools for prediction intervals
In-process:
model fairness metrics and modeling techniques
causal inference methods
a general set of post-processing tools
Thanks
Thanks for the invitation to speak today and sharing your Mate!
The tidymodels team: Hannah Frick, Emil Hvitfeldt, and Simon Couch.
Special thanks to the other folks who contributed so much to tidymodels: Davis Vaughan, Julia Silge, Edgar Ruiz, Alison Hill, Desirée De Leon, our previous interns, and the tidyverse team.