tidymodels for Production

Max Kuhn and Emil Hvitfeldt

Posit PBC

Slides + sources on github:
topepo/tidymodels-for-production

Production Aspects for ML

  • The Data
  • The Model
  • Infrastructure
  • Post-Deployment

Example data

Data on daily ridership on the Chicago “L” trains:

library(lubridate)

chi_data <- 
  Chicago |> 
  mutate(day = wday(date, label = TRUE)) |>
  select(ridership, date, day, Clark_Lake:Irving_Park)

chicago_original <- chi_data |> filter(date <= as.Date("2014-01-01"))
# 4,718 rows

chicago_updated  <- chi_data |> filter(date <= as.Date("2015-01-01")) 
# 5,093 rows

chicago_last     <- chi_data |> filter(date  > as.Date("2015-01-01")) 
#   605 rows

Two example models

Here are two models (a single decision tree and a boosted decision tree) that we’ll use:


library(tidymodels)

# A CART classification tree
cart_wflow <- workflow(ridership ~ ., decision_tree(mode = "regression"))
cart_fit <- fit(cart_wflow, chicago_original)

# An xgboost collection of trees
xgb_wflow <- workflow(ridership ~ ., boost_tree(mode = "regression"))
xgb_fit <- fit(xgb_wflow, chicago_original)

Ingesting Data

We often get our data in some markup format (e.g., cvs, json, etc.) that may not have the right context compared to the original training/testing data.

  • How to map categories to indicators.
  • Timezone for dates (compared to the original data)

How does R (and tidymodels) deal with this?

Data Consistency

tidymodels uses a “zero-row slice” of the data to record important aspects of the data:

ptype <- chicago_original[0,]
str(ptype)
#> tibble [0 × 7] (S3: tbl_df/tbl/data.frame)
#>  $ ridership       : num(0) 
#>  $ date            : 'Date' num(0) 
#>  $ day             : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 
#>  $ Clark_Lake      : num(0) 
#>  $ Clinton         : num(0) 
#>  $ Merchandise_Mart: num(0) 
#>  $ Irving_Park     : num(0)


It (and vetiver) will ensure that newly ingested data for prediction are encoded the same way and alert the user otherwise.

Preparing and saving the model object

There are one or two operations you might want to use on your model object:

  • butcher: remove ancillary data that are not used for prediction.
  • bundle: ensure that all of the model data are captured for production.

Both of these are automatically used by vetiver.

Saved Model Size


# How much space does the model take?
cart_fit |> object.size() |> print(units = "MB")
#> 0.9 Mb


Also, butcher::weigh(cart_fit) will tell you how much space each element of the model will consume.

What takes up the most space?

butcher::weigh(cart_fit)
#> # A tibble: 89 × 2
#>    object                                 size
#>    <chr>                                 <dbl>
#>  1 fit.fit.fit.y                        0.342 
#>  2 fit.fit.fit.where                    0.323 
#>  3 pre.mold.predictors.date             0.0379
#>  4 pre.mold.predictors.Clark_Lake       0.0379
#>  5 pre.mold.predictors.Clinton          0.0379
#>  6 pre.mold.predictors.Merchandise_Mart 0.0379
#>  7 pre.mold.predictors.Irving_Park      0.0379
#>  8 pre.mold.outcomes.ridership          0.0379
#>  9 pre.mold.predictors.day              0.0199
#> 10 fit.fit.fit.terms                    0.008 
#> # ℹ 79 more rows

Trimming the model

The butcher package can be used to remove anything that isn’t needed for prediction.


library(butcher)
# Before: 
cart_fit |> object.size() |> print(units = "Mb")
#> 0.9 Mb

# After
cart_fit |> butcher() |> object.size() |> print(units = "Mb")
#> 0.4 Mb

Getting Everything that Defines the Model

There are a few model types that don’t follow traditional behavior, in that the model object doesn’t contain everything the model needs.

  • Examples: xgboost, lightgbm, catboost, bart, tensorflow, and others


To productionize these, we need to keep all of the model data in a single object.


That’s what the bundle package does.

Boosting Example

Consider our previously fit xgboost model. It retains the model information in a pointer to external memory:


xgb_fit |> extract_fit_engine() |> pluck("ptr")
#> <pointer: 0x107ef14b0>


Using the usual save() command won’t capture this extra data, so the model won’t work outside of this R session.

Bundling

The interface is very simple:


library(bundle)

xgb_bund_fit <- bundle(xgb_fit)
xgb_bund_fit
#> bundled workflow object.

xgb_fit_remade <- unbundle(xgb_bund_fit)
xgb_fit_remade |> extract_fit_engine() |> pluck("ptr")
#> <pointer: 0x107ed5bc0>

Note that the pointer references a different location in memory than the original object.

Model Deployment

Deploying with SQL

Some model types (linear models, trees) can efficiently be turned into SQL expressions. This allows us to take a fitted model and have it predict directly in a database.


Simiarly this can be done with feature engineering and postprocessing too.


The orbital package interfaces directly with tidymodels workflows.

Using orbital

Apply orbital() to a fitted workflow.

(tree and tree_depth are set low for easy viewing on slides)


rec_spec <- recipe(ridership ~ ., data = chicago_original) |>
  step_rm(date, day) |>
  step_normalize(all_numeric_predictors())

stump_spec <- boost_tree(mode = "regression", trees = 5, tree_depth = 2)

xgb_rec_wflow <- workflow(rec_spec, stump_spec)
xgb_rec_fit <- fit(xgb_rec_wflow, chicago_original)

library(orbital)
xgb_orb <- orbital(xgb_rec_fit)

Orbital objects


The orbital object only contains the sufficient calculations needed to perform predictions.


xgb_orb
#> 
#> ── orbital Object ──────────────────────────────────────────────────────────────
#> • Clark_Lake = (Clark_Lake - 13.13903) / 6.401423
#> • Clinton = (Clinton - 2.295728) / 0.9426555
#> • Merchandise_Mart = (Merchandise_Mart - 4.441421) / 2.362541
#> • Irving_Park = (Irving_Park - 3.322339) / 1.115012
#> • .pred = dplyr::case_when(Clinton < -1.247251 & Clark_Lake < -0.7446833 ...
#> ────────────────────────────────────────────────────────────────────────────────
#> 5 equations in total.

Predicting with orbital objects

predict(xgb_orb, chicago_original)
#> # A tibble: 4,728 × 1
#>    .pred
#>    <dbl>
#>  1 14.7 
#>  2 14.7 
#>  3 14.7 
#>  4 14.7 
#>  5 14.7 
#>  6  5.50
#>  7  5.50
#>  8 12.3 
#>  9 14.7 
#> 10 14.7 
#> # ℹ 4,718 more rows
predict(xgb_rec_fit, chicago_original)
#> # A tibble: 4,728 × 1
#>    .pred
#>    <dbl>
#>  1 14.7 
#>  2 14.7 
#>  3 14.7 
#>  4 14.7 
#>  5 14.7 
#>  6  5.50
#>  7  5.50
#>  8 12.3 
#>  9 14.7 
#> 10 14.7 
#> # ℹ 4,718 more rows

Predicting in databases

If you have a connection to a database with the data in the correct format, then you can predict on it directly.

library(DBI)
library(RSQLite)

con <- dbConnect(SQLite(), path = ":memory:")
chicago_original_sqlite <- copy_to(con, chicago_original)

predict(xgb_orb, chicago_original_sqlite)
#> # Source:   SQL [?? x 1]
#> # Database: sqlite 3.51.0 []
#>    .pred
#>    <dbl>
#>  1 14.7 
#>  2 14.7 
#>  3 14.7 
#>  4 14.7 
#>  5 14.7 
#>  6  5.50
#>  7  5.50
#>  8 12.3 
#>  9 14.7 
#> 10 14.7 
#> # ℹ more rows

Generating SQL

using show_query() or orbital_sql() will generate the SQL.

show_query(xgb_orb, con)
#> (`Clark_Lake` - 13.1390333121827) / 6.40142339936784 AS Clark_Lake
#> (`Clinton` - 2.29572810913705) / 0.942655487237437 AS Clinton
#> (`Merchandise_Mart` - 4.44142100253806) / 2.36254062814617 AS Merchandise_Mart
#> (`Irving_Park` - 3.32233851522842) / 1.11501202758246 AS Irving_Park
#> ((((CASE
#> WHEN (`Clinton` < -1.24725115 AND `Clark_Lake` < -0.744683325) THEN (-2.69900179)
#> WHEN ((`Clinton` >= -1.24725115 OR (`Clinton` IS NULL)) AND `Clark_Lake` < -0.744683325) THEN (-1.81820333)
#> WHEN (`Clark_Lake` < 0.713742316 AND (`Clark_Lake` >= -0.744683325 OR (`Clark_Lake` IS NULL))) THEN 0.662370682
#> WHEN ((`Clark_Lake` >= 0.713742316 OR (`Clark_Lake` IS NULL)) AND (`Clark_Lake` >= -0.744683325 OR (`Clark_Lake` IS NULL))) THEN 1.63549113
#> END + CASE
#> WHEN (`Clark_Lake` < -1.26659846 AND `Clark_Lake` < -0.744683325) THEN (-1.91823995)
#> WHEN ((`Clark_Lake` >= -1.26659846 OR (`Clark_Lake` IS NULL)) AND `Clark_Lake` < -0.744683325) THEN (-1.22058809)
#> WHEN (`Clark_Lake` < 0.582209051 AND (`Clark_Lake` >= -0.744683325 OR (`Clark_Lake` IS NULL))) THEN 0.381911665
#> WHEN ((`Clark_Lake` >= 0.582209051 OR (`Clark_Lake` IS NULL)) AND (`Clark_Lake` >= -0.744683325 OR (`Clark_Lake` IS NULL))) THEN 1.09545982
#> END) + CASE
#> WHEN (`Merchandise_Mart` < -1.78258145 AND `Clinton` < -0.623481333) THEN 0.270338893
#> WHEN ((`Merchandise_Mart` >= -1.78258145 OR (`Merchandise_Mart` IS NULL)) AND `Clinton` < -0.623481333) THEN (-1.2570411)
#> WHEN (`Clark_Lake` < 0.948533833 AND (`Clinton` >= -0.623481333 OR (`Clinton` IS NULL))) THEN 0.376703531
#> WHEN ((`Clark_Lake` >= 0.948533833 OR (`Clark_Lake` IS NULL)) AND (`Clinton` >= -0.623481333 OR (`Clinton` IS NULL))) THEN 1.07266045
#> END) + CASE
#> WHEN (`Clark_Lake` < -1.48545611 AND `Clinton` < -0.761389613) THEN (-1.06872714)
#> WHEN ((`Clark_Lake` >= -1.48545611 OR (`Clark_Lake` IS NULL)) AND `Clinton` < -0.761389613) THEN (-0.646734238)
#> WHEN (`Clark_Lake` < 0.459111422 AND (`Clinton` >= -0.761389613 OR (`Clinton` IS NULL))) THEN 0.0673953071
#> WHEN ((`Clark_Lake` >= 0.459111422 OR (`Clark_Lake` IS NULL)) AND (`Clinton` >= -0.761389613 OR (`Clinton` IS NULL))) THEN 0.542465925
#> END) + CASE
#> WHEN (`Merchandise_Mart` < -1.78258145 AND `Clinton` < -0.947035372) THEN 0.60846895
#> WHEN ((`Merchandise_Mart` >= -1.78258145 OR (`Merchandise_Mart` IS NULL)) AND `Clinton` < -0.947035372) THEN (-0.685996652)
#> WHEN (`Irving_Park` < 0.538704038 AND (`Clinton` >= -0.947035372 OR (`Clinton` IS NULL))) THEN 0.031657584
#> WHEN ((`Irving_Park` >= 0.538704038 OR (`Irving_Park` IS NULL)) AND (`Clinton` >= -0.947035372 OR (`Clinton` IS NULL))) THEN 0.399935812
#> END) + 13.133584 AS .pred

Versioning and Deploying with vetiver

vetiver is a tool for deploying models using the well-known REST API.

  • There are R and Python versions of the package.

  • Perhaps more than any other deployment tool, it is designed to be simple and straightforward.

  • It is also designed to correctly ingest data and make deployment and documentation very easy.

  • vetiver exploits the tools in tidymodels for tracking package dependencies.

Convert the model to vetiver

We start by ingesting the model and information about it:

library(vetiver)
xgb_vet <- vetiver_model(xgb_fit, model_name = "boosted-chicago")
xgb_vet
#> 
#> ── boosted-chicago ─ <bundled_workflow> model for deployment 
#> A xgboost regression modeling workflow using 6 features

# Example of stored information: 
xgb_vet$metadata$required_pkgs
#> [1] "parsnip"   "workflows" "xgboost"

Pinning the model

Although vetiver can be used on its own, it works very nicely with the pins package.

This can use many cloud storage options and can save, share, and version the model:


library(pins)
model_board <- board_temp(versioned = TRUE)
model_board |> vetiver_pin_write(xgb_vet)

Model updating

As more data is accrued, we often automate model updates, refitting the same model on the new corpus. We can add the new version to the model board:

xgb_new_vet <- 
  fit(xgb_wflow, chicago_updated) |> 
  vetiver_model(model_name = "boosted-chicago")

model_board |> vetiver_pin_write(xgb_new_vet)
model_board |> pin_versions("boosted-chicago")
#> # A tibble: 2 × 3
#>   version                created             hash 
#>   <chr>                  <dttm>              <chr>
#> 1 20260205T233540Z-40cd9 2026-02-05 18:35:40 40cd9
#> 2 20260205T233540Z-7bd76 2026-02-05 18:35:40 7bd76

Reverting models

Suppose the data pull for the last model was incorrect, and we need to revert to an earlier model version.


last_best_version <- 
  model_board |> 
  pin_versions("boosted-chicago") |> 
  slice(1) |> 
  pluck("version")

model_board |> 
  pin_download("boosted-chicago", version = last_best_version)
#> [1] "/var/folders/ml/6750yrxn1ds3gsf4d8b6bb600000gn/T/RtmpKp4aTT/pins-e9c138cce201/boosted-chicago/20260205T233540Z-40cd9/boosted-chicago.rds"

pin_version_delete() could remove it too.

Deploying models

In R, vetiver uses the plumber package to deploy models:


library(plumber)
vetiver_api(pr(), xgb_vet)
#> # Plumber router with 4 endpoints, 4 filters, and 1 sub-router.
#> # Use `pr_run()` on this object to start the API.
#> ├──[queryString]
#> ├──[body]
#> ├──[cookieParser]
#> ├──[sharedSecret]
#> ├──/logo
#> │  │ # Plumber static router serving from directory: /Users/max/Library/R/arm64/4.5/library/vetiver
#> ├──/metadata (GET)
#> ├──/ping (GET)
#> ├──/predict (POST)
#> └──/prototype (GET)

Plumber deployment file

vetiver_write_plumber(model_board, "boosted-chicago") will write out a template for using plumber to deploy:


# Generated by the vetiver package; edit with care

library(pins)
library(plumber)
library(rapidoc)
library(vetiver)

# Packages needed to generate model predictions
if (FALSE) {
    library(parsnip)
    library(workflows)
    library(xgboost)
}
b <- board_folder(path = "/var/folders/ml/6750yrxn1ds3gsf4d8b6bb600000gn/T/RtmpQGaLaS/pins-113c17c9d1b4e")
v <- vetiver_pin_read(b, "boosted-chicago", version = "20260201T205357Z-b3640")

#* @plumber
function(pr) {
    pr |> vetiver_api(v)
}

Docker deployment file

vetiver_write_docker(xgb_vet) will write out a Docker template:


# Generated by the vetiver package; edit with care

FROM rocker/r-ver:4.5.2
ENV RENV_CONFIG_REPOS_OVERRIDE https://packagemanager.rstudio.com/cran/latest

RUN apt-get update -qq && apt-get install -y --no-install-recommends \
  libcurl4-openssl-dev \
  libicu-dev \
  libsodium-dev \
  libssl-dev \
  libx11-dev \
  make \
  zlib1g-dev \
  && apt-get clean

COPY vetiver_renv.lock renv.lock
RUN Rscript -e "install.packages('renv')"
RUN Rscript -e "renv::restore()"
COPY plumber.R /opt/ml/plumber.R
EXPOSE 8000
ENTRYPOINT ["R", "-e", "pr <- plumber::plumb('/opt/ml/plumber.R'); pr$run(host = '0.0.0.0', port = 8000)"]

Monitoring the Data and Performance

tidymodels has basic tools to compute weekly performance:

xgb_fit |> 
  augment(chicago_last) |> 
  sliding_period(
    date, period = "week",
    lookback = 0, assess_stop = 1) |> 
  mutate(
    weekly = map(splits, analysis),
    Error = map_dbl(weekly, ~ rmse(.x, ridership, .pred)$.estimate),
    Date = map_vec(weekly, ~ .x |> pluck("date") |> min())
  ) |> 
  ggplot(aes(Date, Error)) + 
  geom_point() + 
  geom_smooth()

Monitoring the Data and Performance

Data drifts; models don’t

There are a few specialized tools to make sure that your prediction population has not shifted:


vetiver also includes an Rmarkdown template for documenting the model and training process called “model cards.”

Other resources

Thanks for listening!