Here are two models (a single decision tree and a boosted decision tree) that we’ll use:
library(tidymodels)# A CART classification treecart_wflow <-workflow(ridership ~ ., decision_tree(mode ="regression"))cart_fit <-fit(cart_wflow, chicago_original)# An xgboost collection of treesxgb_wflow <-workflow(ridership ~ ., boost_tree(mode ="regression"))xgb_fit <-fit(xgb_wflow, chicago_original)
Ingesting Data
We often get our data in some markup format (e.g., cvs, json, etc.) that may not have the right context compared to the original training/testing data.
How to map categories to indicators.
Timezone for dates (compared to the original data)
How does R (and tidymodels) deal with this?
Data Consistency
tidymodels uses a “zero-row slice” of the data to record important aspects of the data:
Note that the pointer references a different location in memory than the original object.
Model Deployment
Deploying with SQL
Some model types (linear models, trees) can efficiently be turned into SQL expressions. This allows us to take a fitted model and have it predict directly in a database.
Simiarly this can be done with feature engineering and postprocessing too.
The orbital package interfaces directly with tidymodels workflows.
Using orbital
Apply orbital() to a fitted workflow.
(tree and tree_depth are set low for easy viewing on slides)
using show_query() or orbital_sql() will generate the SQL.
show_query(xgb_orb, con)#> (`Clark_Lake` - 13.1390333121827) / 6.40142339936784 AS Clark_Lake#> (`Clinton` - 2.29572810913705) / 0.942655487237437 AS Clinton#> (`Merchandise_Mart` - 4.44142100253806) / 2.36254062814617 AS Merchandise_Mart#> (`Irving_Park` - 3.32233851522842) / 1.11501202758246 AS Irving_Park#> ((((CASE#> WHEN (`Clinton` < -1.24725115 AND `Clark_Lake` < -0.744683325) THEN (-2.69900179)#> WHEN ((`Clinton` >= -1.24725115 OR (`Clinton` IS NULL)) AND `Clark_Lake` < -0.744683325) THEN (-1.81820333)#> WHEN (`Clark_Lake` < 0.713742316 AND (`Clark_Lake` >= -0.744683325 OR (`Clark_Lake` IS NULL))) THEN 0.662370682#> WHEN ((`Clark_Lake` >= 0.713742316 OR (`Clark_Lake` IS NULL)) AND (`Clark_Lake` >= -0.744683325 OR (`Clark_Lake` IS NULL))) THEN 1.63549113#> END + CASE#> WHEN (`Clark_Lake` < -1.26659846 AND `Clark_Lake` < -0.744683325) THEN (-1.91823995)#> WHEN ((`Clark_Lake` >= -1.26659846 OR (`Clark_Lake` IS NULL)) AND `Clark_Lake` < -0.744683325) THEN (-1.22058809)#> WHEN (`Clark_Lake` < 0.582209051 AND (`Clark_Lake` >= -0.744683325 OR (`Clark_Lake` IS NULL))) THEN 0.381911665#> WHEN ((`Clark_Lake` >= 0.582209051 OR (`Clark_Lake` IS NULL)) AND (`Clark_Lake` >= -0.744683325 OR (`Clark_Lake` IS NULL))) THEN 1.09545982#> END) + CASE#> WHEN (`Merchandise_Mart` < -1.78258145 AND `Clinton` < -0.623481333) THEN 0.270338893#> WHEN ((`Merchandise_Mart` >= -1.78258145 OR (`Merchandise_Mart` IS NULL)) AND `Clinton` < -0.623481333) THEN (-1.2570411)#> WHEN (`Clark_Lake` < 0.948533833 AND (`Clinton` >= -0.623481333 OR (`Clinton` IS NULL))) THEN 0.376703531#> WHEN ((`Clark_Lake` >= 0.948533833 OR (`Clark_Lake` IS NULL)) AND (`Clinton` >= -0.623481333 OR (`Clinton` IS NULL))) THEN 1.07266045#> END) + CASE#> WHEN (`Clark_Lake` < -1.48545611 AND `Clinton` < -0.761389613) THEN (-1.06872714)#> WHEN ((`Clark_Lake` >= -1.48545611 OR (`Clark_Lake` IS NULL)) AND `Clinton` < -0.761389613) THEN (-0.646734238)#> WHEN (`Clark_Lake` < 0.459111422 AND (`Clinton` >= -0.761389613 OR (`Clinton` IS NULL))) THEN 0.0673953071#> WHEN ((`Clark_Lake` >= 0.459111422 OR (`Clark_Lake` IS NULL)) AND (`Clinton` >= -0.761389613 OR (`Clinton` IS NULL))) THEN 0.542465925#> END) + CASE#> WHEN (`Merchandise_Mart` < -1.78258145 AND `Clinton` < -0.947035372) THEN 0.60846895#> WHEN ((`Merchandise_Mart` >= -1.78258145 OR (`Merchandise_Mart` IS NULL)) AND `Clinton` < -0.947035372) THEN (-0.685996652)#> WHEN (`Irving_Park` < 0.538704038 AND (`Clinton` >= -0.947035372 OR (`Clinton` IS NULL))) THEN 0.031657584#> WHEN ((`Irving_Park` >= 0.538704038 OR (`Irving_Park` IS NULL)) AND (`Clinton` >= -0.947035372 OR (`Clinton` IS NULL))) THEN 0.399935812#> END) + 13.133584 AS .pred
Versioning and Deploying with vetiver
vetiver is a tool for deploying models using the well-known REST API.
There are R and Python versions of the package.
Perhaps more than any other deployment tool, it is designed to be simple and straightforward.
It is also designed to correctly ingest data and make deployment and documentation very easy.
vetiver exploits the tools in tidymodels for tracking package dependencies.
Convert the model to vetiver
We start by ingesting the model and information about it:
library(vetiver)xgb_vet <-vetiver_model(xgb_fit, model_name ="boosted-chicago")xgb_vet#> #> ── boosted-chicago ─ <bundled_workflow> model for deployment #> A xgboost regression modeling workflow using 6 features# Example of stored information: xgb_vet$metadata$required_pkgs#> [1] "parsnip" "workflows" "xgboost"
Pinning the model
Although vetiver can be used on its own, it works very nicely with the pins package.
This can use many cloud storage options and can save, share, and version the model:
In R, vetiver uses the plumber package to deploy models:
library(plumber)vetiver_api(pr(), xgb_vet)#> # Plumber router with 4 endpoints, 4 filters, and 1 sub-router.#> # Use `pr_run()` on this object to start the API.#> ├──[queryString]#> ├──[body]#> ├──[cookieParser]#> ├──[sharedSecret]#> ├──/logo#> │ │ # Plumber static router serving from directory: /Users/max/Library/R/arm64/4.5/library/vetiver#> ├──/metadata (GET)#> ├──/ping (GET)#> ├──/predict (POST)#> └──/prototype (GET)
Plumber deployment file
vetiver_write_plumber(model_board, "boosted-chicago") will write out a template for using plumber to deploy:
# Generated by the vetiver package; edit with carelibrary(pins)library(plumber)library(rapidoc)library(vetiver)# Packages needed to generate model predictionsif (FALSE) {library(parsnip)library(workflows)library(xgboost)}b <-board_folder(path ="/var/folders/ml/6750yrxn1ds3gsf4d8b6bb600000gn/T/RtmpQGaLaS/pins-113c17c9d1b4e")v <-vetiver_pin_read(b, "boosted-chicago", version ="20260201T205357Z-b3640")#* @plumberfunction(pr) { pr |>vetiver_api(v)}
Docker deployment file
vetiver_write_docker(xgb_vet) will write out a Docker template:
# Generated by the vetiver package; edit with care
FROM rocker/r-ver:4.5.2
ENV RENV_CONFIG_REPOS_OVERRIDE https://packagemanager.rstudio.com/cran/latest
RUN apt-get update -qq && apt-get install -y --no-install-recommends \
libcurl4-openssl-dev \
libicu-dev \
libsodium-dev \
libssl-dev \
libx11-dev \
make \
zlib1g-dev \
&& apt-get clean
COPY vetiver_renv.lock renv.lock
RUN Rscript -e "install.packages('renv')"
RUN Rscript -e "renv::restore()"
COPY plumber.R /opt/ml/plumber.R
EXPOSE 8000
ENTRYPOINT ["R", "-e", "pr <- plumber::plumb('/opt/ml/plumber.R'); pr$run(host = '0.0.0.0', port = 8000)"]
Monitoring the Data and Performance
tidymodels has basic tools to compute weekly performance: