Conformal Inference with Tidymodels

Max Kuhn

or… how to make prediction intervals with no parameteric asssumptions

but first…

questions: pos.it/slido-CD

Um OK. What’s a prediction interval?

Prediction interval with level \(\alpha\):

A range of values that is likely to contain the value of a single new observation with probability \(1-\alpha\).

It gives a sense of the variability in a new prediction.

Let’s start with some data…

Is this new data point discordant?

A Simple Probability Statement

Without making parametric assumptions, we could say

\[Pr[Q_L< Y_{n_c} < Q_U] = 1 - \alpha\] where \(Q_L\) and \(Q_U\) are quantiles excluding \(\alpha/2\) tail areas.


So, for some \(\alpha\), we could say that new data between \(Q_L\) and \(Q_U\) are “likely” to conform to our original, reference distribution.

Use quantiles to define “conformity”

What if the samples were residuals?

Linear regression with spline features

Compute residuals on other data

Compute interval of “conforming” values

Center interval around predictions

Is this a prediction interval?

Conformal intervals are using a completely different approach to produce intervals with average coverage of \(1-\alpha\)

(usually - for many conformal methods)


For the statisticians out there: the methods have a strong Frequentist theme similar to nonparameteric inference.

Pros

  • Basic methods assume exchangeability of the data.
  • Can work with any regression or classification model.
    • We’ve only implemented it for regression models so far.
  • Relatively fast (except for “full” conformal inference).

Cons

  • Extrapolating beyond training/calibration sets is problematic
    • Some methods may not reflect the extrapolation in the interval width.
    • Other conformal methods can produce especially bad results
    • The applicable package can be a big help identifying extrapolation.
  • Probably not great for small sample sizes.

Setup the data

Simulated data:

  • train_data (\(n_{tr}\) = 1,000) for model training
  • test_data (\(n_{te}\) = 500) for final evaluation
  • cal_data (\(n_{c}\) = 500) a calibration set is only used to get good estimates of the residual distribution
library(tidymodels)

# Setup some resampling for later use
set.seed(322)
cv_folds <- vfold_cv(train_data, v = 10, strata = outcome)

A support vector machine model

svm_spec <- 
  svm_rbf() %>% 
  set_mode("regression")

svm_wflow <- workflow(outcome ~ predictor, svm_spec)

svm_fit <- svm_wflow %>% fit(data = train_data)


Now let’s look at three functions for producing intervals…

Split Conformal Inference

Use a calibration set to create fixed width intervals.

Split Conformal Inference

library(probably)

conf_split <- int_conformal_split(svm_fit, cal_data = cal_data)

conf_split_test <- 
  predict(conf_split, test_data, level = 0.90) %>% 
  bind_cols(test_data)

conf_split_test %>% slice(1)
#> # A tibble: 1 × 5
#>   .pred .pred_lower .pred_upper predictor outcome
#>   <dbl>       <dbl>       <dbl>     <dbl>   <dbl>
#> 1 0.264       0.179       0.350   0.00528   0.160

Split Conformal Inference

CV+ Inference

Use out-of-sample predictions from V-fold cross-validation to produce residuals.

Also fixed length.

Theory only for V-fold cross-validation

  • You can use other resampling methods at your own risk (with a warning)

CV+ Inference

# 'extract' to save the 10 fitted models
ctrl <- control_resamples(save_pred = TRUE, extract = I)

svm_resampled <- 
  svm_wflow %>% 
  fit_resamples(resamples = cv_folds, control = ctrl)

conf_cv <- int_conformal_cv(svm_resampled)

conf_cv_test <- 
  predict(conf_cv, test_data, level = 0.90) %>% 
  bind_cols(test_data)

CV+ Inference

Conformalized quantile regression

Use a quantile regression model to estimate bounds.

  • Can directly estimate the intervals

Example for using linear quantile regression:

Conformalized quantile regression

Use a quantile regression model to estimate bounds.

  • Can directly estimate the intervals

We actually use quantile random forests (for better or worse)

Produces variable length intervals.

Conformalized quantile regression

# We have to pass the data sets and pre-set the interval coverage:
set.seed(837)
conf_qntl <-
  int_conformal_quantile(svm_fit,
                         train_data = train_data,
                         cal_data = cal_data,  #<- split sample 
                         level = 0.90,
                         # Can pass options to `quantregForest()`:
                         ntree = 2000)

conf_qntl_test <- 
  predict(conf_qntl, test_data) %>% 
  bind_cols(test_data)

Conformalized quantile regression

Does it work?

For our test set, the coverage for 90% predictions intervals:

  • Split conformal: 89.6%
  • CV+: 88%
  • Conformalized quantile regression: 89.6%

I also did a lot of simulations to make sure that the coverage was on-target.

These can be found at https://github.com/topepo/conformal_sim.

What’s next?

  • Classification models
    • The focus is often to cluster predicted classes that have equivocal probabilities.
  • New methodologies as they pop up.

Thanks: the tidymodels/tidyverse groups, Joe Rickert, and the conference committee.

Learning More