questions: pos.it/slido-CD
Prediction interval with level \(\alpha\):
A range of values that is likely to contain the value of a single new observation with probability \(1-\alpha\).
It gives a sense of the variability in a new prediction.
Without making parametric assumptions, we could say
\[Pr[Q_L< Y_{n_c} < Q_U] = 1 - \alpha\] where \(Q_L\) and \(Q_U\) are quantiles excluding \(\alpha/2\) tail areas.
So, for some \(\alpha\), we could say that new data between \(Q_L\) and \(Q_U\) are “likely” to conform to our original, reference distribution.
Conformal intervals are using a completely different approach to produce intervals with average coverage of \(1-\alpha\)
(usually - for many conformal methods)
For the statisticians out there: the methods have a strong Frequentist theme similar to nonparameteric inference.
Simulated data:
train_data
(\(n_{tr}\) = 1,000) for model trainingtest_data
(\(n_{te}\) = 500) for final evaluationcal_data
(\(n_{c}\) = 500) a calibration set is only used to get good estimates of the residual distributionNow let’s look at three functions for producing intervals…
Use a calibration set to create fixed width intervals.
library(probably)
conf_split <- int_conformal_split(svm_fit, cal_data = cal_data)
conf_split_test <-
predict(conf_split, test_data, level = 0.90) %>%
bind_cols(test_data)
conf_split_test %>% slice(1)
#> # A tibble: 1 × 5
#> .pred .pred_lower .pred_upper predictor outcome
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.264 0.179 0.350 0.00528 0.160
Use out-of-sample predictions from V-fold cross-validation to produce residuals.
Also fixed length.
Theory only for V-fold cross-validation
# 'extract' to save the 10 fitted models
ctrl <- control_resamples(save_pred = TRUE, extract = I)
svm_resampled <-
svm_wflow %>%
fit_resamples(resamples = cv_folds, control = ctrl)
conf_cv <- int_conformal_cv(svm_resampled)
conf_cv_test <-
predict(conf_cv, test_data, level = 0.90) %>%
bind_cols(test_data)
Use a quantile regression model to estimate bounds.
Example for using linear quantile regression:
Use a quantile regression model to estimate bounds.
We actually use quantile random forests (for better or worse)
Produces variable length intervals.
# We have to pass the data sets and pre-set the interval coverage:
set.seed(837)
conf_qntl <-
int_conformal_quantile(svm_fit,
train_data = train_data,
cal_data = cal_data, #<- split sample
level = 0.90,
# Can pass options to `quantregForest()`:
ntree = 2000)
conf_qntl_test <-
predict(conf_qntl, test_data) %>%
bind_cols(test_data)
For our test set, the coverage for 90% predictions intervals:
I also did a lot of simulations to make sure that the coverage was on-target.
These can be found at https://github.com/topepo/conformal_sim
.
What’s next?
Thanks: the tidymodels/tidyverse groups, Joe Rickert, and the conference committee.
tidymodels.org
awesome-conformal-prediction
on GitHub.