The Post-Modeling Model to Fix the Model

Max Kuhn (Posit)

An unparalleled work of staggering genius.

After a long campaign of data analysis, one model was able to conquer the rest: Naive Bayes!

With 1,514 training set samples and 56 predictors, our amazing model has an area under the ROC curve of 0.86!


What could go wrong?

😱

Most predictions are zero or one?

In a lot of cases, we are confidently incorrect.

This seems… bad.


The model is able to separate the classes but the probabilities are not realistic.

They aren’t well-calibrated.

Some tidymodels code

set.seed(8928)
split <- initial_split(all_data, strata = class)
data_tr <- training(split)
data_te <- testing(split)
data_rs <- vfold_cv(data_tr, strata = class)

bayes_wflow <-
  workflow() %>%
  add_formula(class ~ .) %>%
  add_model(naive_Bayes())

cls_met <- metric_set(roc_auc, brier_class)
ctrl <- control_resamples(save_pred = TRUE)

# The resampling results from 10-fold cross-validation:
bayes_res <-
  bayes_wflow %>%
  fit_resamples(data_rs, metrics = cls_met, control = ctrl)

The probably package

probably has functions for post-processing model results, including:

  • equivocal zones
  • probability threshold optimization

(and in the most recent version)

  • conformal inference prediction intervals
  • calibration visualization and mitigation (Edgar Ruiz did most of this!)

We’ll look at the calibration tools today. There are several tools for assessing calibration

Assessing calibration issues

Conventional calibration plots

library(probably)
cal_plot_breaks(bayes_res)


There are also methods for data frames of predictions.

Moving window calibration plots

bayes_res %>% 
  cal_plot_windowed(
    step_size = 0.025
  )

Logistic (GAM) calibration plots

cal_plot_logistic(bayes_res)

What can we do about it?

If we don’t have a model with better separation and calibration, we can post-process the predictions.

  • Logistic regression
  • Isotonic regression
  • Isotonic regression (resampled)
  • Beta calibration

These models can estimate the trends and ā€œun-borkā€ the predictions.

What data can we use?

Ideally, we would reserve some data to estimate the mis-calibration patterns.

If not, we could use the holdout predictions from resampling (or a validation set). This is a little risky but doable.


The Brier Score is a nice performance metric that can measure effectiveness and calibration.

For 2 classes:

  • Brier Score = 0 šŸ’Æ
  • Brier Score = 1/2 😢

With and without Beta calibration

cal_validate_beta(bayes_res, metrics = cls_met) %>% 
  collect_metrics() %>% 
  arrange(.metric)
#> # A tibble: 4 Ɨ 7
#>   .metric     .type        .estimator  mean     n std_err .config
#>   <chr>       <chr>        <chr>      <dbl> <int>   <dbl> <chr>  
#> 1 brier_class uncalibrated binary     0.201    10 0.0102  config 
#> 2 brier_class calibrated   binary     0.145    10 0.00450 config 
#> 3 roc_auc     uncalibrated binary     0.857    10 0.00945 config 
#> 4 roc_auc     calibrated   binary     0.857    10 0.00942 config

beta_cal <- cal_estimate_beta(bayes_res)

Test set results - Raw

nb_fit <- fit(bayes_wflow, data_tr)
nb_pred <- augment(nb_fit, data_te)

nb_pred %>% cls_met(class, .pred_event)
#> # A tibble: 2 Ɨ 3
#>   .metric     .estimator .estimate
#>   <chr>       <chr>          <dbl>
#> 1 roc_auc     binary         0.841
#> 2 brier_class binary         0.225

Test set results - Calibrated

nb_pred_fixed <- nb_pred %>% cal_apply(beta_cal)

nb_pred_fixed %>% cls_met(class, .pred_event)
#> # A tibble: 2 Ɨ 3
#>   .metric     .estimator .estimate
#>   <chr>       <chr>          <dbl>
#> 1 roc_auc     binary         0.841
#> 2 brier_class binary         0.152

Test set results

nb_pred_fixed %>% 
  cal_plot_windowed(
    class, 
    .pred_event, 
    step_size = 0.025
  )

What’s next?

We will be updating workflow objects with post-processors towards the end of the year.

This means that we can:

  • bind the model fit with pre- and post-processing results
  • automatically calibrate new results using predict(workflow, new_data).

Thanks

Again, Edgar Ruiz did the majority of the work on calibration methods!