Evaluating Time-to-Event Models is Hard

Max Kuhn

Probability Predictions

Compute Metrics at Specific Times

Classification(ish) Metrics

Most dynamic metrics convert the observed event time to a binary event/non-event representation (at a specific evaluation time).

From there, we can apply existing classification metrics, such as

  • Brier Score (for calibration)
  • Area under the ROC curve (for separation)

We’ll talk about both of these.

There are more details on dynamics metrics at tidymodels.org.

Converting to Events

For a specific evaluation time point \(\tau\), we convert the observed event time to a binary event/non-event version (if possible) (\(y_{i\tau} \in \{0, 1\}\)).

\[ y_{i\tau} = \begin{cases} 1 & \text{if } t_{i} \leq \tau\text{ and event} \notag \\ 0 & \text{if } t_{i} \gt \tau \text{ and } either \notag \\ missing & \text{if } t_{i} \leq \tau\text{ and censored } \end{cases} \]

Converting to Events

Dealing with Missing Outcome Data

Without censored data points, this conversion would yield appropriate performance estimates since no event outcomes would be missing.

  • Otherwise, there is the potential for bias due to missingness.


We’ll use tools from causal inference to compensate by creating a propensity score that uses the probability of being censored/missing.

Censoring Weights

We currently use a simple, non-informative censoring model called a “reverse Kaplan-Meier” curve.


For a given evaluation time, we can compute the probability of any sampleing being censored at \(\tau\).


Our metrics use case weights use the inverse of this probability. See Graf et al (1999).

Sum of Weights Over Time

Brier Score

The Brier score is calibration metric originally meant for classification models:

\[ Brier = \frac{1}{N}\sum_{i=1}^N\sum_{k=1}^C (y_{ik} - \hat{\pi}_{ik})^2 \]

For our application, we have two classes and case weights

\[ Brier_{\tau} = \frac{1}{W_\tau}\sum_{i=1}^N\sum_{k=1}^C w_{it}(y_{ik\tau} - \hat{\pi}_{ik\tau})^2 \]

Brier Score Code

# Out-of-sample predictions at many time points
# Results contain survival probabilities and case weights in a 
# list column called `.pred`
val_pred <- augment(mod_fit, new_data = churn_val, eval_time = 5:230)

val_brier <- brier_survival(val_pred, truth = event_time, .pred)
val_brier %>% filter(.eval_time %in% seq(30, 210, by = 30))
#> # A tibble: 7 × 4
#>   .metric        .estimator .eval_time .estimate
#>   <chr>          <chr>           <dbl>     <dbl>
#> 1 brier_survival standard           30   0.00255
#> 2 brier_survival standard           60   0.00879
#> 3 brier_survival standard           90   0.0243 
#> 4 brier_survival standard          120   0.0690 
#> 5 brier_survival standard          150   0.0977 
#> 6 brier_survival standard          180   0.153  
#> 7 brier_survival standard          210   0.188

Integrated Brier Score

brier_survival_integrated(val_pred, truth = event_time, .pred)
#> # A tibble: 1 × 3
#>   .metric                   .estimator .estimate
#>   <chr>                     <chr>          <dbl>
#> 1 brier_survival_integrated standard      0.0783

Brier Scores Over Evaluation Time

Calibration Over Time

Area Under the ROC Curve

This is more straightforward.


We can use the standard ROC curve machinery once we have the indicators, probabilities, and censoring weights at evaluation time \(\tau\) (Hung and Chiang (2010)).


ROC curves measure the separation between events and non-events and are ignorant of how well-calibrated the probabilities are.

Area Under the ROC Curve

val_roc_auc <- roc_auc_survival(val_pred, truth = event_time, .pred)
val_roc_auc %>% filter(.eval_time %in% seq(30, 210, by = 30))
#> # A tibble: 7 × 4
#>   .metric          .estimator .eval_time .estimate
#>   <chr>            <chr>           <dbl>     <dbl>
#> 1 roc_auc_survival standard           30     0.999
#> 2 roc_auc_survival standard           60     0.999
#> 3 roc_auc_survival standard           90     0.984
#> 4 roc_auc_survival standard          120     0.961
#> 5 roc_auc_survival standard          150     0.928
#> 6 roc_auc_survival standard          180     0.835
#> 7 roc_auc_survival standard          210     0.859

ROC AUC Over Evaluation Time

Conculsion

  • Bad news: statistically, this is pretty convoluted.

  • Good news: tidymodels handles the details and provides a clean syntax for some complex statistics.


A huge thanks to the tidymodels team: Hannah Frick, Emil Hvitfeldt, and Simon Couch!

Details

“Reverse Kaplan-Meier” Curve

We assume a non-informative censoring model: to compute the probability

For each prediction at evaluation time \(\tau\), we compute the probability at an adjusted time:

\[ t_i^*= \begin{cases} t_i - \epsilon & \text{if }t_i \le \tau \\ \notag \tau - \epsilon & \text{if }t_i > \tau \notag \end{cases} \]