Most dynamic metrics convert the observed event time to a binary event/non-event representation (at a specific evaluation time).
From there, we can apply existing classification metrics, such as
Brier Score (for calibration)
Area under the ROC curve (for separation)
We’ll talk about both of these.
There are more details on dynamics metrics at tidymodels.org.
Converting to Events
For a specific evaluation time point \(\tau\), we convert the observed event time to a binary event/non-event version (if possible) (\(y_{i\tau} \in \{0, 1\}\)).
\[
y_{i\tau} =
\begin{cases}
1 & \text{if } t_{i} \leq \tau\text{ and event} \notag \\
0 & \text{if } t_{i} \gt \tau \text{ and } either \notag \\
missing & \text{if } t_{i} \leq \tau\text{ and censored }
\end{cases}
\]
Converting to Events
Dealing with Missing Outcome Data
Without censored data points, this conversion would yield appropriate performance estimates since no event outcomes would be missing.
Otherwise, there is the potential for bias due to missingness.
We’ll use tools from causal inference to compensate by creating a propensity score that uses the probability of being censored/missing.
Censoring Weights
We currently use a simple, non-informative censoring model called a “reverse Kaplan-Meier” curve.
For a given evaluation time, we can compute the probability of any sampleing being censored at \(\tau\).
Our metrics use case weights use the inverse of this probability. See Graf et al (1999).
Sum of Weights Over Time
Brier Score
The Brier score is calibration metric originally meant for classification models:
# Out-of-sample predictions at many time points# Results contain survival probabilities and case weights in a # list column called `.pred`val_pred <-augment(mod_fit, new_data = churn_val, eval_time =5:230)val_brier <-brier_survival(val_pred, truth = event_time, .pred)val_brier %>%filter(.eval_time %in%seq(30, 210, by =30))#> # A tibble: 7 × 4#> .metric .estimator .eval_time .estimate#> <chr> <chr> <dbl> <dbl>#> 1 brier_survival standard 30 0.00255#> 2 brier_survival standard 60 0.00879#> 3 brier_survival standard 90 0.0243 #> 4 brier_survival standard 120 0.0690 #> 5 brier_survival standard 150 0.0977 #> 6 brier_survival standard 180 0.153 #> 7 brier_survival standard 210 0.188
Integrated Brier Score
brier_survival_integrated(val_pred, truth = event_time, .pred)#> # A tibble: 1 × 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 brier_survival_integrated standard 0.0783
Brier Scores Over Evaluation Time
Calibration Over Time
Area Under the ROC Curve
This is more straightforward.
We can use the standard ROC curve machinery once we have the indicators, probabilities, and censoring weights at evaluation time \(\tau\) (Hung and Chiang (2010)).
ROC curves measure the separation between events and non-events and are ignorant of how well-calibrated the probabilities are.
Area Under the ROC Curve
val_roc_auc <-roc_auc_survival(val_pred, truth = event_time, .pred)val_roc_auc %>%filter(.eval_time %in%seq(30, 210, by =30))#> # A tibble: 7 × 4#> .metric .estimator .eval_time .estimate#> <chr> <chr> <dbl> <dbl>#> 1 roc_auc_survival standard 30 0.999#> 2 roc_auc_survival standard 60 0.999#> 3 roc_auc_survival standard 90 0.984#> 4 roc_auc_survival standard 120 0.961#> 5 roc_auc_survival standard 150 0.928#> 6 roc_auc_survival standard 180 0.835#> 7 roc_auc_survival standard 210 0.859
ROC AUC Over Evaluation Time
Conculsion
Bad news: statistically, this is pretty convoluted.
Good news: tidymodels handles the details and provides a clean syntax for some complex statistics.
A huge thanks to the tidymodels team: Hannah Frick, Emil Hvitfeldt, and Simon Couch!
Details
“Reverse Kaplan-Meier” Curve
We assume a non-informative censoring model: to compute the probability
For each prediction at evaluation time \(\tau\), we compute the probability at an adjusted time: