Survival analysis with tidymodels: The censored package

class: title-slide, left, middle
background-position: 85% 50%
background-size: 30%
background-color: #F9F8F3

.pull-left-a-lot[

# Survival analysis with tidymodels: The censored package

### Max Kuhn (RStudio) and Hannah Frick (RStudio)

]

---
layout: false
class: inverse, middle, center

## [`tidymodels.org`](https://www.tidymodels.org/)

## _Tidy Modeling with R_ ([`tmwr.org`](https://www.tmwr.org/))

<br>

## Slides: [`rstd.io/censored-rpharma`](https://rstd.io/censored-rpharma)

## Blog: [`tidyverse.org/blog`](https://www.tidyverse.org/blog/2021/11/survival-analysis-parsnip-adjacent/)

## Discussion: [`rstd.io/censored-feedback`](https://rstd.io/censored-feedback)

---
# Censored data

These are where the precise measurement of a time point is not complete.

To illustrate, we'll use data from a [Sliced competition](https://www.kaggle.com/c/sliced-s01e10-playoffs-2/) on adoptions of shelter animals.

The data set is for > 24K dogs and includes predictors based on breed (34), color (15), sex, and other characteristics.

The `age` of the dogs is the outcome with an event indicator called `adopted`.

---
# Types of censoring

* _right censored_: Sadie was born on 2020-01-01 and was not yet adopted as of 2021-11-03.

* _left censored_: We don't know when Sadie was born but she was alive as of 2020-05-01 and was adopted on 2021-11-03.

* _interval censoring_: We don't know when Sadie was born but she was alive as of 2020-05-01 and was not yet adopted as of 2021-11-03.

There are many other aspects of these data, such as multiple event times, correlated event times, and/or competing risks.

---
# The censored package

This package, currently on GitHub, adds engines for the `parsnip` package so that we can fit various types of event time models.

The goals for the package are:

* Provide a clean, consistent user interface and object types.

* Enable additional "guard rails" for these particular models (e.g. empirical validation, eliminate information leakage, etc.).

* Allow users access to additional models and techniques.

To install:

```r
require("remotes")
remotes::install_github("tidymodels/censored")
```

---
# Two Good Dogs for prediction

.pull-left[

Artemis:

]
.pull-right[
Summer:

<img src="images/summer.jpeg" width="70%" style="display: block; margin: auto;" />
]

---
# Two Good Dogs for prediction

Shorthand; all other predictors have values of zero

.pull-left[

```r
artemis <-
  tibble(
    birth_date = 2012,
    sex = "female",
    spay_neuter = "fixed",
    named = 1,
    beagle = 1,
    hound = 1,
    mix = 1,
    color_tricolor = 1,
    pooch = "Artemis"
  )
```

]
.pull-right[

```r
summer <-
  tibble(
    birth_date = 2016,
    sex = "female",
    spay_neuter = "fixed",
    named = 1,
    labrador = 1,
    mix = 1,
    retriever = 1,
    color_yellow = 1,
    pooch = "Summer"
  ) 
```
]

These two samples are in a data frame called `our_dogs`.

---
# Fitting the basic Cox model

```r
library(survival)
library(tidymodels)
library(censored)
tidymodels_prefer()

cph_fit <-
  proportional_hazards() %>%
  fit(Surv(age, adopted) ~ ., data = dogs)
cph_fit
```

```
## parsnip model object
## 
## Fit time:  1.3s 
## Call:
## survival::coxph(formula = Surv(age, adopted) ~ ., data = data, 
##     model = TRUE, x = TRUE)
## 
##                     coef exp(coef) se(coef)     z      p
## birth_date         0.712     2.038    0.005 150.9 <2e-16
## sexmale           -0.063     0.939    0.013  -4.7  2e-06
## spay_neuterintact -0.286     0.751    0.021 -13.6 <2e-16
## named             -0.370     0.691    0.022 -16.6 <2e-16
## american           0.138     1.148    0.060   2.3  0.021
## australian         0.116     1.123    0.053   2.2  0.028
## beagle            -0.177     0.838    0.050  -3.5  4e-04
## border             0.106     1.112    0.073   1.5  0.144
## boxer             -0.046     0.955    0.043  -1.1  0.284
## bull              -0.042     0.958    0.129  -0.3  0.742
## bulldog           -0.189     0.827    0.067  -2.8  0.005
## catahoula          0.070     1.072    0.054   1.3  0.197
## cattle            -0.055     0.947    0.062  -0.9  0.383
## chihuahua         -0.110     0.896    0.061  -1.8  0.070
## collie             0.045     1.046    0.074   0.6  0.539
## dachshund          0.135     1.144    0.036   3.8  1e-04
## german            -0.099     0.906    0.054  -1.8  0.066
## great              0.050     1.052    0.054   0.9  0.353
## hound              0.097     1.101    0.048   2.0  0.043
## husky             -0.413     0.661    0.133  -3.1  0.002
## jack               0.095     1.100    0.158   0.6  0.549
## labrador          -0.167     0.846    0.073  -2.3  0.021
## miniature          0.082     1.085    0.060   1.4  0.174
## mix               -0.143     0.866    0.018  -7.9  2e-15
## pit               -0.123     0.884    0.131  -0.9  0.345
## pointer            0.071     1.073    0.056   1.3  0.204
## poodle            -0.075     0.928    0.065  -1.2  0.250
## rat               -0.053     0.948    0.063  -0.8  0.400
## retriever          0.255     1.290    0.072   3.5  4e-04
## rottweiler        -0.021     0.979    0.059  -0.4  0.717
## russell           -0.051     0.950    0.154  -0.3  0.742
## schnauzer          0.105     1.110    0.074   1.4  0.156
## shepherd           0.126     1.134    0.049   2.5  0.011
## shorthair          0.145     1.156    0.060   2.4  0.016
## siberian           0.308     1.360    0.140   2.2  0.028
## staffordshire     -0.254     0.776    0.056  -4.6  5e-06
## terrier            0.068     1.070    0.037   1.9  0.063
## yorkshire         -0.096     0.908    0.064  -1.5  0.135
## color_black        0.018     1.018    0.019   0.9  0.346
## color_blue         0.018     1.018    0.035   0.5  0.604
## color_brindle      0.138     1.148    0.029   4.7  3e-06
## color_brown       -0.033     0.968    0.022  -1.5  0.145
## color_buff         0.271     1.311    0.058   4.7  3e-06
## color_chocolate    0.084     1.088    0.043   1.9  0.051
## color_cream        0.094     1.098    0.054   1.7  0.082
## color_gray        -0.180     0.836    0.053  -3.4  7e-04
## color_merle       -0.015     0.986    0.056  -0.3  0.796
## color_red          0.032     1.032    0.034   0.9  0.343
## color_sable        0.427     1.533    0.057   7.5  9e-14
## color_tan          0.004     1.004    0.022   0.2  0.871
## color_tricolor     0.146     1.157    0.037   3.9  1e-04
## color_white        0.082     1.086    0.017   4.9  1e-06
## color_yellow       0.128     1.137    0.056   2.3  0.022
## 
## Likelihood ratio test=38739  on 53 df, p=<2e-16
## n= 24134, number of events= 23018
```

---
# Tidied

```r
tidy(cph_fit)
```

```
## # A tibble: 53 × 5
##    term              estimate std.error statistic  p.value
##    <chr>                <dbl>     <dbl>     <dbl>    <dbl>
##  1 birth_date          0.712    0.00472   151.    0       
##  2 sexmale            -0.0632   0.0133     -4.74  2.09e- 6
##  3 spay_neuterintact  -0.286    0.0210    -13.6   4.17e-42
##  4 named              -0.370    0.0223    -16.6   6.92e-62
##  5 american            0.138    0.0600      2.31  2.12e- 2
##  6 australian          0.116    0.0528      2.20  2.80e- 2
##  7 beagle             -0.177    0.0501     -3.53  4.09e- 4
##  8 border              0.106    0.0727      1.46  1.44e- 1
##  9 boxer              -0.0465   0.0434     -1.07  2.84e- 1
## 10 bull               -0.0425   0.129      -0.329 7.42e- 1
## # … with 43 more rows
```

There are prediction methods for `type = c("linear_pred", "survival", "time")`.

---
# Survival Probabilities

.code80[

.pull-left-a-lot[

```r
times <- 1:3000
predict(cph_fit, our_dogs, type = "survival", time = times) 
```

tidymodels always generates _one row of predictions_ for each row of the input data set.

]
.pull-right-a-little[

```
## # A tibble: 2 × 1
##   .pred               
##   <list>              
## 1 <tibble [3,000 × 2]>
## 2 <tibble [3,000 × 2]>
```
]

]

---
# Survival Probabilities

.code80[

.pull-left-a-lot[

```r
predict(cph_fit, our_dogs, type = "survival", time = times) %>% 
  bind_cols(our_dogs %>% select(pooch)) %>%
  unnest(cols = c(.pred))
```

The column names are standardized for each type of prediction.

]
.pull-right-a-little[

```
## # A tibble: 6,000 × 3
##    .time .pred_survival pooch  
##    <int>          <dbl> <chr>  
##  1     1           1    Artemis
##  2     2           1.00 Artemis
##  3     3           1.00 Artemis
##  4     4           1.00 Artemis
##  5     5           1.00 Artemis
##  6     6           1.00 Artemis
##  7     7           1.00 Artemis
##  8     8           1.00 Artemis
##  9     9           1.00 Artemis
## 10    10           1.00 Artemis
## # … with 5,990 more rows
```
]

]

---
# Survival Probabilities

.code80[

.pull-left-a-lot[

```r
# Let's make a plotting function
surv_curves <- function(x) {
  bind_cols(x, our_dogs %>% select(pooch)) %>%
  unnest(.pred) %>%
  mutate(years = .time / 365) %>% 
  ggplot(aes(x = years, y = .pred_survival, group = pooch, 
             col = pooch)) +
  geom_step(lwd = 1.25) + 
  labs(x = "Age (years)", y = "Survival")
}

predict(cph_fit, our_dogs, type = "survival", time = times) %>% 
  surv_curves()
```

]
.pull-right-a-little[

<img src="images/cph-surv-probs-1.svg" width="100%" style="display: block; margin: auto;" />
]

]

---
# A regularized model with strata

`glmnet` has an option for fitting Cox models but...

* matrices are required (i.e. no formulas)
 * strata as specified as part of the _outcome_ in its own special syntax and function (`stratifySurv()`).

We allow the standard formula interface for this model and you can use the usual  `strata()` function:

```r
cph_glmnet_strata_fit <-
  proportional_hazards(penalty = 0.1, mixture = 0.75) %>%
  set_engine("glmnet") %>%
  fit(Surv(age, adopted) ~ . + strata(sex), data = dogs)
```

Prediction code looks the same.

---
# A bagged tree

The nice thing about `parsnip` models is that the model syntax is very similar from model-to-model. For example, if we use a bagged tree:

.code80[

.pull-left-a-lot[

```r
bag_fit <-
  bag_tree() %>%
  set_mode("censored regression") %>%
  set_engine("rpart", times = 50) %>%
  fit(Surv(age, adopted) ~ ., data = dogs)

predict(bag_fit, our_dogs, type = "survival", time = times) %>% 
  surv_curves()
```

]
.pull-right-a-little[

<img src="images/tree-1.svg" width="100%" style="display: block; margin: auto;" />
]

]

Other models: `boost_tree()`, `decision_tree()`, `rand_forest()`, and `survival_reg()`.

---
# Opionionated choices

* Linear predictors are always increasing with time. tidymodels assumes that you can compute performance without knowing about the model type or other data.

* Many survival ROC methods are not designed for assessing predictions. We will implement the one that is most appropriate.

* Current specifications for baseline survival groups are very limited. We'll give a more versatile tool for declaring the baseline.

---
# What's next

* Time-dependent covariates
* Additional models/prediction methods
* Performance metrics (in the `yardstick` package)
* Model tuning (with the `tune` package)

# Thanks

Special thanks to Emil Hvitfeldt for creating the first version of censored.