class: title-slide, left, middle background-position: 85% 50% background-size: 30% background-color: #F9F8F3 .pull-left-a-lot[ # Survival analysis with tidymodels: The censored package ### Max Kuhn (RStudio) and Hannah Frick (RStudio) ] --- layout: false class: inverse, middle, center ## [`tidymodels.org`](https://www.tidymodels.org/) ## _Tidy Modeling with R_ ([`tmwr.org`](https://www.tmwr.org/)) <br> ## Slides: [`rstd.io/censored-rpharma`](https://rstd.io/censored-rpharma) ## Blog: [`tidyverse.org/blog`](https://www.tidyverse.org/blog/2021/11/survival-analysis-parsnip-adjacent/) ## Discussion: [`rstd.io/censored-feedback`](https://rstd.io/censored-feedback) --- # Censored data These are where the precise measurement of a time point is not complete. To illustrate, we'll use data from a [Sliced competition](https://www.kaggle.com/c/sliced-s01e10-playoffs-2/) on adoptions of shelter animals. The data set is for > 24K dogs and includes predictors based on breed (34), color (15), sex, and other characteristics. The `age` of the dogs is the outcome with an event indicator called `adopted`. --- # Types of censoring * _right censored_: Sadie was born on 2020-01-01 and was not yet adopted as of 2021-11-03. * _left censored_: We don't know when Sadie was born but she was alive as of 2020-05-01 and was adopted on 2021-11-03. * _interval censoring_: We don't know when Sadie was born but she was alive as of 2020-05-01 and was not yet adopted as of 2021-11-03. There are many other aspects of these data, such as multiple event times, correlated event times, and/or competing risks. --- # The censored package This package, currently on GitHub, adds engines for the `parsnip` package so that we can fit various types of event time models. The goals for the package are: * Provide a clean, consistent user interface and object types. * Enable additional "guard rails" for these particular models (e.g. empirical validation, eliminate information leakage, etc.). * Allow users access to additional models and techniques. To install: ```r require("remotes") remotes::install_github("tidymodels/censored") ``` --- # Two Good Dogs for prediction .pull-left[ Artemis: <img src="images/artemis.jpeg" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ Summer: <img src="images/summer.jpeg" width="70%" style="display: block; margin: auto;" /> ] --- # Two Good Dogs for prediction Shorthand; all other predictors have values of zero .pull-left[ ```r artemis <- tibble( birth_date = 2012, sex = "female", spay_neuter = "fixed", named = 1, beagle = 1, hound = 1, mix = 1, color_tricolor = 1, pooch = "Artemis" ) ``` ] .pull-right[ ```r summer <- tibble( birth_date = 2016, sex = "female", spay_neuter = "fixed", named = 1, labrador = 1, mix = 1, retriever = 1, color_yellow = 1, pooch = "Summer" ) ``` ] These two samples are in a data frame called `our_dogs`. --- # Fitting the basic Cox model ```r library(survival) library(tidymodels) library(censored) tidymodels_prefer() cph_fit <- proportional_hazards() %>% fit(Surv(age, adopted) ~ ., data = dogs) cph_fit ``` ``` ## parsnip model object ## ## Fit time: 1.3s ## Call: ## survival::coxph(formula = Surv(age, adopted) ~ ., data = data, ## model = TRUE, x = TRUE) ## ## coef exp(coef) se(coef) z p ## birth_date 0.712 2.038 0.005 150.9 <2e-16 ## sexmale -0.063 0.939 0.013 -4.7 2e-06 ## spay_neuterintact -0.286 0.751 0.021 -13.6 <2e-16 ## named -0.370 0.691 0.022 -16.6 <2e-16 ## american 0.138 1.148 0.060 2.3 0.021 ## australian 0.116 1.123 0.053 2.2 0.028 ## beagle -0.177 0.838 0.050 -3.5 4e-04 ## border 0.106 1.112 0.073 1.5 0.144 ## boxer -0.046 0.955 0.043 -1.1 0.284 ## bull -0.042 0.958 0.129 -0.3 0.742 ## bulldog -0.189 0.827 0.067 -2.8 0.005 ## catahoula 0.070 1.072 0.054 1.3 0.197 ## cattle -0.055 0.947 0.062 -0.9 0.383 ## chihuahua -0.110 0.896 0.061 -1.8 0.070 ## collie 0.045 1.046 0.074 0.6 0.539 ## dachshund 0.135 1.144 0.036 3.8 1e-04 ## german -0.099 0.906 0.054 -1.8 0.066 ## great 0.050 1.052 0.054 0.9 0.353 ## hound 0.097 1.101 0.048 2.0 0.043 ## husky -0.413 0.661 0.133 -3.1 0.002 ## jack 0.095 1.100 0.158 0.6 0.549 ## labrador -0.167 0.846 0.073 -2.3 0.021 ## miniature 0.082 1.085 0.060 1.4 0.174 ## mix -0.143 0.866 0.018 -7.9 2e-15 ## pit -0.123 0.884 0.131 -0.9 0.345 ## pointer 0.071 1.073 0.056 1.3 0.204 ## poodle -0.075 0.928 0.065 -1.2 0.250 ## rat -0.053 0.948 0.063 -0.8 0.400 ## retriever 0.255 1.290 0.072 3.5 4e-04 ## rottweiler -0.021 0.979 0.059 -0.4 0.717 ## russell -0.051 0.950 0.154 -0.3 0.742 ## schnauzer 0.105 1.110 0.074 1.4 0.156 ## shepherd 0.126 1.134 0.049 2.5 0.011 ## shorthair 0.145 1.156 0.060 2.4 0.016 ## siberian 0.308 1.360 0.140 2.2 0.028 ## staffordshire -0.254 0.776 0.056 -4.6 5e-06 ## terrier 0.068 1.070 0.037 1.9 0.063 ## yorkshire -0.096 0.908 0.064 -1.5 0.135 ## color_black 0.018 1.018 0.019 0.9 0.346 ## color_blue 0.018 1.018 0.035 0.5 0.604 ## color_brindle 0.138 1.148 0.029 4.7 3e-06 ## color_brown -0.033 0.968 0.022 -1.5 0.145 ## color_buff 0.271 1.311 0.058 4.7 3e-06 ## color_chocolate 0.084 1.088 0.043 1.9 0.051 ## color_cream 0.094 1.098 0.054 1.7 0.082 ## color_gray -0.180 0.836 0.053 -3.4 7e-04 ## color_merle -0.015 0.986 0.056 -0.3 0.796 ## color_red 0.032 1.032 0.034 0.9 0.343 ## color_sable 0.427 1.533 0.057 7.5 9e-14 ## color_tan 0.004 1.004 0.022 0.2 0.871 ## color_tricolor 0.146 1.157 0.037 3.9 1e-04 ## color_white 0.082 1.086 0.017 4.9 1e-06 ## color_yellow 0.128 1.137 0.056 2.3 0.022 ## ## Likelihood ratio test=38739 on 53 df, p=<2e-16 ## n= 24134, number of events= 23018 ``` --- # Tidied ```r tidy(cph_fit) ``` ``` ## # A tibble: 53 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 birth_date 0.712 0.00472 151. 0 ## 2 sexmale -0.0632 0.0133 -4.74 2.09e- 6 ## 3 spay_neuterintact -0.286 0.0210 -13.6 4.17e-42 ## 4 named -0.370 0.0223 -16.6 6.92e-62 ## 5 american 0.138 0.0600 2.31 2.12e- 2 ## 6 australian 0.116 0.0528 2.20 2.80e- 2 ## 7 beagle -0.177 0.0501 -3.53 4.09e- 4 ## 8 border 0.106 0.0727 1.46 1.44e- 1 ## 9 boxer -0.0465 0.0434 -1.07 2.84e- 1 ## 10 bull -0.0425 0.129 -0.329 7.42e- 1 ## # … with 43 more rows ``` There are prediction methods for `type = c("linear_pred", "survival", "time")`. --- # Survival Probabilities .code80[ .pull-left-a-lot[ ```r times <- 1:3000 predict(cph_fit, our_dogs, type = "survival", time = times) ``` tidymodels always generates _one row of predictions_ for each row of the input data set. ] .pull-right-a-little[ ``` ## # A tibble: 2 × 1 ## .pred ## <list> ## 1 <tibble [3,000 × 2]> ## 2 <tibble [3,000 × 2]> ``` ] ] --- # Survival Probabilities .code80[ .pull-left-a-lot[ ```r predict(cph_fit, our_dogs, type = "survival", time = times) %>% bind_cols(our_dogs %>% select(pooch)) %>% unnest(cols = c(.pred)) ``` The column names are standardized for each type of prediction. ] .pull-right-a-little[ ``` ## # A tibble: 6,000 × 3 ## .time .pred_survival pooch ## <int> <dbl> <chr> ## 1 1 1 Artemis ## 2 2 1.00 Artemis ## 3 3 1.00 Artemis ## 4 4 1.00 Artemis ## 5 5 1.00 Artemis ## 6 6 1.00 Artemis ## 7 7 1.00 Artemis ## 8 8 1.00 Artemis ## 9 9 1.00 Artemis ## 10 10 1.00 Artemis ## # … with 5,990 more rows ``` ] ] --- # Survival Probabilities .code80[ .pull-left-a-lot[ ```r # Let's make a plotting function surv_curves <- function(x) { bind_cols(x, our_dogs %>% select(pooch)) %>% unnest(.pred) %>% mutate(years = .time / 365) %>% ggplot(aes(x = years, y = .pred_survival, group = pooch, col = pooch)) + geom_step(lwd = 1.25) + labs(x = "Age (years)", y = "Survival") } predict(cph_fit, our_dogs, type = "survival", time = times) %>% surv_curves() ``` ] .pull-right-a-little[ <img src="images/cph-surv-probs-1.svg" width="100%" style="display: block; margin: auto;" /> ] ] --- # A regularized model with strata `glmnet` has an option for fitting Cox models but... * matrices are required (i.e. no formulas) * strata as specified as part of the _outcome_ in its own special syntax and function (`stratifySurv()`). We allow the standard formula interface for this model and you can use the usual `strata()` function: ```r cph_glmnet_strata_fit <- proportional_hazards(penalty = 0.1, mixture = 0.75) %>% set_engine("glmnet") %>% fit(Surv(age, adopted) ~ . + strata(sex), data = dogs) ``` Prediction code looks the same. --- # A bagged tree The nice thing about `parsnip` models is that the model syntax is very similar from model-to-model. For example, if we use a bagged tree: .code80[ .pull-left-a-lot[ ```r bag_fit <- bag_tree() %>% set_mode("censored regression") %>% set_engine("rpart", times = 50) %>% fit(Surv(age, adopted) ~ ., data = dogs) predict(bag_fit, our_dogs, type = "survival", time = times) %>% surv_curves() ``` ] .pull-right-a-little[ <img src="images/tree-1.svg" width="100%" style="display: block; margin: auto;" /> ] ] Other models: `boost_tree()`, `decision_tree()`, `rand_forest()`, and `survival_reg()`. --- # Opionionated choices * Linear predictors are always increasing with time. tidymodels assumes that you can compute performance without knowing about the model type or other data. * Many survival ROC methods are not designed for assessing predictions. We will implement the one that is most appropriate. * Current specifications for baseline survival groups are very limited. We'll give a more versatile tool for declaring the baseline. --- # What's next * Time-dependent covariates * Additional models/prediction methods * Performance metrics (in the `yardstick` package) * Model tuning (with the `tune` package) # Thanks Special thanks to Emil Hvitfeldt for creating the first version of censored.