How long will you live? Should you spring for that AppleCare+ warranty for your iPhone? When will your buddy pay you back for that lunch?
For centuries, soothsayers have striven to understand the lifespan of things – be they patient longevity, product lifecycles, or even time to loan default. Nowadays, scientists have turned away from reading tea leaves and toward survival analysis – a complex data science method for predicting not only whether an event will happen (the death of a patient, the failure of a product or machine, default on a payment, and so on) but when this event is likely to occur.
But it’s problematic. Until now, the tools of survival analysis have only been applicable in certain settings. This is due to the inherent heterogeneity of what is being analyzed: differences in patient lifestyles, demographics, product usage patterns, and so on.
New research by Goizueta’s Donald Lee, associate professor of information systems and operations management and of biostatistics and bioinformatics, has yielded a new tool that greatly extends survival analysis to broader use cases.
“Historically, scientists have used classic survival analysis tools to predict the lifespan of different things in different fields, from products to patients,” Lee said. “Since the 1950s, the Kaplan-Meier estimator has been the benchmark for analyzing lifetime data, particularly in clinical trials. The next breakthrough came in the 1970s when the Cox proportional hazards model was introduced, which allows researchers to incorporate variables that can affect the predictability of things like patient mortality.”
The problem with the existing survival analysis tools, Lee said, is that they make certain assumptions that can skew the predictions if the assumptions are not met.
“There are very few existing tools that can incorporate variables without imposing assumptions on how they affect survival, let alone when there are a lot of variables that can also change over time. For example, two iPhones will have different lifespans depending on the temperature at which they are stored, amongst many other factors. But it’s unlikely that storing your phone at 30 degrees will halve its lifespan compared to storing it at 60 degrees. This sort of linear relationship is commonly assumed by existing tools.”
Lee’s team developed a new survival methodology based on something called gradient boosting: a machine learning technique that combines decision trees to yield predictions. The method, Lee said, is totally assumption-free (or nonparametric in technical parlance) and can deal with a large number of variables that can change continuously over time, making it significantly more general than existing methods. Nothing like it has been seen until now, he noted.
“Calculating the survival rate of anything is super complex because of the variables. Say you want to create an app for a smart watch that monitors the wearer’s vitals and use this information to create a real-time warning indicator for stroke. Doing this accurately is difficult for two reasons,” Lee explained.
“First, a large number of variables may be relevant to stroke risk, and the variables can interact in ways that break the assumptions central to existing survival analysis methods. And second, variables like blood pressure vary over time, and it is the recent measurements that are most informative. This introduces an additional time dimension that further complicates things.”
The software implementation of Lee’s method, BoXHED, overcomes both issues and allows scientists to develop real-time predictive models for conditions like stroke. The trained model can then be ported to a watch app to tell its wearer if and when they’re likely to have a stroke, a process known as inferencing in machine learning lingo.
The implications, Lee said, are huge.
“BoXHED now opens the door for modern applications of survival analysis. In previous research, I have looked at the design of early warning mortality indicators for patients with advanced cancer and also for patients in the ICU. These use other methods to make predictions at fixed points in time, but now they can be transformed into real-time warning indicators using BoXHED.”
He cited the case of end-stage cancer patients who are often better served by hospice care than by aggressive therapy.
“Accurate predictions of survival are absolutely critical for care planning. In previous analyses, we have seen that using existing predictive models to inform end-of-life care planning can potentially avert $1.9 million in medical costs and 1,600 days of unnecessary inpatient care per 1,000 patient visits in the United States. BoXHED is likely to lead to even better results.”
Lee’s research paper is forthcoming in the Annals of Statistics. He has also created an open-source software implementation of BoXHED, which can radically improve the accuracy of survival analysis across a breadth of applications. The paper describing BoXHED was published in the International Conference on Machine Learning, and the latest version of the software can be found at www.github.com/boxhed.