Survival Analysis with Machine Learning for Predicting the Risk of Adverse Outcomes

Study type
Protocol
Date of Approval
Study reference ID
17_224
Lay Summary

The growing availability of complex and comprehensive clinical datasets, such as electronic health records, together with innovation in machine learning and increase in computing power, has now made new data-driven approaches for automatic variable (or feature) extraction and selection, as well as discovery of their interdependencies, possible. However, to date, the empirical evidence for the utility of such tools for clinical applications have been limited.

The purpose of this study is to use the Clinical Practice Research Datalink (CPRD), to evaluate how machine learning and deep learning compare against conventional statistical survival models that are applied to electronic health records. To this end, we will focus on a particular problem, that is to predict the risk of emergency admissions to hospitals for avoidable causes, and compare our model performance with the previously published reports that have investigated the same question by using standard statistical survival analysis1

Technical Summary

We will study patients aged between 18 and 100 years with a valid IMD score, who have been registered in the system at least one year before the beginning of study period. Data for each individual patient is aggregated up until the beginning of study, and is used to predict the risk of an emergency admission during follow-up study periods of different lengths, e.g., 3, 6, 12, 24, and 48 months.

We will first build a Cox Proportional Hazard model1,2, which will serve as a baseline in our comparisons. We will investigate various techniques for preprocessing of data3,4, as well as, for building better prediction models. We will use ensemble-learning techniques such as Random Forest5 and Gradient Boosting Machines6, which have shown to perform well on various types of data. Both these methods carry out simultaneous variable selection and modelling, which makes their use convenient; they also rely on minimal parametric assumptions, which makes their use for mixed-type feature spaces, such as ours, appropriate. More details are provided in section N (Data/ Statistical Analysis).

Collaborators

Kazem Rahimi - Chief Investigator - The George Institute for Global Health
Fatemeh Rahimian - Corresponding Applicant - The George Institute for Global Health
Amir Hossein Payberah - Collaborator - The George Institute for Global Health
Dexter Canoy - Collaborator - The George Institute for Global Health
Reza Salimi Khorshidi - Collaborator - The George Institute for Global Health

Linkages

HES Accident and Emergency;HES Admitted Patient Care;ONS Death Registration Data;Patient Level Index of Multiple Deprivation