Application of machine learning algorithms on electronic health records to examine patterns of disease clustering and trajectories

Study type
Protocol
Date of Approval
Study reference ID
20_095
Lay Summary

Understanding patterns of the development and combination of different health conditions, and how these might affect an individual’s risk of suffering other health problems in the future can help policy planning and inform future research. The availability of data from patient records and the emergence of new ways to analyse complex data could potentially help improve the accuracy of existing methods to predict who might be at risk and to characterise patterns in recorded health data of disease conditions that tend to co-occur more frequently, which could be of clinical relevance. To address these questions, we plan to conduct our research in two phases – Phase 1 focuses on refining our analytical approaches and using these approaches to discover diseases that often cluster together, and Phase 2 will explore the impact of these disease clusters on future health. In Phase 1 (this protocol), we will use machine learning methods to develop advanced analytical techniques as applied to de-identified health records to allow us to find out which diseases tend to co-occur together, and how these disease patterns of co-occurrences evolve over time. In Phase 2 (protocol to be submitted after conducting Phase 1), we will explore how these disease clusters might are associated with future risk of death and other health outcomes.

Technical Summary

Phenotyping (characterisation of individuals with specific features of interest such as exposure to risk factor of interest, health outcome, or target variable) is an important step in electronic health records (EHR)-based research. Building on our previous work and expertise, we aim to analyse EHR to gain a better understanding of evolving patterns of multimorbidity and define ‘high-throughput’ phenotyping based on machine learning methods. We will study patients aged ?16 years with primary health care data with linkage to other administrative databases, such as Hospital Episode Statistics (HES) and mortality, to capture patients’ health care journey. The purpose of this two-stage study is to use the Clinical Practice Research Datalink (CPRD) to tackle methodological challenges of a dynamic framework for efficient and scalable phenotyping, describe and characterise disease clusters and trajectories in the population, explore the nature of the associations between diseases in poorly understood disease clusters and their underlying determinants, examine consequences of multimorbidity, and provide estimates for uncertainty on these risk associations. By means of machine learning methods we will conduct valid, accurate and reliable phenotyping, which could improve accuracy of predictive modelling. Through the application of state-of-the-art machine-learning approaches that capture complex temporal disease interactions, we will identify and describe disease clusters and trajectories (Phase 1). By taking account of the entire disease trajectory of individuals, we will assess health impact and survival associated with disease clusters identified through the data-driven methodology (Phase 2 – separate protocol) using conventional statistical methods to analyse survival and longitudinal data. Through stepwise elimination of spurious disease-disease interactions and application of multiple methods, we will work towards developing models to aid in the identification of likely causal links and explanatory factors particularly for less well-understood disease cluster associations.

Health Outcomes to be Measured

In this proposal (Phase 1), we will be looking into developing machine learning methods to identify disease clusters and their trajectories. We will characterise co-occurrences of chronic diseases / long-term conditions some of which we have previously identified as of clinical importance in the UK and from the criteria set by relevant clinical guidelines.

Collaborators

Kazem Rahimi - Chief Investigator - The George Institute for Global Health
Kazem Rahimi - Corresponding Applicant - The George Institute for Global Health
Abdelaali Hassaine - Collaborator - University of Oxford
Mohammad Mamouei - Collaborator - University of Oxford
Rema Ramakrishnan - Collaborator - University of Oxford
Shishir Rao - Collaborator - University of Oxford
Yikuan Li - Collaborator - University of Oxford
Yutong Cai - Collaborator - University of Oxford
Zhengxian Fan - Collaborator - University of Oxford

Former Collaborators

Yajie Zhu - Collaborator - University of Oxford

Linkages

HES Accident and Emergency;HES Admitted Patient Care;HES Outpatient;ONS Death Registration Data;Patient Level Index of Multiple Deprivation;Practice Level Index of Multiple Deprivation