We often want to understand what happens to patients who have certain diseases or procedures performed, but there can be lots of other factors that influence what the final outcome might be. For example, patients might have other diseases that have a bigger risk of death than the disease being studied. To take into account of all these other factors can become complicated, given the increasing number of co-existing illnesses that people have in our ageing population. Therefore, researchers often use a combined measure of the disease burden for each patient to produce a score that predicts their risk of death. This can then be used in studies to adjust for any underlying differences in risk of death that the patients have.
The purpose of this study is to produce a new method for summarising the burden of life limiting disease that each patient has, given the increased amount of information available in linked primary care and secondary care data. In particular I will use novel methods that systematically assess the patterns in combinations of diseases rather than just simple counts of diseases. This will provide a better tool for researchers to use in studies using databases like the CPRD.
Identifying patterns of co-morbidity that predict patient groups with reduced survival can identify risk groupings that can be used to adjust for case mix. However, the quantity of clinical coding in linked routine datasets means that systematically assessing each potential combination through traditional statistical methods is not feasible. Previous work has used Bayesian information sharing between similar codes in a fixed a priori hierarchy. This current proposal will use supervised learning to identify a bespoke hierarchy, allowing codes from different disease categories to be grouped together. The method uses latent Dirichlet allocation to learn a latent distribution for these categories. This has been adapted to be supervised by a penalised Cox proportional hazards model predicting mortality will be fitted using cyclical coordinate descent so as to be stable with high dimensional data. In particular, this topic modelling method allows for heterogeneous sub groups for which different combinations of predictors might be relevant. This better reflects the case mix of patients in the general population and might therefore should better adjust for this case mix in future studies.
Dates of all-cause mortality for the whole cohort will be extracted from the linked data using the Office of National Statistics death register. All deaths in England are coded and recorded in the Office of National Statistics Death register from death certificates using the standardised rules established by WHO. Deaths from all causes within the 4 years of follow up will define the outcome events for the cohort.
Colin Crooks - Chief Investigator - University of Nottingham
Colin Crooks - Corresponding Applicant - University of Nottingham