Predicting smoking pack-years using routinely collected smoking data

Study type
Protocol
Date of Approval
Study reference ID
23_003049
Lay Summary

Information on smoking is very important for many different study questions, but whilst current smoking status is often recorded by GPs, it can be challenging to understand the extent of an individual's smoking habits over their entire life. However, this information is crucial for certain research projects to account for the individual’s risk of disease. Our study aims to create a way to estimate a person's total smoking exposure, known as "pack-years," using regular electronic health records from GPs. We will create two different methods to do this, one using a traditional statistical technique called linear regression, and another using advanced computer techniques known as machine learning. Both of these techniques will give us a ‘prediction model’ that can be used to estimate a patient’s pack-years even if this is not recorded by their GP. These models will be built using information related to smoking in the patient record, and the results of each model will be compared to each other and to other techniques to find out which one works best for predicting pack-years. We will then check whether the patients whose data was used to build the model are similar to the general population of patients who have smoked. Our goal is to accurately predict an individual's smoking history to allow researchers to better account for the impact of smoking on various health conditions that they are studying, leading to more accurate study results. This will lead to a higher quality of future research that can benefit patients.

Technical Summary

In this study, we will develop a prediction model for estimating smoking pack-years utilising data from the CPRD Aurum dataset. The primary objective is to construct two distinct models, one employing linear regression and another using a machine learning random forests model, and subsequently compare their performance compare their performance using R-squared, mean squared error, and concordance, to ascertain the most efficacious approach for predicting pack-years. Our predictors will comprise smoking codes (included as binary present/absent or number of times recorded), and time under which the patient was classed as a particular smoking category. The aim is to accurately predict pack-years from available healthcare data, allowing a more useful account of smoking's confounding effects in future epidemiological analyses when pack-year data is missing. Using only smoking codes to generate predictors ensures that the model can be seamlessly integrated as a covariate into other investigations without the possibility of predictors (such as comorbidity) being included twice in future models. We anticipate establishing a reliable, robust, and parsimonious model for predicting smoking pack-years, thus enabling better control for the confounding effects of smoking in subsequent research that uses EHR data. Applicability will be assessed by comparing the characteristics of patients with pack-years recorded to patients in whom pack-years have not been recorded using t-tests for continuous variables and chi-squared tests for categorical variables. For this part of the analysis we will also use IMD data so that we can include socioeconomic status. We will compare the results of analyses that use derived pack-years from our prediction models to those obtained using multiple imputation. This will be undertaken using an example regression model which assesses the effect of age on GP attendance rates, adjusting for other covariates that include smoking pack-years.

Health Outcomes to be Measured

Smoking pack-years
GP attendance rate (for Aim 6 only)

Collaborators

Jennifer Quint - Chief Investigator - Imperial College London
Alexander Adamson - Corresponding Applicant - Imperial College London
Alex Bottle - Collaborator - Imperial College London
Xizhuo Chu - Collaborator - Imperial College London

Linkages

Patient Level Index of Multiple Deprivation