Assessment of LungFlag model - a lung cancer early detection algorithm

Study type
Protocol
Date of Approval
Study reference ID
22_001844
Lay Summary

Most lung cancers are diagnosed at an advanced stage. The stage of disease is strongly associated with lung cancer survival, with the localized stage of lung cancer having the most favorable prognosis. Therefore, a timely detection of individuals at high risk of developing lung cancer will help patients receive earlier intervention, and improve long-term outcomes. A number of risk models have been developed to predict lung cancer onset and death over varying periods of time, based on purely basic demographic and clinical information. The LungFlag model is a machine learning model that utilizes both routine clinical and laboratory data, which could provide a more comprehensive view on lung cancer detection.
The LungFlag model was developed using the Kaiser Permanente Southern California (KPSC) data in the US, which is a unique healthcare setting with an integrated healthcare system that provides comprehensive care. In order to evaluate the performance of the LungFlag model on external databases and from different geographic areas, the proposed study aims to validate such a model in the CPRD population. The findings from this study will provide opportunities to use data science techniques to help physicians schedule timely lung cancer screenings to enable early disease detection and better patient outcomes.

Technical Summary

This is a retrospective cohort study using secondary data from the CPRD GOLD and CPRD Aurum databases linked with several files stated below. The study aims to validate the LungFlag model that detects lung cancer patients with routine clinical and laboratory information. The LungFlag will help physicians schedule timely lung cancer screenings and enable early disease detection and better patient outcomes.
The study cohort includes patients in CPRD databases who were 45-90 years old between 2014-2018. For each patient, the LungFlag generates a risk score that indicates the likelihood of a patient to develop lung cancer in 2 years. The exposure is defined by a pre-defined threshold of the risk score. Score above the threshold are high-risk patients and below are average-risk patients. The primary outcome is a diagnosis of lung cancer, which is flagged from NCRAS Cancer Registration Data. Patient characteristics, relevant clinical information, and routine lab will be extracted from the CPRD databases and hospitalization information are extracted from the HES APC data. Such information is fed into the model to calculate patient’s risk score. Data from Practice-level IMD and patient-level Rural-Urban Classification files are used in the subgroup analysis to understand the model performance stratified by different socioeconomic status.
The LungFlag model is based on Medial Infrastructure for inferring predictive models that utilizes the Extreme Gradient Boosting (XGBoost) algorithm. The model was derived on the Kaiser-Permanente dataset and validated on both Kaiser-Permanente and Geisinger in the United States.
The model performance in CPRD data will be evaluated primarily by the Rate Ratio (RR) comparing the incidence rates of lung cancer between high-risk and average-risk groups flagged by the LungFlag model. Additionally, Odds Ratio (OR), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and area under the receiver operating characteristic curve (AUC) will be calculated.

Health Outcomes to be Measured

The primary outcome to be measured is the diagnosis of lung cancer per the Cancer Registry and/or pathology, for Non-small Cell Lung Cancer (NSCLC) and small cell lung cancer (SCLC).
● NSCLC: Adenocarcinoma; Squamous cell carcinoma; Large cell carcinoma; NSCLC
● SCLC
● Not otherwise specified (NOS) or Poorly differentiated

Collaborators

Yue Jin - Chief Investigator - F. Hoffmann - La Roche Ltd
Yue Jin - Corresponding Applicant - F. Hoffmann - La Roche Ltd
Alon Lanyado - Collaborator - Medial EarlySign
Coby Metzger - Collaborator - Medial EarlySign
Eitan Israeli - Collaborator - Medial EarlySign
Eran Netanel Choman - Collaborator - Medial EarlySign
Iori Namekawa - Collaborator - Roche
Joanna Harton - Collaborator - Genentech, Inc.
Matthew Kent - Collaborator - Genentech, Inc.
Thanh Ton - Collaborator - Genentech, Inc.

Former Collaborators

Joanna Harton - Collaborator - Genentech, Inc.
Matthew Kent - Collaborator - Genentech, Inc.

Linkages

HES Admitted Patient Care;HES Outpatient;NCRAS Cancer Registration Data;No additional NCRAS data required;Practice Level Index of Multiple Deprivation;Rural-Urban Classification