Quantification of potential undiagnosed Pulmonary Arterial Hypertension (PAH) patients among patients with chronic respiratory disorders and PAH-related cardiovascular conditions through machine learning models in the UK CPRD-HES databases

Study type
Protocol
Date of Approval
Study reference ID
21_000580
Lay Summary

Pulmonary arterial hypertension (PAH) is a rare disease affecting 53 per million people in the UK. Without timely diagnosis and treatment, PAH can lead to irreversible damage to the lungs’ blood vessels by making them narrower over time. Narrowing of vessels interferes with blood flow, increasing blood pressure and forcing the heart to pump with greater force. Since this condition eventually leads to heart failure, early PAH diagnosis is vital to treat patients as soon as possible and slow worsening of the disease.

However, people with PAH often have their disease diagnosed late. One reason is PAH may not cause clear symptoms or only cause very general symptoms in early stages, such as feeling out of breath after climbing stairs. A second reason is definitive diagnosis of PAH is complex and requires an invasive procedure (right heart catheterisation) to measure blood pressure in the lungs and heart.

This study will create a computer program to distinguish people with PAH from those without PAH using de-identified data from their healthcare experience. The program will be used to estimate the current number of people in the UK with potentially undiagnosed PAH. In the future, it could help doctors screen for risk of PAH in their patients and increase rates and rapidness of PAH diagnosis.

Technical Summary

Pulmonary arterial hypertension (PAH) is a rare disease that often remains asymptomatic in early stages. Affected patients are often un(mis)-diagnosed since the symptomology of PAH is similar to other chronic respiratory or cardiovascular disorders, and the final diagnosis of PAH requires an invasive right heart catheterisation (RHC) to exclude other forms of pulmonary hypertension.

This study aims to develop a machine learning solution to quantify undiagnosed PAH patients amongst patients initially diagnosed with chronic respiratory or cardiovascular diseases and to determine features that help estimate the probability of having PAH. Patient data is heterogeneous and noisy; therefore, standard biostatistics, such as one step binary regression/classification, are rendered ineffective in handling such heterogeneities, resulting in inflated false-positive rates.

We propose a two-steps approach comprising parallel ML techniques and utilising data from three patient cohorts constructed from CPRD GOLD and Aurum, HES (OP, APC, DID, and A&E):

1. Confirmed PAH cohort – PAH diagnosis within 180 days post-RHC
2. Potential PAH cohort- Chronic respiratory disorders or cardiovascular conditions with similar symptomology as PAH conditions, but not yet confirmed as PAH via RHC
3. Confirmed non-PAH cohort– RHC and no PAH diagnosis within 180 days post-RHC.

The approach uses a semi-supervised learning technique to refine a subgroup of patients within the potential PAH cohort who exhibit closest features with the confirmed PAH cohort. Independently, a binary classifier is trained to distinguish between confirmed PAH and non-PAH cohorts. Finally, the two parallel approaches are merged to further refine potential PAH subgroup. All mathematically significant features will be identified and further evaluated.

Insights from the model will allow a deeper understanding of patient subgroups with higher rates of PAH un(mis)-diagnosis and areas of unmet need where additional PAH education and awareness is required. Ultimately, patients can get diagnosed with PAH earlier, improving their survival chances.

Health Outcomes to be Measured

The study has three main outcomes. First, an estimate of the number of undiagnosed PAH patients among patients with chronic respiratory disorders and cardiovascular conditions. Second, understanding the impact of comorbid conditions, demographics, treatments, procedures, or laboratory test values on patient's estimated probability of having PAH. Finally, the model performance metrics including model sensitivity, specificity, and precision-recall curves.

Collaborators

Hammad Shahid - Chief Investigator - Janssen Pharmaceutica NV
Eva-Maria Didden - Corresponding Applicant - Actelion Pharmaceuticals Ltd
Brenda Reinhart - Collaborator - ZS Associates
Denys Wahl - Collaborator - Janssen-Cilag EMEA
Flora Ashley Daniels - Collaborator - Cilag GmbH International
Hammad Shahid - Collaborator - Janssen Pharmaceutica NV
Manish Kumar - Collaborator - ZS Associates
Pratyush Khare - Collaborator - ZS Associates
Prerna Goel - Collaborator - ZS Associates
Sanchita Porwal - Collaborator - ZS Associates
Shaishav Jain - Collaborator - ZS Associates

Former Collaborators

Priyansh Jain - Collaborator - ZS Associates

Linkages

HES Accident and Emergency;HES Admitted Patient Care;HES Diagnostic Imaging Dataset;HES Outpatient