Quantification of potential undiagnosed Pulmonary Arterial Hypertension (PAH) patients among patients with chronic respiratory disorders and PAH-related cardiovascular conditions through machine learning models in the UK CPRD-HES databases

Date of Approval
Application Number
Technical Summary

Pulmonary arterial hypertension (PAH) is a rare disease that often remains asymptomatic in early stages. Affected patients are often un(mis)-diagnosed since the symptomology of PAH is similar to other chronic respiratory or cardiovascular disorders, and the final diagnosis of PAH requires an invasive right heart catheterisation (RHC) to exclude other forms of pulmonary hypertension.

This study aims to develop a machine learning solution to quantify undiagnosed PAH patients amongst patients initially diagnosed with chronic respiratory or cardiovascular diseases and to determine features that help estimate the probability of having PAH. Patient data is heterogeneous and noisy; therefore, standard biostatistics, such as one step binary regression/classification, are rendered ineffective in handling such heterogeneities, resulting in inflated false-positive rates.

We propose a two-steps approach comprising parallel ML techniques and utilising data from three patient cohorts constructed from CPRD GOLD and Aurum, HES (OP, APC, DID, and A&E):

1. Confirmed PAH cohort – PAH diagnosis within 180 days post-RHC
2. Potential PAH cohort- Chronic respiratory disorders or cardiovascular conditions with similar symptomology as PAH conditions, but not yet confirmed as PAH via RHC
3. Confirmed non-PAH cohort– RHC and no PAH diagnosis within 180 days post-RHC.

The approach uses a semi-supervised learning technique to refine a subgroup of patients within the potential PAH cohort who exhibit closest features with the confirmed PAH cohort. Independently, a binary classifier is trained to distinguish between confirmed PAH and non-PAH cohorts. Finally, the two parallel approaches are merged to further refine potential PAH subgroup. All mathematically significant features will be identified and further evaluated.

Insights from the model will allow a deeper understanding of patient subgroups with higher rates of PAH un(mis)-diagnosis and areas of unmet need where additional PAH education and awareness is required. Ultimately, patients can get diagnosed with PAH earlier, improving their survival chances.

Health Outcomes to be Measured

The study has three main outcomes. First, an estimate of the number of undiagnosed PAH patients among patients with chronic respiratory disorders and cardiovascular conditions. Second, understanding the impact of comorbid conditions, demographics, treatments, procedures, or laboratory test values on patient's estimated probability of having PAH. Finally, the model performance metrics including model sensitivity, specificity, and precision-recall curves.


Eva-Maria Didden - Chief Investigator - Actelion Pharmaceuticals Ltd
Eva-Maria Didden - Corresponding Applicant - Actelion Pharmaceuticals Ltd
Brenda Reinhart - Collaborator - ZS Associates
Denys Wahl - Collaborator - Janssen-Cilag EMEA
Flora Ashley Daniels - Collaborator - Not from an Organisation
Hammad Shahid - Collaborator - Janssen Pharmaceutica NV
Manish Kumar - Collaborator - ZS Associates
Pratyush Khare - Collaborator - ZS Associates
Prerna Goel - Collaborator - ZS Associates
Sanchita Porwal - Collaborator - ZS Associates
Shaishav Jain - Collaborator - ZS Associates


HES Accident and Emergency;HES Admitted Patient Care;HES Diagnostic Imaging Dataset;HES Outpatient