Multi-database imputation to adjust for confounders within distributed data drug safety networks – a methodological study using real-world data from the UK Clinical Practice Research Datalink

Study type
Protocol
Date of Approval
Study reference ID
17_133
Lay Summary

Healthcare databases may contain information on patient demographics, prescriptions, diagnoses, and laboratory test results. This information can be used to see if a drug on the market is harmful or works properly. Sometimes, multiple healthcare databases are analysed together to have enough data. However, one problem with this approach is that some databases may not contain information that others do. This problem may produce invalid research results. In this paper, we discuss a method to correct this problem. In our method, missing information is estimated from databases that do contain the data of interest. To demonstrate our method, we will present a case study using the Clinical Practice Research Datalink (CPRD). Our case study assesses the effect of statins (cholesterol-lowering drugs) on the risk of heart attacks. Our study population will be patients with newly treated type 2 diabetes and no history of heart problems or recent statin use. We will mimic a situation with multiple healthcare databases by dividing patients into mock databases. In some of these, we will remove important data commonly absent from healthcare databases (for example, body mass index [BMI]). We will then compare the results of analyses that do and do use our method.

Technical Summary

Confounders such as smoking history, BMI, and laboratory values are not always captured by databases used in distributed data drug safety networks. Data access restrictions preclude traditional missing data techniques in these settings. We propose a method called “multi-database imputation” to correct for bias introduced when some databases are missing confounders. Our method leverages the “validation” databases that capture the confounders of interest to generate posterior predictive distributions from which values are sampled for multiple imputation in the other, “missing” databases. We will demonstrate our method’s utility by estimating the effect of statins on myocardial infarction (MI) among patients with type 2 diabetes. We will identify a population of patients with newly-treated type 2 diabetes without cardiovascular disease (CVD) or recent history of statin use in the CPRD linked to the Hospital Episode Statistics (HES) and Office of National Statistics (ONS) databases. Patients will be divided into “validation” or “missing” databases based on geographic region. In the “missing” databases, smoking status, BMI, HbA1c, and serum cholesterol levels will be ignored. We will compare meta-analysed hazard ratios (HRs) for the effect of current statin use on MI with and without imputation of missing values using Cox models adjusted for baseline confounders.

Health Outcomes to be Measured

The primary study outcome will be time to fatal or nonfatal MI, excluding perioperative MIs. Outcomes will be ascertained using the HES and ONS, with the earliest code date as the event date. MIs determined in-hospital will be considered events if they occur in the primary or secondary position. Patients who have codes related to perioperative MIs will be censored from further study without signalling a study outcome.

Collaborators

Samy Suissa - Chief Investigator - Sir Mortimer B Davis Jewish General Hospital
Kristian Filion - Corresponding Applicant - McGill University
Andrea Benedetti - Collaborator - Research Institute of the McGill University Health Centre
Colin Dormuth - Collaborator - McGill University
Matthew Secrest - Collaborator - McGill University
pauline reynier - Collaborator - Sir Mortimer B Davis Jewish General Hospital
Robert Platt - Collaborator - McGill University

Linkages

HES Admitted Patient Care;ONS Death Registration Data