Evaluation of ethnicity recording and ascertainment methods using linked data from the UKs Clinical Practice Research Datalink and Hospital Episode Statistics

Study type
Date of Approval
Study reference ID
Lay Summary

The Clinical Practice Research Datalink (CPRD) maintains a repository of primary care electronic healthcare records in the United Kingdom (UK). Primary care data from CPRD contributing GP practices in England can be linked to secondary care data from Hospital Episode Statistics (HES). As ethnicity can be recorded in different datasets and at different times, researchers must use an algorithm to select which ethnicity record is the most appropriate for each patient based on frequency of recording and recency of recording.

This study aims to evaluate different methods for selecting the most appropriate ethnicity record for patients who have multiple records and select a single method with which to assign the most likely ethnicity to patients. This will be achieved by testing several different methods that prioritise different characteristics of the ethnicity records, such as date of the records, number of records, location of the records, etc. and comparing them to the UK ethnic distribution from the Census. This study will also evaluate how many patients have no ethnicity recorded and how many patients declined to have their ethnicity recorded. We will assess the representativeness of ethnicity distribution compared with the ethnicity distributions from the 2021 Census.

This project will provide a better understanding of ethnicity recording in CPRD and HES and will serve to improve creation and interpretation of research results. This may lead to translation of those results into improved clinical practice for patients allowing for targeting of interventions and reduction in health inequalities.

Technical Summary

Ethnicity data is recorded in the Clinical Practice Research Datalink (CPRD) GOLD and CPRD Aurum primary care databases. Ethnicity data may also be obtained via secondary care data linked to CPRD from the Hospital Episode Statistics (HES) datasets for patients in England. Patients may have multiple ethnicity records within primary and/or secondary care, meaning researchers must determine which ethnicity is most appropriately assigned to a patient using bespoke algorithms that consider the frequency, the recency, and the location of recording.

This study aims to evaluate different prioritisations within an algorithm for ascertaining most likely ethnicity using CPRD and linked HES data to create a standardised method for obtaining and ascertaining ethnicity from these data sources. Comparisons of the resulting ethnic distributions will be assessed for inter-rater reliability using Cohen’s kappa and compared to the National distributions of ethnicity in the United Kingdom (UK) based on Census data from 2011 and 2021. Additionally, this study will evaluate the types of missing ethnicity, such as not coded and declined to provide ethnicity, in these data sources. We will assess the representativeness of ethnicity distribution (proportions of the population) in CPRD and CPRD combined with HES compared with the ethnicity distributions from the 2011 and 2021 Censuses in the UK, Great Britain, England, Scotland, Wales, and Northern Ireland.

This project will benefit patients and researchers through the creation of a standardised methodology for the ascertainment of ethnicity using CPRD and linked HES datasets. This will provide a CPRD recommended option for assigning the most likely ethnicity to patients for researchers wishing to investigate the relationships between ethnicity and health in these data sources.

Health Outcomes to be Measured

Counts and proportions of the CPRD populations in each ethnic category. These will be compared to the National distributions of ethnicity in the UK based on Census data from 2011 initially and then 2021 when it becomes available.


Eleanor Axson - Chief Investigator - CPRD
Suhail Shiekh - Corresponding Applicant - CPRD
Helen Booth - Collaborator - CPRD
Rachael Williams - Collaborator - CPRD
Suhail Shiekh - Collaborator - CPRD


HES Accident and Emergency;HES Admitted Patient Care;HES Diagnostic Imaging Dataset;HES Outpatient;Patient Level Index of Multiple Deprivation;Practice Level Index of Multiple Deprivation;Practice Level Rural-Urban Classification;Rural-Urban Classification