Anonymised primary care patient data can be individually linked to secondary care and other health and area-based datasets. This linkage enables CPRD to provide a fuller picture of the patient care record to support vital public health research, informing advances in patient safety and delivery of care. CPRD is expanding its healthcare data and research services to increase both the cover of primary care data and the number of datasets that are linked and made available on a routine basis to the research community.
Linked datasets currently available include:
- Small area level data:
- Patient postcode linked deprivation measures
- Practice postcode linked deprivation measures
- Index of Multiple Deprivation (IMD)
- Townsend Deprivation Index
- Carstairs Index
- Rural-Urban Classification
- Data from NHS Digital:
- Hospital Episode Statistics (HES) Admitted Patient Care (HES APC) data
- HES Outpatient (HES OP) data
- HES Accident and Emergency (HES A&E) data
- HES Diagnostic Imaging Dataset (HES DID)
- HES Patient Reported Outcomes Measures (PROMs) data
- Death registration data from the Office for National Statistics (ONS)
- Mental Health Dataset (MHDS)
- Cancer data from Public Health England (PHE):
- Cancer Registration
- Systemic Anti-Cancer Therapy (SACT) Dataset
- National Radiotherapy Dataset (RTDS)
- Cancer Patient Experience Survey (CPES)
- Quality of Life of Cancer Survivors in England: Pilot Survey (QOLP)
- Quality of Life of Colorectal Cancer Survivors in England: Patient Reported Outcome Measures Survey (QOLC)
Publication: Padmanabhan S, Carty L, Cameron E, Ghosh RE, Williams R, Strongman H. Approach to record linkage of primary care data from Clinical Practice Research Datalink to other health-related patient data: overview and implications. Eur J Epidemiol, 2018.
Availability of linked data
Linkage of CPRD primary care data with other patient level datasets is available for English practices who have consented to participate in the linkage scheme. Each individual GP practice participating in CPRD's collection of their primary care data can choose to revoke their consent for data collection at any point.
CPRD respects all patient opt-outs. Patients who have registered an opt-out will not be extracted for CPRD research or for data linkage.
The latest full set of linkage data, referred to as set 18, is available for both CPRD GOLD, based on the Vision software system, and CPRD Aurum, based on EMIS software.
CPRD GOLD linkage data include patients from 416 practices. These linkages cover approximately 74% of contributing CPRD GOLD practices in the August 2019 build and located in England, and roughly 50% of contributing CPRD GOLD practices in the UK. 10,800,187 patients are eligible for linkage.
CPRD Aurum linkage data include patients from 890 practices. These linkages cover approximately 99% of CPRD Aurum practices available in the August 2019 build, all of which are in England. 28,618,186 patients are currently eligible for linkage.
The next linkage set (set 19) is being made available as a phased release, with priority linkages being expedited to support COVID-19 research at this time. The set 19 ONS deaths data (to April 2020), HES APC (to end March 2020) and small area data are now available, with 9,213,965 acceptable patients in the CPRD GOLD June 2020 build and 32,409,356 acceptable patients in the CPRD Aurum June 2020 build eligible for >/=1 linkage. The set 19 data for the other standard linkages will be released over the coming months.
If you are unsure about which linked dataset and/or source file should be used in your study, please contact us on email@example.com
Access to linked data
Access to patient level data is dependent on approval of a study protocol by the Independent Scientific Advisory Committee (ISAC). All required linked data sources must be requested on the application form. Additionally, researchers who are first time users of a linked dataset must contact the CPRD Observational Research Team to discuss their requirements before submitting their application. Data are only provided by CPRD when part of a data extract is linked to CPRD primary care data.
Classifications based on the population characteristics of small areas or neighbourhoods (and the individuals who live there) are available for linkage to CPRD primary care data. CPRD has linked GP practice postcodes and eligible patient residence postcodes for both CPRD GOLD and CPRD Aurum to some of the most commonly requested area level data. This includes several measures of area level deprivation and a rural-urban classification. These measures can be used as a proxy for socio-demographic and socio-economic data which are generally poorly recorded in the primary care data given they do not directly relate to a patient's care.
For each measure the postcode of the practice or patient residence is mapped to lower layer Super Output Area (LSOA), SOA in Northern Ireland or datazone (DZ) in Scotland using a postcode lookup file.
Patient postcode linked measures are available for patients in English practices that have consented to participate in the linkage scheme. The latest available patient postcode of residence is mapped to an LSOA boundary. The LSOA of residence then allows linkage to the following LSOA-level deprivation measures;
- 2004 English Index of Multiple Deprivation
- 2007 English Index of Multiple Deprivation
- 2010 English Index of Multiple Deprivation
- 2015 English Index of Multiple Deprivation (composite and individual domains)
- Townsend Deprivation Index: calculated using unadjusted 2001 census data
- Carstairs Index using 2011 census data
Data are provided as quintiles, deciles or twentiles of the deprivation score to prevent disclosure of patient location. In order to prevent the possibility of deductive disclosure of a patients’ area of residence, researchers will only be provided with one of the above linked datasets for any one study. Access is provided by CPRD subject to ISAC approval.
The general practice postcode linkages are available for all practices in CPRD GOLD and CPRD Aurum and use the general practice postcode which is linked via LSOA, SOA in Northern Ireland and datazone (DZ) in Scotland.
The general practice postcode linkage includes several well-known area-based measures of deprivation, of which two (Index of Multiple Deprivation and Carstairs Index) are available at the LSOA level for linkage to CPRD primary care data through the practice postcode. These measures are:
- 2015 English Index of Multiple Deprivation (composite and individual domains)
- 2016 Scottish Index of Multiple Deprivation (composite and individual domains)
- 2017 Northern Ireland Index of Multiple Deprivation (composite and individual domains)
- 2014 Welsh Index of Multiple Deprivation (composite and individual domains)
- Carstairs Index: England, Wales and Scotland calculated using 2011 census data
As standard, the most recent national Indices of Deprivation are provided for each country. It is important to note that the IMD indices are not comparable between countries in the UK. Older versions of the deprivation scores can be provided on request. The data are updated monthly. Data is provided as quintiles or deciles of the deprivation score to prevent disclosure of patient location. In order to prevent the possibility of deductive disclosure of the location of a practice, researchers will only be provided with one of the above linked datasets for any one study. Access is provided by CPRD subject to ISAC approval.
It may be important to distinguish between rural and urban areas when investigating differences in social and economic characteristics of small areas. Populations can vary in their composition between urban and rural areas, as can access to services, employment and educational opportunities, and quality of life. The measures available for patient (England only) and practice postcode are:
- 2011 England and Wales Rural-Urban classification
- 2015 Northern Ireland Rural-Urban classification
- 2016 Scottish Rural-Urban classification
Access is provided by CPRD subject to ISAC approval.
For more information about data linkage and prices please contact CPRD Enquiries on firstname.lastname@example.org
NHS Digital has responsibility for standardising, collecting and publishing data and information from across the health and social care system in England.
CPRD linked data from NHS-Digital includes Hospital Episode Statistics (HES) ─ a database containing details of all admissions, A and E attendances and outpatient appointments at NHS hospitals in England; ONS mortality data, and Mental Health Datasets.
HES Admitted Patient Care (HES APC) data contains details of all admissions to, or attendances at English NHS healthcare providers. It includes private patients treated in NHS hospitals, patients resident outside of England and care delivered by treatment centres (including those in the independent sector) funded by the NHS. All NHS healthcare providers in England, including acute hospital trusts, primary care trusts and mental health trusts provide data.
HES APC data includes the complete set of hospital episode information (admission and discharge dates, diagnoses (identifying primary diagnosis), specialists seen under and procedures undertaken) for each linked patient with a hospitalisation record. In addition, Augmented care data (intensive and/or high dependency levels of care) and Maternity data are available.
Diagnostic data recorded in HES are coded using the International Classification of Diseases version 10 (ICD10) coding frame; procedure information is coded using the UK Office of Population, Census and Surveys classification (OPCS) 4.6.
Requests for HES APC data access are subject to prior ISAC approval
The latest release of HES APC data (set 19) covers the period April 1997 to March 2020.
Please click on the link below to download the documentation which provides an overview of the HES APC data linked to CPRD primary care patients.
More information about HES APC data can be found in the data resource profile below, and from a number of recent concordance and validation studies.
Publication: Herbert A, Wijlaars L, Zylbersztejn A, Cromwell D, Hardelid P. Data Resource Profile: Hospital Episode Statistics Admitted Patient Care (HES APC). International Journal of Epidemiology, Volume 46, Issue 4, August 2017, Pages 1093–1093i.
Publication: Thorn JC, Turner EL, Hounsome L the CAP trial group, et al. Validating the use of Hospital Episode Statistics data and comparison of costing methodologies for economic evaluation: an end-of-life case study from the Cluster randomised triAl of PSA testing for Prostate cancer (CAP). BMJ Open 2016;6:e011063
Publication: Saine, ME et al. (2019). Concordance of hospitalizations between Clinical Practice Research Datalink and linked Hospital Episode Statistics among patients treated with oral antidiabetic therapies. Pharmacoepidemiol Drug Saf. issn: 1053-8569. doi: 10.1002/pds.4853
Publication: McDonald, L, CJ Sammon, et al. (2018). Under-recording of hospital bleeding events in UK primary care: a linked Clinical Practice Research Datalink and Hospital Episode Statistics study. Clin Epidemiol 10, pp. 1155– 1168. issn: 1179-1349 (Print) 1179-1349. doi: 10.2147/clep.s170304.
Publication: Williams, R et al. (2018). Cancer recording in patients with and without type 2 diabetes in the Clinical Practice Research Datalink primary care data and linked hospital admission data: a cohort study. BMJ Open 8.5, e020827. issn: 2044-6055. doi: 10.1136/bmjopen-2017-020827.
HES Outpatient (HES OP) data are a collection of individual records of outpatient appointments occurring in England only. The data includes information on the type of outpatient consultation appointment dates, the main specialty and treatment specialty under which the patient was treated, referral source, waiting times, clinical diagnosis and procedures performed. HES OP data can be used to support health resource utilisation studies, clarify clinical health care pathways and enable variations in the uptake of services to be evaluated, for example by gender and age.
Access to linked HES OP data is subject to prior ISAC approval.
The latest release of HES OP data (set 18) covers the period April 2003 to June 2019.
Please click on the link below to download the documentation relating to HES Outpatient data.
Useful information can be found in the following validation study on the coverage of HES OP resource-use data in comparison to medical records from a cluster randomised trial:
Publication: Thorn JC, Turner E, Hounsome L, Walsh E , Donovan JL, Verne J, Neal DE , Hamdy FC, Martin RM, Noble SM. Validation of the Hospital Episode Statistics Outpatient Dataset in England. Pharmacoeconomics, 34 (2), 161-8, Feb 2016.
HES Accident and Emergency (HES A&E) data consists of individual records of patient care administered in the accident and emergency setting in England. These data are a subset of national A&E data collected by NHS England to monitor the national standard that 95% of patients attending A&E should wait no longer than 4 hours from arrival to admission, transfer or discharge. A&E data are submitted by A&E providers of all types in England. Data collected includes details about patients’ attendance, outcomes of attendance, waiting times, referral source, A&E diagnosis, A&E treatment (drugs prescribed not recorded), A&E investigations and Health Resource Group. HES A&E may be used to clarify the health care pathway, to quantity health resource use and costs in the emergency setting, and to assess variations in the uptake of emergency services over time.
Access to HES A&E data is subject to prior ISAC approval.
The latest release of HES A&E data (set 18) covers the period April 2007 to June 2019.
Please click on the link below to download the documentation relating to HES Accident & Emergency data.
The Diagnostic Imaging Dataset (DID) is a collection of detailed information about diagnostic imaging tests, such as x-rays and MRI scans, taken from NHS providers' radiological information systems. The DID includes information on imaging tests carried out from 1 April 2012 on NHS patients in England. It does not include the images that are produced as a result of these tests. The DID captures information about referral source and patient type, details of the test (type of test and body site), plus items about waiting times for each diagnostic imaging event, from time of test request through to time of reporting. The DID enables analysis of demographic and geographic variation in access to different test types and different providers.
The DID is routinely linked to Hospital Episode Statistics (HES) through NHS Digital. This existing HES DID dataset has now been linked to CPRD primary care data enabling users to analyse patient care pathways. Access to HES DID data is subject to prior ISAC approval.
The latest release of HES DID data (set 18) covers the period April 2012 to June 2019.
Please click on the link below to download the documentation relating to the HES Diagnostic Imaging Dataset.
The HES Patient Reported Outcomes Measures (PROMs) programme covers common elective surgical procedures performed in NHS England including groin hernia operations, hip replacements, knee replacements and varicose vein operations. The programme covers over 300 NHS hospitals and Independent Sector Providers in England that undertake elective operations. The purpose of PROMs is to capture patients’ own assessments of their health and health-related quality of life, shortly before and some months after surgery. Patient questionnaires administered comprise a disease-specific instrument, a generic instrument and a series of additional questions about the patient’s health and symptoms. Note, mandatory varicose vein surgery and groin-hernia surgery national PROMs collections ended on 1 October 2017.
Access to linked HES PROMs data is subject to prior ISAC approval, and these data are only available for non-commercial purposes such as academic research or research relating to the delivery of services to the National Health Service (NHS). Case-by-case evaluation of requests involving commercial interests will be required.
The latest release of HES PROMs data (set 18) covers the period April 2009 to June 2019.
Please click on the link below to download the documentation relating to HES PROMs data.
Death Registration data contains data from the Office for National Statistics (ONS) and includes information on the official date and causes of death (using ICD codes).
Access to ONS Death Registration data is subject to prior ISAC approval.
The latest release of ONS Death Registration Data (set 19) covers the period 2 January 1998 to 20 April 2020.
Please note that late registration for some deaths means that the proportion of deaths captured is lower for the last year of the coverage period, and this proportion is likely to differ by age at death and cause of death. This is especially pronounced for the last 1-2 weeks of available death data which shows an under count of the total number of deaths as these data do not capture those where the registration of a death has been delayed (eg deaths referred to coroners in England, Wales and Northern Ireland, which cannot be registered until investigations have been concluded, and can result in delays of months or years).
Please click on the link below to download the documentation relating to ONS death registration data.
For more information please refer to the ONS User guide to mortality statistics, the ONS analysis exploring the impact of registration delays on mortality statistics and the associated dataset used for this report.
Further details can be found in three studies investigating the impact of the choice of data source in estimating mortality.
Publication: Gallagher, AM et al. (2019). The accuracy of date of death recording in the Clinical Practice Research Datalink GOLD database in England compared with the Office for National Statistics death registrations. Pharmacoepidemiol Drug Saf. issn: 1053-8569. doi: 10.1002/pds.4747.
Publication: Harshfield, A et al. (2018). Do GPs accurately record date of death? A UK observational analysis. BMJ Support Palliat Care. issn: 2045-435x. doi: 10.1136/bmjspcare-2018-001514.
Publication: Gallagher, AM. et al. (2016). The Impact of the Choice of Data Source in Record Linkage Studies Estimating Mortality in Venous Thromboembolism. PLoS One 11.2, e0148349. issn: 1932-6203. doi: 10.1371 / journal.pone.0148349.
The Mental Health Dataset (MHDS) is a collection of patient records of individuals who accessed secondary care adult mental health services and who are thought to be suffering from a mental illness. The data include information about the type and location of care received, different episodes of care received within a spell of illness and the events that occurred such as recording of Health of the Nation Outcome Scales (HoNOS) scores, Patient Health Questionnaire (PHQ-9) scores or diagnoses. MHDS data can be used to support research into resource utilisation and provide information about patient access to secondary mental health care services. This can be useful to understand patient pathways and consider associations between primary care and access to and outcomes recorded in secondary mental health care services.
Access to linked MHDS data is subject to the prior approval of the Independent Scientific Advisory Committee (ISAC).
The latest release of MHDS data (set 18) covers the period April 2007 to November 2015. Due to a number of changes in the structure and variables recorded in the MHDS the data are provided in two formats. Data collected between April 2007 and March 2011 are provided in a first format and data collected between April 2011 and November 2015 are provided in a second, slightly different, format.
Please click on the link below to download the documentation relating to the Mental Health Dataset.
Cancer data contain data provided by Public Health England (PHE) via the National Cancer Registration and Analysis Service (NCRAS). Linked NCRAS CPRD datasets include Cancer Registration data, the Systemic Anti-Cancer Treatment (SACT) Dataset, the National Radiotherapy Dataset (RTDS) and the Cancer Patient Experience Survey (CPES).
Access to cancer data is subject to prior ISAC approval.
The data contains a record for each registrable tumour diagnosed or treated in England, of which the NCRAS has been notified. Cancers are coded using the International Classification of Diseases for Oncology, revision 3, 2011. They are also back mapped to the tenth revision of the International Classification of Diseases version 10.
The latest release of PHE cancer registration data (set 18) covers the period January 1990 – December 2016.
More information about the cancer registration data can be found in the data resource profile published by PHE:
Publication: Henson KE, Elliss-Brookes L, Coupland VH, Payne E, Vernon S, Rous B, Rashbass J. Data Resource Profile: National Cancer Registration Dataset in England. International Journal of Epidemiology, dyz076.
Further details can be found in three studies comparing recording of cancer across data sources.
Publication: Strongman H, Williams R, Bhaskaran K. What are the implications of using individual and combined sources of routinely collected data to identify and characterise incident site-specific cancers? a concordance and validation study using linked English electronic health records data. BMJ Open 2020; 10:e037719. doi: 10.1136/bmjopen-2020-037719
Publication: Arhi, CS, A Bottle, et al. (2018). Comparison of cancer diagnosis recording between the Clinical Practice Research Datalink, Cancer Registry and Hospital Episodes Statistics. Cancer Epidemiol 57, pp. 148–157. issn: 1877-7821. doi: 10.1016/j.canep.2018.08.009.
Publication: Margulis, AV, J Fortuny, et al. (2018a). Validation of Cancer Cases Using Primary Care, Cancer Registry, and Hospitalization Data in the United Kingdom. Epidemiology 29.2, pp. 308–313. issn: 1044-3983. doi: 10.1097/ede.0000000000000786.
The SACT dataset covers chemotherapy treatment for all solid tumour and haematological malignancies, including those in clinical trials. Information is included about programme and regime of treatment, and the outcome for each treatment. In the latest linkage release (set 18) SACT data is available for patients with tumours recorded in the cancer registration data from January 2014 to September 2017. Data prior to January 2014 is also available but should be used with caution due to incomplete ascertainment during this period.
More information about the SACT data can be found in the data resource profile published by PHE:
Publication: Bright CJ, Lawton S, Benson S, Bomb M, Dodwell D, Henson KE, McPhail S, Miller L, Rashbass J, Turnbull A, Smittenaar R. Data Resource Profile: The Systemic Anti-Cancer Therapy (SACT) Dataset. International Journal of Epidemiology, dyz137.
The RTDS dataset contains records of radiotherapy services provided since April 2009, including teletherapy and brachytherapy. All radiotherapy delivered in England to patients in NHS facilities, or in private facilities where delivery was funded by the NHS, is included. Brachytherapy delivered for the treatment of non-malignant disease, radiotherapy delivered using unsealed sources, and non-therapeutic exposures delivered using radiotherapy machines (e.g. imaging) are not included. In the latest linkage release (set 18) RTDS data is available for patients with tumours recorded in the cancer registration data from April 2012 to September 2017.
The data include information from patients who have responded to the CPES about their cancer journey from their initial GP visit prior to diagnosis, through diagnosis and treatment and to the ongoing management of their cancer. Data is available for five waves of the survey conducted from 2010 to 2015.
Quality of Life of Cancer Survivors in England: Pilot Patient Reported Outcomes Measures Survey (2011)
The Quality of Life of Cancer Survivors in England: Pilot Survey (2011) was commissioned by the Department of Health as part of the National Cancer Survivorship Initiative (NCSI). The survey was conducted by Quality Health in conjunction with three cancer registries in England. The survey measured the overall quality of life of representative samples of cancer survivors with breast, colorectal cancer, prostate cancer and non-Hodgkin’s lymphoma (NHL) diagnosed during July 2006 - July 2010. Quality of life was assessed at four different time points after diagnosis at approximately one, two, three or five years. As this was a pilot survey, numbers are small and data governance issues will need to be carefully considered on a study by study basis. Outcome items in the survey are made up of Euroqol 5-level (EQ-5D), Functional Assessment of Cancer Therapy (FACT), and Social Difficulties Inventory (SDI) items.
The Quality of Life of Colorectal Cancer Survivors in England: Patient Reported Outcome Measures survey, is a national survey that was commissioned by the Department of Health as a follow-on from the pilot study in July 2011 undertaken to confirm the value of collecting PROMs data on breast, prostate, colorectal and non-Hodgkin’s lymphoma. It includes survey data from 34,467 patients aged 16 years and over with an incident colorectal cancer diagnosis during Jan 2010 Dec 2011. Outcome items in the survey are made up of Euroqol 5-level (EQ-5D), Functional Assessment of Cancer Therapy (FACT), and Social Difficulties Inventory (SDI) items.
The source data are provided to organisations that hold CPRD multi-study licences to enable researchers to ascertain which patients are eligible for linkage and to clarify the coverage periods for each data source. The linkage eligibility file (linkage_eligibility.txt) only includes patients from practices that have consented to take part in the linkage process. The file contains flags to indicate whether the patient is eligible for each individual linked data source. Some patients will not be eligible for any of the linked data sources, whereas others may be eligible for some/all of them. These data are provided so that multi-study licence users can determine the appropriate population to include in their study. The linkage coverage file (linkage_coverage.txt) indicates the start and end of coverage for each individual linked data source.
Access to set 18 and set 19 source data for CPRD GOLD and/or CPRD Aurum is available to nominated users only; for access, please contact us at email@example.com.
If you are unsure about which linked dataset and/or source file should be used in your study, please contact us on firstname.lastname@example.org
CPRD has developed a probabilistic mother-baby link algorithm, based on data recorded in the primary care medical record. This links likely mother-baby pairs within the CPRD GOLD database, based on family number plus maternity information from the mother’s primary care record, and the month of birth of newly registered babies.
The Pregnancy Register is created by an algorithm which was developed jointly by CPRD and the London School of Hygiene and Tropical Medicine. The Pregnancy Register lists all pregnancies identified in the CPRD GOLD database and includes details of each one. A single record represents a unique pregnancy episode. There may be more than one episode per woman. For pregnancies resulting in live births, de-identified information of the linked babies in the CPRD Mother Baby Link are also provided.
Publication: Minassian C, Williams R, Meeraus WH, Smeeth L, Campbell OMR, Thomas SL. Methods to generate and validate a Pregnancy Register in the UK Clinical Practice Research Datalink primary care database. Pharmacoepidemiol Drug Saf, Volume 28, Number 7, p.923-933 (2019)