Synthetic data

CPRD has generated a number of synthetic datasets that can be used for training purposes or to improve algorithms or machine learning workflows.

High-fidelity synthetic datasets
- CPRD cardiovascular disease synthetic dataset
- CPRD COVID-19 symptoms and risk factors synthetic dataset
Medium-fidelity synthetic datasets
- CPRD Aurum and CPRD GOLD sample datasets

Access to the datasets

The synthetic datasets will need a data sharing agreement (DSA) with the applicant’s organisation for access in line with advice received from the Information Commissioner’s Office (ICO) Innovation Hub in response to a formal query by the MHRA.

For access to these datasets please submit the CPRD Synthetic data access request form to enquiries@cprd.com including ‘Synthetic data access request’ in the email subject header. Applicants from organisations that are not existing CPRD clients will also need to submit a new client request form.

Existing multi-study licence (MSL) clients will not need a separate DSA to access the sample datasets as this can be added to their MSL agreement upon request. MSL clients do not need to submit a synthetic data access request form and can apply to access to the sample datasets at no additional cost by emailing enquiries@cprd.com.

Download: CPRD synthetic data access request form v1.1 (Word, 410KB, 3 pages)

Find out about pricing information on our web page or by contacting enquiries@cprd.com.

High-fidelity synthetic datasets

CPRD has generated high-fidelity synthetic datasets using a synthetic data generation and evaluation framework that was developed under a grant from the Regulators’ Pioneer Fund launched by The Department for Business, Energy and Industrial Strategy (BEIS) and managed by Innovate UK. The synthetic data generation and evaluation framework used to generate this synthetic dataset and the synthetic datasets are owned by the Medicines and Healthcare products Regulatory Agency (MHRA).

A detailed technical description of the methodology used to generate the synthetic datasets is available in the publications by Wang et al. (2021) and Tucker et al. (2020).

These high-fidelity synthetic datasets replicate the complex clinical relationships in real primary care patient data while protecting patient privacy as they are wholly synthetic. They can be used instead of real patient data for complex statistical analyses as well as machine learning and artificial intelligence (AI) research applications.

The high-fidelity synthetic datasets are based on data derived from CPRD Aurum and as such, are not suitable if you wish to understand the raw structure of the CPRD Aurum database. For this purpose, please request the medium-fidelity CPRD Aurum sample dataset.

The high-fidelity synthetic datasets are being made available with a nominal administrative fee. An additional fee will apply for an annual teaching licence.

CPRD cardiovascular disease synthetic dataset

This synthetic dataset is based on anonymised real primary care patient data extracted from the CPRD Aurum database. The dataset focuses on cardiovascular disease risk factors and was a proof-of-concept dataset developed as part of a project funded by the Regulators’ Pioneer Fund launched by The Department for Business, Energy and Industrial Strategy (BEIS) and managed by Innovate UK.

Download: CPRD Synthetic Cardiovascular Disease Data Specification (PDF, 194KB, 5 pages)

https://doi.org/10.11581/yk6n-b652

CPRD COVID-19 symptoms and risk factors synthetic dataset

This synthetic dataset is based on anonymised real primary care patient data extracted from the CPRD Aurum database. The dataset focuses on patients presenting to primary care with symptoms indicative of COVID-19 (confirmed/suspected COVID-19) and control patients with negative COVID-19 test results. The dataset includes data on sociodemographic and clinical risk factors.

The development of this dataset was funded by NHSX using the synthetic data generation and evaluation framework developed under a grant from the Regulators’ Pioneer Fund launched by The Department for Business, Energy and Industrial Strategy (BEIS) and managed by Innovate UK.

Download: CPRD COVID-19 Synthetic Data Specification V4 (PDF, 293KB, 8 pages)

https://doi.org/10.48329/yk2n-sz66

Further information and publications

Press release: New synthetic datasets to assist COVID-19 and cardiovascular research

Download: Tucker et al preprint

Publication: de Benedetti, J., Oues, N., Wang, Z., Myles, P., Tucker, A. (2020). Practical lessons from Generating Synthetic Healthcare Data with Bayesian Networks. In: Koprinska I. et al. (eds) ECML PKDD 2020 Workshops. ECML PKDD 2020. Communications in Computer and Information Science, vol 1323. Springer, Cham. https://doi.org/10.1007/978-3-030-65965-3_3

Publication: Tucker, A., Wang, Z., Rotalinti, Y. et al. Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. npj Digit. Med. 3, 147 (2020). https://doi.org/10.1038/s41746-020-00353-9

Publication: Wang, Z., Myles, P., Tucker, A. Generating and evaluating cross-sectional synthetic electronic healthcare data: Preserving data utility and patient privacy. Computational Intelligence. 2021; 1– 33. https://doi.org/10.1111/coin.12427

Publication: Wang, Z et al. Evaluating a Longitudinal Data Generator using Real World Data. 2021 IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS). https://doi.org/10.1109/CBMS52027.2021.00074

Download: Wang et al preprint (PDF, 1MB, 6 pages)

Publication: Myles P, et al. Synthetic data and the innovation, assessment, and regulation of AI medical devices. RF Quarterly. 2022;2(4):20-26. Published online 9 December 2022. (A preprint version of this article can be found at Myles et al preprint (PDF, 197KB, 6 pages))

Accepted manuscript: The potential synergies between synthetic data and in silico trials in relation to generating representative virtual population cohorts. Myles et al 2023 Prog. Biomed. Eng. https://doi.org/10.1088/2516-1091/acafbf (A preprint version of this article can be found at Myles et al preprint (PDF, 159KB, 6 pages))

Our R code for generating high-fidelity synthetic data is available via Github (https://github.com/zhenchenwang/latent_model). The R package bnlearn (v4.6.1) is used for all Bayesian network inference. The R function FCI is used, which is part of the pcalg package (v2.6–11), to identify latent variables. Kmmd is implemented using the R Package kernlab (v0.9–29). This code can be used with the high-fidelity cardiovascular synthetic dataset as a ground truth dataset, for training purposes and can be adapted for use with other similar datasets.

Report: Mitchell, C. and Redrup Hill, E. (2023). Are synthetic health data ‘personal data’? A PHG Foundation report independently commissioned by the MHRA to assess the status of synthetic health data in UK data protection law.

Download: Mitchell, C. and Redrup Hill, E. (PDF, 4508KB, 79 pages)

Publication: Draghi, B., Wang, Z., Myles, P., Tucker, A. (2024). Identifying and handling data bias within primary healthcare data using synthetic data generators. Heliyon. e24164. https://doi.org/10.1016/j.heliyon.2024.e24164

Publication: Wang, Z., Draghi, B., Rotalinti, Y., Lunn, D., Myles, P. (2024). High fidelity synthetic data applications for data augmentation in Damaševičius, R. and Domínguez-Morales, M. (ed.), Deep Learning- Recent Findings and Researchings. Rijeka: IntechOpen. https://doi.org/10.5772/intechopen.113884

Medium-fidelity synthetic datasets

CPRD Aurum and CPRD GOLD sample datasets

The CPRD Aurum and CPRD GOLD sample datasets are medium-fidelity synthetic dataset that resembles the real world CPRD data with respect to the data types, data values, data formats, data structure and table relationships.

The synthetic datasets can be used for multiple purposes including as a sample dataset to understand the structure and utility of the anonymised databases, to use as a data management teaching/training resource, to develop/validate/test analytics tools for use with CPRD data, to improve bespoke CPRD application interfaces/algorithms, e.g. a bespoke cohort selection tool, or to develop machine learning workflows that can be applied to anonymised CPRD data. The development of this dataset was funded by NHSX.

Download: CPRD Aurum sample dataset release notes (PDF, 128KB, 2 pages)

https://doi.org/10.48329/hm7t-qs28

Download: CPRD GOLD sample dataset release notes (PDF, 150KB, 2 pages)

https://doi.org/10.48329/y7q8-gr42

If you are interested in a synthetic version of a dataset that is not listed above, please contact us at enquiries@cprd.com with a summary of the requirements. The CPRD team has the expertise and capabilities to generate the bespoke synthetic dataset that may or may not involve CPRD data. Once we have received your requirements, we will be in touch to set up a call to discuss these in more detail and to explore whether we can support you with your requirements.

Page last reviewed

12-03-2024