Synthetic data

CPRD has generated a number of synthetic datasets to understand the structure and utility of the anonymised CPRD Aurum database, that can be used for training purposes or to improve algorithms or machine learning workflows.

Two high-fidelity synthetic datasets are being made available with a nominal administrative fee:

An additional fee will apply for an annual teaching licence for the high-fidelity synthetic datasets.

CPRD has also developed the CPRD Aurum sample dataset, a medium-fidelity synthetic dataset that resembles the real world CPRD Aurum. Pricing information is available from enquiries@cprd.com.

The three synthetic datasets will need a data sharing agreement (DSA) with the applicant’s organisation for access in line with advice received from the Information Commissioner’s Office (ICO) Innovation Hub in response to a formal query by the MHRA.

For access to these datasets please submit an application form to enquiries@cprd.com including ‘Synthetic data access request’ in the email subject header. Applicants from organisations that are not existing CPRD clients will also need to submit a new client request form.

Existing multi-study licence (MSL) clients will not need a separate DSA to access the CPRD Aurum sample dataset as this will be added to their MSL agreement. MSL clients can request the CPRD Aurum sample dataset at no additional cost by contacting enquiries@cprd.com.

Download:

(Word, 410KB, 3 pages)
 

CPRD has generated high-fidelity synthetic datasets using a synthetic data generation and evaluation framework that was developed under a grant from the Regulators’ Pioneer Fund launched by The Department for Business, Energy and Industrial Strategy (BEIS) and managed by Innovate UK. The synthetic data generation and evaluation framework used to generate this synthetic dataset and the synthetic datasets are owned by the Medicines and Healthcare products Regulatory Agency (MHRA).

A detailed technical description of the methodology used to generate the synthetic datasets is available in the publications by Wang et al. (2021) and Tucker et al. (2020).

These high-fidelity synthetic datasets replicate the complex clinical relationships in real primary care patient data while protecting patient privacy as they are wholly synthetic. They can be used instead of real patient data for complex statistical analyses as well as machine learning and artificial intelligence (AI) research applications.

CPRD cardiovascular disease synthetic dataset

This synthetic dataset is based on anonymised real primary care patient data extracted from the CPRD Aurum database. The dataset focuses on cardiovascular disease risk factors and was a proof-of-concept dataset developed as part of a project funded by the Regulators’ Pioneer Fund launched by The Department for Business, Energy and Industrial Strategy (BEIS) and managed by Innovate UK.

Download:

(PDF, 194KB, 5 pages)

https://doi.org/10.11581/yk6n-b652

CPRD COVID-19 symptoms and risk factors synthetic dataset

This synthetic dataset is based on anonymised real primary care patient data extracted from the CPRD Aurum database. The dataset focuses on patients presenting to primary care with symptoms indicative of COVID-19 (confirmed/suspected COVID-19) and control patients with negative COVID-19 test results. The dataset includes data on sociodemographic and clinical risk factors.

The development of this dataset was funded by NHSX using the synthetic data generation and evaluation framework developed under a grant from the Regulators’ Pioneer Fund launched by The Department for Business, Energy and Industrial Strategy (BEIS) and managed by Innovate UK.

Download:

 (PDF, 293KB, 8 pages)

https://doi.org/10.48329/yk2n-sz66 

Further information and publications

Press release: New synthetic datasets to assist COVID-19 and cardiovascular research

Download:

(PDF, 912KB, 11 pages)
 

Publication: de Benedetti, J., Oues, N., Wang, Z., Myles, P., Tucker, A. (2020). Practical lessons from Generating Synthetic Healthcare Data with Bayesian Networks. In: Koprinska I. et al. (eds) ECML PKDD 2020 Workshops. ECML PKDD 2020. Communications in Computer and Information Science, vol 1323. Springer, Cham. https://doi.org/10.1007/978-3-030-65965-3_3

Publication: Tucker, A., Wang, Z., Rotalinti, Y. et al. Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. npj Digit. Med. 3, 147 (2020). https://doi.org/10.1038/s41746-020-00353-9

Publication: Wang, Z, Myles, P, Tucker, A. Generating and evaluating cross-sectional synthetic electronic healthcare data: Preserving data utility and patient privacy. Computational Intelligence. 2021; 1– 33. https://doi.org/10.1111/coin.12427

Download:

(PDF, 303KB, 6 pages)
 

Publication: Myles P et al. Synthetic data and the innovation, assessment, and regulation of AI Medical devices. RF Quarterly. 2021; 1(2): 48-53. © 2021. Regulatory Affairs Professional Society. 

CPRD Aurum sample dataset

The CPRD Aurum sample dataset is a medium-fidelity synthetic dataset that resembles the real world CPRD Aurum with respect to the data types, data values, data formats, data structure and table relationships.

This synthetic dataset can be used for multiple purposes including as a sample dataset to understand the structure and utility of the anonymised CPRD Aurum database, to use as a data management teaching/training resource, to develop/validate/test analytics tools for use with CPRD Aurum data, to improve bespoke CPRD Aurum application interfaces/algorithms, e.g. a bespoke cohort selection tool, or to develop machine learning workflows that can be applied to anonymised CPRD Aurum data. The development of this dataset was funded by NHSX.

For pricing please contact enquiries@cprd.com.

Download:

(PDF, 128 KB, 2 pages)
 

https://doi.org/10.48329/hm7t-qs28
 

[Page last reviewed 2 December 2021]