Repository logo

Ethnicity data resource in population-wide health records: completeness, coverage and granularity of diversity.

Published version

Repository DOI

Change log


Pineda-Moncusí, Marta  ORCID logo
Allery, Freya 
Delmestri, Antonella  ORCID logo
Bolton, Thomas 
Nolan, John 


Intersectional social determinants including ethnicity are vital in health research. We curated a population-wide data resource of self-identified ethnicity data from over 60 million individuals in England primary care, linking it to hospital records. We assessed ethnicity data in terms of completeness, consistency, and granularity and found one in ten individuals do not have ethnicity information recorded in primary care. By linking to hospital records, ethnicity data were completed for 94% of individuals. By reconciling SNOMED-CT concepts and census-level categories into a consistent hierarchy, we organised more than 250 ethnicity sub-groups including and beyond "White", "Black", "Asian", "Mixed" and "Other, and found them to be distributed in proportions similar to the general population. This large observational dataset presents an algorithmic hierarchy to represent self-identified ethnicity data collected across heterogeneous healthcare settings. Accurate and easily accessible ethnicity data can lead to a better understanding of population diversity, which is important to address disparities and influence policy recommendations that can translate into better, fairer health for all.


Acknowledgements: The British Heart Foundation Data Science Centre (grant No SP/19/3/34678, awarded to Health Data Research (HDR) UK), funded co-development (with NHS England) of the Secure Data Environment service for England, provision of linked datasets, data access, user software licences, computational usage, and data management and wrangling support, with additional contributions from the HDR UK Data and Connectivity component of the UK Government Chief Scientific Adviser’s National Core Studies programme to coordinate national COVID-19 priority research. Consortium partner organisations funded the time of contributing data analysts, biostatisticians, epidemiologists, and clinicians. The authors acknowledge English language editing by Dr Jennifer A de Beyer and Amelia M Doran, Centre for Statistics in Medicine, University of Oxford. This work was carried out with the support of the BHF Data Science Centre led by HDR UK (BHF Grant no. SP/19/3/34678). This study made use of de-identified data held in NHS England’s Secure Data Environment service for England and made available via the BHF Data Science Centre’s CVD-COVID-UK/COVID-IMPACT consortium. This work used data provided by patients and collected by the NHS as part of their care and support. We would like to acknowledge all data providers who make health relevant data available for research. This research is part of the Data and Connectivity National Core Study, led by Health Data Research UK in partnership with the Office for National Statistics and funded by UK Research and Innovation (grant ref MC_PC_20058). This work was also supported by The Alan Turing Institute via ‘Towards Turing 2.0’ EPSRC Grant Funding. The funders had no role in the study design, data collection, data analysis, data interpretation, or report writing.

Funder: UKRI (UK Research and Innovation), grant ref MC_PC_20058

Funder: UCL UKRI Centre for Doctoral Training in AI-enabled Healthcare studentship, gran ref EP/S021612/1

Funder: Health Data Research UK (HDRUK), grant ref HDR-9006.


Humans, Ethnicity, Population Health, England

Journal Title

Sci Data

Conference Name

Journal ISSN


Volume Title



Springer Science and Business Media LLC
Alan Turing Institute (Towards Turing 2.0)
British Heart Foundation (BHF) (SP/19/3/34678)
RCUK | Medical Research Council (MRC) (MR/V028367/1)
RCUK | Economic and Social Research Council (ESRC) (ES/S007393/1)