Repository logo

De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation.

Published version

Repository DOI



Change log


BACKGROUND: Epidemiological research may require linkage of information from multiple organizations. This can bring two problems: (1) the information governance desirability of linkage without sharing direct identifiers, and (2) a requirement to link databases without a common person-unique identifier. METHODS: We develop a Bayesian matching technique to solve both. We provide an open-source software implementation capable of de-identified probabilistic matching despite discrepancies, via fuzzy representations and complete mismatches, plus de-identified deterministic matching if required. We validate the technique by testing linkage between multiple medical records systems in a UK National Health Service Trust, examining the effects of decision thresholds on linkage accuracy. We report demographic factors associated with correct linkage. RESULTS: The system supports dates of birth (DOBs), forenames, surnames, three-state gender, and UK postcodes. Fuzzy representations are supported for all except gender, and there is support for additional transformations, such as accent misrepresentation, variation for multi-part surnames, and name re-ordering. Calculated log odds predicted a proband's presence in the sample database with an area under the receiver operating curve of 0.997-0.999 for non-self database comparisons. Log odds were converted to a decision via a consideration threshold θ and a leader advantage threshold δ. Defaults were chosen to penalize misidentification 20-fold versus linkage failure. By default, complete DOB mismatches were disallowed for computational efficiency. At these settings, for non-self database comparisons, the mean probability of a proband being correctly declared to be in the sample was 0.965 (range 0.931-0.994), and the misidentification rate was 0.00249 (range 0.00123-0.00429). Correct linkage was positively associated with male gender, Black or mixed ethnicity, and the presence of diagnostic codes for severe mental illnesses or other mental disorders, and negatively associated with birth year, unknown ethnicity, residential area deprivation, and presence of a pseudopostcode (e.g. indicating homelessness). Accuracy rates would be improved further if person-unique identifiers were also used, as supported by the software. Our two largest databases were linked in 44 min via an interpreted programming language. CONCLUSIONS: Fully de-identified matching with high accuracy is feasible without a person-unique identifier and appropriate software is freely available.



Bayesian probabilistic linkage, De-identification, Electronic health records, Electronic patient records, Electronic medical records, Identity matching, Mental health, Open-source software, Privacy-preserving record linkage, Pseudonymisation, Psychiatry, Humans, Male, Privacy, Medical Record Linkage, Bayes Theorem, State Medicine, Software

Journal Title

BMC Med Inform Decis Mak

Conference Name

Journal ISSN


Volume Title



Springer Science and Business Media LLC
Medical Research Council (MC_PC_17213)
MRC (MR/T046430/1)
Medical Research Council (MR/W014386/1)
MRC (via Swansea University) (DATAMIND 106893)
National Institute for Health and Care Research (IS-BRC-1215-20014)
Medical Research Council (MC_PC_21025)
This work was funded by UK Research and Innovation (UKRI) via the UK Medical Research Council (MRC; Mental Health Data Pathfinder, grant MC_PC_17213; DATAMIND, grant MR/W014386/1) and the MRC, UK Arts and Humanities Research Council (AHRC), and UK Economic and Social Research Council (ESRC) (TIMELY, grant MR/T046430/1). For the purpose of open access, the author has applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising. This research was supported in part by the National Institute for Health and Care Research (NIHR) Cambridge Biomedical Research Centre (BRC-1215-20014, NIHR203312) and the NIHR Applied Research Collaboration East of England; the views expressed are those of the authors and not necessarily those of the NHS, the NIHR, or the Department of Health and Social Care. The funders played no role in the design of the study, the collection, analysis, and interpretation of data, or the writing of the manuscript.
Is derived from: