Repository logo
 

A pipeline to further enhance quality, integrity and reusability of the NCCID clinical data

Published version
Peer-reviewed

Repository DOI


Change log

Abstract

The National COVID-19 Chest Imaging Database (NCCID) is a centralized UK database of thoracic imaging and corresponding clinical data. It is made available by the National Health Service Artificial Intelligence (NHS AI) Lab to support the development of machine learning tools focused on Coronavirus Disease 2019 (COVID-19). A bespoke cleaning pipeline for NCCID, developed by the NHSx, was introduced in 2021. We present an extension to the original cleaning pipeline for the clinical data of the database. It has been adjusted to correct additional systematic inconsistencies in the raw data such as patient sex, oxygen levels and date values. The most important changes will be discussed in this paper, whilst the code and further explanations are made publicly available on GitHub. The suggested cleaning will allow global users to work with more consistent data for the development of machine learning tools without being an expert. In addition, it highlights some of the challenges when working with clinical multi-center data and includes recommendations

Description

Acknowledgements: There is no direct funding for this study, but the authors are grateful for the EU/EFPIA Innovative Medicines Initiative project DRAGON (101005122) (A.B., I.S., M.R., J.B., E.G.-K., L.E.S., AIX-COVNET, J.W.M., E.S., C.-B.S.), FWF Austria (A.B.), the Trinity Challenge BloodCounts! project (M.R., C.-B.S.), the EPSRC Cambridge Mathematics of Information in Healthcare Hub EP/T017961/1 (M.R., J.H.F.R., J.A.D.A, C.-B.S.), the Cantab Capital Institute for the Mathematics of Information (C.-B.S.), NIHR Cambridge Biomedical Research Centre (BRC-1215-20014) (I.S., J.W.M., L.E.S., E.S.), Wellcome Trust (J.H.F.R.), British Heart Foundation (J.H.F.R.), the NIHR Cambridge Biomedical Research Centre (J.H.F.R.). The European Research Council under the European Union’s Horizon 2020 research and innovation programme grant agreement no. 777826 (C.-B.S.), the Alan Turing Institute (C.-B.S.), Cancer Research UK Cambridge Centre (C9685/A25177) (C.-B.S.). In addition, C.-B.S. acknowledges support from the Leverhulme Trust project on ‘Breaking the non-convexity barrier’, the Philip Leverhulme Prize, the EPSRC grants EP/S026045/1 and EP/T003553/1 and the Wellcome Innovator Award RG98755. The AIX-COVNET collaboration is also grateful to Intel for financial support and to CRUK National Cancer Imaging Translational Accelerator (NCITA) (C22479/A28667) for use of their data repository. Lastly, we want to thank NHS AI Lab, the British Thoracic Society, Royal Surrey NHS Foundation Trust and Faculty for their great work on the NCCID and the original cleaning pipeline.

Journal Title

Scientific data

Conference Name

Journal ISSN

2052-4463
2052-4463

Volume Title

10

Publisher

Nature Portfolio

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution 4.0 International
Sponsorship
EPSRC (EP/T017961/1)
Engineering and Physical Sciences Research Council (EP/N014588/1)
European Commission Horizon 2020 (H2020) Marie Sk?odowska-Curie actions (777826)
EPSRC (EP/S026045/1)
EPSRC (EP/T003553/1)
National Institute for Health and Care Research (IS-BRC-1215-20014)
Cancer Research UK (C96/A25177)
Cancer Research UK (C197/A28667)