Show simple item record

dc.contributor.authorTrisovic, Ana
dc.date.accessioned2018-10-11T15:17:40Z
dc.date.available2018-10-11T15:17:40Z
dc.date.issued2018-10-10
dc.date.submitted2018-07-19
dc.identifier.urihttps://www.repository.cam.ac.uk/handle/1810/283607
dc.description.abstractThis dissertation presents the first study of data preservation and research reproducibility in data science at the Large Hadron Collider at CERN. In particular, provenance capture of the experimental data and the reproducibility of physics analyses at the LHCb experiment were studied. First, the preservation of the software and hardware dependencies of the LHCb experimental data and simulations was investigated. It was found that the links between the data processing information and the datasets themselves were obscure. In order to document these dependencies, a graph database was designed and implemented. The nodes in the graph represent the data with their processing information, software and computational environment, whilst the edges represent their dependence on the other nodes. The database provides a central place to preserve information that was previously scattered across the LHCb computing infrastructure. Using the developed database, a methodology to recreate the LHCb computational environment and to execute the data processing on the cloud was implemented with the use of virtual containers. It was found that the produced physics events were identical to the official LHCb data, meaning that the system can aid in data preservation. Furthermore, the developed method can be used for outreach purposes, providing a streamlined way for a person external to CERN to process and analyse the LHCb data. Following this, the reproducibility of data analyses was studied. A data provenance tracking service was implemented within the LHCb software framework \textsc{Gaudi}. The service allows analysts to capture their data processing configurations that can be used to reproduce a dataset within the dataset itself. Furthermore, to assess the current status of the reproducibility of LHCb physics analyses, the major parts of an analysis were reproduced by following methods described in publicly and internally available documentation. This study allowed the identification of barriers to reproducibility and specific points where documentation is lacking. With this knowledge, one can specifically target areas that need improvement and encourage practices that would improve reproducibility in the future. Finally, contributions were made to the CERN Analysis Preservation portal, which is a general knowledge preservation framework developed at CERN to be used across all the LHC experiments. In particular, the functionality to preserve source code from git repositories and Docker images in one central location was implemented.
dc.description.sponsorshipCERN Doctoral Student program, the Muir Wood studentship awarded by the Newnham College, the Dositeja award by the Fund for Young Talents in Serbia
dc.language.isoen
dc.rightsAll rights reserved
dc.subjectsoftware preservation
dc.subjectreproducible research
dc.subjectdata preservation
dc.subjectReproducibility
dc.titleData preservation and reproducibility at the LHCb experiment at CERN
dc.typeThesis
dc.type.qualificationlevelDoctoral
dc.type.qualificationnameDoctor of Philosophy (PhD)
dc.publisher.institutionUniversity of Cambridge
dc.publisher.departmentPhysics
dc.date.updated2018-10-10T14:39:05Z
dc.identifier.doi10.17863/CAM.30973
dc.identifier.doi10.17863/CAM.30973
dc.identifier.doi10.17863/CAM.30973
dc.contributor.orcidTrisovic, Ana [0000-0003-1991-0533]
dc.publisher.collegeNewnham
dc.type.qualificationtitlePhD in Computing in High Energy Physics
cam.supervisorJones, Chris
cam.thesis.fundingfalse


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record