NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding.
View / Open Files
Authors
Stevens, Robert
Alachram, Halima
King, Ross
Ananiadou, Sophia
Schoene, Annika M
Christopoulou, Fenia
Matthew, Joel
Garg, Sahil
Hermjakob, Ulf
Marcu, Daniel
Sheng, Emily
Beißbarth, Tim
Wingender, Edgar
Galstyan, Aram
Chambers, Brendan
Pan, Weidi
Publication Date
2021-10-20Journal Title
NPJ Syst Biol Appl
ISSN
2056-7189
Publisher
Springer Science and Business Media LLC
Volume
7
Issue
1
Language
eng
Type
Article
This Version
VoR
Metadata
Show full item recordCitation
Wang, K., Stevens, R., Alachram, H., Li, Y., Soldatova, L., King, R., Ananiadou, S., et al. (2021). NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding.. NPJ Syst Biol Appl, 7 (1) https://doi.org/10.1038/s41540-021-00200-x
Abstract
Machine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades1,2, the most dramatic advances in MR have followed in the wake of critical corpus development3. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet4 was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.
Keywords
Networking and Information Technology R&D (NITRD)
Sponsorship
Biotechnology and Biological Sciences Research Council (BB/F008228/1)
Biotechnology and Biological Sciences Research Council (BB/G000662/1)
Biotechnology and Biological Sciences Research Council (BB/D00425X/1)
Biotechnology and Biological Sciences Research Council (BB/E018025/1)
Engineering and Physical Sciences Research Council (EP/S014128/1)
Engineering and Physical Sciences Research Council (EP/R022925/1)
Engineering and Physical Sciences Research Council (EP/M015688/1)
Engineering and Physical Sciences Research Council (EP/K030469/1)
EPSRC (EP/R022925/2)
Identifiers
PMC8528865, 34671039
External DOI: https://doi.org/10.1038/s41540-021-00200-x
This record's URL: https://www.repository.cam.ac.uk/handle/1810/331092
Statistics
Total file downloads (since January 2020). For more information on metrics see the
IRUS guide.
Recommended or similar items
The current recommendation prototype on the Apollo Repository will be turned off on 03 February 2023. Although the pilot has been fruitful for both parties, the service provider IKVA is focusing on horizon scanning products and so the recommender service can no longer be supported. We recognise the importance of recommender services in supporting research discovery and are evaluating offerings from other service providers. If you would like to offer feedback on this decision please contact us on: support@repository.cam.ac.uk