Repository logo
 

Data-efficient classification of road inspection texts with a semantic similarity criterion

Published version
Peer-reviewed

Repository DOI


Loading...
Thumbnail Image

Change log

Abstract

Road maintenance involves manually classifying a large volume of textual data necessary for downstream applications such as raising a maintenance job order. Automation can not only bring significant time and cost savings, but it can also facilitate digitalization efforts like the Road Digital Twin (DT). However, as is the case with many Architecture, Engineering and Construction (AEC) applications, annotated data availability is low, which demands exploration of specialized techniques for resource-constrained settings that have not been focused on in engineering. This work bridges this gap by proposing a data-efficient similarity-based text classifier that aims at effectively utilizing existing domain knowledge and pre-training knowledge of Large Language Models (LLMs) to enable rapid domain adaptation. It reformulates text classification as a similarity comparison task, using semantics directly as a classification criterion. Through a case study on classifying road inspection comments, the proposed classifier outperformed both traditionally fine-tuned and few-shot learning approaches. It attained an f 1 score of 0.46 with just one example per class, equivalent to the value for Sentence Transformer Fine-Tuning (SetFit) with 4 examples and Llama3 with 10. Additionally, it is able to keep up with traditional fine-tuning methods when trained with more than 300,000 total examples, achieving an a c c u r a c y of more than 95% and f 1 of around 0.9. These results indicate that the proposal is competitive against traditionally fine-tuned and few-shot models across all levels of data availability. This versatility significantly elevates the feasibility of deploying an automated text classification pipeline in a complex engineering field like road maintenance.

Description

Journal Title

Advanced Engineering Informatics

Conference Name

Journal ISSN

1474-0346
1873-5320

Volume Title

67

Publisher

Elsevier

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution 4.0 International
Sponsorship
Engineering and Physical Sciences Research Council (EP/S02302X/1)
EPSRC (EP/V056441/1)
European Commission Horizon 2020 (H2020) Marie Sk?odowska-Curie actions (101034337)