Data-efficient classification of road inspection texts with a semantic similarity criterion
Published version
Peer-reviewed
Repository URI
Repository DOI
Change log
Abstract
Road maintenance involves manually classifying a large volume of textual data necessary for downstream applications such as raising a maintenance job order. Automation can not only bring significant time and cost savings, but it can also facilitate digitalization efforts like the Road Digital Twin (DT). However, as is the case with many Architecture, Engineering and Construction (AEC) applications, annotated data availability is low, which demands exploration of specialized techniques for resource-constrained settings that have not been focused on in engineering. This work bridges this gap by proposing a data-efficient similarity-based text classifier that aims at effectively utilizing existing domain knowledge and pre-training knowledge of Large Language Models (LLMs) to enable rapid domain adaptation. It reformulates text classification as a similarity comparison task, using semantics directly as a classification criterion. Through a case study on classifying road inspection comments, the proposed classifier outperformed both traditionally fine-tuned and few-shot learning approaches. It attained an f 1 score of 0.46 with just one example per class, equivalent to the value for Sentence Transformer Fine-Tuning (SetFit) with 4 examples and Llama3 with 10. Additionally, it is able to keep up with traditional fine-tuning methods when trained with more than 300,000 total examples, achieving an a c c u r a c y of more than 95% and f 1 of around 0.9. These results indicate that the proposal is competitive against traditionally fine-tuned and few-shot models across all levels of data availability. This versatility significantly elevates the feasibility of deploying an automated text classification pipeline in a complex engineering field like road maintenance.
Description
Journal Title
Conference Name
Journal ISSN
1873-5320
Volume Title
Publisher
Publisher DOI
Rights and licensing
Sponsorship
EPSRC (EP/V056441/1)
European Commission Horizon 2020 (H2020) Marie Sk?odowska-Curie actions (101034337)

