XHate-999: Analyzing and Detecting Abusive Language Across Domains and Languages

We present XHate -999, a multi-domain and multilingual evaluation data set for abusive language detection. By aligning test instances across six typologically diverse languages, XHate-999 for the first time allows for disentanglement of the domain transfer and language transfer effects in abusive language detection. We conduct a series of domain- and language-transfer experiments with state-of-the-art monolingual and multilingual transformer models, setting strong baseline results and profiling XH ATE -999 as a comprehensive evaluation resource for abusive language detection. Finally, we show that domain- and language-adaptation, via intermediate masked language modeling on abusive corpora in the target language, can lead to substantially improved abusive language detection in the target language in the zero-shot transfer setups.

Journal Title

Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020)

Conference Name

28th International Conference on Computational Linguistics (COLING 2020)

Publisher

International Committee on Computational Linguistics

Publisher DOI

https://doi.org/10.17863/CAM.62218

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution 4.0 International

Sponsorship

European Research Council (648909)

Collections

University of Cambridge Research Outputs (Articles and Conferences)