Please use this identifier to cite or link to this item:
|OLID-BR: offensive language identification dataset for Brazilian Portuguese
|Trajano, D., Bordini, R.H. & Vieira, R. OLID-BR: offensive language identification dataset for Brazilian Portuguese. Lang Resources & Evaluation (2023). https://doi.org/10.1007/s10579-023-09657-0
|Social media has revolutionized the manner in which our society is interconnected. While this extensive connectivity offers numerous benefits, it is also accompanied by significant drawbacks, particularly in terms of the proliferation of fake news and the vast dissemination of hate speech. Identifying offensive comments is a critical task for ensuring the safety of users, which is why industry and academia have been working on developing solutions to this problem. Prior research on hate speech detection has predominantly focused on the English language, with few studies devoted to other languages such as Portuguese. This paper introduces the Offensive Language Identification Dataset for Brazilian Portuguese (OLID-BR), a high-quality NLP dataset for offensive language detection, which we make publicly available. The dataset contains 6,354 (extendable to 13,538) comments labeled using a fine- grained three-layer annotation schema compatible with datasets in other languages, which allows the training of multilingual/cross-lingual models. The five NLP tasks available in OLID-BR allow the detection of offensive comments, the classification of the types of offenses such as racism, LGBTQphobia, sexism, xenophobia, and so on, the identification of the type and the target of offensive comments, and the extraction of toxic spans of offensive comments. All those tasks can enhance the capabilities of content moderation systems by providing deep contextual analysis or highlighting the spans that make a text toxic. We further experiment with and evaluate the dataset using state-of-the-art BERT-based and NER models, which demonstrates the usefulness of OLID-BR for the development of toxicity detection systems for Portuguese texts.
|Appears in Collections:
|CIDEHUS - Publicações - Artigos em Revistas Internacionais Com Arbitragem Científica
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.