Please use this identifier to cite or link to this item: http://hdl.handle.net/10174/35061

Title: OLID-BR: offensive language identification dataset for Brazilian Portuguese
Authors: Trajano, Douglas
Bordini, Rafael
Vieira, Renata
Keywords: Hate speech
Issue Date: May-2023
Publisher: Springer Nature
Citation: Trajano, D., Bordini, R.H. & Vieira, R. OLID-BR: offensive language identification dataset for Brazilian Portuguese. Lang Resources & Evaluation (2023). https://doi.org/10.1007/s10579-023-09657-0
Abstract: Social media has revolutionized the manner in which our society is interconnected. While this extensive connectivity offers numerous benefits, it is also accompanied by significant drawbacks, particularly in terms of the proliferation of fake news and the vast dissemination of hate speech. Identifying offensive comments is a critical task for ensuring the safety of users, which is why industry and academia have been working on developing solutions to this problem. Prior research on hate speech detection has predominantly focused on the English language, with few studies devoted to other languages such as Portuguese. This paper introduces the Offensive Language Identification Dataset for Brazilian Portuguese (OLID-BR), a high-quality NLP dataset for offensive language detection, which we make publicly available. The dataset contains 6,354 (extendable to 13,538) comments labeled using a fine- grained three-layer annotation schema compatible with datasets in other languages, which allows the training of multilingual/cross-lingual models. The five NLP tasks available in OLID-BR allow the detection of offensive comments, the classification of the types of offenses such as racism, LGBTQphobia, sexism, xenophobia, and so on, the identification of the type and the target of offensive comments, and the extraction of toxic spans of offensive comments. All those tasks can enhance the capabilities of content moderation systems by providing deep contextual analysis or highlighting the spans that make a text toxic. We further experiment with and evaluate the dataset using state-of-the-art BERT-based and NER models, which demonstrates the usefulness of OLID-BR for the development of toxicity detection systems for Portuguese texts.
URI: https://link.springer.com/article/10.1007/s10579-023-09657-0
http://hdl.handle.net/10174/35061
Type: article
Appears in Collections:CIDEHUS - Publicações - Artigos em Revistas Internacionais Com Arbitragem Científica

Files in This Item:

File Description SizeFormat
olidBR.pdf1.47 MBAdobe PDFView/OpenRestrict Access. You can Request a copy!
FacebookTwitterDeliciousLinkedInDiggGoogle BookmarksMySpaceOrkut
Formato BibTex mendeley Endnote Logotipo do DeGóis 

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

 

Dspace Dspace
DSpace Software, version 1.6.2 Copyright © 2002-2008 MIT and Hewlett-Packard - Feedback
UEvora B-On Curriculum DeGois