Repositório Digital de Publicações Científicas: An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing


Sign on to:
	Login
	My DSpace authorized users
	Edit Profile
	Receive email updates

Browse
	Communities & Collections
	Issue Date
	Author
	Title
	Subject

Helps
	Regulamento RDPC
	Guia do Utilizador RDPC
	Depósito RDPC
	Faq's RDPC

	Integração CV DeGóis
	Workshop Open Access

	Newsletter Open Access


	About Dspace
	DSpace Software

Repositorio Digital de Publicacoes Cientificas da Universidade de Evora

/ Departamento de Informática / INF - Publicações - Artigos em Revistas Internacionais Com Arbitragem Científica /

Please use this identifier to cite or link to this item: http://hdl.handle.net/10174/34695

Title:	An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing
Authors:	Carnaz, Gonçalo Antunes, Mário Nogueira, Vitor Beires
Keywords:	crime-related documents cybersecurity criminal investigation Portuguese language corpus
Issue Date:	26-Jun-2021
Citation:	Carnaz, G.; Antunes, M.; Nogueira, V.B. An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing. Data 2021, 6, 71. https://doi.org/10.3390/data6070071
Abstract:	Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.
URI:	http://hdl.handle.net/10174/34695
Type:	article
Appears in Collections:	INF - Publicações - Artigos em Revistas Internacionais Com Arbitragem Científica

Files in This Item:

File	Description	Size	Format
data-06-00071-v2.pdf		349.16 kB	Adobe PDF	View/Open

Serviços de Ciência e Cooperação - Universidade de Évora