|
|
Please use this identifier to cite or link to this item:
http://hdl.handle.net/10174/40410
|
| Title: | Modelação e predição de eventos raros – um estudo comparativo |
| Authors: | Santos, Lorena Ventura |
| Advisors: | Santos, Paulo Infante Afonso, Anabela Cristina Cavaco Ferreira Jacinto, Gonçalo João Costa |
| Keywords: | Desequilíbrio de categorias Eventos raros Machine learning Reamostragem Class imbalance Rare events Firth logistic regression Machine learning Resampling |
| Issue Date: | 11-Dec-2025 |
| Publisher: | Universidade de Évora |
| Abstract: | A modelação de eventos raros constitui um desafio central na ciência de dados aplicada
à segurança rodoviária. Este estudo, centrado no distrito de Setúbal (2016–2023),
analisou sinistros registados pela GNR, complementados com variáveis meteorológicas
e infraestruturais. Testaram-se modelos estatísticos e de machine learning (Regressão
Logística, Firth, Random Forest, XGBoost, C5.0 e Naive Bayes), avaliados por PR-AUC,
ROC-AUC, 𝐹1 e Brier score. Para mitigar o desequilíbrio extremo (≈2% casos graves),
aplicaram-se técnicas de oversampling (ROSE e SMOTENC) apenas no treino, evitando
data leakage, e definiu-se o ponto de corte pela maximização do 𝐹2-score. O XGBoost e
a Logística de Firth mostraram melhor compromisso entre sensibilidade e calibração,
com AUC≈0,88. Conclui-se que a combinação de reamostragem adequada e calibração
criteriosa melhora a previsão de sinistros graves, oferecendo suporte à definição de
políticas de prevenção baseadas em evidência; - Modelling and Prediction of Rare Events – a comparative study -
Abstract (English):
Modelling rare events remains a central challenge in data science applied to road safety.
This study focuses on severe road accidents in the district of Setúbal (2019–2023), using
data from the National Republican Guard (GNR), complemented with meteorological
and infrastructural information. Several statistical and machine learning models (Logistic
Regression, Firth, Random Forest, XGBoost, C5.0 and Naive Bayes) were evaluated
through PR-AUC, ROC-AUC, F₁ and Brier score metrics. To address the strong class
imbalance (≈2% severe accidents), oversampling techniques (ROSE and SMOTENC) were
applied only to the training set, avoiding data leakage, and thresholds were defined by
maximising the F₂-score. The XGBoost and Firth logistic models achieved the best
balance between sensitivity and calibration (AUC≈0,88). Results demonstrate that
combining appropriate resampling with careful calibration enhances the prediction of
severe road accidents, supporting evidence-based decision-making in road safety
policies. |
| URI: | http://hdl.handle.net/10174/40410 |
| Type: | masterThesis |
| Appears in Collections: | BIB - Formação Avançada - Teses de Mestrado
|
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.
|