Outlier detection algorithms over fuzzy data with weighted least squares

Nikolova, Nataliya; Rodriguez, RM; Symes, Mark; Toneva, D; Kolev, K; Tenekedjiev, Kiril

File(s) under permanent embargo

Outlier detection algorithms over fuzzy data with weighted least squares

journal contribution

posted on 2023-05-21, 02:12 authored by Nataliya NikolovaNataliya Nikolova, Rodriguez, RM, Mark SymesMark Symes, Toneva, D, Kolev, K, Kiril TenekedjievKiril Tenekedjiev

In the classical leave-one-out procedure for outlier detection in regression analysis, we exclude an observation and then construct a model on the remaining data. If the difference between predicted and observed value is high we declare this value an outlier. As a rule, those procedures utilize single comparison testing. The problem becomes much harder when the observations can be associated with a given degree of membership to an underlying population, and the outlier detection should be generalized to operate over fuzzy data. We present a new approach for outlier detection that operates over fuzzy data using two inter-related algorithms. Due to the way outliers enter the observation sample, they may be of various order of magnitude. To account for this, we divided the outlier detection procedure into cycles. Furthermore, each cycle consists of two phases. In Phase 1, we apply a leave-one-out procedure for each non-outlier in the dataset. In Phase 2, all previously declared outliers are subjected to Benjamini–Hochberg step-up multiple testing procedure controlling the false-discovery rate, and the non-confirmed outliers can return to the dataset. Finally, we construct a regression model over the resulting set of non-outliers. In that way, we ensure that a reliable and high-quality regression model is obtained in Phase 1 because the leave-one-out procedure comparatively easily purges the dubious observations due to the single comparison testing. In the same time, the confirmation of the outlier status in relation to the newly obtained high-quality regression model is much harder due to the multiple testing procedure applied hence only the true outliers remain outside the data sample. The two phases in each cycle are a good trade-off between the desire to construct a high-quality model (i.e., over informative data points) and the desire to use as much data points as possible (thus leaving as much observations as possible in the data sample). The number of cycles is user defined, but the procedures can finalize the analysis in case a cycle with no new outliers is detected. We offer one illustrative example and two other practical case studies (from real-life thrombosis studies) that demonstrate the application and strengths of our algorithms. In the concluding section, we discuss several limitations of our approach and also offer directions for future research.

History

Publication title

International Journal of Fuzzy Systems

Volume

23

Issue

5

Pagination

1234-1256

ISSN

1562-2479

Department/School

Australian Maritime College

Publisher

Springer

Place of publication

Germany

Rights statement

Copyright Taiwan Fuzzy Systems Association 2021

Repository Status

Restricted

Socio-economic Objectives

Artificial intelligence; Expanding knowledge in engineering

Usage metrics

Keywords

regression analysis leave-one-out method degree of membership multiple testing Benjamini–Hochberg step-up multiple testing false-discovery rate

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

File(s) under permanent embargo

Outlier detection algorithms over fuzzy data with weighted least squares

History

Publication title

Volume

Issue

Pagination

ISSN

Department/School

Publisher

Place of publication

Rights statement

Repository Status

Socio-economic Objectives

Usage metrics

Categories

Keywords

Licence

Exports