Detection of Rare Events: Cluster Based Preprocessing of the Training Set: The Case on Complaints for Invoice Time Series
Keywords:
Unbalanced data, majority class, hierarchical clustering, heuristicsAbstract
Detection of rare events is a major problem when dealing with unbalanced data. In the application of machine learning tools, data is split into training and test samples and preprocessing is applied to the training set, with the aim of obtaining a more balanced sample. In this paper we discuss preprocessing methods applied to heterogenous data clustered with respect to expected anomaly types. We propose a method for deciding on oversampling and under-sampling from each cluster, based on the variability of the items in each cluster, using Principal Component Analysis. The method is applied to the problem of detecting anomalies in a time series invoices, with an average rate of complaints of orders 10-4.
References
M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, “A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches”, IEEE Trans. Syst. Man Cybern. – Part C, 42 (4), 463–484, 2012
C. Beyan and R. Fisher, “Classifying imbalanced data sets using similarity-based hierarchical decomposition,” Pattern Recognition, vol. 48, no. 5, pp. 1653–1672, 2015.
M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches, IEEE Trans. Syst. Man Cybern. – Part C 42 (4) (2012) 463–484.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
Azhar, N.A., M.S.M. Pozi, A.M. Din, and A. Jatowt. “An Investigation of SMOTE Based Methods for Imbalanced Datasets with Data Complexity Analysis.” IEEE Transactions on Knowledge and Data Engineering, Knowledge and Data Engineering, IEEE Transactions on, IEEE Trans. Knowl. Data Eng 35, no. 7 (July 1, 2023): 6651–72. doi:10.1109/TKDE.2022.3179381.
Jason Van Hulse, Taghi M Khoshgoftaar, and Amri Napolitano. “An empirical comparison of repetitive undersampling techniques.” In Information Reuse & Integration, 2009. IRI’09. IEEE International Conference
on, pages 29–34. IEEE, 2009.
Hasanin, T., & Khoshgoftaar, T. (2018). “The Effects of Random Undersampling with Simulated Class Imbalance for Big Data.” 2018 IEEE International Conference on Information Reuse and Integration (IRI), Information Reuse and Integration (IRI), 2018 IEEE International Conference on, IRI, 70–79. https://icproxy.khas.edu.tr:2071/10.1109/IRI.2018.00018
C. Seiffert, T. Khoshgoftaar, J. Van Hulse, A. Napolitano, “RUSBoost: a hybrid approach to alleviating class imbalance”, IEEE Trans. Syst. Man Cybern. – Part A 40 (1), 185–197, 2010
R. Barandela, R.M. Valdovinos, J.S. Sanchez, “New applications of ensembles of classifiers”, Pattern Anal. , Appl. 6 ,245–256, 2003.
K. Randhawa, C. K. Loo, M. Seera, C. P. Lim, and A. K. Nandi, “Credit Card Fraud Detection Using AdaBoost and Majority Voting”, IEEE Access, Vol. 6, pp. 14277–14284, 2018.
M. Zareapoor and P. Shamsolmoali, “Application of Credit Card Fraud Detection: Based on Bagging Ensemble Classifier”, Procedia Computer Science, Vol. 48, pp. 679–685, 2015.
J. O. Awoyemi, A. O. Adetunmbi, and S. A. Oluwadare, “Credit card fraud detection using machine learning techniques: A comparative analysis”, In: Proc. of 2017 International Conference on Computing Networking and Informatics (ICCNI), pp. 1–9, 2017.
Bilge, A. H. ., Ogrenci, A. S. ., Carpanali, H. ., Aktunc, E. A. ., Atas, F., Ozmen, T. ., & Kaya, B. E. . (2022). Detection of Expenditure Trends in the Telecommunication Sector. American Scientific Research Journal for Engineering, Technology, and Sciences, 90(1), 340–350.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 American Scientific Research Journal for Engineering, Technology, and Sciences
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who submit papers with this journal agree to the following terms.