Testing Methods for Machine Learning Systems: From Data Validation to Model Evaluation

Kochetov Dmitrii

Authors

Kochetov Dmitrii

Keywords:

machine learning, software testing, robustness, fairness, data validation, model evaluation, MLOps, responsible AI

Abstract

As machine-learning systems penetrate domains with tangible human and economic consequences, conventional specification-driven software testing proves inadequate for artefacts whose behaviour is stochastic and tightly coupled to data distributions. Quality, therefore, requires a multi-axis conception: not merely point estimates of predictive accuracy but an integrated appraisal that spans nominal performance, resilience to input degradation, and measures of group-level parity. This study employs a mixed-methodology approach, combining a structured literature review with empirical case analysis. The empirically taken dataset used is UCI Adult. It has a baseline for logistic regression implemented (Python 3.10; scikit-learn 1.3) under five scenarios: Baseline, Typos — 5% random character replacement noise in categorical fields, Noise — numerical feature perturbed by Gaussian distribution where ? = 0.5, Drift — 10% of test examples replaced with instances from another demographic subgroup, Bias-Mitigation — post-processing with Calibrated Equalized Odds (AIF360 0.5.0). Predictive quality is measured based on Accuracy and ROC-AUC; fairness on two simple metrics: Demographic Parity Gap DPG and Equalized Odds Gap EOG. All five scenarios are run five times to average out possible sampling variation in results. The model gets an accuracy of 0.835 and ROC-AUC of 0.918 under clean conditions with a fairness deficit that is demonstrably measurable by group inequity when aggregate discrimination-agnostic performance is high? DPG = 0.029, EOG = 0.040. Typographical noise does not change accuracy; it stays at 0.835 with the same small but consistent gap remaining (EOG = 0.039), thereby showing one ‘surface-metric’ failure mode where unaccounted ethical risk goes into the metrics reported, say as Accuracy. Applicative noise and distributional shift reduce predictive competence (Accuracy = 0.781 and 0.801; ROC-AUC = 0.869 and 0.876) while drift magnifies between-group error imbalances such that vulnerability is asymmetric on protected groups (EOG rising to 0.065). Calibrated Applying Equalized Odds removes the measured Equalized Odds gap (EOG back to zero) with only a minimal reduction in maximal accuracy, decreasing from the baseline by just one basis point to now be one less than the maximum possible. However, it also leads to increased demographic parity gaps and rising DPG, which continues to grow further. In conclusions; they call for the embedding of multidimensional automated testing regimes that jointly gate correctness, robustness, and fairness within the MLOps pipelines (CI/CD/CT). Calibrated Equalized Odds is good as a way of neutralizing imbalances in error rates -but by reallocation of selection rates and with a modest reduction of nominal accuracies- meaning that fairness targets and tolerances have to be chosen explicitly regarding legal constraints and operational priorities as well as stakeholder values.

Author Biography

Kochetov Dmitrii

Independent researcher,Moscow, Russia

References

[1] J. Asaad and E. Аvksentieva, “Review of ways to apply machine learning methods in software engineering,” E3S web of conferences, vol. 449, no. 07018, 2023, doi: https://doi.org/10.1051/e3sconf/202344907018.

[2] S. Q. Ahmed, B. V. Ganesh, S. S. Kumar, P. Mishra, R. Anand, and B. Akurathi, “A Comprehensive Review of Adversarial Attacks on Machine Learning,” Arxiv, Dec. 2024, doi: https://doi.org/10.48550/arxiv.2412.11384.

[3] C. Barr, O. Erdelyi, P. D. Docherty, and R. C. Grace, “A Review of Fairness and A Practical Guide to Selecting Context-Appropriate Fairness Metrics in Machine Learning,” Arxiv, Nov. 2024, doi: https://doi.org/10.48550/arxiv.2411.06624.

[4] D. Marijan and A. Gotlieb, “Software Testing for Machine Learning,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 9, pp. 13576–13582, Apr. 2020, doi: https://doi.org/10.1609/aaai.v34i09.7084.

[5] J. M. Zhang, M. Harman, L. Ma, and Y. Liu, “Machine Learning Testing: Survey, Landscapes and Horizons,” IEEE Transactions on Software Engineering, vol. 48, no. 1, pp. 1–36, 2022, doi: https://doi.org/10.1109/tse.2019.2962027.

[6] A. Jaffri, “Hype Cycle for Artificial Intelligence 2024,” Gartner, Nov. 11, 2024. https://www.gartner.com/en/articles/hype-cycle-for-artificial-intelligence (accessed Aug. 16, 2025).

Testing Methods for Machine Learning Systems: From Data Validation to Model Evaluation

Authors

Keywords:

Abstract

Author Biography

References

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission

Information

Developed By

Language

Announcements

Latest publications