Automation of Research Master Data Management for Dataset Consistency
Keywords:
master-data management, FAIR principles, dataset consistency, probabilistic record linkage, active learning, research reproducibility, cloud architecturesAbstract
The article addresses the automation of master-data management in research organizations as a key prerequisite for dataset consistency and result reproducibility. The problem is pressing because of the growing volumes of heterogeneous data, the reproducibility crisis acknowledged by most biomedical researchers, and the considerable economic losses associated with manual cleansing and duplicated experiments. The study aims to justify and experimentally confirm the effectiveness of integrating FAIR principles with a multi-layer architecture for data intake, normalization, and golden record creation. The novelty is a holistic method that joins together cloud reference architectures, Data Mesh, and Landing Zone, probabilistic record linkage, graph embeddings, and active learning for dynamically adjusting confidence thresholds, thus reducing the burden imposed on experts while delivering continuous quality metrics. Automated MDM removes 37% data redundancy, reduces researchers’ time spent on cleansing to just 26%, and accelerates integration into machine-learning pipelines by close to one third; besides, it proves an actual economic effect visible already from the estimated annual cost reduction of at least EUR 10.2 billion in the EU. Some known shortcomings about the risk of wrong joins, old records, and people's pushback against using machines will guide further research into changing thresholds, fixing past data issues, and improving human-machine links. This paper is for data-management workers, bioinformaticians, research project bosses, and information-system builders.
References
[1] A. Shaikh, H. Harreis, J. Machado, and K. Rowshankish, “Master data management: The key to getting more from your data,” McKinsey, May 15, 2024. https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/master-data-management-the-key-to-getting-more-from-your-data (accessed Jul. 14, 2025).
[2] K. D. Cobey et al., “Biomedical researchers’ perspectives on the reproducibility of research,” PLoS Biology, vol. 22, no. 11, pp. e3002870–e3002870, Nov. 2024, doi: https://doi.org/10.1371/journal.pbio.3002870.
[3] M. Barker et al., “Introducing the FAIR Principles for research software,” Scientific Data, vol. 9, no. 622, Oct. 2022, doi: https://doi.org/10.1038/s41597-022-01710-x.
[4] “FAIR Principles,” Go Fair. https://www.go-fair.org/fair-principles/ (accessed Jul. 15, 2025).
[5] M. A. Musen, M. J. O’Connor, E. Schultes, M. Martínez-Romero, J. Hardi, and J. Graybeal, “Modeling community standards for metadata as templates makes data FAIR,” Scientific Data, vol. 9, no. 696, Nov. 2022, doi: https://doi.org/10.1038/s41597-022-01815-3.
[6] H. Koga, “FAIR Data Principles Drive Better Scientific R&D,” Dotmatics, Feb. 07, 2023. https://www.dotmatics.com/fair-data-principles-drive-better-scientific-r-and-d (accessed Aug. 10, 2025).
[7] F. A. Islas, “The Value of Data Catalogs for Data Scientists - Enterprise Knowledge,” Enterprise Knowledge, Jun. 30, 2022. https://enterprise-knowledge.com/the-value-of-data-catalogs-for-data-scientists/ (accessed Jul. 18, 2025).
[8] “Guidance for a Laboratory Data Mesh on AWS,” Amazon Web Services, Inc. https://aws.amazon.com/ru/solutions/guidance/laboratory-data-mesh-on-aws/ (accessed Jul. 19, 2025).
[9] “Cloud-scale analytics data management landing zone overview - Cloud Adoption Framework,” Microsoft Learn, Feb. 21, 2025. https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/cloud-scale-analytics/architectures/data-management-landing-zone (accessed Jul. 20, 2025).
[10] “Record Linkage & Machine Learning,” US Census Bureau. https://www.census.gov/topics/research/stat-research/expertise/record-linkage.html (accessed Jul. 21, 2025).
[11] M. Vinodkumar and R. Surasani, “Mastering Enterprise Data: MDM Strategies, Tools, and Impacts Across U.S. Industries,” IJNRD, vol. 8, no. 12, 2023, Accessed: Jul. 22, 2025. [Online]. Available: https://www.ijnrd.org/papers/IJNRD2312451.pdf
[12] “Microsoft Purview and Profisee Master Data Management (MDM),” Microsoft Learn, Apr. 04, 2025. https://learn.microsoft.com/en-us/purview/data-governance-master-data-management-profisee (accessed Jul. 23, 2025).
Downloads
Published
Issue
Section
License
Copyright (c) 2025 American Scientific Research Journal for Engineering, Technology, and Sciences

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who submit papers with this journal agree to the following terms.