Data Quality Metrics How to Measure and Improve Accuracy

Bharath Kishore Gudepu; Oscar Gellago; Rebecca Eichler

Authors

Bharath Kishore Gudepu Senior Informatica Developer, Transamerica, 10100 N Central Expy Ste 595, Dallas, TX 75231
Oscar Gellago University of Žilina, Žilina, Slovakia
Rebecca Eichler PRA Group Inc., USA

Keywords:

Data Quality Metrics, Data Accuracy, Data Quality, Data Governance, Data Integrity

Abstract

The quality of data, particularly its assessment, has been extensively examined in both research and practical applications. To facilitate economically driven management of data quality and decision-making under uncertainty, it is crucial to evaluate the data quality level using robust metrics. Nevertheless, if these measures are not well delineated, they may result in erroneous conclusions and financial detriment. Consequently, within a decision-oriented framework, we delineate five prerequisites for data quality measures. These requirements pertain to a measure designed to facilitate economically driven management of data quality and decision-making under uncertainty. We further illustrate the applicability and effectiveness of these requirements by assessing five data quality measures across several data quality dimensions. Furthermore, we examine the practical ramifications of implementing the outlined standards. The two most important requirements for data quality are consistency and accuracy. Database inconsistencies and errors are often caused by breaches of integrity requirements. To make a filthy database D consistent, automated techniques are required to locate a repair D0 that fulfils criteria and differs "minimally" from D. It's crucial to guarantee that the automatically generated fix D0 is accurate and makes sense. D0 should deviate from the "correct" data within a given range. This study explores practical approaches to improve data consistency and accuracy. We use conditional functional dependencies (CFDs) from [6] to ensure data consistency and detect mistakes that standard methods may miss. We present two techniques to increase data consistency: one for automatically computing a repair D0 that satisfies a set of CFDs, and another for progressively discovering a repair in response to clean database changes. We demonstrate that both challenges are insurmountable. We empirically validate the effectiveness and efficiency of our heuristic algorithms. We created a statistical strategy to ensure the accuracy of algorithmic fixes beyond a preset rate without requiring unnecessary human engagement.

Data Quality Metrics How to Measure and Improve Accuracy

Authors

Keywords:

Abstract

Downloads

Published

Issue

Section