Options
Preliminary analysis on data quality for ML applications
Citation Link: https://doi.org/10.15480/882.4693
Publikationstyp
Conference Paper
Date Issued
2022-09
Sprache
English
Herausgeber*innen
TORE-DOI
First published in
Number in series
33
Start Page
207
End Page
236
Citation
Hamburg International Conference of Logistics (HICL) 33: 207-236 (2022)
Contribution to Conference
Publisher
epubli
Peer Reviewed
true
Purpose: This publication investigates preliminary data quality analyses to estimate the efforts and expected results of the use of data sets for ML solutions already in the data understanding phase of an implementation. Knowledge about the necessary data cleaning efforts and result qualities allows potentials to be estimated early in the process.
Methodology: Through a literature research, characteristics of a time series as well as methods of data cleaning are analysed. Based on the results, a test environment is implemented in Python, enabling the evaluation of individual methods using sample data sets from the process industry and comparing them with different error analyses.
Findings: The publication describes a detailed overview of data cleaning procedures and addresses a first Indication of a connection between the final achievable forecast quality and the degree of error of the original data set. Insights into the influence of the choice of preprocessing method on the achievable quality of the AI-based forecast can be concluded.
Originality: Within the publication, the link between data characteristics in time series and preprocessing methods is established to draw conclusions in advance about the quality improvement to be expected from selected data cleaning methods and to provide decision support for the selection of the method.
Methodology: Through a literature research, characteristics of a time series as well as methods of data cleaning are analysed. Based on the results, a test environment is implemented in Python, enabling the evaluation of individual methods using sample data sets from the process industry and comparing them with different error analyses.
Findings: The publication describes a detailed overview of data cleaning procedures and addresses a first Indication of a connection between the final achievable forecast quality and the degree of error of the original data set. Insights into the influence of the choice of preprocessing method on the achievable quality of the AI-based forecast can be concluded.
Originality: Within the publication, the link between data characteristics in time series and preprocessing methods is established to draw conclusions in advance about the quality improvement to be expected from selected data cleaning methods and to provide decision support for the selection of the method.
Subjects
Artificial Intelligence
Blockchain
DDC Class
004: Informatik
330: Wirtschaft
380: Handel, Kommunikation, Verkehr
Publication version
publishedVersion
Loading...
Name
Kiebler et al. (2022) - Preliminary Analysis on Data Quality for ML Applications.pdf
Size
872.31 KB
Format
Adobe PDF