Repository logo
 

A framework for detecting unnecessary industrial data in ETL processes


Change log

Authors

Jess, T 
Harrison, M 
Shah, A 

Abstract

Extract transform and load (ETL) is a critical process used by industrial organisations to shift data from one database to another, such as from an operational system to a data warehouse. With the increasing amount of data stored by industrial organisations, some ETL processes can take in excess of 12 hours to complete; this can leave decision makers stranded while they wait for the data needed to support their decisions. After designing the ETL processes, inevitably data requirements can change, and much of the data that goes through the ETL process may not ever be used or needed. This paper therefore proposes a framework for dynamically detecting and predicting unnecessary data and preventing it from slowing down ETL processes - either by removing it entirely or deprioritizing it. Other advantages of the framework include being able to prioritise data cleansing tasks and determining what data should be processed first and placed into fast access memory. We show existing example algorithms that can be used for each component of the framework, and present some initial testing results as part of our research to determine whether the framework can help to reduce ETL time.

Description

Keywords

Extract, transform and load ETL, Data warehouse, reduce ETL, unnecessary data, data overload, detecting unnecessary data

Journal Title

Proceedings - 2014 12th IEEE International Conference on Industrial Informatics, INDIN 2014

Conference Name

2014 12th IEEE International Conference on Industrial Informatics (INDIN)

Journal ISSN

1935-4576

Volume Title

Publisher

IEEE