Automated Execution of Data Pipelines based on Configuration Files.

We are pleased to announce the release of a new paper, written by Károly Bósa and Paul Heinzlreiter, our partners from RISC. This study is carried out within the Platform-ZERO framework and mentions solar cell quantum efficiency measurement within the project.

Background

Data preparation is a fundamental aspect of data engineering, a prerequisite for later tasks such as data visualization, reporting, and training machine learning models. Despite the recurring patterns in data transformation processes, the specific steps often vary depending on the project context, data sources, and application domain.

Methods

To address these challenges, this paper presents a flexible and extensible framework that enables the coordinated execution of modular data processing steps defined in a configuration file. By adopting a declarative, configuration-driven approach, the framework promotes modular, step-by-step development while substantially improving code reuse, maintainability, and adaptability. The framework also supports basic iterative execution constructs, such as loops and limited recursion, within the data pipeline definitions to accommodate more complex workflows.

Results

By enabling the reuse of existing code snippets, the framework shifts development efforts toward enhancing and refining a shared code base, rather than repeatedly creating project-specific, disposable implementations. The long-term benefits of this approach become increasingly apparent as the system evolves. As more generalized modules and functions are developed, they can reduce duplication and improve maintainability without sacrificing flexibility.

Conclusions

To assess the effectiveness of the framework, we apply cyclomatic complexity as a metric, demonstrating how the proposed approach impacts the development effort across some relatively simple, real-world data engineering scenarios.

Read the article!