The Case for a Reusable Data Pipeline Framework
- 5 minutes read
- Author: Swathi Maheshkumar
If your IT team ingests multiple data sources, most likely you’ve built a data pipeline or two. These pipelines routinely and automatically transfer your data from its source to the data lake where it is stored or the data warehouse where it can be accessed for data analytics. Yet, every time you add a new data source, you may have to build a new data pipeline from scratch.
At CoStrategix, we developed a reusable data pipeline framework using Azure services that significantly reduces the time and effort it takes to ingest new data sources. Using a best-practices approach, this framework helps us both with our own internal projects as well as with client projects. Our framework:
- Transfers data from various sources as quickly as possible so it can be available for data-driven projects
- Transforms data into a usable format as efficiently as possible
- Ensures the quality of data before it reaches the data warehouse
The Value Proposition for a Reusable Data Pipeline Framework
What we did was combine data collection with data transformation and data quality monitoring into a single, reproducible framework. Our generic data pipeline framework incorporates reusable components, so we minimize the configurational changes required for ingesting new data sources. Using this standard framework also enables us to orchestrate the ingestion of data from multiple sources through a single pipeline.
Other benefits include:
- Reducing the dependencies for manual pipeline scripts
- Increasing the data cycle time while decreasing manipulation time
- How much time and manual effort are we able to save by standardizing our process, you might ask? We have built many, many data pipelines on behalf of our clients. So our level of expertise in this type of work is high.
Yet still, our generic data pipeline framework:
- Reduced the time to ingest a new data source – 16x faster
- Even when copying the sources, reduced the time to create a new pipeline – 4x faster
Preventing Bad Data Ingestion
The biggest challenge that companies face when building a data pipeline is ingesting bad data. Fixing incorrect/invalid data after it has been ingested takes a lot of resources, and it’s painful to backtrack to the starting point to find the source of the incorrect data.
Having the right architecture in place with the proper checks and validation inside of the framework can provide gold data using fewer resources – while minimizing or avoiding corrections.
We employ a 3-zone architecture in our data warehouses in order to help manage data quality. In Zone 1, we dump “dirty data,” or everyday files with unmodified source data in the data tables. In Zone 2, we transform the data (append/upsert/replace) based on business requirements. By the time data reaches Zone 3, it should have undergone all data transformations and validations to become gold data – data ready for analysis by the business teams.
The data pipeline framework we developed:
- Employs checks on the application side (i.e. Azure Data Factory) to avoid ingestion of invalid/incorrect data or file formats
- Employs validation of data in each zone before moving data into the following zone of the data warehouse
- Captures logs for all successful and failed cases for the invalid/incorrect data, which gives sufficient details to find the root cause of any issue and correct it to avoid failure of the pipeline run in future
- Parameterizes the pipeline to help in rerunning the pipelines for any previous day’s file ingestion
- Adds additional metadata as the data traverses the pipeline to allow users to see when it was received, validated, and made available to the business users
CoStrategix is a strategic consulting and engineering company that helps organizations realize value from digital and data. If you’re looking to harness data for decision-making, we can help you modernize your data platform infrastructure, drive insights, and advance data literacy throughout your organization.
Related Blog Posts
4 Pitfalls to Avoid When Creating a Customer Journey Map
September 20, 2022
A New Approach to Back-End Cloud Governance
May 22, 2024
Embrace the Chaos! Exploring the World of Chaos Engineering
April 15, 2024