Enabling Self-Serve Data Platform with Apache Beam & Cookiecutter

In this article, I talk about how we enabled Domain teams at Achievers to bootstrap Apache Beam pipelines by leveraging PyPi Cookiecutter.

Archit Shah
Achievers Engineering

--

Introduction

At Achievers, we are building a self-serve data platform that enables our Domain teams to autonomously create, share, and use data assets. Every backend service on our platform generates data that is actively being used to inform our business decisions, identify usage patterns, measure performance, and train Machine Learning models. We are enabling our Domain teams to manage these data assets themselves, thus, decentralizing the ownership of data within our organization.

One of the key data assets we want to decentralize is data processing pipelines. Traditionally, a team of well-trained Data Engineers design, build, test, deploy and centrally maintain data processing pipelines. However, on a self-serve data platform like ours, we want the Domain teams to take up this responsibility and manage these pipelines as part of their data assets.

For instance, the Recognition service on our Achievers Experience Platform (AEXP) generates data for every single employee recognition that is posted on our platform. The Recognition Domain team is responsible for capturing the associated raw data from the application layer, transforming it as required with business logic, and loading it to a central location. From here, all external teams can access this data and use it for their use case.

Image of a sample recognition on AEXP

As you can imagine, there is an obvious challenge with this strategy. We are expecting our Domain teams to undertake niche tasks like building data pipelines when they do not necessarily have exposure to the underlying data processing technology stack. In this blog, we will talk about how the Data Architecture team at Achievers was able to abstract the creation of data processing pipelines using Apache Beam and PyPi Cookiecutter; thus enabling our Domain teams to simply leverage these abstractions to easily build their data pipelines with minimal prior knowledge on data processing technology stack.

Our Toolbox — Apache Beam and PyPi Cookiecutter

Apache Beam is the choice of tool for defining data processing pipelines here at Achievers. It is an open-source programming model for building data processing pipelines that are portable across different execution environments like Apache Flink, Apache Spark, and Google Cloud Dataflow. It allows developers to write code that can be easily reused and scaled for different data processing needs, such as ETL (extract, transform, load), stream processing, and batch processing. Beam SDKs offer the flexibility of defining those data processing pipelines using popular programming languages like Java, Python, Go, Scala and SQL.

We were looking for solutions that can centralize the framework to define Beam pipelines, encapsulate best practices and allow our developers on different teams to define their business logic on the pipeline. Cookiecutter is a popular project templating tool on PyPi that fits the bill. We already use Cookiecutter at Achievers to generate the starter code for our backend services. Cookiecutter is a command-line utility that can be easily installed with `pip`.

Apache Beam x PyPi Cookiecutter

Apache Beam Cookiecutter — Overview

Apache Beam Cookiecutter is our in-house templating tool for auto-generating Apache Beam pipelines. To generate an Apache Beam pipeline on the fly, we first need to break it down into different modules as shown below:

1. Pipeline Options — Configuration, Params, Batch/Streaming settings

2. Input — Sources from where we frequently import data

3. Transformation — Most frequently used transformations in our pipelines

4. Output — Sinks where we frequently export data

Architecture Diagram for Apache Beam Cookiecutter

We abstract each module with different options. For example, we have created custom connectors to interact with external systems like Google Cloud Storage, Google BigQuery, and MySQL/Postgres. When a developer invokes Cookiecutter, it asks them for a choice of input, transforms, and output. Based on that choice, our Cookiecutter can generate a Beam pipeline code by plugging in these connectors.

The core logic for these connectors resides on a centrally managed in-house Python package, and it is merely being invoked from Cookiecutter. By doing this, we ensure that we are centrally managing the underlying code for common components, such as connectors. This makes it easy to upgrade them in the future.

Tool Demo

Developers can invoke Apache Beam Cookiecutter by simply running the following command on Bash Terminal:

cookiecutter <git-clone-link>
Terminal screen when developer invokes Cookiecutter utility. Cookiecutter asks developers for their choice on input, transform, and output

Once the developer makes their choice on Cookiecutter, it generates the corresponding Beam modules as shown below. Once generated, developers are free to update the code on each one of these modules according to their needs. Most developers will update the code inside the transform module to insert their business logic for data processing. We also generate a driver file that binds together the entire Beam pipeline and connects the four modules. As a bonus, we also auto-generate a custom README.md file for every data pipeline that shows developers, how to test their data pipeline with sample commands.

Folder structure for Beam pipeline, auto-generated from Cookiecutter

Here is a sample Beam pipeline code in Python for reading data from Google Cloud Storage, which is auto-generated by Cookiecutter, when a developer selects “Google Cloud Storage” as input. It supports both CSV and Parquet Files.

def get_input(pipeline: beam.pipeline.Pipeline, known_args):  
if known_args.gcs_input_file_type == 'csv':
return pipeline | 'Read CSV files' >> ReadCsvFiles(
file_patterns=known_args.gcs_input_file_pattern_list
)
elif known_args.gcs_input_file_type == 'parquet':
for pattern in known_args.gcs_input_file_pattern_list:
pipeline = pipeline | 'Read Parquet files' >> beam.io.ReadFromParquet(pattern)
return pipeline

The ReadCsvFiles is a custom connector for reading CSV files from Google Cloud Storage. We retrieve this connector from our in-house package . This means if we ever have to update the logic for this connector, we don’t need to touch the existing data pipelines. They can simply upgrade the package version on their requirements.txt file for this package. As evident, we do not generate much logic from Cookiecutter and that is by design. We keep it simple and flexible so developers can update it according to their needs

Future Considerations

Adding more connectors

We will add more options on connectors for developers to choose from. As our business grows, our Domain teams use more external systems to import/export data. We have also started using non-relational databases like MongoDB and Google BigTable. We want our teams to be able to tap into these systems using our Cookiecutter and better manage their data assets in the future.

Add Streaming Support

As of now our Apache Beam Cookiecutter only supports batch pipelines. We have use cases for streaming near real-time data from sources like Google Cloud PubSubs. We will add support for generating such streaming Beam pipelines from our Cookiecutter in the future.

Continuing Support and Guidance for Domain Teams

As more Domain teams start using our Apache Beam Cookiecutter, we will encounter more edge cases that need to be handled on our pipelines. It is important to maintain a communication hotline (like a dedicated teams Channel in our case) where developers can raise their concerns and have them addressed promptly. It is also equally important to document the Cookiecutter and its usage and to make sure that it is always up to date.

Conclusion

Apache Beam Cookiecutter has been adopted by Domain teams at Achievers to generate their Beam data processing pipelines for certain use cases. Cookiecutter allows developers on Domain teams to bootstrap Beam data processing pipelines with minimal prior knowledge of Apache Beam. Domain teams no longer rely entirely on a central team to create and manage custom data pipelines for them. This allows our Data Architecture team at Achievers to focus more on solving framework-level problems, rather than solving domain-level problems that require domain-specific subject matter expertise.

If you’d also like to work on such cool projects, check out our Achievers Careers page. We’re hiring!

--

--