Joins in Google Cloud Dataflow using reusable Composite Transforms

Pavan Kumar Kattamuri
2 min readJul 27, 2019

--

Often in your Dataflow pipelines, you might want to implement a SQL-like join on two different PCollections. These PCollections can be the initial sources of your data or can be from somewhere in between the pipeline. There are no apache beam built-in transforms that can perform a SQL-like join. The closest available transform is CoGroupByKey which performs a relational join on two or more PCollections that have the same key type. The problem with CoGroupByKey is that the resulting PCollection contains the records in nested format which is not the case with a normal SQL join. So, a typical implementation of join in Dataflow will compose of multiple steps like,

1. Pre-processing the records to desired <K,V> value pairs, where key is a tuple of value(s) of column(s) you want to join and value being the entire record

2. CoGroupByKey transform

3. Transformation step to unnest the records depending on the type of join you are trying to achieve.

Hmm, looks like a use case for Composite Transform. A Composite transform comprises of multiple simpler PTransforms which makes code modular and reusable, enabling the developers to avoid writing the same logic multiple times.

This blog focusses on using a composite transform to perform different types of joins (left, right, inner, outer) on a single/multiple column(s). The source code can be found under this github. This has been inspired and extended from this original blog

Code

The Join class is a composite transform implementing the above discussed transformations. It takes both the PCollections &their names, column names to join on and the type of join as arguments.

Now you can import this Join PTransform into your Dataflow code and use it whenever required, just like an apache beam core transform. The test.py is a sample Dataflow code illustrating how to import this & perform a left join. Perform a different type of join like outer join just by passing the argument ‘outer’ instead of ‘left’ at line number 29. Yes, it’s as simple as that.

tat

And this is how the Dataflow DAG looks like on the GCP console when you expand the ‘left join’ composite transform.

Summary

In this blog, I have demonstrated how to achieve multiple join functionalities using the same composite transform. The bottom line is to use composite transforms to make code modular and reusable across your Dataflow pipelines.

--

--