Input source reading patterns in Google Cloud Dataflow

Pavan Kumar Kattamuri
3 min readJul 21, 2019

--

A Dataflow pipeline encapsulates your entire data processing task, from start to finish. This includes reading input data from a source, transforming that data, and writing back the output data to a sink. Google Cloud Dataflow supports multiple sources and sinks like Cloud Storage, BigQuery, BigTable, PubSub, Datastore etc. The most common way of reading into Dataflow is from a Cloud Storage source. Even when using Cloud Storage as a source there might be different use cases one can come across.

Here I have listed out few of the most common source reading patterns, emphasising on Cloud Storage and PubSub. This does not focus on utilising different GCP services as source/sink nor on doing complex transformations. I repeat only on source reading patterns.

Using wildcard character

Read multiple files from Cloud Storage whose filenames have a similar pattern. Cloud Storage URI supports wildcard character * to read files which have the same prefix. Similarly you can use other supported wildcard characters.

List GCS & Create

For use cases where wildcard character is not useful since the files doesn’t follow a certain prefix, you can fetch the list of files you want to process, create a PCollection of files using beam.Create and then do whatever you want. Also, instead of providing the list yourself, you can also use the bucket.list_blobs method from the cloud storage python client library to fetch the list of files.

Multiple PubSub sources

It is very well known that Dataflow can read from a PubSub topic in case of streaming processing pipelines. One can also read from multiple pubsub topics and then merge these multiple PCollections into a single PCollection using beam.Flatten.

Read from Cloud Storage and a PubSub topic

Dataflow pipelines can also have a cloud storage URI and a pubsub topic as two sources. It might be useful where you want to create a single pipeline to process all the historical data that is already stored in GCS and also the streaming data coming to a pubsub topic.

Streaming processing of GCS files

You can also have a streaming Dataflow pipeline that processes the data as soon as the file lands to the storage bucket. This might be useful where data files lands to the bucket frequently. To achieve this, first of all, you need to enable pubsub notifications for your cloud storage bucket and then at the start of the dataflow pipeline, read the messages from pubsub, extract the information of the file landed i.e bucket name, object name etc. and do whatever you want.

Summary

And that’s it for this first article of mine. Suggestions are encouraged. Will meet you again down the pipeline.

--

--

Pavan Kumar Kattamuri
Pavan Kumar Kattamuri

Written by Pavan Kumar Kattamuri

Platform Engineer | Cloud | GCP | AWS

Responses (3)