Input source reading patterns in Google Cloud Dataflow
A Dataflow pipeline encapsulates your entire data processing task, from start to finish. This includes reading input data from a source, transforming that data, and writing back the output data to a sink. Google Cloud Dataflow supports multiple sources and sinks like Cloud Storage, BigQuery, BigTable, PubSub, Datastore etc. The most common way of reading into Dataflow is from a Cloud Storage source. Even when using Cloud Storage as a source there might be different use cases one can come across.
Here I have listed out few of the most common source reading patterns, emphasising on Cloud Storage and PubSub. This does not focus on utilising different GCP services as source/sink nor on doing complex transformations. I repeat only on source reading patterns.
Using wildcard character
Read multiple files from Cloud Storage whose filenames have a similar pattern. Cloud Storage URI supports wildcard character * to read files which have the same prefix. Similarly you can use other supported wildcard characters.
List GCS & Create
For use cases where wildcard character is not useful since the files doesn’t follow a certain prefix, you can fetch the list of files you want to process, create a PCollection of files using beam.Create and then do whatever you want. Also, instead of providing the list yourself, you can also use the bucket.list_blobs method from the cloud storage python client library to fetch the list of files.
Multiple PubSub sources
It is very well known that Dataflow can read from a PubSub topic in case of streaming processing pipelines. One can also read from multiple pubsub topics and then merge these multiple PCollections into a single PCollection using beam.Flatten.
Read from Cloud Storage and a PubSub topic
Dataflow pipelines can also have a cloud storage URI and a pubsub topic as two sources. It might be useful where you want to create a single pipeline to process all the historical data that is already stored in GCS and also the streaming data coming to a pubsub topic.
Streaming processing of GCS files
You can also have a streaming Dataflow pipeline that processes the data as soon as the file lands to the storage bucket. This might be useful where data files lands to the bucket frequently. To achieve this, first of all, you need to enable pubsub notifications for your cloud storage bucket and then at the start of the dataflow pipeline, read the messages from pubsub, extract the information of the file landed i.e bucket name, object name etc. and do whatever you want.
Summary
And that’s it for this first article of mine. Suggestions are encouraged. Will meet you again down the pipeline.