Introduction

Google BigQuery provides storage & compute for all sizes of data.
For storing input data it stores it in columnar format which is called “capacitor” and stored in the “colossus” file system.
We can execute a data transformation query inside BigQuery and BigQuery will execute our query within seconds. But for some use cases, it makes sense to perform data transformation outside BigQuery using BigData tools such as Spark or Apache Beam which provides compute to the data stored inside BigQuery.

In this blog, we will see how can we perform data transformation outside BigQuery using Apache Beam SDK and Dataflow as an execution engine.

Use Case

Input Table

Data Transformation Query

What are our options?

Option 1: We can perform ETL i.e Extract From BigQuery, Transform Inside Dataflow, and Load the result again in the BigQuery destination Table.
Option 2: We can just execute data transformation Query inside BigQuery through dataflow and get the result and Load the result inside BigQuery Table.

Let’s see both options in action.

Option 1:

Extract
Let’s load the entire input BigQuery table in the dataflow.
Below code extract the Wikipedia dataset from BigQuery and Maps the input tabular record to Java POJO Wikipedia Object

Transform

Since we have input data loaded into Dataflow from BigQuery, We can perform required transformations on the input data using Apache Beam transformations.
In the below code, we are using inputRecords that we read from BigQuery and apply a transformation that converts PCollection<inputRecords> to KV of Key and corresponding WikiPedia object. We want to group our result by two column hence we create our Key as KV pair, this KV pair will store “wiki” and “title” column and form key for our result.
Once we created KV pair of key and record, now we can perform count operation by key. this will provide us KV of key and count for that key.
Now since we performed our transformation, we can map the record to TableRow so that we can append our results to the BigQuery destination table.

Load

In the transform step, we transformed input records and converted them into TableRow, now we can write load logic to append our result to BigQuery.

We need to provide a schema of our destination table.

Option 2:

In option 2 we perform extract and transform inside BigQuery since BigQuery provides compute and storage both. and after transformed data is finally appended to the destination table.

Extract and Transform

We are executing data extraction and transformation query using apache beam. Apache beam allows to executes SQL query on BigQuery datasets.
Our SQL query perform group by using wiki and title column and perform count operation.
Don’t forget that this query executes inside BigQuery and Dataflow doesn’t perform any transformations.

Load

Since our data is transformed, now the only job is left to append to the destination BigQuery table.

Confirm Record Count

Let’s confirm the count for the records that were appended by dataflow in the BigQuery destination table. Let’s first run the count query.

Now run the transformation using BigQuery and confirm the count on the result.

we can confirm that the record count for ETL with dataflow with transformation done in dataflow is the same as executing transformation query on BigQuery.

Conclusion

Using Apache beam SDK we can perform ETL either using Dataflow worker machine or Inside BigQuery. Basically, we can either do the transformation on the dataflow worker machine(Option 1) or executing in BigQuery itself(Option 2).
If we execute transform in BigQuery then dataflow’s job is to just run the execution query on BigQuery which transforms the input data, gets the result, and writes it back to the destination table(source).
but before deciding which option is good you should always do a performance due to diligence and what makes sense for the use case ( executing the transform in BigQuery or Outside BigQuery in data processing engine such as dataflow, Spark, etc. )

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

ETL with Dataflow & BigQuery

Introduction

Use Case

Input Table

Data Transformation Query

What are our options?

Option 1:

Option 2:

Confirm Record Count

Conclusion

Leave a Reply Cancel reply

Recent Posts

Introduction

Use Case

Input Table

Data Transformation Query

What are our options?

Option 1:

Option 2:

Confirm Record Count

Conclusion

Please Share This Share this content

You Might Also Like

Comprehensive List of Java Exercises For Practice ( Updated Frequently )

Switch Statement & Expression In Java (JEP 361)

How to Ingest API Data using Serverless in Google BigQuery : Part 1 (Introduction)

Leave a Reply Cancel reply

Share this content