One of the main reason to build ETL pipeline was to do data transformation on data before loading into the data warehouse. The only reason we were doing that because data warehouses were not capable to handle these data transformations due to several reasons such as performance and flexibility.
What is UDF?
From Google Cloud Documentations:
You can define a UDFs as either persistent or temporary. You can reuse persistent UDFs across multiple queries, while temporary UDFs only exist in the scope of a single query
The below function is a simple example of creating a temporary UDF.
CREATE TEMP FUNCTION multiplyInputs(x FLOAT64, y FLOAT64)RETURNS FLOAT64LANGUAGE js AS r""" return x*y;""";WITH numbers AS (SELECT 1 AS x, 5 as y UNION ALL SELECT 2 AS x, 10 as y UNION ALL SELECT 3 as x, 15 as y)SELECT x, y, multiplyInputs(x, y) as productFROM numbers;
Package Existing NPM library
- We will package credit card generator npm library using webpack.
- Create a card-number-generator folder and install the creditcard-generator library. Create webpack config file to create webpack file.
- We need to edit webpack.config.js and mention about entry point for the library and output path.
- Run below command will generate webpack file at card-number-generator/dist/main.js
- After packaging, we will upload the packaged file to the google cloud storage bucket.
Using Packaged NPM library In BigQuery UDF
- We will use one of the methods called GenCC(“Amex”)
- We will also create another UDF just to generate a random customer id.
- Now we can use the above two udfs to generate our table with customer_id and credit card number. Additionally, in order to generate N numbers of rows, we can use any public or authorized table with LIMIT.
- In the below query, we are using bigquery public dataset to generate 100 records.
- Let’s summarize all the information and our entire code looks like below
if you execute the above query it will generate 100 records as shown below.