Using SQL to generate bulk data for Postgres table
Introduction
- Generating sample data for database testing is one of the common steps.
- In the previous articles, we covered how to generate sample data in bulk using plain Java and using regex.
- In this article, we will learn how we can use SQL and Postgres functions to generate N number of sample records in seconds.
Schema
- Our schema is an Account table that looks like below which contains different columns such as username, password, email, etc.
Generating Single Account Record
Generating Random Usernames
- Generating a username is the concatenation of a few names and numbers which is getting selected using the random() function.
-- usernames
with usernames as (
select ((select (array['john', 'jane', 'jacky'])[floor(random() * 3 + 1)]) || (floor(1000+random()*1000)::text)) as username
),
Generating Random Passwords
- Returning md5 of random text as password.
-- password
passwords as (
SELECT md5(random()::text) as password
),
Generating Random Email Domain
- Picking email domains randomly from the list.
email_domain as (
select (array['@gmail.com', '@yahoo.com', '@outlook.com'])[floor(random() * 3 + 1)] as domain
),
Joining to Return Single Record
- Now that we have randomness to choose different values for different columns we can write SQL to return them as a single record each time we execute the SQL.
account_record as (
select username, password, (username || domain) as email
from usernames
join passwords
on 1=1
join email_domain
on 1=1
)
- Now selecting each field from the account record.
select username, password, email, now(), now(), 1 from account_record;
Convert to Postgres Function
- We know how to generate a single unique account record, we can convert that logic to the postgres function as below.
CREATE OR REPLACE FUNCTION sample_account_record()
RETURNS TABLE(username text, password text, email text, created_at timestamp, last_login timestamp, permissions_id int)
AS
$$
-- usernames
with usernames as (
select ((select (array['john', 'jane', 'jacky'])[floor(random() * 3 + 1)]) || (floor(1000+random()*1000)::text)) as username
),
-- password
passwords as (
SELECT md5(random()::text) as password
),
-- email
email_domain as (
select (array['@gmail.com', '@yahoo.com', '@outlook.com'])[floor(random() * 3 + 1)] as domain
),
account_record as (
select username, password, (username || domain) as email
from usernames
join passwords
on 1=1
join email_domain
on 1=1
)
select username, password, email, now(), now(), 1 from account_record;
$$
LANGUAGE sql;
- Our function sample_account_record got created in Postgres.
- Now that we have everything natively as postgres function, we can just query select to that function and we will get our output.
select * from sample_account_record()
Generating Record In Bulk
- So far we are only generating a single record, but then we can use the generate_series() function from Postgres to generate N number of records.
select
sample_account_record() as record
from generate_series(1,5)
- The output contains different sample records for account schema but its format is not what we want. We are looking for each record as a separate column instead of a single CSV record inside the bracket.
- Let’s do some data processing to split that into multiple columns.
- The very first thing we can do is to replace the bracket with nothing so that we can csv record without the bracket.
select
REGEXP_REPLACE( cast(sample_account_record() as text), '[\(\)]', '', 'g') as record
from generate_series(1,50)
- The bracket has been removed.
- Once we have csv record we can split them and assign them to different columns that they belong to.
select
split_part(cast(record as text), ',', 1) as username,
split_part(cast(record as text), ',', 2) as password,
split_part(cast(record as text), ',', 3) as email,
split_part(cast(record as text), ',', 4) as created_at,
split_part(cast(record as text), ',', 5) as last_login,
split_part(cast(record as text), ',', 6) as permission
from(
select
REGEXP_REPLACE( cast(sample_account_record() as text), '[\(\)]', '', 'g') as record
from generate_series(1,5)
) x
- Our records now contain multiple columns with different records.
One thing to know is that we should be afraid to use this logic on STG and PROD instances since generate_series and other used Postgres functions might take a lot of computing time that might impact database. We can use it for developmement and testing purposes.
Conclusion
- In this article, we learned how to generate sample records using SQL in postgres.
- We also learned how to convert our SQL logic to the Postgres function and scale it to generate bulk inserts.
Before You Leave
- Let me know if I can be of any help to your career, I would love to chat or jump on a call.
- Upgrade your Java skills with Grokking the Java Interview
- If you want to upskill your Java skills, you should definitely check out
[NEW] Master Spring Boot 3 & Spring Framework 6 with Java
[ 38 hrs content, 4.7/5 stars, 6+ students already enrolled]