Data Ingestion In Plain Java

  • Post last modified:November 1, 2022
  • Reading time:4 mins read

Introduction

  • In this article, we will build simple data ingestion design that will be using simple Java SDK
  • We will Read data from remote sources and write to Database.

Use Case

  • Our use cases is to ingest News Articles from RSS Sources.
  • We will build RSS Client that will ingest data and then write to RDBMS.

Source

  • As i mentioned our source is RSS Feed. We will be ingesting news article from Google cloud tech blogs.
  • If you want to see detailed blog about building RSS clients please visit here 
  • In the scope of this blog we will use dependency called Rome, which will help us connecting to RSS feed.
  • Below is complete logic which ingests data from google cloud blog and map it to NewsArticle POJO

NewsArticle POJO

  • NewsArticle basically consists of properties that we are expecting to get from RSS news articles.

RSS Mapper

  • This class is responsible for mapping RSS feed data to NewsArticle POJO.

Database Setup

  • This article requires setup of MySQL or similar RDBMS as sink , it’s pre-req to setup this database
  • Once we have the running database, we can create a table called NEWS_ARTICLE under any schema present in the Database.
  • If you are lazy like me you can just copy below code to get started quickly
  • Just make sure you replace User with your schema of choice.
CREATE TABLE `User`.`NEWS_ARTICLES` (
  `id` INT NOT NULL AUTO_INCREMENT,
  `title` VARCHAR(300) NOT NULL,
  `link` VARCHAR(300) NULL,
  `image_url` VARCHAR(300) NULL,
  `category` VARCHAR(45) NULL,
  `published_date` DATETIME NOT NULL,
  `create_timestamp` DATETIME NOT NULL,
  `update_timestamp` DATETIME NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE INDEX `id_UNIQUE` (`id` ASC));

Connection to Database

  • Once table is ready we can focus on Java logic to connect to database.
  • We will need JDBC driver to connect to database, make sure to add the dependency in POM.xml
  • Now once we have the driver we can start coding our client.
    We user DriverManager class to get the connection object. We need to pass JDBC URL string along with user and password.
  • Note: I have put config value in the code but in actual scenario we should never do it. I believe reading from config file at runtime or KMS/Vault is recommended approach.

Insert Data into Database

  • Once we are connected to the database, we will write data insertion logic.
  • First lets get the prepared statement object that we will use to set our data and execute against database.
  • One thing to make sure is that we should insert our data as batch since inserting each record would not be good for performance when data size is large.
  • We set autoCommit as false so that we can commit once all the records has been added to the batch and we can execute batch.
  • We iterate over each newsarticle and add to preparedstamtent batch.
  • and once we finish adding each record to the batch we can execute batch and perform commit operation.
  • Below is the logic to set the preparedstatement with newsarticle data

Client Logic

  • Our client is pretty simple. At first it calls the logic to get the NewsArticle from RSS Feed and then write them as batch insert to our database.

Result

  • Lets execute our client code to see the output.
  • Before Execution : Our database table is empty 

After Execution

Log

MySQL

  • As we can see our database has successfully inserted 20 news articles and our code execution was successful.

Source Code

  • As usual source code is available on github:

Conclusion

  • In this article we built data ingestion logic using plain java. 
  • We read from RSS Feed and Inserted the record into MySQL Database

Further Reading 

Leave a Reply