Loading Data In Bulk: Multi Insert vs Copy

  • Post last modified:August 6, 2023
  • Reading time:4 mins read

Comparing bulk loading data with multi-insert and copy

Introduction

  • When it comes to loading bulk data into the database table we get 2 options, either we can load them using multi-line insert or using copy command.
  • In this article, we will discuss these two options and compare the execution time for these two commands.

Table

  • We have a book_reviews table that records the review from different users for different books.
postgres=# \d book_reviews
                                   Table "public.book_reviews"
   Column    |  Type   | Collation | Nullable |                     Default
-------------+---------+-----------+----------+--------------------------------------------------
 reviews_id  | integer |           | not null | nextval('book_reviews_reviews_id_seq'::regclass)
 user_id     | text    |           |          |
 isbn        | text    |           |          |
 book_rating | text    |           |          |
Indexes:
    "book_reviews_pkey" PRIMARY KEY, btree (reviews_id)

Multi values Insert

  • Data is inserted with an insert command which inserts one row at a time.
INSERT INTO products (product_no, name, price) VALUES (1, 'Cheese', 9.99);
INSERT INTO products (name, price, product_no) VALUES ('Cheese', 9.99, 1);
  • But it will be inefficient to use them to load bulk data since we need to execute the insert command for each insert statement.
  • With multi-insert, we only need to parse and execute the query once. for example,
INSERT INTO products (product_no, name, price) VALUES
    (1, 'Cheese', 9.99),
    (2, 'Bread', 1.99),
    (3, 'Milk', 2.99);
  • In below example we are constructing multi-line insert command and executing it with prepared statement.
public static void insertAll() throws SQLException {
   List<BookReview> bookReviews = IntStream.rangeClosed(1, 1000000)
                .mapToObj(i -> getBookReview())
                .toList();

   String query = "insert into book_reviews (user_id, isbn, book_rating) values ";

   String values = bookReviews.stream()
         .map(a -> "( " + "'" +a.getUserId()+ "'" + "," + "'" + a.getIsbn()+ "'" + ","+ "'" + a.getBookRating()+ "'" + ")")
         .collect(Collectors.joining(","));

   Connection connection = getConnection();
   PreparedStatement preparedStatement = connection.prepareStatement(query + values);
   long start = System.currentTimeMillis();
   long rows = preparedStatement.executeUpdate();
   connection.commit();
   connection.close();
   long end = System.currentTimeMillis();
   System.out.println("execution took "+(end-start)+" milliseconds");
}

private static BookReview getBookReview() {
   BookReview bookReview = new BookReview();
   String seed = UUID.randomUUID().toString();
   bookReview.setIsbn("isbn"+seed);
   bookReview.setBookRating("2.3");
   bookReview.setUserId(seed);
   return bookReview;
}

private static Connection getConnection() throws SQLException {
   Connection connection = DriverManager.getConnection("jdbc:postgresql://localhost/postgres",
                $user, $pass);
   if(connection != null){
            System.out.println("Connected to database");
   }
   connection.setAutoCommit(false);
   return connection;
}
  • It takes total 20.9 seconds.
  • Although multi-values insert is faster we can do better with copy command

Copy

  • We can use single copy command to load bulk data into the table which is more performant than insert and multi-val insert.
  • Copy has some in-built optimizations that make it performant. ( such as less commit cost, network latency, context switches, etc. )
  • Below example, we are loading bulk data using copy command.
public static void copy3() throws SQLException, IOException {
        List<BookReview> bookReviews = IntStream.rangeClosed(1, 1000000)
                .mapToObj(i -> getBookReview())
                .toList();

        Connection connection = getConnection();

        String values = bookReviews.stream()
                .map(LoadDataWithCopy::toCsv)
                .collect(Collectors.joining("\n"));

        CopyManager copyManager = ((PGConnection) connection).getCopyAPI();
        InputStream is = new ByteArrayInputStream(values.getBytes());
        long start = System.currentTimeMillis();
        long l = copyManager.copyIn("COPY book_reviews(user_id, isbn, book_rating) FROM STDIN delimiter ',' csv encoding 'UTF-8'",
                    is);
        long end = System.currentTimeMillis();
        System.out.println("execution took " + (end - start) + " milliseconds");
        connection.close();
}

public static Connection getConnection() throws SQLException {
        Connection connection = DriverManager.getConnection("jdbc:postgresql://localhost/postgres",
                $user, $pass);
        if(connection != null){
            System.out.println("Connected to database");
        }
        return connection;
}

private static BookReview getBookReview() {
        BookReview bookReview = new BookReview();
        String seed = UUID.randomUUID().toString();
        bookReview.setIsbn("isbn"+seed);
        bookReview.setBookRating("2.3");
        bookReview.setUserId(seed);
        return bookReview;
}

private static String toCsv(BookReview a) {
        return a.getUserId() + ", " +
                a.getIsbn() + ", " +
                a.getBookRating();
}
  • Total time 7.6 seconds.

Conclusion

  • If we need to load bulk data into the table we should consider using copy over insert/multi-val insert.
  • Copy command has some in-built optimizations that make it good options to save on execution time.

References

Before You Leave

Leave a Reply