How to Read PDF Files in Java

  • Post last modified:April 15, 2023
  • Reading time:3 mins read

Using PDFBox library to read PDF File in Java

Introduction

  • In this short article, we will use the PDFBox library to read PDF files in Java.
  • This library is useful in cases where we need to find text in pdf files. We will not cover how to read PDF file which contains images.

Use Case

  • Every quarter apple release earnings since it’s a public company.
  • All the earnings information is shared with shareholders in PDF file format.
  • This format is consistent and hence can be helpful for us to write an automation code that reads this PDF file and extract useful information.
  • Here is the target file.

Dependency

  • There are a few libraries that help us operate over PDF files and allow us to extract the content much more easily.
  • We will use one such library called PDFbox
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>2.0.27</version>
</dependency>

Business Logic

  • PDDocument allows us to load the pdf file and with the help PDFTextStripper instance, we get a text from it
private static String getContents(File file) throws IOException {
        PDDocument doc = PDDocument.load(file);
        PDFTextStripper pdfTextStripper = new PDFTextStripper();
        String content = pdfTextStripper.getText(doc);
      //  System.out.println(content);
        return content;
    }
  • Now next step is to capture the target info from this text. Keep in mind the output from the getText method is String since it reads the entire doc content in String.
  • In order to capture target info we use the string helper method and find the start and end index position of the char, then using substring we can extract the information that we need.
  • Since the format is kind of consistent it would be useful for performing the below operations periodically on multiple earnings.
private static String getNetSalesData(File file) throws IOException {
    String content = getContents(file);
    int s = content.indexOf("Net sales: ");
    int e = content.indexOf("Cost of sales: ");
    String result = content.substring(s,e);
    return content.substring(s,e);
}
  • One other way to extract info is using regex. We can pass our regex expression to the Pattern Java class which can return all the matches for us.
  • Then we can use those matches and extract more detailed info as we need based on our business requirements.
private static void getByRegex(File file) throws IOException {
        String contents = getContents(file);
        Matcher matcher = Pattern.compile(".*sale").matcher(contents);
        System.out.println(matcher.find());
        List<String> matcherGroup = new ArrayList<>();
        while(matcher.find()){
            matcherGroup.add(matcher.group());
        }
    }

Conclusion

  • In this short article, we discuss how we can use the PDFBox library to read PDF content as a String and parse the target content from it.
  • We didn’t cover how we will handle images in PDF, that will be the topic for next time.

Before You Leave

  • Let me know if I can be of any help to your career, I would love to chat or jump on a call. you can connect me over Linkedin.
  • If you like this content consider supporting it.

Leave a Reply