How to Read PDF Files in Java

Using PDFBox library to read PDF File in Java

Introduction

In this short article, we will use the PDFBox library to read PDF files in Java.
This library is useful in cases where we need to find text in pdf files. We will not cover how to read PDF file which contains images.

Use Case

Every quarter apple release earnings since it’s a public company.
All the earnings information is shared with shareholders in PDF file format.
This format is consistent and hence can be helpful for us to write an automation code that reads this PDF file and extract useful information.
Here is the target file.

Dependency

There are a few libraries that help us operate over PDF files and allow us to extract the content much more easily.
We will use one such library called PDFbox

<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>2.0.27</version>
</dependency>

Business Logic

PDDocument allows us to load the pdf file and with the help PDFTextStripper instance, we get a text from it

private static String getContents(File file) throws IOException {
        PDDocument doc = PDDocument.load(file);
        PDFTextStripper pdfTextStripper = new PDFTextStripper();
        String content = pdfTextStripper.getText(doc);
      //  System.out.println(content);
        return content;
    }

Now next step is to capture the target info from this text. Keep in mind the output from the getText method is String since it reads the entire doc content in String.
In order to capture target info we use the string helper method and find the start and end index position of the char, then using substring we can extract the information that we need.
Since the format is kind of consistent it would be useful for performing the below operations periodically on multiple earnings.

private static String getNetSalesData(File file) throws IOException {
    String content = getContents(file);
    int s = content.indexOf("Net sales: ");
    int e = content.indexOf("Cost of sales: ");
    String result = content.substring(s,e);
    return content.substring(s,e);
}

One other way to extract info is using regex. We can pass our regex expression to the Pattern Java class which can return all the matches for us.
Then we can use those matches and extract more detailed info as we need based on our business requirements.

private static void getByRegex(File file) throws IOException {
        String contents = getContents(file);
        Matcher matcher = Pattern.compile(".*sale").matcher(contents);
        System.out.println(matcher.find());
        List<String> matcherGroup = new ArrayList<>();
        while(matcher.find()){
            matcherGroup.add(matcher.group());
        }
    }

Conclusion

In this short article, we discuss how we can use the PDFBox library to read PDF content as a String and parse the target content from it.
We didn’t cover how we will handle images in PDF, that will be the topic for next time.

Before You Leave

Let me know if I can be of any help to your career, I would love to chat or jump on a call. you can connect me over Linkedin.

If you like this content consider supporting it.

If you want to upskill your Java skills, you should definitely check out
Java Programming Masterclass updated to Java 17
[ 750,000 students already enrolled, with 4.5 stars]

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

How to Read PDF Files in Java

Using PDFBox library to read PDF File in Java

Introduction

Use Case

Dependency

Business Logic

Conclusion

Before You Leave

Leave a Reply Cancel reply

Recent Posts

Using PDFBox library to read PDF File in Java

Introduction

Use Case

Dependency

Business Logic

Conclusion

Before You Leave

Please Share This Share this content

You Might Also Like

Commonly Used Intermediate Stream Operations In Java

Spring Data JPA: CriteriaQuery Explained!

Ad Campaign ROAS — SQL Questions Asked By Google

Leave a Reply Cancel reply

Share this content