Simple Web Crawling Logic In Java

Call Us: (212) 336-1440-556

Simple Web Crawling Logic In Java

Post last modified:December 15, 2022
Reading time:3 mins read

Introduction

Crawling is a very basic operation that every organization performs at some level for its data needs.
All the crawling logic doesn’t have to be as sophisticated as of Google or Yahoo, but sometimes simple logic is all we need to crawl the various sources.
These sources not necessarily be a webpage but any resource like a file directory.
But in this blog, we will consider the case of webpage crawling.

Source Web Pages

These will be the source web pages we want to visit. Each web page will have a list of href’s mentioned on the webpage.
For example, we have page https://abc.com as source page and all the hrefs on the page are https://pqr.com, https://tuf.com

 static Map<String, List<String>> links = getTargetLinks();
 /**
     * get target links
     * @return
 */
private static Map<String, List<String>> getTargetLinks(){
        return Map.of("https://abc.com", List.of("https://pqr.com", "https://tuf.com")
                , "https://pqr.com", List.of("https://mmm.com", "https://nnn.com")
                , "https://mmm.com", List.of("https://abc.com", "https://pqr.com"));
}

Crawling Source Web Pages

Crawling web pages is a recursive problem. Once we have the source page and list of links mentioned on that page, we traverse each of them and crawl them.
But by doing that we also keep track of visited links so that we don’t end up in loop/cycle.
First, we check if we have already visited this page before or not, if yes then we just return from the loop.
Otherwise, we visit/crawl this page and add it to the visited list.
If the links that in the list of source web pages we recursively call the links on that page.

static Set<String> visited = new HashSet<>();
        // crawl links recursively
    private static void crawl(String url){
       // if this link is already visited then no need to return, avoid having cycle
        if(visited.contains(url)){
            return;
        }

        // crawling this url
        System.out.println("crawling url: " + url);

        // marking this url visited since we have already crawled it
        visited.add(url);

        // if this link is in the target list
        // then crawl all the urls listed on the page
        if(links.containsKey(url)) {
            for ( String link : links.get(url) ) { // make crawl call for list of links
                crawl(link);
            }
        }

Client Code

Now we can call all the source web pages and crawl each of them.

public static void main(String[] args) {
     for(Map.Entry<String,List<String>> link:links.entrySet()){
                crawl(link.getKey());
     }
}

Entire Code is here

GitHub Link

Conclusion

In this article our main goal is to give higher perspective of crawling logic with recursion.
We can also implement using an iterative approach using the stack data structure.

Bonus Tip

If you want to upskill your Java, you should definitely check out this bestseller course

Tags: Java, Recursive, Web Crawling

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Simple Web Crawling Logic In Java

Introduction

Source Web Pages

Crawling Source Web Pages

Client Code

Entire Code is here

Conclusion

Bonus Tip

Leave a Reply Cancel reply

Recent Posts

Introduction

Source Web Pages

Crawling Source Web Pages

Client Code

Entire Code is here

Conclusion

Bonus Tip

Please Share This Share this content

You Might Also Like

Reverse Integer (Solution For Leetcode Problem #7)

Builder Pattern In Java Explained!

Java Interview Practice Problem (Beginner): Distinct Email IDs

Leave a Reply Cancel reply

Share this content