Simple Web Crawling Logic In Java

  • Post last modified:December 15, 2022
  • Reading time:3 mins read
Photo by freestocks on Unsplash

Introduction

  • Crawling is a very basic operation that every organization performs at some level for its data needs.
  • All the crawling logic doesn’t have to be as sophisticated as of Google or Yahoo, but sometimes simple logic is all we need to crawl the various sources.
  • These sources not necessarily be a webpage but any resource like a file directory.
  • But in this blog, we will consider the case of webpage crawling.

Source Web Pages

  • These will be the source web pages we want to visit. Each web page will have a list of href’s mentioned on the webpage.
  • For example, we have page https://abc.com as source page and all the hrefs on the page are https://pqr.com, https://tuf.com
 static Map<String, List<String>> links = getTargetLinks();
 /**
     * get target links
     * @return
 */
private static Map<String, List<String>> getTargetLinks(){
        return Map.of("https://abc.com", List.of("https://pqr.com", "https://tuf.com")
                , "https://pqr.com", List.of("https://mmm.com", "https://nnn.com")
                , "https://mmm.com", List.of("https://abc.com", "https://pqr.com"));
}

Crawling Source Web Pages

  • Crawling web pages is a recursive problem. Once we have the source page and list of links mentioned on that page, we traverse each of them and crawl them.
  • But by doing that we also keep track of visited links so that we don’t end up in loop/cycle.
  • First, we check if we have already visited this page before or not, if yes then we just return from the loop.
  • Otherwise, we visit/crawl this page and add it to the visited list.
  • If the links that in the list of source web pages we recursively call the links on that page.
static Set<String> visited = new HashSet<>();
        // crawl links recursively
    private static void crawl(String url){
       // if this link is already visited then no need to return, avoid having cycle
        if(visited.contains(url)){
            return;
        }

        // crawling this url
        System.out.println("crawling url: " + url);

        // marking this url visited since we have already crawled it
        visited.add(url);

        // if this link is in the target list
        // then crawl all the urls listed on the page
        if(links.containsKey(url)) {
            for ( String link : links.get(url) ) { // make crawl call for list of links
                crawl(link);
            }
        }

Client Code

  • Now we can call all the source web pages and crawl each of them.
public static void main(String[] args) {
     for(Map.Entry<String,List<String>> link:links.entrySet()){
                crawl(link.getKey());
     }
}

Entire Code is here

Conclusion

  • In this article our main goal is to give higher perspective of crawling logic with recursion. 
  • We can also implement using an iterative approach using the stack data structure.

Bonus Tip

Leave a Reply