Simple Web Crawling Logic In Java

  • Post last modified:December 15, 2022
  • Reading time:3 mins read
Photo by freestocks on Unsplash


  • Crawling is a very basic operation that every organization performs at some level for its data needs.
  • All the crawling logic doesn’t have to be as sophisticated as of Google or Yahoo, but sometimes simple logic is all we need to crawl the various sources.
  • These sources not necessarily be a webpage but any resource like a file directory.
  • But in this blog, we will consider the case of webpage crawling.

Source Web Pages

  • These will be the source web pages we want to visit. Each web page will have a list of href’s mentioned on the webpage.
  • For example, we have page as source page and all the hrefs on the page are,
 static Map<String, List<String>> links = getTargetLinks();
     * get target links
     * @return
private static Map<String, List<String>> getTargetLinks(){
        return Map.of("", List.of("", "")
                , "", List.of("", "")
                , "", List.of("", ""));

Crawling Source Web Pages

  • Crawling web pages is a recursive problem. Once we have the source page and list of links mentioned on that page, we traverse each of them and crawl them.
  • But by doing that we also keep track of visited links so that we don’t end up in loop/cycle.
  • First, we check if we have already visited this page before or not, if yes then we just return from the loop.
  • Otherwise, we visit/crawl this page and add it to the visited list.
  • If the links that in the list of source web pages we recursively call the links on that page.
static Set<String> visited = new HashSet<>();
        // crawl links recursively
    private static void crawl(String url){
       // if this link is already visited then no need to return, avoid having cycle

        // crawling this url
        System.out.println("crawling url: " + url);

        // marking this url visited since we have already crawled it

        // if this link is in the target list
        // then crawl all the urls listed on the page
        if(links.containsKey(url)) {
            for ( String link : links.get(url) ) { // make crawl call for list of links

Client Code

  • Now we can call all the source web pages and crawl each of them.
public static void main(String[] args) {
     for(Map.Entry<String,List<String>> link:links.entrySet()){

Entire Code is here


  • In this article our main goal is to give higher perspective of crawling logic with recursion. 
  • We can also implement using an iterative approach using the stack data structure.

Bonus Tip

Leave a Reply