Crawling is a very basic operation that every organization performs at some level for its data needs.
All the crawling logic doesn’t have to be as sophisticated as of Google or Yahoo, but sometimes simple logic is all we need to crawl the various sources.
These sources not necessarily be a webpage but any resource like a file directory.
But in this blog, we will consider the case of webpage crawling.
Source Web Pages
These will be the source web pages we want to visit. Each web page will have a list of href’s mentioned on the webpage.
Crawling web pages is a recursive problem. Once we have the source page and list of links mentioned on that page, we traverse each of them and crawl them.
But by doing that we also keep track of visited links so that we don’t end up in loop/cycle.
First, we check if we have already visited this page before or not, if yes then we just return from the loop.
Otherwise, we visit/crawl this page and add it to the visited list.
If the links that in the list of source web pages we recursively call the links on that page.
static Set<String> visited = new HashSet<>();
// crawl links recursively
private static void crawl(String url){
// if this link is already visited then no need to return, avoid having cycle
if(visited.contains(url)){
return;
}
// crawling this url
System.out.println("crawling url: " + url);
// marking this url visited since we have already crawled it
visited.add(url);
// if this link is in the target list
// then crawl all the urls listed on the page
if(links.containsKey(url)) {
for ( String link : links.get(url) ) { // make crawl call for list of links
crawl(link);
}
}
Client Code
Now we can call all the source web pages and crawl each of them.
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
Cookie
Duration
Description
cookielawinfo-checkbox-analytics
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional
11 months
The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy
11 months
The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.