In the context of the World Wide Web, crawling refers to gathering web pages, by following hyperlinks, starting from a small set of web pages, for the purposes of further processing. For example, a Web search engine needs to gather as many pages as possible before it indexes and makes them available for searching.
A program which performs crawling is variably known as a crawler, a spider, a robot or simply a bot. The set of pages from which the crawler starts crawling is known as seed list.
Although it seems pretty straightforward, writing a good web crawler is not very much so. There are a good number of challenges which vary subtly depending on whether it’s a large-scale web crawler or a crawler for a handful of websites. These challenges include: ensuring politeness to the web servers (by observing the widely accepted robots exclusion protocol), URL normalization, duplicate detection, avoiding spider traps, maintaining a queue of un-fetched pages, maintaining a repository of crawled pages, re-crawling and a few more. For large-scale crawlers, one of the most important challenges is to increase the throughput by optimizing the resource utilization, because their coverage usually gets limited by this.
5 Comments
good to know you write crawlers. I was doing research in Grid Computing sometime back when I dived into crawlers and fell in love with the whole concept and the math involved, is a field to spend a lifetime in. Good post.
What is *URL Normalization*? Converting relative to absolute addresses or something like that?
@sanket,
Converting relative URLs to absolute is necessary, but, I was referring to canonicalization of URLs — different URLs can be referring to the same resource. For example, http://www.grok.in:80/ and http://www.grok.in/ refer to the same resource. Canonicalizing URLs will ensure that these different URLs are considered as the same; in the previous example, both the URLs will be normalized to one of the two forms.
Check this link about web crawling it’s very interesting,
http://crawltheweb.blogspot.com/
The “read more…” link from *this* article goes to the full text of a *different* article