Introduction to Web Crawling

In the context of the World Wide Web, crawling refers to gathering web pages, by following hyperlinks, starting from a small set of web pages, for the purposes of further processing. For example, a Web search engine needs to gather as many pages as possible before it indexes and makes them available for searching.

A program which performs crawling is variably known as a crawler, a spider, a robot or simply a bot. The set of pages from which the crawler starts crawling is known as seed list.

Although it seems pretty straightforward, writing a good web crawler is not very much so. There are a good number of challenges which vary subtly depending on whether it’s a large-scale web crawler or a crawler for a handful of websites. These challenges include: ensuring politeness to the web servers (by observing the widely accepted robots exclusion protocol), URL normalization, duplicate detection, avoiding spider traps, maintaining a queue of un-fetched pages, maintaining a repository of crawled pages, re-crawling and a few more. For large-scale crawlers, one of the most important challenges is to increase the throughput by optimizing the resource utilization, because their coverage usually gets limited by this.

Read on…

This entry was posted in Crawling and tagged . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

6 Comments

  1. Posted June 7, 2008 at 10:52 pm | Permalink

    good to know you write crawlers. I was doing research in Grid Computing sometime back when I dived into crawlers and fell in love with the whole concept and the math involved, is a field to spend a lifetime in. Good post.

  2. sanket
    Posted June 9, 2008 at 9:25 am | Permalink

    What is *URL Normalization*? Converting relative to absolute addresses or something like that?

  3. Posted June 9, 2008 at 11:15 am | Permalink

    @sanket,

    Converting relative URLs to absolute is necessary, but, I was referring to canonicalization of URLs — different URLs can be referring to the same resource. For example, http://www.grok.in:80/ and http://www.grok.in/ refer to the same resource. Canonicalizing URLs will ensure that these different URLs are considered as the same; in the previous example, both the URLs will be normalized to one of the two forms.

  4. Posted October 16, 2008 at 3:18 pm | Permalink

    Check this link about web crawling it’s very interesting,

    http://crawltheweb.blogspot.com/

  5. Michael Wolf
    Posted November 1, 2009 at 4:36 pm | Permalink

    The “read more…” link from *this* article goes to the full text of a *different* article

  6. Posted October 19, 2011 at 3:55 pm | Permalink

    Toast the magic of the celebrations of your loved ones in Japan by sending gifts to Japan online with us. Just log on to http://www.gifts2japan.com and send gifts to Japan with poise for every relation and for all occasions.

  • About grok.in

    This is a blog primarily focussed on the subjects of Information Engineering—Retrieval, Extraction & Management, Machine Learning, Scalability and Cloud Computing.