In the context of the World Wide Web, crawling refers to gathering web pages, by following hyperlinks, starting from a small set of web pages, for the purposes of further processing. For example, a Web search engine needs to gather as many pages as possible before it indexes and makes them available for searching.
A program which performs crawling is variably known as a crawler, a spider, a robot or simply a bot. The set of pages from which the crawler starts crawling is known as seed list.
Although it seems pretty straightforward, writing a good web crawler is not very much so. There are a good number of challenges which vary subtly depending on whether it’s a large-scale web crawler or a crawler for a handful of websites. These challenges include: ensuring politeness to the web servers (by observing the widely accepted robots exclusion protocol), URL normalization, duplicate detection, avoiding spider traps, maintaining a queue of un-fetched pages, maintaining a repository of crawled pages, re-crawling and a few more. For large-scale crawlers, one of the most important challenges is to increase the throughput by optimizing the resource utilization, because their coverage usually gets limited by this.