<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The art of Information Engineering &#187; Crawling</title>
	<atom:link href="http://www.grok.in/blog/cats/crawling/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.grok.in</link>
	<description>(ignorance killed the cat, curiosity was framed)</description>
	<lastBuildDate>Tue, 11 Aug 2009 06:30:18 +0000</lastBuildDate>
	
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Web Crawling with Perl</title>
		<link>http://www.grok.in/blog/2008/06/09/web-crawling-with-perl/</link>
		<comments>http://www.grok.in/blog/2008/06/09/web-crawling-with-perl/#comments</comments>
		<pubDate>Sun, 08 Jun 2008 19:13:25 +0000</pubDate>
		<dc:creator>Siddhartha Reddy</dc:creator>
				<category><![CDATA[Crawling]]></category>
		<category><![CDATA[perl]]></category>

		<guid isPermaLink="false">http://www.grok.in/?p=49</guid>
		<description><![CDATA[If you are looking to write a web crawler, Perl, with all its great CPAN modules, is one of the best platforms you can pick. There are CPAN modules for most of the common components of a web crawler. Here, I&#8217;ll point to some of the modules that you would want to start out with.
Read [...]]]></description>
			<content:encoded><![CDATA[<p>If you are looking to write a web crawler, Perl, with all its great CPAN modules, is one of the best platforms you can pick. There are CPAN modules for most of the common components of a web crawler. Here, I&#8217;ll point to some of the modules that you would want to start out with.</p>
<p><em><strong><a title="Web Crawling with Perl - Notes" href="http://www.grok.in/notes/web-crawling-with-perl/">Read on&#8230;</a></strong></em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.grok.in/blog/2008/06/09/web-crawling-with-perl/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Introduction to Web Crawling</title>
		<link>http://www.grok.in/blog/2008/06/07/introduction-to-web-crawling/</link>
		<comments>http://www.grok.in/blog/2008/06/07/introduction-to-web-crawling/#comments</comments>
		<pubDate>Sat, 07 Jun 2008 16:40:17 +0000</pubDate>
		<dc:creator>Siddhartha Reddy</dc:creator>
				<category><![CDATA[Crawling]]></category>

		<guid isPermaLink="false">http://www.grok.in/?p=35</guid>
		<description><![CDATA[In the context of the World Wide Web, crawling refers to gathering web pages, by following hyperlinks, starting from a small set of web pages, for the purposes of further processing. For example, a Web search engine needs to gather as many pages as possible before it indexes and makes them available for searching.
A program [...]]]></description>
			<content:encoded><![CDATA[<p>In the context of the World Wide Web, <em>crawling</em> refers to gathering web pages, by following hyperlinks, starting from a small set of web pages, for the purposes of further processing. For example, a Web search engine needs to gather as many pages as possible before it indexes and makes them available for searching.</p>
<p>A program which performs crawling is variably known as a <em>crawler</em>, a <em>spider</em>, a <em>robot</em> or simply a <em>bot</em>. The set of pages from which the crawler starts crawling is known as <em>seed list</em>.</p>
<p>Although it seems pretty straightforward, writing a good web crawler is not very much so. There are a good number of challenges which vary subtly depending on whether it&#8217;s a large-scale web crawler or a crawler for a handful of websites. These challenges include: ensuring politeness to the web servers (by observing the widely accepted <em>robots exclusion protocol</em>), <em>URL normalization</em>, <em>duplicate detection</em>, avoiding <em>spider traps</em>, maintaining a <em>queue</em> of un-fetched pages, maintaining a <em>repository</em> of crawled pages, <em>re-crawling</em> and a few more. For large-scale crawlers, one of the most important challenges is to increase the <em>throughput</em> by optimizing the <em>resource utilization</em>, because their coverage usually gets limited by this.</p>
<p><a title="Introduction to Web Crawling - Notes" href="http://www.grok.in/notes/full-text-search/">Read on&#8230;</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.grok.in/blog/2008/06/07/introduction-to-web-crawling/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>
