Web Crawling with Perl

If you are looking to write a web crawler, Perl, with all its great CPAN modules, is one of the best platforms you can pick. There are CPAN modules for most of the common components of a web crawler. Here, I’ll point to some of the modules that you would want to start out with.

LWP (”Library for WWW in Perl”) is a set of Perl modules that implement various aspects of requesting pages from the Web and handling the responses.

LWP::UserAgent is an implementation of a web user agent and can be used to dispatch web requests.

WWW::RobotRules makes it easy to parse robots.txt files and determine the permissions the crawler has on a given website.

LWP::RobotUA combines LWP::UserAgent and WWW::RobotRules to free you from worrying about robots exclusion protocol entirely.

LWP::MediaTypes helps with determining the media type of a URL; this allows you to only deal with those types that you are interested in.

HTML::LinkExtor can extract links from the crawled pages. (You may also use HTML::Parser or HTML::TreeBuilder for this purpose.)

URI::URL is a great help in dealing with URLs (normalization, for example.)

Encode brings in Perl’s powerful Unicode handling.

WWW::Robot is an interesting module that combines several of the modules listed above to get you up and running with a crawler without having to understand and deal with each component yourselves. You might now question my intent of presenting all the above modules separately when this one module would have sufficed. Although WWW::Robot offers some convenience, it is not complete by any means — you’ll still need to deal with a lot of stuff yourself. More importantly, it is definitely better to know how WWW::Robot is accomplishing its job than to treat it as a black box. I’m planning a blog post on WWW::Robot itself where I’ll discuss these in greater detail.

In addition to the above, WWW::Mechanize is often a handy module that allows you to automate interactions with websites. You can use this to have your crawler “browse” the website in a particular way. This is useful in specialized cases like when you want to crawl particular parts of a website requiring some patterns of interactions. I’ll probably do a blog spot demonstrating some of the power of this tremendously useful module.

In the coming weeks, as I’ll go through explaining the various concepts in web crawling, I’ll use some of these modules to demonstrate those concepts.

  • About grok.in

    This is a blog primarily focussed on the subjects of Information Engineering—Retrieval, Extraction & Management, Machine Learning, Scalability and Cloud Computing.