Tools/Libraries

Tools and Libraries useful during IR engineering. As far as possible, free and open source tools or libraries are given preference.

Crawling

  • Nutch: Open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.
  • Heritrix: Internet Archive’s open-source, extensible, web-scale, archival-quality web crawler project.
  • WebSPHINX: A Java class library and interactive development environment for web crawlers.
  • Perl WWW::Robot: A Perl class for implementing well-behaved parallel web robots.

Indexing

  • Lucene: An Open Source, high-performance, full-featured text search engine library written entirely in Java.
  • Sphinx: Free open-source SQL full-text search engine.
  • Egothor: An Open Source, high-performance, full-featured text search engine written entirely in Java.
  • ht://dig: A complete world wide web indexing and searching system for a small domain or intranet.

Clustering

Information Extraction

  • LingPipe: A suite of Java libraries for the linguistic analysis of human language. See the online demos.
  • ANNIE: A Portable IE system part of the GATE (General Architecture for Text Engineering) project.
  • About grok.in

    This is a blog primarily focussed on the subjects of Information Engineering—Retrieval, Extraction & Management, Machine Learning, Scalability and Cloud Computing.