Tools and Libraries useful during IR engineering. As far as possible, free and open source tools or libraries are given preference.
Crawling
- Nutch: Open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.
- Heritrix: Internet Archive’s open-source, extensible, web-scale, archival-quality web crawler project.
- WebSPHINX: A Java class library and interactive development environment for web crawlers.
- Perl WWW::Robot: A Perl class for implementing well-behaved parallel web robots.
Indexing
- Lucene: An Open Source, high-performance, full-featured text search engine library written entirely in Java.
- Sphinx: Free open-source SQL full-text search engine.
- Egothor: An Open Source, high-performance, full-featured text search engine written entirely in Java.
- ht://dig: A complete world wide web indexing and searching system for a small domain or intranet.
Clustering
- Carrot2: An open source search results clustering engine. See online demo.
Information Extraction
- LingPipe: A suite of Java libraries for the linguistic analysis of human language. See the online demos.
- ANNIE: A Portable IE system part of the GATE (General Architecture for Text Engineering) project.