A lot of applications have a requirement to search the full-text of some content they have for some words it might contain. This kind of functionality is often referred to as full-text search. For example, a blogging software might need to provide a search functionality that searches the blog posts for the user entered query terms.
It is not possible to use the regular database indexes (usually B-Trees or Hashmaps) for this purpose because they require that you provide the full value of the column you are searching in; in essence they do an equality search. In the blogging software example, the user would then have to type in the entire blog post verbatim in order to find it; even if you could imagine the most patient of users, if s/he already knows the entire post by-heart, why would s/he be looking for it anyway?!
SQL does offer the LIKE operator which allows a query to look for words occuring anywhere within the field. The query to search for the keyword ‘database‘ would be something like the following:
SELECT * FROM posts WHERE body LIKE '%database%'
The LIKE operator, while seemingly very simple to use, presents us with several disadvantages:
- Its performance (in terms of time taken for the query to execute) is pretty bad. The bad performance stems from two main reasons: (a) The DBMS is unable to use any indexes for a query containing the LIKE operator due to it’s very nature (and the nature of the indexes). This implies that the DBMS will have to scan through all the records in the database and try to match each record with the LIKE conditions. (b) For each record, the DBMS has to scan through the entire text of the column to match it with the LIKE condition on it.
- It does literal matching, without doing any pre-processing of the query or the content the way more advanced search systems do. At a minimum, stemming is an essential pre-processing step (stemming involves pruning out the common endings of words such as ing, s, ed etc.). So the LIKE operator will not match ‘coding’ against ‘code’ (or vice versa). It is of course possible for us to do the pre-processing ourselves (pre-process the content and store it in a separate column, for example), but it is usually too much trouble.
- It does not provide any mechanism for for relevance ranking. So a record containing the query term once as well as a record containing it ten times will both be ranked equally. In fact, all the matching records would be equally ranked and the ordering could be arbitrary.
Most RDBMS’s include support for full-text search indexes to serve just these needs. MySQL, Microsoft SQL Server and Oracle all include such a functionality. I have not had the opportunity to use this functionality in Microsoft SQL Server or Oracle, but at least in MySQL this has a lot of limitations.
I’ll now discuss, in brief, the full-text functionality offered by MySQL and some alternative libraries/software that you can use instead.
MySQL Full-text Search
MySQL has pretty good support for full-text search (link to the MySQL documentation). The MySQL documentation for this feature is very good, it should get you up and running in no time at all. There are plenty of tutorials on this subject all over the Web that you might find useful; here are a couple to get you started: MySQL FULLTEXT Searching, Using MySQL Full-text Searching.
Pros:
- Easy—maybe even too easy—to setup and use. You add a bunch of indexes, modify your queries to use then, …well that’s all!
- Supports a good set of features: stop words, boolean searches, query expansion, relevance scoring (based on TF-IDF score, no less; if you don’t know what TF-IDF is, don’t bother, it does not matter).
- If you insert new data into the table, it is instantaneously indexed and available for searching. This is a ‘pro’ because the other systems cannot offer this trivially.
Cons:
- Does not scale very well with the number of records. In fact, the performance, in my experience, is sub-linear. I’ve had tremendous delays (in the order of minutes) even with under a million records.
- Many of the MySQL full-text options can only be changed globally for the entire server, not even for a table/database. These include: the stop-word list (list of common words to ignore; for example: a, an, the, of etc.), the minimum and maximum length of words that should be included in the index (the default minimum is 4; words of 3 charcters or less will not be included in the index), the set of characters that are considered as word characters (i.e. characters that’ll form a word; ‘-’ is not treated as a word character by default, for example, and so if the term ‘full-text’ occurs in the record, ‘full’ and ‘text’ will be indexed separately.) Some of these changes also require you to rebuild the index manually.
- Any word that occurs in more than 50% of the records is treated as a stop word and ignored. This might not be a problem for a lot of situations but I’ve certainly come across situations where it is. This behaviour can be changed but requires that you recompile MySQL after making some changes (yay! for open source) and even then greatly reduces the performance.
- MySQL full-text search is available only if you use the MyISAM storage engine. Using MyISAM is not an option for many transactional systems because of its table-locking behaviour; InnoDB is usually preferable in these situations (but this is all for another discussion).
Apache Lucene
From the Lucene website: [emphasis mine] “Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.”
Apache Lucene is one of the most well-known projects from the Apache Software Foundation. It is very powerful, offers an unrivalled set of features and extremely scalable.
Pros:
- It is a library and so extremely flexible. You can have to write your own code over it and get extactly what you want.
- Great support. A lot of people use and hence write about Lucene. The mailing lists are enormously helpful too. There is even a book on Lucene: Lucene in Action.
- It has very solid design and is very stable.
Cons:
- It is a library (yes, this is both a pro and a con) and hence requires you to write a considerable amount of code even to set up a basic system.
- The learning curve is pretty steep.
- It’s written in Java thus forcing you to write your code in Java too. While I have nothing against Java (no really, I don’t hate it any more) this may not suit your needs (like say, the rest of your application is not in Java.)
Solr
From the Solr website: [emphasis mine] “Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface and many more features. It runs in a Java servlet container such as Tomcat.”
Solr is basically a bunch of code over Lucene that you’d anyway write if you were to use Lucene directly (okay, you wouldn’t write all of it but most of the code you would write is already written for you by Solr) so you can get on with just using it! Setting up an entire search system is a matter of adding a little configuration and pointing Solr to your data.
Pros:
- Very easy to set up and use.
- Uses Lucene in the backend and thus inherits all its great features, including scalability and performance.
- Has some great bonus features other than the expected ones (like stop words, stemming etc.): hit highlighing, faceted search (if you have multitple “properties” for your records, this is very useful; the options to narrow your search found on several shopping sites is an example of this feature), distributed setup (for even more scalability), caching (for better response times) and several other really desirable features like web administration interface.
- Solr can also be used as a library like Lucene if you desire a lot more control. Even in this mode, several of its advantages are retained (although some like web administration interface are lost).
- API access implies that you can write your main program in any language you desire. It also implies that you can have Solr deployed on a machine separate from where the rest of your code is running.
- Very good active support over the mailing lists.
Cons:
- The documenation might leave you wanting more a lot of times.
- Not the most mature solution in the market.
- Solr does not provide direct access to the underlying Lucene interfaces. In this way it takes away at least some of the flexibility that Lucene offers. But this is something that can be expected of any system built over a library—certain decisions taken by the system will cloud at least some of the library’s features.
Sphinx
From the Sphinx website: [emphasis mine] “Sphinx is a standalone search engine, meant to provide fast, size-efficient and relevant fulltext search functions to other applications. Sphinx was specially designed to integrate well with SQL databases and scripting languages. Currently built-in data sources support fetching data either via direct connection to MySQL or PostgreSQL, or using XML pipe mechanism (a pipe to indexer in special XML-based format which Sphinx recognizes).”
If you have simple needs but desire a blazingly fast, well-implemented and complete search system, look no further than Sphinx Search. Its speed of indexing is, quite literally, mind-boggling!
Pros:
- Pretty easy to set up, especially if your data is already in a MySQL or PostgreSQL database.
- Mind-boggling speed of indexing. Yeah, I can say it many more times.
- Very good search speeds too.
- Simple interface for the scripting languages (PHP, Perl, Ruby—as well as Rails, Python etc.)
Cons:
- As the number of parallel queries increase, the performance starts degrading.
- When it comes to the features offered, Sphinx trails Lucene/Solr.
Others
I’ve not used these other softwares/libraries, but in spite of that, they actually exist!
- Ferret – A Ruby search library inspired by Lucene. There is an acts_as_ferret Rails plugin which makes it very easy to use it in Rails projects.
- mnoGoSearch – Written in C.
- Microsoft SQL Server Full-text Search
- Oracle Text