Marketing your Search Engine

I recently came across an interesting post by Mark Johnson, senior product manager at Powerset:

While demoing Live Search at the Web 2.0 Expo, people continually asked the same questions: “What makes Live different?” or “Show me some features that will make me want to switch from my search engine” or the extremely confrontational “Why do you think you’re better than Google?”

If only I had a dollar for every time someone asked me one of these questions when demo-ing Zook (my company’s Mobile search engine) …I would have been richer by at least a couple of hundred dollars!

How do you convinve someone that your search engine is better than Google? How do you convince someone that Google is not the do-all and be-all of search engines? Don’t get me wrong, I love Google. But I do resent the fact that, with their enormous market share, Google has moulded an expectation of what a search engine is and is not into people’s minds; so much so that if even Google itself introduced anything drastically different, it would probably get rejected! How do you convince someone of something when they are already convinced of the contrary? If you run a Web search engine, Mark Johnson offers some advice:

So, after awhile, I started my demos with a caveat about the nature of a search engine: I implored my audience to try out Live Search for a week so that, in the words of the immortal Lavar Burton of Reading Rainbow, “But, you don’t have to take my word for it.”

This is a great tactic if you are a generic Web search engine. But if you are a more specialised search engine which has been built to answer only a subset of the queries that people go to Google for — but answer them much better than Google does — you are out of luck: people are not too keen on juggling between different search engines for different needs; they want a single box into which they would like to type in a query and get the results. So what is the recourse?

If you have Deep Web content — data that others (read as: Google) do not have access to — you are through. Can there be anything sweeter than a monopoly? This is usually achieved by search engines that own the data they are searching. Usually, most of this data is inaccessible to the regular search engines. Several local search engines are this way. YouTube, the second largest search engine in US, also falls in this category. As do Amazon and eBay and Craigslist. But with the increasing domination of Google and its growing importance in driving traffic, many of these websites are resorting to Search Engine Optimization (SEO) techniques resulting in large portions of their content being made available to be indexed by Google.

If you are a vertical search engine, you can entice people with features that go well beyond mere listing of results. These features can include things like sorting (example: sort by price in product search), actionable results (example: book tickets in movie search), faceted search (example: narrow down by brands in product search — useful for cameras, laptops etc.), community inputs (example: user reviews & ratings in restaurant search) and so on. Often, several such perks are needed to get people to try out your search engine long enough for them to really experience it.

But if you are a drastically different search engine trying to bring in a whole new paradigm of search, you are facing a real tough battle. Wolfram|Alpha learned it the hard way after its recent launch. Powerset faced similar hurdles when trying to convince people of they are worthy of attention. Several other such attempts have been made but none too successfully.

Developing a brand new search engine in this Google-dominated world is no more just about coming up with great technical ideas. The technical superiorities need to be market driven. The ideas need to come from a marketing perspective. This does not imply that there is any less scope for leaps of technical improvements, it just means that without a marketing plan to go with them, such improvements will find themselves in obscurity in a hurry.

We came to this realisation at Zook a long time ago. When we did, and started developing our marketing strategy, it was mere good fortune that we found that most of our development until that time would align quite well with the strategy. We were lucky.

We like to think of Zook as a lateral search engine i.e. we specialise in some kinds of content unlike a horizontal search engine (most Web search engines) but unlike most vertical search engines we pan across several verticals without going too deep into any one of them. Being this way, we are able to offer several of the features/benefits that are normally the privilege only of vertical search engines — actionable results (examples: buy/download a song, reserve a table at a restaurant, subscribe to alerts), faceted search (examples: restaurant bangalore, ringtones, movies) etc.

Another thing we have going for Zook is that people seem to be more open to trying out alternative search engines on their mobile than they are on their desktops; Google deosn’t yet have a stranglehold on mobile internet users. In most cases, the promise of “exact/precise information instead of a set of links that you have navigate yourself to find the information” gets people excited enough that they are willing to Zook a try.

Zook has a lot of Deep Web content that we source directly from the multitude of our partners. That helps too :).

Posted in Marketing, Search Engines | Tagged | 2 Comments

Cloud Computing

A couple of weeks ago I participated in an interesting discussion on Cloud Computing at an unconference in Bangalore. Though the discussion was to be on “whether Cloud Computing is inevitable or not”, we hardly got past defining it! That just about demonstrates the confusion that surrounds Cloud Computing — it isn’t even clear what it’s supposed to be. It’s not for no reason that it has been referred to as Haze Computing!

I think everyone has a moral (:P) responsibility to add to the confusion. Only through such attempts can we achieve clarity. This post — an attempt to put in words, my understanding of what Cloud Computing is and is not — is a contribution to that end.

Read on…

Posted in Cloud Computing | Tagged , , , | 15 Comments

Softwares/Libraries for Full-text Search

A lot of applications have a requirement to search the full-text of some content they have for some words it might contain. This kind of functionality is often referred to as full-text search. For example, a blogging software might need to provide a search functionality that searches the blog posts for the user entered query terms.

It is not possible to use the regular database indexes (usually B-Trees or Hashmaps) for this purpose because they require that you provide the full value of the column you are searching in; in essence they do an equality search. In the blogging software example, the user would then have to type in the entire blog post verbatim in order to find it; even if you could imagine the most patient of users, if s/he already knows the entire post by-heart, why would s/he be looking for it anyway?!

Read on…

Posted in Information Retrieval | Tagged , , , , | 3 Comments

The Mother of All Database Normalization Debates

Over at “Coding Horror” blog, Jeff Atwood published an interesting article titled “Maybe Normalizing Isn’t Normal“.

But more than the article itself, the debate that ensued in the comments there is very interesting. The “High Scalability” blog published a compilation of some of the interesting quotes from the debate. This compilation provides a great overview of the (admittedly long) discussion.

I would recommend that you read the original article first and then the compilation of the quotes at the High Scalability blog.

Posted in Databases | Tagged , , | 8 Comments

Web Crawling with Perl

If you are looking to write a web crawler, Perl, with all its great CPAN modules, is one of the best platforms you can pick. There are CPAN modules for most of the common components of a web crawler. Here, I’ll point to some of the modules that you would want to start out with.

Read on…

Posted in Crawling | Tagged , | 5 Comments

Introduction to Web Crawling

In the context of the World Wide Web, crawling refers to gathering web pages, by following hyperlinks, starting from a small set of web pages, for the purposes of further processing. For example, a Web search engine needs to gather as many pages as possible before it indexes and makes them available for searching.

A program which performs crawling is variably known as a crawler, a spider, a robot or simply a bot. The set of pages from which the crawler starts crawling is known as seed list.

Although it seems pretty straightforward, writing a good web crawler is not very much so. There are a good number of challenges which vary subtly depending on whether it’s a large-scale web crawler or a crawler for a handful of websites. These challenges include: ensuring politeness to the web servers (by observing the widely accepted robots exclusion protocol), URL normalization, duplicate detection, avoiding spider traps, maintaining a queue of un-fetched pages, maintaining a repository of crawled pages, re-crawling and a few more. For large-scale crawlers, one of the most important challenges is to increase the throughput by optimizing the resource utilization, because their coverage usually gets limited by this.

Read on…

Posted in Crawling | Tagged | 6 Comments

More Bayes’ magic

Statistics can be quite bewildering. Consider the following problem:

It is given that if a person having a disease takes a diagnostic test for the disease, the test returns a positive result 99% of the time, or with a probability of 0.99. Now, for some person picked at random, if the test returns a positive result, what is the probability that s/he has the disease?

You might think that the probability is of course 0.99. But of course that isn’t so. If you did reach the naive conclusion, don’t worry: a lot of eminent scientists and doctors have been seen doing the same mistake (try it with your doctor!)

Read More »

Posted in Probability & Statistics | Tagged | 3 Comments

Probabilities, huh!

sanket asked a very interesting question in the comments to my previous post on Monty Hall Problem:

Assume that boys and girls are equally likely to be born. Let us say that a family has two children. Given that one of them is a boy, what is the probability that the other one is a boy too? (Source: One of Scott Aaronson’s ( lecture notes.)

Update: It turns out that after stating this problem this way here, I solved a different problem altogether. Thanks, Nikhil, for pointing it out in the comments. To keep my life simple, I’ll state a modified problem below — the one that I did solve.

Assume that boys and girls are equally likely to be born. Let us say that a family has two children. Given that one of them is a boy, what is the probability that the other one is a girl?

Most people would jump out with 1/2 as the answer. Of course, if the answer was that obvious the question wouldn’t exist. The answer is 2/3. Here I will describe two different ways of arriving at this, as well as the common mistake that leads people to 1/2.

Read More »

Posted in Probability & Statistics | Tagged , | 2 Comments

To switch or not to switch, that is the question

I recently came across a very interesting problem known as “The Monty Hall Problem.” This is a statistical puzzle named after the host of an old television show “Let’s Make a Deal” which featured a similar problem albeit a little more involved than the basic version that mathematicians use. Here is a simple description of the problem from Wikipedia:

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?

Read More »

Posted in Probability & Statistics | Tagged , , | 4 Comments

More data usually beats better algorithms?

Anand Rajaraman, who teaches a class on  Machine Learning at Stanford, recently wrote an interesting blog post: More data usually beats better algorithms, he claimed. The post makes for an interesting read and so do the plethora of comments on it. He made a follow-up post, which is equally interesting.

I do agree with a good number of the points he brings up, but at the same time believe that such a blanket statement is not warranted. I believe that adding more data to a given algorithm does give out better results, especially if the new data is independent and the algorithm is capable of utilizing such data appropriately. But to say that better data is more important than better algorithms most of the time is far-fetched.

Read More »

Posted in Machine Learning | Tagged , | 3 Comments
  • About

    This is a blog primarily focussed on the subjects of Information Engineering—Retrieval, Extraction & Management, Machine Learning, Scalability and Cloud Computing.