Google Sets is a real fun experiment from the Google Labs. It basically allows you to “automatically create sets of items from a few examples.” So you can enter “Sachin Tendulkar”, “Rahul Dravid” and “Sourav Ganguly,” and, be presented with a much larger set of the players of the Indian Cricket Team. Or enter “Athens”, “Sydney”, “Atlanta”, “Barcelona” and “Tokyo” to get a much larger set of the cities that have hosted (or will be hosting) Olympics. Neat.
If you had not come across this before, let me warn you: this can cause you to waste a whole lot of time and have a whole lot of fun (how many sex positions do you know?
. But I remember wasting a lot of time in another way when Google Sets was first unveiled: discussing/arguing/fighting with my friends about how it was done (nah, not really — we didn’t fight over it, I just made that up. But it would have been fun to fight in Google Sets — instead of hitting your opponent directly, you would enter your moves in Google Sets which would then expand it to a much larger set and throw them all at your opponent!) Of course we did search for existing literature on the subject but that did not turn up anything interesting. We did come up with some pretty clever techniques ourselves, most of which I do not remember.
One technique I do remember is the one which remember thinking of as the most practical. The basic idea is to gather words/phrases that co-occur (i.e. occur together in the same document/context) to form sets. But simple co-occurrence won’t do — it will pollute a set with words that are extremely relevant to the context of the set but which however do not belong to it. In the above examples, “Cricket” and “Olympics” could become part of those two sets respectively. It is quite easy to see why this could happen: “Sachin Tendulkar” would co-occur with “Cricket” much more than with his other teammates; this is true for almost all the team members.
Because simple co-occurrence will not suffice, we can get more specific: co-occurrence within a list. It could either be a HTML list (created with the li/ol HTML elements) or a list specified in plain English (a sentence with the words/phrases separated by commas.) By observing such lists across several web pages, we can easily construct sets of different kinds.
Although this technique might work, it is limited by the fact that we are using only a fraction of the web pages; only using those pages which have a list. We can generalize this technique a little more to be able to use a lot more web pages and hence come up with more, better, comprehensive sets: we can form sets out of words/phrases that co-occur within a document with similar HTML markup around them. This generalization will allow us to bring in a lot more data that we can use to create the sets.
The technique outlined in the above is very similar to the one used by David Nadeau et. al. in “Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity,” for a step they call “Generating Gazettes.” I was pleasantly surprised when I came across this paper and realized how similar their technique is to the one outlined above. So instead of trying to explain the technique in detail, I am just linking to the paper so that you can just read it up over there.
When I did some searching for this post, I came across a couple of papers on this very topic, these are listed below. I have not gone through them myself.
1. Bayesian Sets
by: Z Ghahramani, KA Heller
Inspired by “Google Sets”, we consider the problem of retrieving items from a concept or cluster, given a query consisting of a few items from that cluster. We formulate this as a Bayesian inference problem and describe a very simple algorithm for solving it. Our algorithm uses a modelbased concept of a cluster and ranks items using a score which evaluates the marginal probability that each item belongs to a cluster containing the query items. For exponential family models with…
2. Related Word and Phrase Set Generation
by Sam Bradley
Problem: To develop a method that will take a small set of related words and use results from Google to find a larger set of words that are also related to the original set in a similar way.