<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The art of Information Engineering &#187; Information Retrieval</title>
	<atom:link href="http://www.grok.in/blog/cats/information-retrieval/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.grok.in</link>
	<description>(ignorance killed the cat, curiosity was framed)</description>
	<lastBuildDate>Tue, 11 Aug 2009 06:30:18 +0000</lastBuildDate>
	
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Softwares/Libraries for Full-text Search</title>
		<link>http://www.grok.in/blog/2008/11/05/full-text-search/</link>
		<comments>http://www.grok.in/blog/2008/11/05/full-text-search/#comments</comments>
		<pubDate>Wed, 05 Nov 2008 13:47:15 +0000</pubDate>
		<dc:creator>Siddhartha Reddy</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[full-text]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[sphinx]]></category>

		<guid isPermaLink="false">http://www.grok.in/?p=74</guid>
		<description><![CDATA[A lot of applications have a requirement to search the full-text of some content they have for some words it might contain. This kind of functionality is often referred to as full-text search. For example, a blogging software might need to provide a search functionality that searches the blog posts for the user entered query [...]]]></description>
			<content:encoded><![CDATA[<p>A lot of applications have a requirement to search the full-text of some content they have for some words it might contain. This kind of functionality is often referred to as <em>full-text search</em>. For example, a blogging software might need to provide a search functionality that searches the blog posts for the user entered query terms.</p>
<p>It is not possible to use the regular database indexes (usually B-Trees or Hashmaps) for this purpose because they require that you provide the full value of the column you are searching in; in essence they do an equality search. In the blogging software example, the user would then have to type in the entire blog post verbatim in order to find it; even if you could imagine the most patient of users, if s/he already knows the entire post by-heart, why would s/he be looking for it anyway?!</p>
<p><em><strong><a title="Softwares/Libraries for Full-text Search - Notes" href=" http://www.grok.in/notes/full-text-search/">Read on&#8230;</a></strong></em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.grok.in/blog/2008/11/05/full-text-search/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Fun with Google Sets</title>
		<link>http://www.grok.in/blog/2008/03/21/fun-with-google-sets/</link>
		<comments>http://www.grok.in/blog/2008/03/21/fun-with-google-sets/#comments</comments>
		<pubDate>Fri, 21 Mar 2008 15:01:03 +0000</pubDate>
		<dc:creator>Siddhartha Reddy</dc:creator>
				<category><![CDATA[Information Extraction]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[sets]]></category>

		<guid isPermaLink="false">http://www.grok.in/blog/2008/03/21/fun-with-google-sets/</guid>
		<description><![CDATA[Google Sets is a real fun experiment from the Google Labs. It basically allows you to &#8220;automatically create sets of items from a few examples.&#8221; So you can enter &#8220;Sachin Tendulkar&#8221;, &#8220;Rahul Dravid&#8221; and &#8220;Sourav Ganguly,&#8221; and, be presented with a much larger set of the players of the Indian Cricket Team. Or enter &#8220;Athens&#8221;, [...]]]></description>
			<content:encoded><![CDATA[<p><a title="Google Sets" href="http://labs.google.com/sets">Google Sets</a> is a real fun experiment from the <a title="Google Labs" href="http://labs.google.com/">Google Labs</a>. It basically allows you to &#8220;automatically create sets of items from a few examples.&#8221; So you can enter &#8220;Sachin Tendulkar&#8221;, &#8220;Rahul Dravid&#8221; and &#8220;Sourav Ganguly,&#8221; and, be presented with a much larger set of the players of the Indian Cricket Team. Or enter &#8220;Athens&#8221;, &#8220;Sydney&#8221;, &#8220;Atlanta&#8221;, &#8220;Barcelona&#8221; and &#8220;Tokyo&#8221; to get a much larger set of the cities that have hosted (or will be hosting) Olympics. Neat.</p>
<p><span id="more-29"></span>If you had not come across this before, let me warn you: this can cause you to waste a whole lot of time and have a whole lot of fun (how many sex positions do you know? <img src='http://www.grok.in/wp/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> . But I remember wasting a lot of time in another way when Google Sets was first unveiled: discussing/arguing/fighting with my friends about how it was done (nah, not really &#8212; we didn&#8217;t fight over it, I just made that up. But it would have been fun to fight <em>in</em> Google Sets &#8212; instead of hitting your opponent directly, you would enter your moves in Google Sets which would then expand it to a much larger set and throw them all at your opponent!) Of course we did search for existing literature on the subject but that did not turn up anything interesting. We did come up with some pretty clever techniques ourselves, most of which I do not remember.</p>
<p>One technique I do remember is the one which remember thinking of as the most practical. The basic idea is to gather words/phrases that co-occur (i.e. occur together in the same document/context) to form sets. But simple co-occurrence won&#8217;t do &#8212; it will pollute a set with words that are extremely relevant to the context of the set but which however do not belong to it. In the above examples, &#8220;Cricket&#8221; and &#8220;Olympics&#8221; could become part of those two sets respectively. It is quite easy to see why this could happen: &#8220;Sachin Tendulkar&#8221; would co-occur with &#8220;Cricket&#8221; much more than with his other teammates; this is true for almost all the team members.</p>
<p>Because simple co-occurrence will not suffice, we can get more specific: co-occurrence within a list. It could either be a HTML list (created with the <em>li</em>/<em>ol</em> HTML elements) or a list specified in plain English (a sentence with the words/phrases separated by commas.) By observing such lists across several web pages, we can easily construct sets of different kinds.</p>
<p>Although this technique might work, it is limited by the fact that we are using only a fraction of the web pages; only using those pages which have a list. We can generalize this technique a little more to be able to use a lot more web pages and hence come up with more, better, comprehensive sets: we can form sets out of words/phrases that co-occur within a document with similar HTML markup around them. This generalization will allow us to bring in a lot more data that we can use to create the sets.</p>
<p>The technique outlined in the above is very similar to the one used by David Nadeau et. al. in &#8220;<a title="Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity" href="http://iit-iti.nrc-cnrc.gc.ca/publications/nrc-48727_e.html">Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity</a>,&#8221; for a step they call &#8220;Generating Gazettes.&#8221; I was pleasantly surprised when I came across this paper and realized how similar their technique is to the one outlined above. So instead of trying to explain the technique in detail, I am just linking to the paper so that you can just read it up over there.</p>
<p>When I did some searching for this post, I came across a couple of papers on this very topic, these are listed below. I have not gone through them myself.</p>
<p>1. <a title="Bayesian Sets" href="http://http://citeseer.ist.psu.edu/737792.html">Bayesian Sets</a><br />
by: Z Ghahramani, KA Heller</p>
<p>Inspired by &#8220;Google Sets&#8221;, we consider the problem of retrieving items from a concept or cluster, given a query consisting of a few items from that cluster. We formulate this as a Bayesian inference problem and describe a very simple algorithm for solving it. Our algorithm uses a modelbased concept of a cluster and ranks items using a score which evaluates the marginal probability that each item belongs to a cluster containing the query items. For exponential family models with&#8230;</p>
<p>2. <a title="Related Word and Phrase Set Generation" href="http://www.google.com/url?sa=t&amp;ct=res&amp;cd=4&amp;url=http%3A%2F%2Fwww.d.umn.edu%2F~tpederse%2FCourses%2FCS5761-SPR04%2FProjects%2Fbrad0250.pdf&amp;ei=prjeR6TZMIHy7AOPmqTCCg&amp;usg=AFQjCNEms6FcKWLx_MMbA2-9re56m-jx2A&amp;sig2=JIEhYfAz4M4NypBWeVHdYA">Related Word and Phrase Set Generation</a><br />
by Sam Bradley<br />
Problem: To develop a method that will take a small set of related words and use results from Google to find a larger set of words that are also related to the original set in a similar way.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.grok.in/blog/2008/03/21/fun-with-google-sets/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Information R/evolution</title>
		<link>http://www.grok.in/blog/2007/10/23/information-revolution/</link>
		<comments>http://www.grok.in/blog/2007/10/23/information-revolution/#comments</comments>
		<pubDate>Tue, 23 Oct 2007 02:21:20 +0000</pubDate>
		<dc:creator>Siddhartha Reddy</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[information]]></category>
		<category><![CDATA[social]]></category>
		<category><![CDATA[videos]]></category>

		<guid isPermaLink="false">http://www.grok.in/blog/2007/10/23/information-revolution/</guid>
		<description><![CDATA[This is a must watch video by Michael Wesch, Assistant Professor of Cultural Anthropology at the Kansas State University.
This video explores the changes in the way we find, store, create, critique, and share information. This video was created as a conversation starter, and works especially well when brainstorming with people about the near future and [...]]]></description>
			<content:encoded><![CDATA[<p>This is a must watch video by Michael Wesch, Assistant Professor of Cultural Anthropology at the Kansas State University.</p>
<blockquote><p>This video explores the changes in the way we find, store, create, critique, and share information. This video was created as a conversation starter, and works especially well when brainstorming with people about the near future and the skills needed in order to harness, evaluate, and create information effectively.</p></blockquote>
<p><object height="355" width="425"><param name="movie" value="http://www.youtube.com/v/-4CV05HyAbM&amp;rel=1"></param><param name="wmode" value="transparent"></param><embed src="http://www.youtube.com/v/-4CV05HyAbM&amp;rel=1" type="application/x-shockwave-flash" wmode="transparent" height="355" width="425"></embed></object></p>
]]></content:encoded>
			<wfw:commentRss>http://www.grok.in/blog/2007/10/23/information-revolution/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Dude Experiment</title>
		<link>http://www.grok.in/blog/2007/05/12/the-dude-experiment/</link>
		<comments>http://www.grok.in/blog/2007/05/12/the-dude-experiment/#comments</comments>
		<pubDate>Sat, 12 May 2007 05:57:27 +0000</pubDate>
		<dc:creator>Siddhartha Reddy</dc:creator>
				<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[experiments]]></category>

		<guid isPermaLink="false">http://www.grok.in/blog/2007/05/12/the-dude-experiment/</guid>
		<description><![CDATA[A friend had once shown me something very interesting: no matter how many [u]s you use in [dude], Google always has results for it! (One of those results for some twenty-odd [u]s was a blog entry of his &#8212; that&#8217;s how he came up with this). I just decided to run this experiment using the [...]]]></description>
			<content:encoded><![CDATA[<p>A friend had once shown me something very interesting: no matter how many [<em>u</em>]s you use in [<em>dude</em>], Google always has results for it! (One of those results for some twenty-odd [<em>u</em>]s was a blog entry of his &#8212; that&#8217;s how he came up with this). I just decided to run this experiment using the Google Search API and record the results. You can see the results (from 1 [<em>u</em>] to 900; Google only allows 1000 searches per day per IP <img src='http://www.grok.in/wp/wp-includes/images/smilies/icon_sad.gif' alt=':(' class='wp-smiley' /> ) at <a href="http://spreadsheets.google.com/pub?key=p0O6ZaIrydVN03TRYGjvBPA" target="_blank" style="font-size: 9pt" class="aBlue">http://spreadsheets.google.com/pub?</a><a href="http://spreadsheets.google.com/pub?key=p0O6ZaIrydVN03TRYGjvBPA" target="_blank" style="font-size: 9pt" class="aBlue">key=p<wbr></wbr>0O6ZaIrydVN03TRYGjvBPA</a>.</p>
<p><a href="http://www.grok.in/wordpress/wp-content/the_dudes.png" title="The Dudes"><img src="http://www.grok.in/wordpress/wp-content/the_dudes.png" alt="The Dudes" height="288" width="432" /></a></p>
<p>But what is this supposed to show? It&#8217;s not one thing, it&#8217;s many <img src='http://www.grok.in/wp/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> . Draw your own conclusions!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.grok.in/blog/2007/05/12/the-dude-experiment/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
