<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The art of Information Engineering &#187; Information Extraction</title>
	<atom:link href="http://www.grok.in/blog/cats/information-extraction/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.grok.in</link>
	<description>(ignorance killed the cat, curiosity was framed)</description>
	<lastBuildDate>Tue, 11 Aug 2009 06:30:18 +0000</lastBuildDate>
	
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Fun with Google Sets</title>
		<link>http://www.grok.in/blog/2008/03/21/fun-with-google-sets/</link>
		<comments>http://www.grok.in/blog/2008/03/21/fun-with-google-sets/#comments</comments>
		<pubDate>Fri, 21 Mar 2008 15:01:03 +0000</pubDate>
		<dc:creator>Siddhartha Reddy</dc:creator>
				<category><![CDATA[Information Extraction]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[sets]]></category>

		<guid isPermaLink="false">http://www.grok.in/blog/2008/03/21/fun-with-google-sets/</guid>
		<description><![CDATA[Google Sets is a real fun experiment from the Google Labs. It basically allows you to &#8220;automatically create sets of items from a few examples.&#8221; So you can enter &#8220;Sachin Tendulkar&#8221;, &#8220;Rahul Dravid&#8221; and &#8220;Sourav Ganguly,&#8221; and, be presented with a much larger set of the players of the Indian Cricket Team. Or enter &#8220;Athens&#8221;, [...]]]></description>
			<content:encoded><![CDATA[<p><a title="Google Sets" href="http://labs.google.com/sets">Google Sets</a> is a real fun experiment from the <a title="Google Labs" href="http://labs.google.com/">Google Labs</a>. It basically allows you to &#8220;automatically create sets of items from a few examples.&#8221; So you can enter &#8220;Sachin Tendulkar&#8221;, &#8220;Rahul Dravid&#8221; and &#8220;Sourav Ganguly,&#8221; and, be presented with a much larger set of the players of the Indian Cricket Team. Or enter &#8220;Athens&#8221;, &#8220;Sydney&#8221;, &#8220;Atlanta&#8221;, &#8220;Barcelona&#8221; and &#8220;Tokyo&#8221; to get a much larger set of the cities that have hosted (or will be hosting) Olympics. Neat.</p>
<p><span id="more-29"></span>If you had not come across this before, let me warn you: this can cause you to waste a whole lot of time and have a whole lot of fun (how many sex positions do you know? <img src='http://www.grok.in/wp/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> . But I remember wasting a lot of time in another way when Google Sets was first unveiled: discussing/arguing/fighting with my friends about how it was done (nah, not really &#8212; we didn&#8217;t fight over it, I just made that up. But it would have been fun to fight <em>in</em> Google Sets &#8212; instead of hitting your opponent directly, you would enter your moves in Google Sets which would then expand it to a much larger set and throw them all at your opponent!) Of course we did search for existing literature on the subject but that did not turn up anything interesting. We did come up with some pretty clever techniques ourselves, most of which I do not remember.</p>
<p>One technique I do remember is the one which remember thinking of as the most practical. The basic idea is to gather words/phrases that co-occur (i.e. occur together in the same document/context) to form sets. But simple co-occurrence won&#8217;t do &#8212; it will pollute a set with words that are extremely relevant to the context of the set but which however do not belong to it. In the above examples, &#8220;Cricket&#8221; and &#8220;Olympics&#8221; could become part of those two sets respectively. It is quite easy to see why this could happen: &#8220;Sachin Tendulkar&#8221; would co-occur with &#8220;Cricket&#8221; much more than with his other teammates; this is true for almost all the team members.</p>
<p>Because simple co-occurrence will not suffice, we can get more specific: co-occurrence within a list. It could either be a HTML list (created with the <em>li</em>/<em>ol</em> HTML elements) or a list specified in plain English (a sentence with the words/phrases separated by commas.) By observing such lists across several web pages, we can easily construct sets of different kinds.</p>
<p>Although this technique might work, it is limited by the fact that we are using only a fraction of the web pages; only using those pages which have a list. We can generalize this technique a little more to be able to use a lot more web pages and hence come up with more, better, comprehensive sets: we can form sets out of words/phrases that co-occur within a document with similar HTML markup around them. This generalization will allow us to bring in a lot more data that we can use to create the sets.</p>
<p>The technique outlined in the above is very similar to the one used by David Nadeau et. al. in &#8220;<a title="Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity" href="http://iit-iti.nrc-cnrc.gc.ca/publications/nrc-48727_e.html">Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity</a>,&#8221; for a step they call &#8220;Generating Gazettes.&#8221; I was pleasantly surprised when I came across this paper and realized how similar their technique is to the one outlined above. So instead of trying to explain the technique in detail, I am just linking to the paper so that you can just read it up over there.</p>
<p>When I did some searching for this post, I came across a couple of papers on this very topic, these are listed below. I have not gone through them myself.</p>
<p>1. <a title="Bayesian Sets" href="http://http://citeseer.ist.psu.edu/737792.html">Bayesian Sets</a><br />
by: Z Ghahramani, KA Heller</p>
<p>Inspired by &#8220;Google Sets&#8221;, we consider the problem of retrieving items from a concept or cluster, given a query consisting of a few items from that cluster. We formulate this as a Bayesian inference problem and describe a very simple algorithm for solving it. Our algorithm uses a modelbased concept of a cluster and ranks items using a score which evaluates the marginal probability that each item belongs to a cluster containing the query items. For exponential family models with&#8230;</p>
<p>2. <a title="Related Word and Phrase Set Generation" href="http://www.google.com/url?sa=t&amp;ct=res&amp;cd=4&amp;url=http%3A%2F%2Fwww.d.umn.edu%2F~tpederse%2FCourses%2FCS5761-SPR04%2FProjects%2Fbrad0250.pdf&amp;ei=prjeR6TZMIHy7AOPmqTCCg&amp;usg=AFQjCNEms6FcKWLx_MMbA2-9re56m-jx2A&amp;sig2=JIEhYfAz4M4NypBWeVHdYA">Related Word and Phrase Set Generation</a><br />
by Sam Bradley<br />
Problem: To develop a method that will take a small set of related words and use results from Google to find a larger set of words that are also related to the original set in a similar way.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.grok.in/blog/2008/03/21/fun-with-google-sets/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Tag Mirror</title>
		<link>http://www.grok.in/blog/2007/08/27/tag-mirror/</link>
		<comments>http://www.grok.in/blog/2007/08/27/tag-mirror/#comments</comments>
		<pubDate>Mon, 27 Aug 2007 18:15:09 +0000</pubDate>
		<dc:creator>Siddhartha Reddy</dc:creator>
				<category><![CDATA[Information Extraction]]></category>
		<category><![CDATA[tagging]]></category>

		<guid isPermaLink="false">http://www.grok.in/blog/2007/08/27/tag-mirror/</guid>
		<description><![CDATA[LibraryThing (an online service to help people catalogue their books easily) recently launched a very useful feature that they call &#8220;Tag Mirror&#8220;. This is one of the more interesting things that has been done with tags. In fact, I would wager that this is one of the best thing to happen to tagging since tag [...]]]></description>
			<content:encoded><![CDATA[<p align="left"><a href="http://www.librarything.com/" title="LibraryThing.com">LibraryThing</a> (an online service to help people catalogue their books easily) recently launched a very useful feature that they call &#8220;<a href="http://www.librarything.com/blog/2007/08/tag-mirror-see-your-books-way-others-do.php" title="LibraryThing Blog (Tag Mirror: See your books the way others do)">Tag Mirror</a>&#8220;. This is one of the more interesting things that has been done with tags. In fact, I would wager that this is one of the best thing to happen to tagging since tag clouds came along.</p>
<blockquote><p>Tag Mirror &#8220;holds a mirror&#8221; up to your books and to you. Instead of showing what <span style="font-style: italic">you</span> think about your booksâ€”what a regular tag cloud showsâ€”it shows you what <span style="font-style: italic">others</span> think of them.</p></blockquote>
<p>Compare my <a href="http://www.librarything.com/tagcloud.php?view=sids" title="sids's tag cloud | LibraryThing">tag cloud</a> with my <a href="http://www.librarything.com/profile_tagmirror.php?view=sids" title="sids's tag mirror | LibraryThing">tag mirror</a> and you&#8217;ll instantly see just how useful this is. My tag cloud shows my perspective of the  books that I have while the tag mirror shows the world&#8217;s (the LibraryThing community) perspective of the same.</p>
<p>Take all the entities that you have tagged (in this case, books), pull in the tags that others have used for them and you have your tag mirror. Simple. Yet very powerful. This seems like such an obvious thing to do, that it is almost surprising that no one has ever done it! (Or am I  plain ignorant?)</p>
<p>Additional notes:</p>
<ol>
<li>Check out <a href="http://www.librarything.com/thingology/" title="Thingology (LibraryThing's ideas blog)">Thingology</a> &#8212; LibraryThing&#8217;s ideas blog, on the philosophy and methods of tags, libraries and suchnot.</li>
<li>LibraryThing has done some other fun things with tags (and other data in general). Their &#8220;tag merge&#8221; feature allows their users to group different tags that are in fact not that different. The usefulness is obvious.</li>
<li>LibraryThing is a rather cool site.</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://www.grok.in/blog/2007/08/27/tag-mirror/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Interesting Papers on Web Spam at AIRWeb 2007</title>
		<link>http://www.grok.in/blog/2007/04/16/interesting-papers-on-web-spam-at-airweb-2007/</link>
		<comments>http://www.grok.in/blog/2007/04/16/interesting-papers-on-web-spam-at-airweb-2007/#comments</comments>
		<pubDate>Mon, 16 Apr 2007 15:17:12 +0000</pubDate>
		<dc:creator>Siddhartha Reddy</dc:creator>
				<category><![CDATA[Information Extraction]]></category>
		<category><![CDATA[papers]]></category>
		<category><![CDATA[webspam]]></category>

		<guid isPermaLink="false">http://www.grok.in/blog/2007/04/16/interesting-papers-on-web-spam-at-airweb-2007/</guid>
		<description><![CDATA[AIRWeb (Adversarial Information Retrieval on the Web) is workshop on IR in the world of Web Spam. From the call for papers page:
Adversarial Information Retrieval addresses tasks such as gathering, indexing, filtering, retrieving and ranking information from collections wherein a subset has been manipulated maliciously. On the Web, the predominant form of such manipulation is [...]]]></description>
			<content:encoded><![CDATA[<p>AIRWeb (Adversarial Information Retrieval on the Web) is workshop on IR in the world of Web Spam. From the <a href="http://airweb.cse.lehigh.edu/2007/cfp.html">call for papers page</a>:</p>
<blockquote><p>Adversarial Information Retrieval addresses tasks such as gathering, indexing, filtering, retrieving and ranking information from collections wherein a subset has been manipulated maliciously. On the Web, the predominant form of such manipulation is &#8220;search engine spamming&#8221; or spamdexing, i.e., malicious attempts to influence the outcome of ranking algorithms, aimed at getting an undeserved high ranking for some items in the collection. There is an economic incentive to rank higher in search engines, considering that a good ranking on them is strongly correlated with more traffic, which often translates to more revenue.</p></blockquote>
<p>The list of papers to be presented is released and there are some very interesting ones. Search Engine Land has posted a <a href="http://searchengineland.com/070415-133537.php">nice listing</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.grok.in/blog/2007/04/16/interesting-papers-on-web-spam-at-airweb-2007/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
