<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The art of Information Engineering &#187; Machine Learning</title>
	<atom:link href="http://www.grok.in/blog/cats/machine-learning/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.grok.in</link>
	<description>(ignorance killed the cat, curiosity was framed)</description>
	<lastBuildDate>Tue, 11 Aug 2009 06:30:18 +0000</lastBuildDate>
	
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>More data usually beats better algorithms?</title>
		<link>http://www.grok.in/blog/2008/04/14/more-data-usually-beats-better-algorithms/</link>
		<comments>http://www.grok.in/blog/2008/04/14/more-data-usually-beats-better-algorithms/#comments</comments>
		<pubDate>Mon, 14 Apr 2008 12:13:41 +0000</pubDate>
		<dc:creator>Siddhartha Reddy</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[algorithms]]></category>
		<category><![CDATA[data]]></category>

		<guid isPermaLink="false">http://www.grok.in/?p=44</guid>
		<description><![CDATA[Anand   Rajaraman, who teaches a class onÂ  Machine Learning at Stanford, recently wrote an interesting blog post: More data usually beats better algorithms, he claimed. The post makes for an interesting read and so do the plethora of comments on it. He made a follow-up post, which is equally interesting.
I do agree with [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www-db.stanford.edu/%7Eanand/">Anand   Rajaraman</a>, who teaches a class onÂ  Machine Learning at Stanford, recently wrote an interesting blog post: <a title="More data usually beats better algorithms" href="http://anand.typepad.com/datawocky/2008/03/more-data-usual.html">More data usually beats better algorithms</a>, he claimed. The post makes for an interesting read and so do the plethora of comments on it. He made a <a href="http://anand.typepad.com/datawocky/2008/04/data-versus-alg.html">follow-up post</a>, which is equally interesting.</p>
<p>I do agree with a good number of the points he brings up, but at the same time believe that such a blanket statement is not warranted. I believe that adding more data to a given algorithm does give out better results, especially if the new data is independent and the algorithm is capable of utilizing such data appropriately. But to say that better data is more important than better algorithms most of the time is far-fetched.</p>
<p><span id="more-44"></span>Such generalizations usually fall flat in the absence of the unstated assumptions under which they have been made. In the case of Dr. Rajaraman&#8217;s post, these assumptions include (but are not limited to): the data that is already available is not representative enough so that more data could add value; the algorithm is capable of utilizing the additional data; the additional data is good, even better, than the existing data.</p>
<p>There is data. There is information. There is a small <a title="Differences between data and information" href="http://jmcsweeney.co.uk/computing/m150/differences.php">&#8217;semantic&#8217; difference</a> between the two. Data has an &#8216;informational value&#8217; which can be described as the information it brings to the system. Additional data, if it does not add any information to the system is more often than not useless. So, more data only helps if it adds to the information in the system. Even where there is data available that can add information to the system, the value of that information needs to be considered. If a lot of additional data adds a marginal value, it might not be worth using that data. That is because processing the additional data requires additional resources in terms of time and machine power, and, in machine learning applications we are usually trying to optimize the resource utilization.</p>
<p>Most simplistic algorithms need to be modified for them to be able to utilize more than one independent sets of data. So in this case, the algorithm is being improved, as well as more data being added. This is the case, for example, with Google&#8217;s PageRank: Google decided to use the hyperlink information for ranking web pages but the existing ranking algorithms could not utilize that information, so they improved on them to come up with PageRank. Now, this very example could have been stated in another way: Google came up with a better ranking algorithm that could utilize the social citations of web pages to rank them, this new algorithm needed new data and hyperlink information happened to be that data; it might as well have been the bookmarks of all the people, if cloud computing was invented before the Internet (sic).</p>
<p>At the end of the day, whether more/better data is more important or better algorithm is highly dependent on the particular application at hand.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.grok.in/blog/2008/04/14/more-data-usually-beats-better-algorithms/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Machine Learning: Classification</title>
		<link>http://www.grok.in/blog/2008/03/27/machine-learning-classification/</link>
		<comments>http://www.grok.in/blog/2008/03/27/machine-learning-classification/#comments</comments>
		<pubDate>Thu, 27 Mar 2008 17:00:56 +0000</pubDate>
		<dc:creator>Siddhartha Reddy</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[classification]]></category>

		<guid isPermaLink="false">http://www.grok.in/blog/2008/03/27/machine-learning-classification/</guid>
		<description><![CDATA[Machine Learning is a branch of Computer Science that is concerned with designing systems that can learn from the provided input. Usually the systems are designed to use this learned knowledge to better process similar input in the future. Machine learning can be considered as a subfield of Artificial Intelligence.
A very familiar example is the [...]]]></description>
			<content:encoded><![CDATA[<p>Machine Learning is a branch of Computer Science that is concerned with designing systems that can <em>learn</em> from the provided input. Usually the systems are designed to use this learned knowledge to better process similar input in the future. Machine learning can be considered as a subfield of Artificial Intelligence.</p>
<p>A very familiar example is the email spam-catching system: given a set of emails marked as spam and not-spam, it learns the characteristics of spam emails and is then able to process future email messages to mark them as spam or not-spam.</p>
<p><em><strong><a title="Machine Learning: Classification - Notes" href="http://www.grok.in/notes/machine-learning-classification/">Read on&#8230;</a></strong></em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.grok.in/blog/2008/03/27/machine-learning-classification/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Fun with Google Sets</title>
		<link>http://www.grok.in/blog/2008/03/21/fun-with-google-sets/</link>
		<comments>http://www.grok.in/blog/2008/03/21/fun-with-google-sets/#comments</comments>
		<pubDate>Fri, 21 Mar 2008 15:01:03 +0000</pubDate>
		<dc:creator>Siddhartha Reddy</dc:creator>
				<category><![CDATA[Information Extraction]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[sets]]></category>

		<guid isPermaLink="false">http://www.grok.in/blog/2008/03/21/fun-with-google-sets/</guid>
		<description><![CDATA[Google Sets is a real fun experiment from the Google Labs. It basically allows you to &#8220;automatically create sets of items from a few examples.&#8221; So you can enter &#8220;Sachin Tendulkar&#8221;, &#8220;Rahul Dravid&#8221; and &#8220;Sourav Ganguly,&#8221; and, be presented with a much larger set of the players of the Indian Cricket Team. Or enter &#8220;Athens&#8221;, [...]]]></description>
			<content:encoded><![CDATA[<p><a title="Google Sets" href="http://labs.google.com/sets">Google Sets</a> is a real fun experiment from the <a title="Google Labs" href="http://labs.google.com/">Google Labs</a>. It basically allows you to &#8220;automatically create sets of items from a few examples.&#8221; So you can enter &#8220;Sachin Tendulkar&#8221;, &#8220;Rahul Dravid&#8221; and &#8220;Sourav Ganguly,&#8221; and, be presented with a much larger set of the players of the Indian Cricket Team. Or enter &#8220;Athens&#8221;, &#8220;Sydney&#8221;, &#8220;Atlanta&#8221;, &#8220;Barcelona&#8221; and &#8220;Tokyo&#8221; to get a much larger set of the cities that have hosted (or will be hosting) Olympics. Neat.</p>
<p><span id="more-29"></span>If you had not come across this before, let me warn you: this can cause you to waste a whole lot of time and have a whole lot of fun (how many sex positions do you know? <img src='http://www.grok.in/wp/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> . But I remember wasting a lot of time in another way when Google Sets was first unveiled: discussing/arguing/fighting with my friends about how it was done (nah, not really &#8212; we didn&#8217;t fight over it, I just made that up. But it would have been fun to fight <em>in</em> Google Sets &#8212; instead of hitting your opponent directly, you would enter your moves in Google Sets which would then expand it to a much larger set and throw them all at your opponent!) Of course we did search for existing literature on the subject but that did not turn up anything interesting. We did come up with some pretty clever techniques ourselves, most of which I do not remember.</p>
<p>One technique I do remember is the one which remember thinking of as the most practical. The basic idea is to gather words/phrases that co-occur (i.e. occur together in the same document/context) to form sets. But simple co-occurrence won&#8217;t do &#8212; it will pollute a set with words that are extremely relevant to the context of the set but which however do not belong to it. In the above examples, &#8220;Cricket&#8221; and &#8220;Olympics&#8221; could become part of those two sets respectively. It is quite easy to see why this could happen: &#8220;Sachin Tendulkar&#8221; would co-occur with &#8220;Cricket&#8221; much more than with his other teammates; this is true for almost all the team members.</p>
<p>Because simple co-occurrence will not suffice, we can get more specific: co-occurrence within a list. It could either be a HTML list (created with the <em>li</em>/<em>ol</em> HTML elements) or a list specified in plain English (a sentence with the words/phrases separated by commas.) By observing such lists across several web pages, we can easily construct sets of different kinds.</p>
<p>Although this technique might work, it is limited by the fact that we are using only a fraction of the web pages; only using those pages which have a list. We can generalize this technique a little more to be able to use a lot more web pages and hence come up with more, better, comprehensive sets: we can form sets out of words/phrases that co-occur within a document with similar HTML markup around them. This generalization will allow us to bring in a lot more data that we can use to create the sets.</p>
<p>The technique outlined in the above is very similar to the one used by David Nadeau et. al. in &#8220;<a title="Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity" href="http://iit-iti.nrc-cnrc.gc.ca/publications/nrc-48727_e.html">Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity</a>,&#8221; for a step they call &#8220;Generating Gazettes.&#8221; I was pleasantly surprised when I came across this paper and realized how similar their technique is to the one outlined above. So instead of trying to explain the technique in detail, I am just linking to the paper so that you can just read it up over there.</p>
<p>When I did some searching for this post, I came across a couple of papers on this very topic, these are listed below. I have not gone through them myself.</p>
<p>1. <a title="Bayesian Sets" href="http://http://citeseer.ist.psu.edu/737792.html">Bayesian Sets</a><br />
by: Z Ghahramani, KA Heller</p>
<p>Inspired by &#8220;Google Sets&#8221;, we consider the problem of retrieving items from a concept or cluster, given a query consisting of a few items from that cluster. We formulate this as a Bayesian inference problem and describe a very simple algorithm for solving it. Our algorithm uses a modelbased concept of a cluster and ranks items using a score which evaluates the marginal probability that each item belongs to a cluster containing the query items. For exponential family models with&#8230;</p>
<p>2. <a title="Related Word and Phrase Set Generation" href="http://www.google.com/url?sa=t&amp;ct=res&amp;cd=4&amp;url=http%3A%2F%2Fwww.d.umn.edu%2F~tpederse%2FCourses%2FCS5761-SPR04%2FProjects%2Fbrad0250.pdf&amp;ei=prjeR6TZMIHy7AOPmqTCCg&amp;usg=AFQjCNEms6FcKWLx_MMbA2-9re56m-jx2A&amp;sig2=JIEhYfAz4M4NypBWeVHdYA">Related Word and Phrase Set Generation</a><br />
by Sam Bradley<br />
Problem: To develop a method that will take a small set of related words and use results from Google to find a larger set of words that are also related to the original set in a similar way.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.grok.in/blog/2008/03/21/fun-with-google-sets/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
