More data usually beats better algorithms?

Anand Rajaraman, who teaches a class on  Machine Learning at Stanford, recently wrote an interesting blog post: More data usually beats better algorithms, he claimed. The post makes for an interesting read and so do the plethora of comments on it. He made a follow-up post, which is equally interesting.

I do agree with a good number of the points he brings up, but at the same time believe that such a blanket statement is not warranted. I believe that adding more data to a given algorithm does give out better results, especially if the new data is independent and the algorithm is capable of utilizing such data appropriately. But to say that better data is more important than better algorithms most of the time is far-fetched.

Such generalizations usually fall flat in the absence of the unstated assumptions under which they have been made. In the case of Dr. Rajaraman’s post, these assumptions include (but are not limited to): the data that is already available is not representative enough so that more data could add value; the algorithm is capable of utilizing the additional data; the additional data is good, even better, than the existing data.

There is data. There is information. There is a small ‘semantic’ difference between the two. Data has an ‘informational value’ which can be described as the information it brings to the system. Additional data, if it does not add any information to the system is more often than not useless. So, more data only helps if it adds to the information in the system. Even where there is data available that can add information to the system, the value of that information needs to be considered. If a lot of additional data adds a marginal value, it might not be worth using that data. That is because processing the additional data requires additional resources in terms of time and machine power, and, in machine learning applications we are usually trying to optimize the resource utilization.

Most simplistic algorithms need to be modified for them to be able to utilize more than one independent sets of data. So in this case, the algorithm is being improved, as well as more data being added. This is the case, for example, with Google’s PageRank: Google decided to use the hyperlink information for ranking web pages but the existing ranking algorithms could not utilize that information, so they improved on them to come up with PageRank. Now, this very example could have been stated in another way: Google came up with a better ranking algorithm that could utilize the social citations of web pages to rank them, this new algorithm needed new data and hyperlink information happened to be that data; it might as well have been the bookmarks of all the people, if cloud computing was invented before the Internet (sic).

At the end of the day, whether more/better data is more important or better algorithm is highly dependent on the particular application at hand.

This entry was posted in Machine Learning and tagged , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.


  1. Harish
    Posted April 19, 2008 at 2:41 pm | Permalink

    hey sid,
    stumbled upon your blog.. interesting post.
    the importance of data over the algorithm is a useful concept even in the realms of speech engineering (coding, recognition and synthesis) in order to capture more variability. Ofcourse one might feel that one can use a parametric model (like the HMMs) to circumvent a lot of data but the naturalness (which is very important in recognition/synthesis paradigms) is lost. But, I suppose the resources you’re dealing with has a significant say in what model you adopt. So yes it depends on the application to choose between statistical or parametric models.

  2. Posted April 21, 2008 at 8:49 pm | Permalink

    Thanks for the comment, Harish. You are absolutely right about the importance of good data in Speech Engineering. In fact I think this is a common theme in many if not most fields of Computer Science. Specifically in Speech Engineering, Google provides some very good support for the idea: Google runs a free 411 (inquiry) service in US that works using speech recognition and they have gone on record to say that the primary purpose of this service is to provide better data for their Speech Recognition algorithms.

    Posted November 24, 2011 at 2:27 pm | Permalink

    Kitchen works are synergistic with the proper gadgets. Proper gadgets not just assure the proper utilization of the ingredients, but also serve the safety concern of user as well.

  • About

    This is a blog primarily focussed on the subjects of Information Engineering—Retrieval, Extraction & Management, Machine Learning, Scalability and Cloud Computing.