Anand Rajaraman, who teaches a class onÂ Machine Learning at Stanford, recently wrote an interesting blog post: More data usually beats better algorithms, he claimed. The post makes for an interesting read and so do the plethora of comments on it. He made a follow-up post, which is equally interesting.
I do agree with a good number of the points he brings up, but at the same time believe that such a blanket statement is not warranted. I believe that adding more data to a given algorithm does give out better results, especially if the new data is independent and the algorithm is capable of utilizing such data appropriately. But to say that better data is more important than better algorithms most of the time is far-fetched.
Such generalizations usually fall flat in the absence of the unstated assumptions under which they have been made. In the case of Dr. Rajaraman’s post, these assumptions include (but are not limited to): the data that is already available is not representative enough so that more data could add value; the algorithm is capable of utilizing the additional data; the additional data is good, even better, than the existing data.
There is data. There is information. There is a small ‘semantic’ difference between the two. Data has an ‘informational value’ which can be described as the information it brings to the system. Additional data, if it does not add any information to the system is more often than not useless. So, more data only helps if it adds to the information in the system. Even where there is data available that can add information to the system, the value of that information needs to be considered. If a lot of additional data adds a marginal value, it might not be worth using that data. That is because processing the additional data requires additional resources in terms of time and machine power, and, in machine learning applications we are usually trying to optimize the resource utilization.
Most simplistic algorithms need to be modified for them to be able to utilize more than one independent sets of data. So in this case, the algorithm is being improved, as well as more data being added. This is the case, for example, with Google’s PageRank: Google decided to use the hyperlink information for ranking web pages but the existing ranking algorithms could not utilize that information, so they improved on them to come up with PageRank. Now, this very example could have been stated in another way: Google came up with a better ranking algorithm that could utilize the social citations of web pages to rank them, this new algorithm needed new data and hyperlink information happened to be that data; it might as well have been the bookmarks of all the people, if cloud computing was invented before the Internet (sic).
At the end of the day, whether more/better data is more important or better algorithm is highly dependent on the particular application at hand.