Machine Learning is a branch of Computer Science that is concerned with designing systems that can learn from the provided input. Usually the systems are designed to use this learned knowledge to better process similar input in the future. Machine learning can be considered as a subfield of Artificial Intelligence.
A very familiar example is the email spam-catching system: given a set of emails marked as spam and not-spam, it learns the characteristics of spam emails and is then able to process future email messages to mark them as spam or not-spam.
3 Comments
But what is the training data in this case? Words? Sets of words?
@sanket,
The training data is made up of entities that need to be classified. So, in the case of the spam classifier, the training data is emails labeled as ’spam’ and ‘not-spam’.
Words/sets-of-words are used as ‘features’ for the classification process. What this means is that the classification is made based on the occurrence of words.
Feature selection is an important and usually non-trivial step in classification. I’ll be posting about this later on.
In the particular case of spam detectors, one very common input data representation is called “bag of words”, which is just a list of the distinct words in the message, or a 0/1 vector representing the same. Some pre-processing will likely precede the construction of a bag of words, such as removal of very common words (“the”, “of”), de-stemming (“walking”, “walks” and “walked” become “walk”) and selection of words with high predictive value (“Viagra”).