Machine Learning: Classification

Machine Learning is a branch of Computer Science that is concerned with designing systems that can learn from the provided input. Usually the systems are designed to use this learned knowledge to better process similar input in the future. Machine learning can be considered as a subfield of Artificial Intelligence.

A very familiar example is the email spam-catching system: given a set of emails marked as spam and not-spam, it learns the characteristics of spam emails and is then able to process future email messages to mark them as spam or not-spam.

Read on…

This entry was posted in Databases, Machine Learning and tagged . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

3 Comments

  1. Posted March 30, 2008 at 12:11 am | Permalink

    But what is the training data in this case? Words? Sets of words?

  2. Posted March 30, 2008 at 9:52 am | Permalink

    @sanket,

    The training data is made up of entities that need to be classified. So, in the case of the spam classifier, the training data is emails labeled as ’spam’ and ‘not-spam’.

    Words/sets-of-words are used as ‘features’ for the classification process. What this means is that the classification is made based on the occurrence of words.

    Feature selection is an important and usually non-trivial step in classification. I’ll be posting about this later on.

  3. Posted April 5, 2008 at 3:00 pm | Permalink

    In the particular case of spam detectors, one very common input data representation is called “bag of words”, which is just a list of the distinct words in the message, or a 0/1 vector representing the same. Some pre-processing will likely precede the construction of a bag of words, such as removal of very common words (“the”, “of”), de-stemming (“walking”, “walks” and “walked” become “walk”) and selection of words with high predictive value (“Viagra”).

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*
  • About grok.in

    This is a blog primarily focussed on the subjects of Information Engineering—Retrieval, Extraction & Management, Machine Learning, Scalability and Cloud Computing.