CS340  Lab 7 - Naive Bayes Learning

1. We are going to use Naive Bayes to write a simple spam filter.  We will start by using the natural language toolkit and python.  More than you need to know can be found at www.nltk.org/book and chapter 6 specifically deals with text classification.

2. Copy the file ~jillz/cs340/spamClassifier.py and the directories ~jillz/cs340/spam and ~jillz/cs340/ham.  The spam classifier uses the NaiveBayesClassifier that is already provided by nltk.  The "feature set" that is created are the most common words that are found in spam and in ham documents.  The directories spam and ham contain emails that have already been determined to be either spam or ham.  We split those files into a training set and a test set.  Once we train the learning on the training set we will see how well it performs on the test set.

3. Try out the classifier by entering python and then once in the python importing the file with the command
from spamClassifier import *

4. You will get the accuracy of the test set as well as the features that were most informative in determining whether a test file was either spam or ham.

5. Now it is time to work out the details that are under the hood!  Copy the incomplete python file ~jillz/cs340/myClassifier.py.  We are defining a class and I have already provided the class constructor whichCollects the featureWords in both the spam and ham directories and creates a list of the 2000 most frequent words.  The constructor also breaks up the documents into randomly selected training documents and test documents.

6. Write the train method.  This method needs to compute the following probabilities:

• The probability of a document being spam in the training set

• The probability of a document being ham in the training set

• For each word in the featureWords, determining the frequency of that word occurring in a spam document and the frequency of that word occurring in a ham document.  This will allow you to determine for a word w, P(w|spam) and P(w|ham).
I recommend that you store these values in a dictionary with the word being the key and the frequency being the value.
We don't want a probability being zero in our multiplication later on so if you get a zero frequency, set the probability to be a suitably small epsilon.  (What would be a reasonable value to choose?)

7. Write the classify method which for each document in the the test set will calculate the following:

• Calculate the probability P(w1|spam) P(w2 | spam) ... P(wn|spam) P(spam) , where the wi's are the words in the document that are also in the featureWords.

• Do the same for ham

• Append to the classified list the pair (doc,"spam") or (doc,"ham") depending on the correct classification.

8. Write the accuracy method that uses the classified list and determines the percent of the documents that were correctly classified.

9. You can test your class by creating an object  c = mySpamClassifier("spam","ham")

10. You can then apply the training method, c.train()

11. You can then classify the test documents, c.classify()

12. Finally, check the accuracy of your classification