CS340 Lab 7 - Naive Bayes Learning
We are going to use Naive Bayes to write a simple spam
filter. We will start by using the natural language toolkit and
python. More than you need to know can be found at
www.nltk.org/book and chapter 6
specifically deals with text classification.
Copy the file ~jillz/cs340/spamClassifier.py and the
directories ~jillz/cs340/spam and ~jillz/cs340/ham. The spam
classifier uses the NaiveBayesClassifier that is already provided by nltk.
The "feature set" that is created are the most common words that are found
in spam and in ham documents. The directories spam and ham contain
emails that have already been determined to be either spam or ham. We
split those files into a training set and a test set. Once we train
the learning on the training set we will see how well it performs on the
Try out the classifier by entering python and then
once in the python importing the file with the command
from spamClassifier import *
You will get the accuracy of the test set as well as the
features that were most informative in determining whether a test file was
either spam or ham.
Now it is time to work out the details that are under the
hood! Copy the incomplete python file ~jillz/cs340/myClassifier.py.
We are defining a class and I have already provided the class constructor
whichCollects the featureWords in both the spam and ham directories and
creates a list of the 2000 most frequent words. The constructor also
breaks up the documents into randomly selected training documents and test
Write the train method. This method needs to compute the following probabilities:
The probability of a document being spam in the training set
The probability of a document being ham in the training set
For each word in the featureWords, determining the
frequency of that word occurring in a spam document and the frequency of
that word occurring in a ham document. This will allow you to
determine for a word w, P(w|spam) and P(w|ham).
I recommend that you store these values in a dictionary with the word being the key and the frequency being the value.
We don't want a probability being zero in our multiplication later on so if you get a zero frequency, set the probability to be a suitably small epsilon. (What would be a reasonable value to choose?)
Write the classify method which for each document in the the test set will calculate the following:
Calculate the probability P(w1|spam) P(w2 | spam) ... P(wn|spam) P(spam) , where the wi's are the words in the document that are also in the featureWords.
Do the same for ham
Append to the classified list the pair (doc,"spam")
or (doc,"ham") depending on the correct classification.
Write the accuracy method that uses the classified list
and determines the percent of the documents that were correctly classified.
You can test your class by creating an object c
You can then apply the training method, c.train()
You can then classify the test documents, c.classify()
Finally, check the accuracy of your classification
Submit your mySpamClassifier.py code