emailData = read.csv('email.txt',sep='\t')
head(emailData)
dim(emailData)
The data above is from a large sample of e-mails. We would like to predict whether any given e-mail is spam or not. Therefore spam is our response variable. Of all the other variables, we will focus on the following 10 for our model:
| variable | description |
|---|---|
| spam | Specifies whether the message was spam. |
| to_multiple | An indicator variable for if more than one person was listed in the To field of the email. |
| cc | An indicator for if someone was CCed on the email. |
| attach | An indicator for if there was an attachment, such as a document or image. |
| dollar | An indicator for if the word “dollar” or dollar symbol ($) appeared in the email. |
| winner | An indicator for if the word “winner” appeared in the email message. |
| inherit | An indicator for if the word “inherit” (or a variation, like “inheritance”) appeared in the email. |
| password | An indicator for if the word “password” was present in the email. |
| format | Indicates if the email contained special formatting, such as bolding, tables, or links. |
| re_subj | Indicates whether “Re:” was included at the start of the email subject. |
| exclaim_subj | Indicates whether any exclamation point was included in the email subject. |
theModel = glm(spam~to_multiple+cc+attach+dollar+winner+inherit+password+format+re_subj+exclaim_subj,family='binomial',data=emailData)
summary(theModel)
reducedModel = glm(spam~to_multiple+attach+dollar+winner+inherit+password+format+re_subj,family='binomial',data=emailData)
summary(reducedModel)
Using backwards elimination to remove variables with p-values greater than 5%, we get a model that depends on 8 explanatory variables. The form of the model is: $$\log \left( \frac{p_\text{spam}}{1-p_\text{spam}} \right) = -2.767\text{to_multiple}+0.204\text{attach}-0.697\text{dollar}+1.867\text{winner}+0.336\text{inherit}-0.760\text{password}-1.518\text{format}-3.113\text{re_subj}$$ We can use the predicted values of $p_\text{spam}$ to try to determine whether an e-mail is spam or not. For example, the e-mails that I sent out with your project grades might make a good example.
predict(reducedModel,data.frame(to_multiple=0,attach=0,dollar=0,winner='no',inherit=0,password=0,format=1,re_subj=0))