Spam Filter

The data below is from a sample of 3921 e-mails.

emailData = read.csv('http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/email.txt',sep='\t')
dim(emailData)
## [1] 3921   21

We would like to predict whether any given e-mail is spam or not. Therefore spam is our response variable. Of all the other variables, we will focus on the following 10 for our model:

  1. to_multiple An indicator variable for if more than one person was listed in the To field of the email.
  2. cc An indicator for if someone was CC’ed on the email.
  3. attach An indicator for if there was an attachment, such as a document or image.
  4. dollar An indicator for if the word “dollar” or dollar symbol ($) appeared in the email.
  5. winner An indicator for if the word “winner” appeared in the email message.
  6. inherit An indicator for if the word “inherit” (or a variation, like “inheritance”) appeared in the email.
  7. password An indicator for if the word “password” was present in the email.
  8. format Indicates if the email contained special formatting, such as bolding, tables, or links.
  9. re_subj Indicates whether “Re:” was included at the start of the email subject.
  10. exclaim_subj Indicates whether any exclamation point was included in the email subject.

Logistic Model

fullModel = glm(spam~to_multiple+cc+attach+dollar+winner+inherit+password+format+re_subj+exclaim_subj,family='binomial',data=emailData)
summary(fullModel)
## 
## Call:
## glm(formula = spam ~ to_multiple + cc + attach + dollar + winner + 
##     inherit + password + format + re_subj + exclaim_subj, family = "binomial", 
##     data = emailData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6348  -0.4325  -0.2566  -0.0945   3.8846  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.79976    0.08935  -8.950  < 2e-16 ***
## to_multiple  -2.84097    0.31158  -9.118  < 2e-16 ***
## cc            0.03134    0.01895   1.654 0.098058 .  
## attach        0.20351    0.05851   3.478 0.000505 ***
## dollar       -0.07304    0.02306  -3.168 0.001535 ** 
## winneryes     1.83103    0.33641   5.443 5.24e-08 ***
## inherit       0.32999    0.15223   2.168 0.030184 *  
## password     -0.75953    0.29597  -2.566 0.010280 *  
## format       -1.52284    0.12270 -12.411  < 2e-16 ***
## re_subj      -3.11857    0.36522  -8.539  < 2e-16 ***
## exclaim_subj  0.24399    0.22502   1.084 0.278221    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2437.2  on 3920  degrees of freedom
## Residual deviance: 1936.2  on 3910  degrees of freedom
## AIC: 1958.2
## 
## Number of Fisher Scoring iterations: 7

In a logistic model, you can use the predict() function to compute the predicted log-odds for an e-mail to be spam. Recall that predict() takes two arguments: a model, and a data frame with values for each of the explanatory variables in the model. Here is an example.

predict(fullModel,data.frame(to_multiple=0,attach=0,cc=0,dollar=0,winner="no",inherit=0,password=0,format=0,re_subj=0,exclaim_subj=0))
##          1 
## -0.7997563

The result is the log-odds. To convert log-odds to the actual probability, use the formula:

\[p = \frac{e^A}{1+e^A}\]

So, for example, an e-mail with none of the features in the 10 variables above would have a probability of being spam equal to:

exp(-0.7997563)/(1+exp(-0.7997563))
## [1] 0.3100777

Questions

  1. Which variable would you remove first from the full model using backwards elimination?

  2. Find the reduced model.

  3. What kind of e-mail would be most likely to be flagged as spam by the reduced model?

  4. Use the predict() command to find the log-odds of such an e-mail being spam, according to the reduced model.

  5. Convert the log-odds to a probability.