The data below is from a sample of 3921 e-mails.
emailData = read.csv('http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/email.txt',sep='\t')
dim(emailData)
## [1] 3921 21
We would like to predict whether any given e-mail is spam or not. Therefore spam is our response variable. Of all the other variables, we will focus on the following 10 for our model:
to_multiple
An indicator variable for if more than one person was listed in the To field of the email.cc
An indicator for if someone was CC’ed on the email.attach
An indicator for if there was an attachment, such as a document or image.dollar
An indicator for if the word “dollar” or dollar symbol ($) appeared in the email.winner
An indicator for if the word “winner” appeared in the email message.inherit
An indicator for if the word “inherit” (or a variation, like “inheritance”) appeared in the email.password
An indicator for if the word “password” was present in the email.format
Indicates if the email contained special formatting, such as bolding, tables, or links.re_subj
Indicates whether “Re:” was included at the start of the email subject.exclaim_subj
Indicates whether any exclamation point was included in the email subject.fullModel = glm(spam~to_multiple+cc+attach+dollar+winner+inherit+password+format+re_subj+exclaim_subj,family='binomial',data=emailData)
summary(fullModel)
##
## Call:
## glm(formula = spam ~ to_multiple + cc + attach + dollar + winner +
## inherit + password + format + re_subj + exclaim_subj, family = "binomial",
## data = emailData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6348 -0.4325 -0.2566 -0.0945 3.8846
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.79976 0.08935 -8.950 < 2e-16 ***
## to_multiple -2.84097 0.31158 -9.118 < 2e-16 ***
## cc 0.03134 0.01895 1.654 0.098058 .
## attach 0.20351 0.05851 3.478 0.000505 ***
## dollar -0.07304 0.02306 -3.168 0.001535 **
## winneryes 1.83103 0.33641 5.443 5.24e-08 ***
## inherit 0.32999 0.15223 2.168 0.030184 *
## password -0.75953 0.29597 -2.566 0.010280 *
## format -1.52284 0.12270 -12.411 < 2e-16 ***
## re_subj -3.11857 0.36522 -8.539 < 2e-16 ***
## exclaim_subj 0.24399 0.22502 1.084 0.278221
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2437.2 on 3920 degrees of freedom
## Residual deviance: 1936.2 on 3910 degrees of freedom
## AIC: 1958.2
##
## Number of Fisher Scoring iterations: 7
In a logistic model, you can use the predict()
function to compute the predicted log-odds for an e-mail to be spam. Recall that predict()
takes two arguments: a model, and a data frame with values for each of the explanatory variables in the model. Here is an example.
predict(fullModel,data.frame(to_multiple=0,attach=0,cc=0,dollar=0,winner="no",inherit=0,password=0,format=0,re_subj=0,exclaim_subj=0))
## 1
## -0.7997563
The result is the log-odds. To convert log-odds to the actual probability, use the formula:
\[p = \frac{e^A}{1+e^A}\]
So, for example, an e-mail with none of the features in the 10 variables above would have a probability of being spam equal to:
exp(-0.7997563)/(1+exp(-0.7997563))
## [1] 0.3100777
Which variable would you remove first from the full model using backwards elimination?
Find the reduced model.
What kind of e-mail would be most likely to be flagged as spam by the reduced model?
Use the predict()
command to find the log-odds of such an e-mail being spam, according to the reduced model.
Convert the log-odds to a probability.