Multiple Logistic Regression Example

In [1]:
emailData = read.csv('email.txt',sep='\t')
head(emailData)
dim(emailData)
Out[1]:
spamto_multiplefromccsent_emailtimeimageattachdollarwinnerviagrapasswordnum_charline_breaksformatre_subjexclaim_subjurgent_subjexclaim_messnumber
1001002011-12-31 22:16:41000no0011.3720210000big
2001002011-12-31 23:03:59000no0010.50420210001small
3001002012-01-01 08:00:32004no007.77319210006small
4001002012-01-01 01:09:49000no0013.256255100048small
5001002012-01-01 02:00:01000no021.2312900001none
6001002012-01-01 02:04:46000no021.0912500001none
Out[1]:
  1. 3921
  2. 21

The data above is from a large sample of e-mails. We would like to predict whether any given e-mail is spam or not. Therefore spam is our response variable. Of all the other variables, we will focus on the following 10 for our model:

variable description
spam Specifies whether the message was spam.
to_multiple An indicator variable for if more than one person was listed in the To field of the email.
cc An indicator for if someone was CCed on the email.
attach An indicator for if there was an attachment, such as a document or image.
dollar An indicator for if the word “dollar” or dollar symbol ($) appeared in the email.
winner An indicator for if the word “winner” appeared in the email message.
inherit An indicator for if the word “inherit” (or a variation, like “inheritance”) appeared in the email.
password An indicator for if the word “password” was present in the email.
format Indicates if the email contained special formatting, such as bolding, tables, or links.
re_subj Indicates whether “Re:” was included at the start of the email subject.
exclaim_subj Indicates whether any exclamation point was included in the email subject.
In [2]:
theModel = glm(spam~to_multiple+cc+attach+dollar+winner+inherit+password+format+re_subj+exclaim_subj,family='binomial',data=emailData)
In [3]:
summary(theModel)
Out[3]:
Call:
glm(formula = spam ~ to_multiple + cc + attach + dollar + winner + 
    inherit + password + format + re_subj + exclaim_subj, family = "binomial", 
    data = emailData)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6348  -0.4325  -0.2566  -0.0945   3.8846  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.79976    0.08935  -8.950  < 2e-16 ***
to_multiple  -2.84097    0.31158  -9.118  < 2e-16 ***
cc            0.03134    0.01895   1.654 0.098058 .  
attach        0.20351    0.05851   3.478 0.000505 ***
dollar       -0.07304    0.02306  -3.168 0.001535 ** 
winneryes     1.83103    0.33641   5.443 5.24e-08 ***
inherit       0.32999    0.15223   2.168 0.030184 *  
password     -0.75953    0.29597  -2.566 0.010280 *  
format       -1.52284    0.12270 -12.411  < 2e-16 ***
re_subj      -3.11857    0.36522  -8.539  < 2e-16 ***
exclaim_subj  0.24399    0.22502   1.084 0.278221    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2437.2  on 3920  degrees of freedom
Residual deviance: 1936.2  on 3910  degrees of freedom
AIC: 1958.2

Number of Fisher Scoring iterations: 7
In [4]:
reducedModel = glm(spam~to_multiple+attach+dollar+winner+inherit+password+format+re_subj,family='binomial',data=emailData)
summary(reducedModel)
Out[4]:
Call:
glm(formula = spam ~ to_multiple + attach + dollar + winner + 
    inherit + password + format + re_subj, family = "binomial", 
    data = emailData)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6591  -0.4373  -0.2544  -0.0944   3.8707  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.78138    0.08860  -8.820  < 2e-16 ***
to_multiple -2.77682    0.30752  -9.030  < 2e-16 ***
attach       0.20419    0.05789   3.527  0.00042 ***
dollar      -0.06970    0.02239  -3.113  0.00185 ** 
winneryes    1.86675    0.33652   5.547  2.9e-08 ***
inherit      0.33614    0.15073   2.230  0.02575 *  
password    -0.76035    0.29680  -2.562  0.01041 *  
format      -1.51770    0.12226 -12.414  < 2e-16 ***
re_subj     -3.11329    0.36519  -8.525  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2437.2  on 3920  degrees of freedom
Residual deviance: 1939.6  on 3912  degrees of freedom
AIC: 1957.6

Number of Fisher Scoring iterations: 7

Reduced Model

Using backwards elimination to remove variables with p-values greater than 5%, we get a model that depends on 8 explanatory variables. The form of the model is: $$\log \left( \frac{p_\text{spam}}{1-p_\text{spam}} \right) = -2.767\text{to_multiple}+0.204\text{attach}-0.697\text{dollar}+1.867\text{winner}+0.336\text{inherit}-0.760\text{password}-1.518\text{format}-3.113\text{re_subj}$$ We can use the predicted values of $p_\text{spam}$ to try to determine whether an e-mail is spam or not. For example, the e-mails that I sent out with your project grades might make a good example.

In [6]:
predict(reducedModel,data.frame(to_multiple=0,attach=0,dollar=0,winner='no',inherit=0,password=0,format=1,re_subj=0))
Out[6]:
1: -2.2990853567772
In [ ]: