Multiple Logistic Regression Example¶

emailData = read.csv('email.txt',sep='\t')
head(emailData)
dim(emailData)

The data above is from a large sample of e-mails. We would like to predict whether any given e-mail is spam or not. Therefore spam is our response variable. Of all the other variables, we will focus on the following 10 for our model:

variable	description
spam	Specifies whether the message was spam.
to_multiple	An indicator variable for if more than one person was listed in the To field of the email.
cc	An indicator for if someone was CCed on the email.
attach	An indicator for if there was an attachment, such as a document or image.
dollar	An indicator for if the word “dollar” or dollar symbol ($) appeared in the email.
winner	An indicator for if the word “winner” appeared in the email message.
inherit	An indicator for if the word “inherit” (or a variation, like “inheritance”) appeared in the email.
password	An indicator for if the word “password” was present in the email.
format	Indicates if the email contained special formatting, such as bolding, tables, or links.
re_subj	Indicates whether “Re:” was included at the start of the email subject.
exclaim_subj	Indicates whether any exclamation point was included in the email subject.

theModel = glm(spam~to_multiple+cc+attach+dollar+winner+inherit+password+format+re_subj+exclaim_subj,family='binomial',data=emailData)

summary(theModel)

Call:
glm(formula = spam ~ to_multiple + cc + attach + dollar + winner + 
    inherit + password + format + re_subj + exclaim_subj, family = "binomial", 
    data = emailData)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6348  -0.4325  -0.2566  -0.0945   3.8846  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.79976    0.08935  -8.950  < 2e-16 ***
to_multiple  -2.84097    0.31158  -9.118  < 2e-16 ***
cc            0.03134    0.01895   1.654 0.098058 .  
attach        0.20351    0.05851   3.478 0.000505 ***
dollar       -0.07304    0.02306  -3.168 0.001535 ** 
winneryes     1.83103    0.33641   5.443 5.24e-08 ***
inherit       0.32999    0.15223   2.168 0.030184 *  
password     -0.75953    0.29597  -2.566 0.010280 *  
format       -1.52284    0.12270 -12.411  < 2e-16 ***
re_subj      -3.11857    0.36522  -8.539  < 2e-16 ***
exclaim_subj  0.24399    0.22502   1.084 0.278221    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2437.2  on 3920  degrees of freedom
Residual deviance: 1936.2  on 3910  degrees of freedom
AIC: 1958.2

Number of Fisher Scoring iterations: 7

reducedModel = glm(spam~to_multiple+attach+dollar+winner+inherit+password+format+re_subj,family='binomial',data=emailData)
summary(reducedModel)

Call:
glm(formula = spam ~ to_multiple + attach + dollar + winner + 
    inherit + password + format + re_subj, family = "binomial", 
    data = emailData)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6591  -0.4373  -0.2544  -0.0944   3.8707  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.78138    0.08860  -8.820  < 2e-16 ***
to_multiple -2.77682    0.30752  -9.030  < 2e-16 ***
attach       0.20419    0.05789   3.527  0.00042 ***
dollar      -0.06970    0.02239  -3.113  0.00185 ** 
winneryes    1.86675    0.33652   5.547  2.9e-08 ***
inherit      0.33614    0.15073   2.230  0.02575 *  
password    -0.76035    0.29680  -2.562  0.01041 *  
format      -1.51770    0.12226 -12.414  < 2e-16 ***
re_subj     -3.11329    0.36519  -8.525  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2437.2  on 3920  degrees of freedom
Residual deviance: 1939.6  on 3912  degrees of freedom
AIC: 1957.6

Number of Fisher Scoring iterations: 7

Reduced Model¶

Using backwards elimination to remove variables with p-values greater than 5%, we get a model that depends on 8 explanatory variables. The form of the model is: $$\log \left( \frac{p_\text{spam}}{1-p_\text{spam}} \right) = -2.767\text{to_multiple}+0.204\text{attach}-0.697\text{dollar}+1.867\text{winner}+0.336\text{inherit}-0.760\text{password}-1.518\text{format}-3.113\text{re_subj}$$ We can use the predicted values of $p_\text{spam}$ to try to determine whether an e-mail is spam or not. For example, the e-mails that I sent out with your project grades might make a good example.

predict(reducedModel,data.frame(to_multiple=0,attach=0,dollar=0,winner='no',inherit=0,password=0,format=1,re_subj=0))

	from	time	dollar	winner	⋯	password	num_char	line_breaks	format	exclaim_mess	number
1	1	2011-12-31 22:16:41	0	no	⋯	0	11.37	202	1	0	big
2	1	2011-12-31 23:03:59	0	no	⋯	0	10.504	202	1	1	small
3	1	2012-01-01 08:00:32	4	no	⋯	0	7.773	192	1	6	small
4	1	2012-01-01 01:09:49	0	no	⋯	0	13.256	255	1	48	small
5	1	2012-01-01 02:00:01	0	no	⋯	2	1.231	29	0	1	none
6	1	2012-01-01 02:04:46	0	no	⋯	2	1.091	25	0	1	none