House Price vs. Square Footage

A group of students wanted to investigate which factors influence the price of a house (Koester, Davis, and Ross, 2003). They used http://www.househunt.com, limiting their search to single family homes in California. They collected a stratified sample by stratifying on three different regions in CA (northern, southern, and central), and then randomly selecting a sample from within each strata. They decided to focus on homes that were less than 5000 square feet and sold for less than $1.5 million.

houseData = read.csv("http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/housing.txt")
head(houseData)

##   sqft  price       City bedrooms baths
## 1 3392 339000     Dublin        3   2.1
## 2 4100 899900 pleasanton        4   3.0
## 3 3200 448641    Clayton        5   4.0
## 4 1436 239999     Moraga        4   3.0
## 5 1944 377500    Antioch        3   2.0
## 6 1500 299900   Danville        3   2.5

myLM = lm(price~sqft,data=houseData)
plot(houseData$sqft,houseData$price,ylab='Price ($)',xlab="Area (sq. ft.)")
abline(myLM)

par(mfrow=c(1,2))
plot(houseData$sqft,resid(myLM),main='Residual Plot',xlab='Area (sq. ft.)',ylab='Residual')
abline(0,0)

qqnorm(resid(myLM))
qqline(resid(myLM))

Transforming the Data

As you can see, this isn’t a great sample for using linear regression. The variances of the residuals appear to get bigger as houses get bigger, and the residuals are not normal (they are skewed right). One option is to apply a log-transformation to one of the variables.

myLM2 = lm(log10(price)~log10(sqft),data=houseData)
plot(log10(houseData$sqft),log10(houseData$price),ylab='log Price ($)',xlab="log Area (sq. ft.)")
abline(myLM2)

par(mfrow=c(1,2))
plot(houseData$sqft,resid(myLM2),main='Residual Plot',xlab='Area (sq. ft.)',ylab='Residual')
abline(0,0)

qqnorm(resid(myLM2))
qqline(resid(myLM2))

Using this new model, we could make a confidence interval for the price of houses that are 2000 sq. ft.

predict(myLM2,data.frame(sqft=2000),interval='confidence')

##        fit      lwr      upr
## 1 5.635178 5.592034 5.678322

Keep in mind that we are estimating the mean of the log price. If we back transform to actual prices, then we won’t be 95% sure that we have the average price, but assuming the prices are normal, we can hope that we are capturing the median prices.

10^predict(myLM2,data.frame(sqft=2000),interval='confidence')

##        fit      lwr      upr
## 1 431696.1 390871.5 476784.6