HMDA Case Study Series. Non-linear models–Part 2. General Additive Models and Splines

Let’s account for non-linearities by exploring several algorithms. Splines and general additive models will be examined in this second part of the exploration of non-linear models.

Instead of a single polynomial across the entire space, one can use different polynomials in regions defined by knots with the constraint that there must be continuity at the knots. A cubic spline has both first and second derivatives continous at the knots.

incomelims=range(ver3$logincome)
income.grid=seq(from=incomelims[1],to=incomelims[2])
#par(mfrow=c(1,3)) 
plot(ver3$logincome,ver3$logloanamt,col="darkgrey")
lines(income.grid,predict(fit,list(logincome=income.grid)),col="darkgreen",lwd=2) 
abline(v=c(2,4,6),lty=2,col="darkgreen")

plot of chunk unnamed-chunk-2

With a smoothing spline, one can set the smoothing parameters using degrees of freedom or with the tuning parameter lambda selected by cross-validation. Let’s build a smoothing spline model using cross validation.

#using loocv to fit a smoothing spline
#par(mfrow=c(3,1))
plot(ver3$logincome,ver3$logloanamt,col="darkgrey")
lines(fit,col="purple",lwd=2)                                   

plot of chunk unnamed-chunk-3

## Call:
## smooth.spline(x = ver3$logloanamt, y = ver3$logincome, cv = TRUE)
## 
## Smoothing Parameter  spar= 0.9116  lambda= 0.01145 (10 iterations)
## Equivalent Degrees of Freedom (Df): 11.11
## Penalized Criterion: 407.6
## PRESS: 0.354

General additive models are useful for looking at multiple non-linear terms. In this model, we use smoothed term for a continous independent variable, a linear term for population, and linear term for the factor variable of agency name. The plot shows each of the variables along with the standard error.

    
par(mfrow=c(1,3))                                                                      
plot(gam1,se=T) 

plot of chunk unnamed-chunk-4

Now, let’s build a general additive model using a smoothing term for the population variable and compare it to the previous model. We’ll use anova to check if the variable needs to be smoothed.The pvalue is .004 which says that a non-linear(smoothing) term for population is needed.


par(mfrow=c(1,3)) 
plot(gam3,se=T) 

plot of chunk unnamed-chunk-5

## Analysis of Deviance Table
## 
## Model 1: logloanamt ~ s(logincome, df = 4) + population + agency_abbr
## Model 2: logloanamt ~ s(logincome, df = 4) + s(population, df = 4) + agency_abbr
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1     34562      13764                        
## 2     34559      13759  3     5.31    0.004 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1