Let’s account for non-linearities by exploring several algorithms. Splines and general additive models will be examined in this second part of the exploration of non-linear models.
Instead of a single polynomial across the entire space, one can use different polynomials in regions defined by knots with the constraint that there must be continuity at the knots. A cubic spline has both first and second derivatives continous at the knots.
incomelims=range(ver3$logincome)
income.grid=seq(from=incomelims[1],to=incomelims[2])
#par(mfrow=c(1,3))
plot(ver3$logincome,ver3$logloanamt,col="darkgrey")
lines(income.grid,predict(fit,list(logincome=income.grid)),col="darkgreen",lwd=2)
abline(v=c(2,4,6),lty=2,col="darkgreen")
With a smoothing spline, one can set the smoothing parameters using degrees of freedom or with the tuning parameter lambda selected by cross-validation. Let’s build a smoothing spline model using cross validation.
#using loocv to fit a smoothing spline
#par(mfrow=c(3,1))
plot(ver3$logincome,ver3$logloanamt,col="darkgrey")
lines(fit,col="purple",lwd=2)
## Call:
## smooth.spline(x = ver3$logloanamt, y = ver3$logincome, cv = TRUE)
##
## Smoothing Parameter spar= 0.9116 lambda= 0.01145 (10 iterations)
## Equivalent Degrees of Freedom (Df): 11.11
## Penalized Criterion: 407.6
## PRESS: 0.354
General additive models are useful for looking at multiple non-linear terms. In this model, we use smoothed term for a continous independent variable, a linear term for population, and linear term for the factor variable of agency name. The plot shows each of the variables along with the standard error.
par(mfrow=c(1,3))
plot(gam1,se=T)
Now, let’s build a general additive model using a smoothing term for the population variable and compare it to the previous model. We’ll use anova to check if the variable needs to be smoothed.The pvalue is .004 which says that a non-linear(smoothing) term for population is needed.
par(mfrow=c(1,3))
plot(gam3,se=T)
## Analysis of Deviance Table
##
## Model 1: logloanamt ~ s(logincome, df = 4) + population + agency_abbr
## Model 2: logloanamt ~ s(logincome, df = 4) + s(population, df = 4) + agency_abbr
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 34562 13764
## 2 34559 13759 3 5.31 0.004 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1