Machine Learning with R

HMDA Case Study Series. Non-linear models–Part 1. Data Transformations and Polynomial Terms

Let’s account for non-linearities by exploring several models including logistic polynomial regression in this analysis. Splines and general additive models will be examined in part two of this exploration.

In previous posts, logarithmic transformations were used on skewed data to improve the interpretability of the models. Here, we will look at the raw data and illustrate the use of polynomials and compare it to a baseline model.

This first model will introduce a polynomial term to account for non-linearities. The output indicates that all four levels of applicant income are significant and the R-squared is 28%.

## 
## Call:
## lm(formula = loan_amount_000s ~ poly(applicant_income_000s, 4), 
##     data = ver4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -637.6  -56.4   -5.2   47.4 2285.3 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       178.623      0.532   335.6   <2e-16 ***
## poly(applicant_income_000s, 4)1  9790.582     98.971    98.9   <2e-16 ***
## poly(applicant_income_000s, 4)2 -4486.413     98.971   -45.3   <2e-16 ***
## poly(applicant_income_000s, 4)3  2191.142     98.971    22.1   <2e-16 ***
## poly(applicant_income_000s, 4)4 -3073.708     98.971   -31.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 99 on 34568 degrees of freedom
## Multiple R-squared:  0.278,  Adjusted R-squared:  0.278 
## F-statistic: 3.32e+03 on 4 and 34568 DF,  p-value: <2e-16

Using a polynomial term improves on the R-squared of 20% in the non-polynomial regression model.

## 
## Call:
## lm(formula = loan_amount_000s ~ applicant_income_000s, data = ver4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3032.7   -61.1   -10.4    47.6  2164.2 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.42e+02   6.81e-01   208.4   <2e-16 ***
## applicant_income_000s 3.31e-01   3.51e-03    94.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 104 on 34571 degrees of freedom
## Multiple R-squared:  0.204,  Adjusted R-squared:  0.204 
## F-statistic: 8.88e+03 on 1 and 34571 DF,  p-value: <2e-16

The anova also indicates that polynomial fit is a more appropriate model in this case compared to the non-polynomial model on the raw numerical data.

## Analysis of Variance Table
## 
## Model 1: loan_amount_000s ~ poly(applicant_income_000s, 4)
## Model 2: loan_amount_000s ~ applicant_income_000s
##   Res.Df      RSS Df Sum of Sq    F Pr(>F)    
## 1  34568 3.39e+08                             
## 2  34571 3.73e+08 -3 -34376689 1170 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If we look at the log-transformed data, the non-polynomial model yields an R-squared of 18.1% whereas the polynomial model on the log-transformed data yields an R-squared of 18.6% . So, although applying a polynomial term on log-transformed data did not improve that metric significantly, non-linearities in the relationship between income and loan amount will be accounted for. The point is to illustrate the importance of being aware of how one is applying data transforms and the effect on the models.

## 
## Call:
## lm(formula = logloanamt ~ logincome, data = ver3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.180 -0.245  0.131  0.408  2.980 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.03421    0.02251   134.8   <2e-16 ***
## logincome    0.44085    0.00504    87.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.653 on 34571 degrees of freedom
## Multiple R-squared:  0.181,  Adjusted R-squared:  0.181 
## F-statistic: 7.66e+03 on 1 and 34571 DF,  p-value: <2e-16
## 
## Call:
## lm(formula = logloanamt ~ poly(ver3$logincome, 4), data = ver3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.205 -0.243  0.132  0.407  3.039 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                4.9799     0.0035 1421.20  < 2e-16 ***
## poly(ver3$logincome, 4)1  57.1699     0.6515   87.75  < 2e-16 ***
## poly(ver3$logincome, 4)2  -0.8644     0.6515   -1.33     0.18    
## poly(ver3$logincome, 4)3  -7.5381     0.6515  -11.57  < 2e-16 ***
## poly(ver3$logincome, 4)4   5.3418     0.6515    8.20  2.5e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.652 on 34568 degrees of freedom
## Multiple R-squared:  0.186,  Adjusted R-squared:  0.186 
## F-statistic: 1.98e+03 on 4 and 34568 DF,  p-value: <2e-16

Although it may seem that the models with polynomial terms are “better” because of the higher R-squared, this is not necessarily the case. These plots indicate that the log transformed data gives a better understanding of the nature of the data. For one thing, it has removed the skewness introduced by outliers. In addition, there are fewer very high income values. The raw data plot would lead one to believe that that there is a sharp drop in loan amount for higher values of income but this does not account for outliers, data error, and that there are fewer values in the fourth quartile. Using a variety of metrics and visuals will lead to a better understanding of the data than relying on one number.


preds=predict(fit,newdata=list(applicant_income_000s=income.grid1),se=TRUE)
se.bands=cbind(preds$fit+2*preds$se,preds$fit-2*preds$se)
plot(ver4$applicant_income_000s,ver4$loan_amount_000s,col="darkgrey")
lines(income.grid1,preds$fit,lwd=2,col="blue")
matlines(income.grid1,se.bands,col="blue",lty=2)

plot of chunk unnamed-chunk-7


preds=predict(fit,newdata=list(logincome=income.grid),se=TRUE)
se.bands=cbind(preds$fit+2*preds$se,preds$fit-2*preds$se)
plot(ver3$logincome,ver3$logloanamt,col="darkgrey")
lines(income.grid,preds$fit,lwd=2,col="blue")
matlines(income.grid,se.bands,col="blue",lty=2)

plot of chunk unnamed-chunk-8