Machine Learning with R

HMDA Case Study Series. Model selection with Ensemble Methods

In a previous post, models for a binary response variable were analyzed. In continuing our exploration of the Vermont loan data for 2012, we are interested in looking at predictors of loan amount. Let’s use ensemble methods to improve the accuracy of our regression models.

We will build gradient boosting trees and random forests models for a continous dependent variable and use them together to reiterate and improve on the initial models.

Boosting can be used with many machine learning algorithms and is often used with decision trees. Each tree is grown sequentially using information from previously grown trees. The idea of boosting is to learn more slowly by shrinking back the tree rather than accepting the full tree.

plot of chunk unnamed-chunk-2

##                                                           var   rel.inf
## lien_status_name                             lien_status_name 49.859217
## logincome                                           logincome 26.268678
## loan_purpose_name                           loan_purpose_name  4.912682
## county_name                                       county_name  3.545654
## agency_abbr                                       agency_abbr  2.923222
## logmsamd_income                               logmsamd_income  2.574236
## purchaser_type_name                       purchaser_type_name  2.514350
## property_type_name                         property_type_name  2.352586
## action_taken_name                           action_taken_name  1.903195
## owner_occupancy_name                     owner_occupancy_name  1.041718
## census_tract_number                       census_tract_number  0.645231
## loan_type_name                                 loan_type_name  0.558067
## number_of_1_to_4_family_units   number_of_1_to_4_family_units  0.304160
## number_of_owner_occupied_units number_of_owner_occupied_units  0.187416
## minority_population                       minority_population  0.161291
## applicant_sex_name                         applicant_sex_name  0.086749
## co_applicant_race_name_1             co_applicant_race_name_1  0.050009
## co_applicant_ethnicity_name       co_applicant_ethnicity_name  0.034711
## population                                         population  0.024882
## co_applicant_sex_name                   co_applicant_sex_name  0.023282
## applicant_ethnicity_name             applicant_ethnicity_name  0.022449
## applicant_race_name_1                   applicant_race_name_1  0.004311
## preapproval_name                             preapproval_name  0.001902
## hoepa_status_name                           hoepa_status_name  0.000000
## loghud_income                                   loghud_income  0.000000

plot of chunk unnamed-chunk-2

The variable importance plot is showing that lien status and income are the top predictors of loan amount. We will build a regression model with these two independent variables as polynomial terms. The R-squared increases to 34% in the new model from 28% in the model with only income. The output also indicates that lien status is statistically significant. In this way, various methods can be used together to improve our models.

## 
## Call:
## lm(formula = loan_amount_000s ~ poly(applicant_income_000s, 4) + 
##     poly(lien_status, 3), data = ver)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -647.2  -53.4   -6.7   41.7 2284.7 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       178.623      0.509  351.03   <2e-16 ***
## poly(applicant_income_000s, 4)1  9614.720     94.670  101.56   <2e-16 ***
## poly(applicant_income_000s, 4)2 -4360.689     94.646  -46.07   <2e-16 ***
## poly(applicant_income_000s, 4)3  2063.586     94.651   21.80   <2e-16 ***
## poly(applicant_income_000s, 4)4 -2960.623     94.648  -31.28   <2e-16 ***
## poly(lien_status, 3)1            -105.541     94.624   -1.12    0.265    
## poly(lien_status, 3)2            5399.512     94.734   57.00   <2e-16 ***
## poly(lien_status, 3)3             280.478     94.640    2.96    0.003 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 94.6 on 34565 degrees of freedom
## Multiple R-squared:  0.34,   Adjusted R-squared:  0.34 
## F-statistic: 2.54e+03 on 7 and 34565 DF,  p-value: <2e-16
## 
## Call:
## lm(formula = loan_amount_000s ~ poly(applicant_income_000s, 4), 
##     data = ver)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -637.6  -56.4   -5.2   47.4 2285.3 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       178.623      0.532   335.6   <2e-16 ***
## poly(applicant_income_000s, 4)1  9790.582     98.971    98.9   <2e-16 ***
## poly(applicant_income_000s, 4)2 -4486.413     98.971   -45.3   <2e-16 ***
## poly(applicant_income_000s, 4)3  2191.142     98.971    22.1   <2e-16 ***
## poly(applicant_income_000s, 4)4 -3073.708     98.971   -31.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 99 on 34568 degrees of freedom
## Multiple R-squared:  0.278,  Adjusted R-squared:  0.278 
## F-statistic: 3.32e+03 on 4 and 34568 DF,  p-value: <2e-16

Now, we’ll construct a Random Forest model and see if the random forest returns the same variable importances. The top two important variables for the model are also income and lien status. This model also places the most importance on income in predicting loan amount. The statistical analysis confirms our hypothesis.

## [1] 0.2071
##                                %IncMSE IncNodePurity
## action_taken_name                8.300      348.2948
## agency_abbr                     25.169      540.1711
## applicant_ethnicity_name         6.528       70.9537
## applicant_race_name_1            7.638       84.2079
## applicant_sex_name              14.369      153.4181
## census_tract_number             21.906      379.6956
## co_applicant_ethnicity_name      7.987      114.3612
## co_applicant_race_name_1         7.566      128.6287
## co_applicant_sex_name            5.815      160.2106
## county_name                     28.458      483.5115
## hoepa_status_name                0.000        0.8029
## lien_status_name                44.811     2793.7072
## loan_purpose_name               22.414      946.7152
## loan_type_name                  13.499       76.3697
## owner_occupancy_name            19.548      134.4479
## preapproval_name                 5.699       55.1664
## property_type_name              20.443      219.8124
## purchaser_type_name             19.468      498.3647
## number_of_1_to_4_family_units   19.287      321.7416
## number_of_owner_occupied_units  21.443      320.1651
## minority_population             15.304      302.5329
## population                      22.119      302.8017
## logincome                       60.369     2080.9986
## logmsamd_income                 22.526      527.2664
## loghud_income                    8.346       56.4375

plot of chunk unnamed-chunk-6

In bagging, bootstrap samples are taken from the training set and bootstrapped training sets are generated. Then, a tree is grown on each one and averaged. In random forest, each split of the tree considers a random sample of a subset of the predictors whereas the full set is considered in bagging. Random forest improves over bagging by decorrelating the trees which reduces the variance when the trees are averaged.