In a previous post, models for a binary response variable were analyzed. In continuing our exploration of the Vermont loan data for 2012, we are interested in looking at predictors of loan amount. Let’s use ensemble methods to improve the accuracy of our regression models.
We will build gradient boosting trees and random forests models for a continous dependent variable and use them together to reiterate and improve on the initial models.
Boosting can be used with many machine learning algorithms and is often used with decision trees. Each tree is grown sequentially using information from previously grown trees. The idea of boosting is to learn more slowly by shrinking back the tree rather than accepting the full tree.
## var rel.inf
## lien_status_name lien_status_name 49.859217
## logincome logincome 26.268678
## loan_purpose_name loan_purpose_name 4.912682
## county_name county_name 3.545654
## agency_abbr agency_abbr 2.923222
## logmsamd_income logmsamd_income 2.574236
## purchaser_type_name purchaser_type_name 2.514350
## property_type_name property_type_name 2.352586
## action_taken_name action_taken_name 1.903195
## owner_occupancy_name owner_occupancy_name 1.041718
## census_tract_number census_tract_number 0.645231
## loan_type_name loan_type_name 0.558067
## number_of_1_to_4_family_units number_of_1_to_4_family_units 0.304160
## number_of_owner_occupied_units number_of_owner_occupied_units 0.187416
## minority_population minority_population 0.161291
## applicant_sex_name applicant_sex_name 0.086749
## co_applicant_race_name_1 co_applicant_race_name_1 0.050009
## co_applicant_ethnicity_name co_applicant_ethnicity_name 0.034711
## population population 0.024882
## co_applicant_sex_name co_applicant_sex_name 0.023282
## applicant_ethnicity_name applicant_ethnicity_name 0.022449
## applicant_race_name_1 applicant_race_name_1 0.004311
## preapproval_name preapproval_name 0.001902
## hoepa_status_name hoepa_status_name 0.000000
## loghud_income loghud_income 0.000000
The variable importance plot is showing that lien status and income are the top predictors of loan amount. We will build a regression model with these two independent variables as polynomial terms. The R-squared increases to 34% in the new model from 28% in the model with only income. The output also indicates that lien status is statistically significant. In this way, various methods can be used together to improve our models.
##
## Call:
## lm(formula = loan_amount_000s ~ poly(applicant_income_000s, 4) +
## poly(lien_status, 3), data = ver)
##
## Residuals:
## Min 1Q Median 3Q Max
## -647.2 -53.4 -6.7 41.7 2284.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 178.623 0.509 351.03 <2e-16 ***
## poly(applicant_income_000s, 4)1 9614.720 94.670 101.56 <2e-16 ***
## poly(applicant_income_000s, 4)2 -4360.689 94.646 -46.07 <2e-16 ***
## poly(applicant_income_000s, 4)3 2063.586 94.651 21.80 <2e-16 ***
## poly(applicant_income_000s, 4)4 -2960.623 94.648 -31.28 <2e-16 ***
## poly(lien_status, 3)1 -105.541 94.624 -1.12 0.265
## poly(lien_status, 3)2 5399.512 94.734 57.00 <2e-16 ***
## poly(lien_status, 3)3 280.478 94.640 2.96 0.003 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 94.6 on 34565 degrees of freedom
## Multiple R-squared: 0.34, Adjusted R-squared: 0.34
## F-statistic: 2.54e+03 on 7 and 34565 DF, p-value: <2e-16
##
## Call:
## lm(formula = loan_amount_000s ~ poly(applicant_income_000s, 4),
## data = ver)
##
## Residuals:
## Min 1Q Median 3Q Max
## -637.6 -56.4 -5.2 47.4 2285.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 178.623 0.532 335.6 <2e-16 ***
## poly(applicant_income_000s, 4)1 9790.582 98.971 98.9 <2e-16 ***
## poly(applicant_income_000s, 4)2 -4486.413 98.971 -45.3 <2e-16 ***
## poly(applicant_income_000s, 4)3 2191.142 98.971 22.1 <2e-16 ***
## poly(applicant_income_000s, 4)4 -3073.708 98.971 -31.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 99 on 34568 degrees of freedom
## Multiple R-squared: 0.278, Adjusted R-squared: 0.278
## F-statistic: 3.32e+03 on 4 and 34568 DF, p-value: <2e-16
Now, we’ll construct a Random Forest model and see if the random forest returns the same variable importances. The top two important variables for the model are also income and lien status. This model also places the most importance on income in predicting loan amount. The statistical analysis confirms our hypothesis.
## [1] 0.2071
## %IncMSE IncNodePurity
## action_taken_name 8.300 348.2948
## agency_abbr 25.169 540.1711
## applicant_ethnicity_name 6.528 70.9537
## applicant_race_name_1 7.638 84.2079
## applicant_sex_name 14.369 153.4181
## census_tract_number 21.906 379.6956
## co_applicant_ethnicity_name 7.987 114.3612
## co_applicant_race_name_1 7.566 128.6287
## co_applicant_sex_name 5.815 160.2106
## county_name 28.458 483.5115
## hoepa_status_name 0.000 0.8029
## lien_status_name 44.811 2793.7072
## loan_purpose_name 22.414 946.7152
## loan_type_name 13.499 76.3697
## owner_occupancy_name 19.548 134.4479
## preapproval_name 5.699 55.1664
## property_type_name 20.443 219.8124
## purchaser_type_name 19.468 498.3647
## number_of_1_to_4_family_units 19.287 321.7416
## number_of_owner_occupied_units 21.443 320.1651
## minority_population 15.304 302.5329
## population 22.119 302.8017
## logincome 60.369 2080.9986
## logmsamd_income 22.526 527.2664
## loghud_income 8.346 56.4375
In bagging, bootstrap samples are taken from the training set and bootstrapped training sets are generated. Then, a tree is grown on each one and averaged. In random forest, each split of the tree considers a random sample of a subset of the predictors whereas the full set is considered in bagging. Random forest improves over bagging by decorrelating the trees which reduces the variance when the trees are averaged.