This is the first in a series of posts on applied statistical methods and predictive machine learning on home mortgage disclosure data from Vermont for 2012. The aim is to explore various models and gain insights from the data.
The Home Mortgage Disclosure Act(HMDA) has been collecting data on loan applications and originations since 1975. Recently, this data has been made available by the Consumer Financial Protection Bureau. This database provides detailed information on loan data that is required by HMDA. In this analysis, statistical methods will be used to answer research questions and explore hypotheses. In this analysis, we are interested in exploring applications that were originated versus those that were denied and comparing several modeling methods. Specifically, we will be looking at logistic regression, random forest, and classification tree models.
The first research question looks at applications that were originated versus those that were denied by the financial institution. In the total dataset, 63% of loan applications were originated. This following table shows the breakdown of action taken.
##
## Application approved but not accepted
## 1421
## Application denied by financial institution
## 4894
## Application withdrawn by applicant
## 2759
## File closed for incompleteness
## 976
## Loan originated
## 21852
## Loan purchased by the institution
## 2671
Of the subset of applications that were either originated or denied, 82% of applications were originated.
## [1] 0.817
Since the research question pertains to the binary of whether the loan was originated or denied, logistic linear regression is a model that may yield interesting insight.
Let’s build a logistic regression model where action taken is the binary independent variable.
##
## Call:
## glm(formula = action_taken ~ ., family = binomial, data = vermontml2_action)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.972 0.000 0.000 0.564 2.463
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.24e+03 4.69e+04 0.03 0.9789
## agency_code -1.61e-01 7.55e-03 -21.27 < 2e-16 ***
## applicant_ethnicity -3.29e-01 8.16e-02 -4.03 5.5e-05 ***
## applicant_race_1 2.04e-01 4.82e-02 4.23 2.3e-05 ***
## applicant_sex 3.08e-02 3.94e-02 0.78 0.4349
## census_tract_number -2.08e-03 3.69e-04 -5.65 1.6e-08 ***
## co_applicant_ethnicity -1.02e-01 8.82e-02 -1.16 0.2472
## co_applicant_race_1 -1.40e-01 6.70e-02 -2.09 0.0367 *
## co_applicant_sex 1.63e-01 5.70e-02 2.87 0.0041 **
## county_code -3.09e-03 2.82e-03 -1.10 0.2720
## hoepa_status -2.19e+01 2.34e+04 0.00 0.9993
## lien_status -6.31e-01 7.65e-02 -8.24 < 2e-16 ***
## loan_purpose -1.58e-01 3.12e-02 -5.07 4.1e-07 ***
## loan_type -6.13e-01 5.13e-02 -11.95 < 2e-16 ***
## owner_occupancy -2.20e-01 4.74e-02 -4.63 3.6e-06 ***
## preapproval -1.34e-01 7.29e-02 -1.83 0.0672 .
## property_type -1.93e-01 9.95e-02 -1.94 0.0525 .
## purchaser_type 1.88e+01 1.28e+02 0.15 0.8834
## number_of_1_to_4_family_units -1.30e-04 5.83e-05 -2.23 0.0261 *
## number_of_owner_occupied_units 1.13e-05 1.20e-04 0.09 0.9247
## minority_population 2.14e-02 9.30e-03 2.30 0.0215 *
## population 3.98e-05 3.30e-05 1.20 0.2284
## logloanamt -7.49e-01 3.71e-02 -20.19 < 2e-16 ***
## logincome 8.61e-01 3.63e-02 23.73 < 2e-16 ***
## logmsamd_income 7.08e-01 1.08e-01 6.54 6.1e-11 ***
## loghud_income -1.06e+02 1.91e+01 -5.57 2.6e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 25456 on 26745 degrees of freedom
## Residual deviance: 14715 on 26720 degrees of freedom
## AIC: 14767
##
## Number of Fisher Scoring iterations: 21
Using a threshold of 0.5, the accuracy of the logistic regression model is 86%
##
## FALSE TRUE
## 0 2275 2619
## 1 1249 20603
## [1] 0.8554
Now, because overfitting can occur when predictions are run on the same data used to build the model, the data will be split into a training and testing set and the accuracy of the models will be compared to a baseline.
## action_taken agency_code applicant_ethnicity applicant_race_1
## Min. :0.000 Min. :1.00 Min. :1.00 Min. :1.00
## 1st Qu.:1.000 1st Qu.:3.00 1st Qu.:2.00 1st Qu.:5.00
## Median :1.000 Median :5.00 Median :2.00 Median :5.00
## Mean :0.817 Mean :5.61 Mean :2.11 Mean :5.06
## 3rd Qu.:1.000 3rd Qu.:9.00 3rd Qu.:2.00 3rd Qu.:5.00
## Max. :1.000 Max. :9.00 Max. :4.00 Max. :7.00
## applicant_sex census_tract_number co_applicant_ethnicity
## Min. :1.00 Min. : 1 Min. :1.00
## 1st Qu.:1.00 1st Qu.: 33 1st Qu.:2.00
## Median :1.00 Median :9532 Median :2.00
## Mean :1.45 Mean :5422 Mean :3.21
## 3rd Qu.:2.00 3rd Qu.:9607 3rd Qu.:5.00
## Max. :4.00 Max. :9713 Max. :5.00
## co_applicant_race_1 co_applicant_sex county_code hoepa_status
## Min. :1.00 Min. :1.00 Min. : 1 Min. :1
## 1st Qu.:5.00 1st Qu.:2.00 1st Qu.: 7 1st Qu.:2
## Median :5.00 Median :2.00 Median :11 Median :2
## Mean :6.18 Mean :3.05 Mean :13 Mean :2
## 3rd Qu.:8.00 3rd Qu.:5.00 3rd Qu.:21 3rd Qu.:2
## Max. :8.00 Max. :5.00 Max. :27 Max. :2
## lien_status loan_purpose loan_type owner_occupancy
## Min. :1.00 Min. :1.00 Min. :1.00 Min. :1.00
## 1st Qu.:1.00 1st Qu.:2.00 1st Qu.:1.00 1st Qu.:1.00
## Median :1.00 Median :3.00 Median :1.00 Median :1.00
## Mean :1.06 Mean :2.48 Mean :1.16 Mean :1.18
## 3rd Qu.:1.00 3rd Qu.:3.00 3rd Qu.:1.00 3rd Qu.:1.00
## Max. :3.00 Max. :3.00 Max. :4.00 Max. :3.00
## preapproval property_type purchaser_type number_of_1_to_4_family_units
## Min. :1.0 Min. :1.00 Min. :0.00 Min. : 140
## 1st Qu.:3.0 1st Qu.:1.00 1st Qu.:0.00 1st Qu.:1370
## Median :3.0 Median :1.00 Median :1.00 Median :1714
## Mean :2.9 Mean :1.03 Mean :1.93 Mean :1819
## 3rd Qu.:3.0 3rd Qu.:1.00 3rd Qu.:3.00 3rd Qu.:2341
## Max. :3.0 Max. :2.00 Max. :9.00 Max. :3311
## number_of_owner_occupied_units minority_population population
## Min. : 53 Min. : 2.11 Min. : 299
## 1st Qu.: 857 1st Qu.: 3.53 1st Qu.:2855
## Median :1204 Median : 4.38 Median :4054
## Mean :1250 Mean : 5.50 Mean :4154
## 3rd Qu.:1532 3rd Qu.: 6.59 3rd Qu.:5094
## Max. :2874 Max. :23.92 Max. :8698
## logloanamt logincome logmsamd_income loghud_income
## Min. :0.00 Min. :0.00 Min. :3.56 Min. :11.1
## 1st Qu.:4.65 1st Qu.:3.97 1st Qu.:4.52 1st Qu.:11.1
## Median :5.06 Median :4.37 Median :4.64 Median :11.1
## Mean :4.95 Mean :4.40 Mean :4.63 Mean :11.1
## 3rd Qu.:5.40 3rd Qu.:4.78 3rd Qu.:4.77 3rd Qu.:11.2
## Max. :7.97 Max. :9.21 Max. :5.32 Max. :11.2
The baseline accuracy for this model is 82%. The most frequent action taken is that the loan is originated.
##
## 0 1
## 0.183 0.817
Logistic regression has an accuracy of 85% which is an improvement over the baseline.
##
## Call:
## glm(formula = action_taken ~ ., family = binomial, data = Train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.672 0.000 0.000 0.555 2.385
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.29e+03 5.48e+04 0.02 0.98115
## agency_code -1.70e-01 9.08e-03 -18.67 < 2e-16 ***
## applicant_ethnicity -2.69e-01 9.61e-02 -2.80 0.00506 **
## applicant_race_1 1.76e-01 5.50e-02 3.20 0.00139 **
## applicant_sex 1.23e-03 4.72e-02 0.03 0.97922
## census_tract_number -2.18e-03 4.45e-04 -4.90 9.8e-07 ***
## co_applicant_ethnicity -1.93e-01 1.03e-01 -1.88 0.06071 .
## co_applicant_race_1 -3.21e-02 7.66e-02 -0.42 0.67494
## co_applicant_sex 1.48e-01 6.86e-02 2.16 0.03066 *
## county_code -1.37e-03 3.38e-03 -0.41 0.68455
## hoepa_status -2.20e+01 2.74e+04 0.00 0.99936
## lien_status -6.65e-01 9.23e-02 -7.21 5.5e-13 ***
## loan_purpose -1.80e-01 3.75e-02 -4.79 1.7e-06 ***
## loan_type -6.28e-01 6.17e-02 -10.18 < 2e-16 ***
## owner_occupancy -2.11e-01 5.76e-02 -3.67 0.00024 ***
## preapproval -2.55e-02 8.68e-02 -0.29 0.76905
## property_type -2.52e-01 1.20e-01 -2.09 0.03625 *
## purchaser_type 1.89e+01 1.53e+02 0.12 0.90195
## number_of_1_to_4_family_units -1.39e-04 7.01e-05 -1.98 0.04785 *
## number_of_owner_occupied_units 1.98e-04 1.43e-04 1.39 0.16587
## minority_population 3.15e-02 1.13e-02 2.78 0.00549 **
## population -1.66e-05 3.95e-05 -0.42 0.67379
## logloanamt -7.76e-01 4.46e-02 -17.41 < 2e-16 ***
## logincome 8.55e-01 4.38e-02 19.51 < 2e-16 ***
## logmsamd_income 6.71e-01 1.29e-01 5.18 2.2e-07 ***
## loghud_income -1.11e+02 2.30e+01 -4.83 1.4e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 17820 on 18721 degrees of freedom
## Residual deviance: 10197 on 18696 degrees of freedom
## AIC: 10249
##
## Number of Fisher Scoring iterations: 21
##
## FALSE TRUE
## 0 684 784
## 1 412 6144
Let’s run a classification tree on this model to see if accuracy can be improved over the regression model. CART is more interpretable than logistic regression and this is an advantage.
The accuracy of the CART model on the test set is 86% which is good. It seems that CART model provides slightly improves both accuracy and interpretability in this case.
## PredictCART
## 0 1
## 0 846 622
## 1 517 6039
ROCR provides useful metrics on our classification tree model. The AUC is 90% for the CART model which is very good. The ROCR curve plots the false positive rate versus the true postive rate giving an optimal range of values along the curve.
The plot indicates that the AUC is maximized at a threshold between .5 and .6 and the accuracy at these levels is .858 and .858 respectively. The accuracy decreases when a threshold above or below these values are used.
## [1] 0.8969
ROCR plot:
##
## FALSE TRUE
## 0 846 622
## 1 517 6039
Build a random forest model on the train and testing data with two dependent variables. The accuracy of the model is 82% which is slightly less than cart and logistic regression models.
## PredictForest
## 0 1
## 0 38 1430
## 1 23 6533
## [1] 0.8189
Now, we’ll build a model similar to the classification tree and logistic regression model. This model has an accuracy rate of 86%. It performs well in comparison.
## PredictForestBrf
## 0 1
## 0 763 705
## 1 444 6112
Now, we’ll construct some metrics to produce a variable importance plot for the random forest model. The results confirm that applicant income is an important predictor of whether the loan is originated or denied.