Machine Learning with R

HMDA Case Study Series Modeling binary response variable

This is the first in a series of posts on applied statistical methods and predictive machine learning on home mortgage disclosure data from Vermont for 2012. The aim is to explore various models and gain insights from the data.

The Home Mortgage Disclosure Act(HMDA) has been collecting data on loan applications and originations since 1975. Recently, this data has been made available by the Consumer Financial Protection Bureau. This database provides detailed information on loan data that is required by HMDA. In this analysis, statistical methods will be used to answer research questions and explore hypotheses. In this analysis, we are interested in exploring applications that were originated versus those that were denied and comparing several modeling methods. Specifically, we will be looking at logistic regression, random forest, and classification tree models.

The first research question looks at applications that were originated versus those that were denied by the financial institution. In the total dataset, 63% of loan applications were originated. This following table shows the breakdown of action taken.

## 
##       Application approved but not accepted 
##                                        1421 
## Application denied by financial institution 
##                                        4894 
##          Application withdrawn by applicant 
##                                        2759 
##              File closed for incompleteness 
##                                         976 
##                             Loan originated 
##                                       21852 
##           Loan purchased by the institution 
##                                        2671

Of the subset of applications that were either originated or denied, 82% of applications were originated.

## [1] 0.817

Since the research question pertains to the binary of whether the loan was originated or denied, logistic linear regression is a model that may yield interesting insight.

Let’s build a logistic regression model where action taken is the binary independent variable.

## 
## Call:
## glm(formula = action_taken ~ ., family = binomial, data = vermontml2_action)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.972   0.000   0.000   0.564   2.463  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     1.24e+03   4.69e+04    0.03   0.9789    
## agency_code                    -1.61e-01   7.55e-03  -21.27  < 2e-16 ***
## applicant_ethnicity            -3.29e-01   8.16e-02   -4.03  5.5e-05 ***
## applicant_race_1                2.04e-01   4.82e-02    4.23  2.3e-05 ***
## applicant_sex                   3.08e-02   3.94e-02    0.78   0.4349    
## census_tract_number            -2.08e-03   3.69e-04   -5.65  1.6e-08 ***
## co_applicant_ethnicity         -1.02e-01   8.82e-02   -1.16   0.2472    
## co_applicant_race_1            -1.40e-01   6.70e-02   -2.09   0.0367 *  
## co_applicant_sex                1.63e-01   5.70e-02    2.87   0.0041 ** 
## county_code                    -3.09e-03   2.82e-03   -1.10   0.2720    
## hoepa_status                   -2.19e+01   2.34e+04    0.00   0.9993    
## lien_status                    -6.31e-01   7.65e-02   -8.24  < 2e-16 ***
## loan_purpose                   -1.58e-01   3.12e-02   -5.07  4.1e-07 ***
## loan_type                      -6.13e-01   5.13e-02  -11.95  < 2e-16 ***
## owner_occupancy                -2.20e-01   4.74e-02   -4.63  3.6e-06 ***
## preapproval                    -1.34e-01   7.29e-02   -1.83   0.0672 .  
## property_type                  -1.93e-01   9.95e-02   -1.94   0.0525 .  
## purchaser_type                  1.88e+01   1.28e+02    0.15   0.8834    
## number_of_1_to_4_family_units  -1.30e-04   5.83e-05   -2.23   0.0261 *  
## number_of_owner_occupied_units  1.13e-05   1.20e-04    0.09   0.9247    
## minority_population             2.14e-02   9.30e-03    2.30   0.0215 *  
## population                      3.98e-05   3.30e-05    1.20   0.2284    
## logloanamt                     -7.49e-01   3.71e-02  -20.19  < 2e-16 ***
## logincome                       8.61e-01   3.63e-02   23.73  < 2e-16 ***
## logmsamd_income                 7.08e-01   1.08e-01    6.54  6.1e-11 ***
## loghud_income                  -1.06e+02   1.91e+01   -5.57  2.6e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25456  on 26745  degrees of freedom
## Residual deviance: 14715  on 26720  degrees of freedom
## AIC: 14767
## 
## Number of Fisher Scoring iterations: 21

Using a threshold of 0.5, the accuracy of the logistic regression model is 86%

##    
##     FALSE  TRUE
##   0  2275  2619
##   1  1249 20603

## [1] 0.8554

Now, because overfitting can occur when predictions are run on the same data used to build the model, the data will be split into a training and testing set and the accuracy of the models will be compared to a baseline.

##   action_taken    agency_code   applicant_ethnicity applicant_race_1
##  Min.   :0.000   Min.   :1.00   Min.   :1.00        Min.   :1.00    
##  1st Qu.:1.000   1st Qu.:3.00   1st Qu.:2.00        1st Qu.:5.00    
##  Median :1.000   Median :5.00   Median :2.00        Median :5.00    
##  Mean   :0.817   Mean   :5.61   Mean   :2.11        Mean   :5.06    
##  3rd Qu.:1.000   3rd Qu.:9.00   3rd Qu.:2.00        3rd Qu.:5.00    
##  Max.   :1.000   Max.   :9.00   Max.   :4.00        Max.   :7.00    
##  applicant_sex  census_tract_number co_applicant_ethnicity
##  Min.   :1.00   Min.   :   1        Min.   :1.00          
##  1st Qu.:1.00   1st Qu.:  33        1st Qu.:2.00          
##  Median :1.00   Median :9532        Median :2.00          
##  Mean   :1.45   Mean   :5422        Mean   :3.21          
##  3rd Qu.:2.00   3rd Qu.:9607        3rd Qu.:5.00          
##  Max.   :4.00   Max.   :9713        Max.   :5.00          
##  co_applicant_race_1 co_applicant_sex  county_code  hoepa_status
##  Min.   :1.00        Min.   :1.00     Min.   : 1   Min.   :1    
##  1st Qu.:5.00        1st Qu.:2.00     1st Qu.: 7   1st Qu.:2    
##  Median :5.00        Median :2.00     Median :11   Median :2    
##  Mean   :6.18        Mean   :3.05     Mean   :13   Mean   :2    
##  3rd Qu.:8.00        3rd Qu.:5.00     3rd Qu.:21   3rd Qu.:2    
##  Max.   :8.00        Max.   :5.00     Max.   :27   Max.   :2    
##   lien_status    loan_purpose    loan_type    owner_occupancy
##  Min.   :1.00   Min.   :1.00   Min.   :1.00   Min.   :1.00   
##  1st Qu.:1.00   1st Qu.:2.00   1st Qu.:1.00   1st Qu.:1.00   
##  Median :1.00   Median :3.00   Median :1.00   Median :1.00   
##  Mean   :1.06   Mean   :2.48   Mean   :1.16   Mean   :1.18   
##  3rd Qu.:1.00   3rd Qu.:3.00   3rd Qu.:1.00   3rd Qu.:1.00   
##  Max.   :3.00   Max.   :3.00   Max.   :4.00   Max.   :3.00   
##   preapproval  property_type  purchaser_type number_of_1_to_4_family_units
##  Min.   :1.0   Min.   :1.00   Min.   :0.00   Min.   : 140                 
##  1st Qu.:3.0   1st Qu.:1.00   1st Qu.:0.00   1st Qu.:1370                 
##  Median :3.0   Median :1.00   Median :1.00   Median :1714                 
##  Mean   :2.9   Mean   :1.03   Mean   :1.93   Mean   :1819                 
##  3rd Qu.:3.0   3rd Qu.:1.00   3rd Qu.:3.00   3rd Qu.:2341                 
##  Max.   :3.0   Max.   :2.00   Max.   :9.00   Max.   :3311                 
##  number_of_owner_occupied_units minority_population   population  
##  Min.   :  53                   Min.   : 2.11       Min.   : 299  
##  1st Qu.: 857                   1st Qu.: 3.53       1st Qu.:2855  
##  Median :1204                   Median : 4.38       Median :4054  
##  Mean   :1250                   Mean   : 5.50       Mean   :4154  
##  3rd Qu.:1532                   3rd Qu.: 6.59       3rd Qu.:5094  
##  Max.   :2874                   Max.   :23.92       Max.   :8698  
##    logloanamt     logincome    logmsamd_income loghud_income 
##  Min.   :0.00   Min.   :0.00   Min.   :3.56    Min.   :11.1  
##  1st Qu.:4.65   1st Qu.:3.97   1st Qu.:4.52    1st Qu.:11.1  
##  Median :5.06   Median :4.37   Median :4.64    Median :11.1  
##  Mean   :4.95   Mean   :4.40   Mean   :4.63    Mean   :11.1  
##  3rd Qu.:5.40   3rd Qu.:4.78   3rd Qu.:4.77    3rd Qu.:11.2  
##  Max.   :7.97   Max.   :9.21   Max.   :5.32    Max.   :11.2

The baseline accuracy for this model is 82%. The most frequent action taken is that the loan is originated.

## 
##     0     1 
## 0.183 0.817

Logistic regression has an accuracy of 85% which is an improvement over the baseline.

## 
## Call:
## glm(formula = action_taken ~ ., family = binomial, data = Train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.672   0.000   0.000   0.555   2.385  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     1.29e+03   5.48e+04    0.02  0.98115    
## agency_code                    -1.70e-01   9.08e-03  -18.67  < 2e-16 ***
## applicant_ethnicity            -2.69e-01   9.61e-02   -2.80  0.00506 ** 
## applicant_race_1                1.76e-01   5.50e-02    3.20  0.00139 ** 
## applicant_sex                   1.23e-03   4.72e-02    0.03  0.97922    
## census_tract_number            -2.18e-03   4.45e-04   -4.90  9.8e-07 ***
## co_applicant_ethnicity         -1.93e-01   1.03e-01   -1.88  0.06071 .  
## co_applicant_race_1            -3.21e-02   7.66e-02   -0.42  0.67494    
## co_applicant_sex                1.48e-01   6.86e-02    2.16  0.03066 *  
## county_code                    -1.37e-03   3.38e-03   -0.41  0.68455    
## hoepa_status                   -2.20e+01   2.74e+04    0.00  0.99936    
## lien_status                    -6.65e-01   9.23e-02   -7.21  5.5e-13 ***
## loan_purpose                   -1.80e-01   3.75e-02   -4.79  1.7e-06 ***
## loan_type                      -6.28e-01   6.17e-02  -10.18  < 2e-16 ***
## owner_occupancy                -2.11e-01   5.76e-02   -3.67  0.00024 ***
## preapproval                    -2.55e-02   8.68e-02   -0.29  0.76905    
## property_type                  -2.52e-01   1.20e-01   -2.09  0.03625 *  
## purchaser_type                  1.89e+01   1.53e+02    0.12  0.90195    
## number_of_1_to_4_family_units  -1.39e-04   7.01e-05   -1.98  0.04785 *  
## number_of_owner_occupied_units  1.98e-04   1.43e-04    1.39  0.16587    
## minority_population             3.15e-02   1.13e-02    2.78  0.00549 ** 
## population                     -1.66e-05   3.95e-05   -0.42  0.67379    
## logloanamt                     -7.76e-01   4.46e-02  -17.41  < 2e-16 ***
## logincome                       8.55e-01   4.38e-02   19.51  < 2e-16 ***
## logmsamd_income                 6.71e-01   1.29e-01    5.18  2.2e-07 ***
## loghud_income                  -1.11e+02   2.30e+01   -4.83  1.4e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 17820  on 18721  degrees of freedom
## Residual deviance: 10197  on 18696  degrees of freedom
## AIC: 10249
## 
## Number of Fisher Scoring iterations: 21

##    
##     FALSE TRUE
##   0   684  784
##   1   412 6144

Let’s run a classification tree on this model to see if accuracy can be improved over the regression model. CART is more interpretable than logistic regression and this is an advantage.

The accuracy of the CART model on the test set is 86% which is good. It seems that CART model provides slightly improves both accuracy and interpretability in this case.

##    PredictCART
##        0    1
##   0  846  622
##   1  517 6039

ROCR provides useful metrics on our classification tree model. The AUC is 90% for the CART model which is very good. The ROCR curve plots the false positive rate versus the true postive rate giving an optimal range of values along the curve.

The plot indicates that the AUC is maximized at a threshold between .5 and .6 and the accuracy at these levels is .858 and .858 respectively. The accuracy decreases when a threshold above or below these values are used.

## [1] 0.8969

ROCR plot:

plot of chunk unnamed-chunk-12

##    
##     FALSE TRUE
##   0   846  622
##   1   517 6039

Build a random forest model on the train and testing data with two dependent variables. The accuracy of the model is 82% which is slightly less than cart and logistic regression models.

##    PredictForest
##        0    1
##   0   38 1430
##   1   23 6533

## [1] 0.8189

Now, we’ll build a model similar to the classification tree and logistic regression model. This model has an accuracy rate of 86%. It performs well in comparison.

##    PredictForestBrf
##        0    1
##   0  763  705
##   1  444 6112

Now, we’ll construct some metrics to produce a variable importance plot for the random forest model. The results confirm that applicant income is an important predictor of whether the loan is originated or denied.

plot of chunk unnamed-chunk-16

plot of chunk unnamed-chunk-17