Machine Learning with R

Ridge and Lasso Regression Models

In this post, we’ll explore ridge and lasso regression models. The idea is that by shrinking or regularizing the coefficients, prediction accuracy can be improved, variance can be decreased, and model interpretabily can also be improved.

In ridge regression, we add a penalty by way of a tuning parameter called lambda which is chosen using cross validation. The idea is to make the fit small by making the residual sum or squares small plus adding a shrinkage penalty. The shrinkage penalty is lambda times the sum of squares of the coefficients so coefficients that get too large are penalized. As lambda gets larger, the bias is unchanged but the variance drops. The drawback of ridge is that it doesn’t select variables. It includes all of the variables in the final model

In lasso, the penalty is the sum of the absolute values of the coefficients. Lasso shrinks the coefficient estimates towards zero and it has the effect of setting variables exactly equal to zero when lambda is large enough while ridge does not. Hence, much like the best subset selection method, lasso performs variable selection. The tuning parameter lambda is chosen by cross validation. When lambda is small, the result is essentially the least squares estimates. As lambda increases, shrinkage occurs so that variables that are at zero can be thrown away. So, a major advantage of lasso is that it is a combination of both shrinkage and selection of variables. In cases with very large number of features, lasso allow us to efficiently find the sparse model that involve a small subset of the features.

Let’s build lasso and ridge regression models on continous dependent variable. We’ll need to construct a model matrix of the predictors. In this example, four predictors were selected.

x1=model.matrix(logloanamt~logincome+applicant_race_1+applicant_sex+loan_purpose-1,data=vermontml2)

Ridge keeps all variables and shrinks the coefficients towards zero. In the plot, when lambda values get small, that is unregularized. Using cross validation to pick the best value for lambda, the resulting plot indicates that the unregularized full model does pretty well in this case.

plot(fit.ridge,xvar="lambda",label=TRUE)

plot of chunk unnamed-chunk-3

plot(cv.ridge)

plot of chunk unnamed-chunk-3

Now, let’s fit the lasso model. Lasso regularization does both shrinkage and variable selection.

plot(fit.lasso,xvar="lambda",label=TRUE) 

plot of chunk unnamed-chunk-4

This plot tells us how much of the deviance which is similar to R-squared has been explained by the model.

plot(fit.lasso,xvar="dev",label=TRUE)

plot of chunk unnamed-chunk-5

Cross validation will indicate which variables to include and picks the coefficients from the best model.

plot(cv.lasso)

plot of chunk unnamed-chunk-6

coef(cv.lasso)
## 5 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept)      3.4734
## logincome        0.3413
## applicant_race_1 .     
## applicant_sex    .     
## loan_purpose     .

To summarize, when lambda is zero, then the lasso model simply gives the least squares fit. Lasso can produce a model involving any number of variables. In contrast, ridge regression will always include all of the variables in the model.

Now, let’s construct a full model including all the variables.

x2=model.matrix(logloanamt~.-1,data=vermontml2)

Ridge minimizes the residual sum of squares plus a shrinkage penalty of lambda multiplied by the sum of squares of the coefficients. As lambda increases, the coefficients approach zero. The coefficients are unregularized when lambda is zero. The plot shows the whole path of variables as they shrink towards zero.

The value of lambda will be chosen by cross-validation. The plot shows cross-validated mean squared error. As lambda decreases, the mean squared error decreases. Ridge includes all the variables in the model and the value of lambda selected is indicated by the vertical lines.

plot(fit.ridge,xvar="lambda",label=TRUE)

plot of chunk unnamed-chunk-8

plot(cv.ridge)

plot of chunk unnamed-chunk-8

Lasso minimizes the residual sum of squares plus a shrinkage penalty of lambda multiplied by the sum of absolute values of the coefficients. This model performs variable selection in that it restricts some of the coefficients to be exactly zero. The plot shows how many non-zero variables are in the model at the top. So at a logLambda of -4, the model has fourteen variables.

plot(fit.lasso,xvar="lambda",label=TRUE) 

plot of chunk unnamed-chunk-9

This plot indicates that about 20% of the deviance which is similar to R-squared is explained by six variables whereas the full model explains about 35% of the variance.

plot(fit.lasso,xvar="dev",label=TRUE)

plot of chunk unnamed-chunk-10

Using cross-validation to select lambda, indicates that the unregularized model does a good job and that within one standard error at seventeen variables is also a good choice.

Indeed the values of the coefficients for the best model of seventeen variables can be extracted.

plot(cv.lasso)

plot of chunk unnamed-chunk-11

coef(cv.lasso)
## 26 x 1 sparse Matrix of class "dgCMatrix"
##                                         1
## (Intercept)                    -6.380e-01
## action_taken                    1.309e-01
## agency_code                     1.287e-02
## applicant_ethnicity             .        
## applicant_race_1                .        
## applicant_sex                  -1.090e-02
## census_tract_number            -6.741e-06
## co_applicant_ethnicity          .        
## co_applicant_race_1             .        
## co_applicant_sex                .        
## county_code                    -3.221e-04
## hoepa_status                    .        
## lien_status                    -2.429e-01
## loan_purpose                    1.170e-02
## loan_type                       8.186e-02
## owner_occupancy                -3.246e-02
## preapproval                    -2.051e-02
## property_type                  -4.917e-01
## purchaser_type                  5.728e-02
## number_of_1_to_4_family_units   2.977e-06
## number_of_owner_occupied_units  .        
## minority_population             7.353e-03
## population                      .        
## logincome                       3.997e-01
## logmsamd_income                 3.866e-01
## loghud_income                   2.172e-01