Yogesh Chauhan's Blog

The Lending Club Analysis using Logistic Regression and Random Forest in RStudio

in Miscellaneous on September 25, 2019

Introduction

About Lending Club

The Lending Club is a peer-to-peer lending company that compares borrowers with investors through an web platform. It is the world’s biggest online marketplace that connects borrowers and investors.

The Lending Club is renovating the banking system to make credit cheap and investing more fulfilling. Lending Club works at a lower cost than traditional bank systems lending programs and pass the investments on to the borrowers in the form of lower charges and to investors they give solid risk-adjusted returns.

How it works?

The Lending Club services people who need personal loans between $1,000 and $40,000. Borrowers obtain the full amount of the issued loan minus the initial fee, which is paid to the company. Investors buy notes backed by the personal loans and they pay Lending Club a service fee. The company shares that data about the loans borrowed through its platform during certain time like a quarter, a year or couple of years.

About The Data

The Landing Club releases data every quarter. So far, the data they published is available till 2018 Q3 and can be dated back to 2007. The data includes in 4 files updated every quarter. They update it on the same day as the quarterly results of the company are published.

They comprise information on nearly all the loans issued by LC. The only loans missing from the data files are the few loans where the Lending Club was not authorized to release the facts of the transactions publicly.

The info available for each loan comprises of all the details of the loans at the time of the issuance along with more info related to the revised status of loan for example how much principal has been paid until now, the interest, whether the loan was fully paid, or it’s defaulted, or if the debtor is late on payments etc.

Demonstrating credit risk for both personal and business loans is of utmost importance for financial institutions. The possibility that a borrower will default is a key factor in getting to a credit risk measure.

This analysis will focus on the Loan Data from 2007 to 2011.

The data regarding this project can be accessed here.

Obtain the data


loan_data1a <- data.frame(read.csv("LoanStats2007_11.csv", stringsAsFactors = F, skip=1))

Scrub the data

I’ve installed all the packages required for the analysis and I’m going to install the libraries in the code chunk below:


library(zoo)
library(ggplot2)
library(scales)
library(extrafont)
library(mapproj)
library(randomForest)
library(lubridate)
library(dplyr)
library(tidyr)

I’ve already skipped the first row in the previous code because it was just a source url for the data.

There are more than 140 columns with more than 125 variables. We don’t need all those variables for this analysis.

I’m going to skip the unnecessary data with select() function and keep the columns which are important for our analysis.

using dplyr package selecting the columns that we need


loan_data <- dplyr::select(loan_data1a, 3:4, 6:9, 11:18, 21, 23, 26:30, 33:37)

alternatively we can use this:


loan_data <- select(loan_data1a, loan_amnt, funded_amnt, term, int_rate, installment, emp_title, emp_length, home_ownership, annual_inc, verification_status, issue_d, borrower_score, payment_inc_ratio, purpose, addr_state, dti, delinq_2yrs, earliest_cr_line, inq_last_6mths, open_acc, pub_rec, revol_bal, revol_util, total_acc)

The following code chunk will remove the missing values from loan_data:


loan_data <- na.omit(loan_data)
OR
loan_data %>% drop_na()
OR
apply(loan_data,2,max,na.rm=TRUE); '

this will remove the NA’s from columns that contain them.

NOTE: The lines with ## are Results of a code chunk.


##           loan_amnt         funded_amnt                term
##             " 9975"             " 9975"        " 60 months"
##            int_rate         installment               grade
##             "9.99%"           "  99.99"                 "G"
##           emp_title          emp_length      home_ownership
##             "Zynga"               "n/a"              "RENT"
##          annual_inc verification_status             issue_d
##        "  99999.00"          "Verified"            "Sep-11"
##         loan_status      borrower_score   payment_inc_ratio
##        "Fully Paid"              "1.00"        " 9.9992000"
##             purpose          addr_state                 dti
##           "wedding"                "WY"             " 9.99"
##         delinq_2yrs    earliest_cr_line      inq_last_6mths
##                " 9"            "Sep-99"                " 8"
##            open_acc             pub_rec           revol_bal
##                " 9"                " 4"            "  9998"
##          revol_util           total_acc
##            "99.90%"                "90"

Explore the Data

This data section defines each column and describes the data including the data type (e.g., number, string, character) and allowable values (e.g., minimum and maximums) using functions as class(), str(), and summary().


class(loan_data)

## [1] "data.frame"

The code above determines the class of an object(which in this case is loan_data) and as we see the result is data.frame for the loan_data


str(loan_data)


## 'data.frame':    39789 obs. of  26 variables:
##  $ loan_amnt          : int  500 500 500 500 500 700 725 750 800 900 ...
##  $ funded_amnt        : int  500 500 500 500 500 700 725 750 800 900 ...
##  $ term               : chr  " 36 months" " 36 months" " 36 months" " 36 months" ...
##  $ int_rate           : chr  "9.76%" "10.71%" "10.46%" "11.41%" ...
##  $ installment        : num  16.1 16.3 16.2 16.5 15.7 ...
##  $ grade              : chr  "B" "B" "B" "C" ...
##  $ emp_title          : chr  "Hughes, Hubbard & Reed LLP" "" "THe University of Illinois" "Global Travel International -and- Global Domains International" ...
##  $ emp_length         : chr  "7 years" "< 1 year" "3 years" "< 1 year" ...
##  $ home_ownership     : chr  "MORTGAGE" "MORTGAGE" "MORTGAGE" "RENT" ...
##  $ annual_inc         : num  59000 7904 26000 19500 18000 ...
##  $ verification_status: chr  "Not Verified" "Not Verified" "Not Verified" "Not Verified" ...
##  $ issue_d            : chr  "Mar-08" "Jan-08" "Jan-08" "Jan-08" ...
##  $ loan_status        : chr  "Fully Paid" "Fully Paid" "Fully Paid" "Fully Paid" ...
##  $ borrower_score     : num  0.35 0.65 0.55 0.75 0.5 0.4 0.7 0.45 0.4 0.45 ...
##  $ payment_inc_ratio  : num  9.546 1.046 0.189 0.573 1.172 ...
##  $ purpose            : chr  "other" "vacation" "small_business" "other" ...
##  $ addr_state         : chr  "NY" "CA" "IL" "VA" ...
##  $ dti                : num  22.17 3.04 14.17 3.69 4.27 ...
##  $ delinq_2yrs        : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ earliest_cr_line   : chr  "Aug-95" "Feb-89" "Jul-94" "Nov-83" ...
##  $ inq_last_6mths     : int  0 2 0 0 0 1 0 0 0 2 ...
##  $ open_acc           : int  9 3 8 8 4 4 4 8 8 2 ...
##  $ pub_rec            : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ revol_bal          : int  65414 44 5643 12229 0 0 1814 12220 19901 167 ...
##  $ revol_util         : chr  "47.80%" "3.70%" "60.70%" "90.60%" ...
##  $ total_acc          : int  26 6 28 15 4 4 10 8 15 4 ...

This is perhaps the most useful function in R.

This str function provides valuable information about the structure of loan_data object.

The information we can see on the variable view tab in SPSS, the above function provides the same info.

It shows the class of the object as well as number of observations (which are 39789 in our example) and number of variables (which are 27 in our selected data.frame)

We can see that it tells us about the type of the variable in each rows whether it’s integer or character or number or something else and in the same row it prints the first few observations(or values) for the variable.

For numeric analysis, we will convert some of the chr object to numeric like int_rate is displaying as a char ‘10.5%’ need to be converted to 0.105 as a numeric value.


loan_data$term <- as.numeric(substr(loan_data$term, 1,3))
loan_data$emp_length <- as.numeric(substr(loan_data$emp_length, 1,2))
loan_data$int_rate <- as.numeric(gsub("%", "", loan_data$int_rate)) / 100
loan_data$revol_util <- as.numeric(gsub("%", "", loan_data$revol_util)) / 100

Let’s check again:


str(loan_data)

## 'data.frame':    39789 obs. of  26 variables:
##  $ loan_amnt          : int  500 500 500 500 500 700 725 750 800 900 ...
##  $ funded_amnt        : int  500 500 500 500 500 700 725 750 800 900 ...
##  $ term               : num  36 36 36 36 36 36 36 36 36 36 ...
##  $ int_rate           : num  0.0976 0.1071 0.1046 0.1141 0.0807 ...
##  $ installment        : num  16.1 16.3 16.2 16.5 15.7 ...
##  $ grade              : chr  "B" "B" "B" "C" ...
##  $ emp_title          : chr  "Hughes, Hubbard & Reed LLP" "" "THe University of Illinois" "Global Travel International -and- Global Domains International" ...
##  $ emp_length         : num  7 NA 3 NA NA NA 1 NA 2 4 ...
##  $ home_ownership     : chr  "MORTGAGE" "MORTGAGE" "MORTGAGE" "RENT" ...
##  $ annual_inc         : num  59000 7904 26000 19500 18000 ...
##  $ verification_status: chr  "Not Verified" "Not Verified" "Not Verified" "Not Verified" ...
##  $ issue_d            : chr  "Mar-08" "Jan-08" "Jan-08" "Jan-08" ...
##  $ loan_status        : chr  "Fully Paid" "Fully Paid" "Fully Paid" "Fully Paid" ...
##  $ borrower_score     : num  0.35 0.65 0.55 0.75 0.5 0.4 0.7 0.45 0.4 0.45 ...
##  $ payment_inc_ratio  : num  9.546 1.046 0.189 0.573 1.172 ...
##  $ purpose            : chr  "other" "vacation" "small_business" "other" ...
##  $ addr_state         : chr  "NY" "CA" "IL" "VA" ...
##  $ dti                : num  22.17 3.04 14.17 3.69 4.27 ...
##  $ delinq_2yrs        : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ earliest_cr_line   : chr  "Aug-95" "Feb-89" "Jul-94" "Nov-83" ...
##  $ inq_last_6mths     : int  0 2 0 0 0 1 0 0 0 2 ...
##  $ open_acc           : int  9 3 8 8 4 4 4 8 8 2 ...
##  $ pub_rec            : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ revol_bal          : int  65414 44 5643 12229 0 0 1814 12220 19901 167 ...
##  $ revol_util         : num  0.478 0.037 0.607 0.906 0 NA 0.07 0.849 0.298 0.033 ...
##  $ total_acc          : int  26 6 28 15 4 4 10 8 15 4 ...

Following are the Variables in the data we selected.
The definitions are given from Lending Club Data Dictionary which can be accessed online from the same loan data source here.

Dependent Variable

“loan status”: Current status of the loan e.g. defaulted borrower or the loan was charged off

Independent Variables

“loan_amnt”: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.

funded_amnt: The total amount committed to that loan at that point in time.

“term”: The number of payments on the loan. Values are in months and can be either 36 or 60.

int_rate: Interest Rate on the loan

installment: The monthly payment owed by the borrower if the loan originates.

emp_title: The job title supplied by the Borrower when applying for the loan.

“emp_length”: Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.

“home_ownership”: The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER

“annual_inc”: The self-reported annual income provided by the borrower during registration.

grade: LC assigned loan grade.

verification_status: Indicates if income was verified by LC, not verified, or if the income source was verified

issue_d: The month which the loan was funded

“borrower_score”: Credit score of the borrower.

payment_inc_ratio: Borrower’s payment to income ratio

“purpose”: A category provided by the borrower for the loan request.

addr_state: The state provided by the borrower in the loan application

“dti”: A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.

“delinq_2yrs_zero”: The number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 years

earliest_cr_line: The month the borrower’s earliest reported credit line was opened

inc_last_6mths: The number of inquiries in past 6 months (excluding auto and mortgage inquiries)

“open_acc”: The number of open credit lines in the borrower’s credit file.

“pub_rec”: The borrower’s number of critical public records (bankruptcy filings or tax liens)

“revol_bal”: Total credit revolving balance

“revol_util”: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.

summary(loan_data) it won’t have missing values


##    loan_amnt      funded_amnt         term          int_rate    
##  Min.   :  500   Min.   :  500   Min.   :36.00   Min.   :0.0542 
##  1st Qu.: 5500   1st Qu.: 5400   1st Qu.:36.00   1st Qu.:0.0925 
##  Median :10000   Median : 9650   Median :36.00   Median :0.1186 
##  Mean   :11231   Mean   :10959   Mean   :42.45   Mean   :0.1203 
##  3rd Qu.:15000   3rd Qu.:15000   3rd Qu.:60.00   3rd Qu.:0.1459 
##  Max.   :35000   Max.   :35000   Max.   :60.00   Max.   :0.2459 
##  NA's   :3       NA's   :3       NA's   :3       NA's   :3      
##   installment         grade            emp_title           emp_length   
##  Min.   :  15.69   Length:39789       Length:39789       Min.   : 1.000 
##  1st Qu.: 167.08   Class :character   Class :character   1st Qu.: 3.000 
##  Median : 280.61   Mode  :character   Mode  :character   Median : 5.000 
##  Mean   : 324.73                                         Mean   : 5.644 
##  3rd Qu.: 430.78                                         3rd Qu.:10.000 
##  Max.   :1305.19                                         Max.   :10.000 
##  NA's   :3                                               NA's   :5671   
##  home_ownership       annual_inc      verification_status
##  Length:39789       Min.   :   4000   Length:39789      
##  Class :character   1st Qu.:  40500   Class :character  
##  Mode  :character   Median :  59000   Mode  :character  
##                     Mean   :  68979                     
##                     3rd Qu.:  82342                     
##                     Max.   :6000000                     
##                     NA's   :3                           
##    issue_d          loan_status        borrower_score   payment_inc_ratio
##  Length:39789       Length:39789       Min.   :0.0500   Min.   : 0.04889 
##  Class :character   Class :character   1st Qu.:0.4000   1st Qu.: 4.36452 
##  Mode  :character   Mode  :character   Median :0.5000   Median : 7.01757 
##                                        Mean   :0.4991   Mean   : 7.61661 
##                                        3rd Qu.:0.6000   3rd Qu.:10.33960 
##                                        Max.   :1.0000   Max.   :43.54560 
##                                        NA's   :3        NA's   :3        
##    purpose           addr_state             dti         delinq_2yrs    
##  Length:39789       Length:39789       Min.   : 0.00   Min.   : 0.0000 
##  Class :character   Class :character   1st Qu.: 8.18   1st Qu.: 0.0000 
##  Mode  :character   Mode  :character   Median :13.41   Median : 0.0000 
##                                        Mean   :13.32   Mean   : 0.1465 
##                                        3rd Qu.:18.60   3rd Qu.: 0.0000 
##                                        Max.   :29.99   Max.   :11.0000 
##                                        NA's   :3       NA's   :3       
##  earliest_cr_line   inq_last_6mths     open_acc         pub_rec      
##  Length:39789       Min.   :0.000   Min.   : 2.000   Min.   :0.00000 
##  Class :character   1st Qu.:0.000   1st Qu.: 6.000   1st Qu.:0.00000 
##  Mode  :character   Median :1.000   Median : 9.000   Median :0.00000 
##                     Mean   :0.869   Mean   : 9.294   Mean   :0.05514 
##                     3rd Qu.:1.000   3rd Qu.:12.000   3rd Qu.:0.00000 
##                     Max.   :8.000   Max.   :44.000   Max.   :4.00000 
##                     NA's   :3       NA's   :3        NA's   :3       
##    revol_bal        revol_util       total_acc   
##  Min.   :     0   Min.   :0.0000   Min.   : 2.00 
##  1st Qu.:  3704   1st Qu.:0.2540   1st Qu.:13.00 
##  Median :  8860   Median :0.4930   Median :20.00 
##  Mean   : 13392   Mean   :0.4886   Mean   :22.09 
##  3rd Qu.: 17065   3rd Qu.:0.7240   3rd Qu.:29.00 
##  Max.   :149588   Max.   :0.9990   Max.   :90.00 
##  NA's   :3        NA's   :53       NA's   :3

Summary (or descriptive) statistics are useful to calculate summary statistics, including the mean, standard deviation, range, and percentiles.

For each numerical variable, the code above prints

Range(Minimum and Maximum values),

1st and 3rd Quantiles(quartiles are the three cut points that will divide a dataset into four equal-sized groups),

Mean(Average) as well as

Medians(Middle value separating the data from higher half and lower half) as we can see in the output.

For example, the first variable loan_amt, the Lending Club funded the loan between the range 500 to 35000 as we can see it in Min and Max values. The Mean was 11231 and the Median was 10000. The 1st Qu. was 5500 and the 3rd Qu. was 15000 in the data. (Note: All the values are in dollar amount.)

Scatter plot with ggplot2 (including graphs)

In which month year people apply for the loan most?

ggplot 1: Loans issued over the years [2007 – 2011]


library(lubridate)

loan_data$issue_d <- dmy(paste0("01-",loan_data$issue_d))
loan_amnt_by_month <- aggregate(loan_amnt ~ issue_d, data = loan_data, sum)

ggplot(loan_amnt_by_month, aes(issue_d, loan_amnt)) + geom_bar(stat = "identity") + labs(title = 'ggplot 1: Loans issued over the years')
loan-issued-over-the-years
loan-issued-over-the-years

From 2007 to 2011 data, people applied for the loan in the 2011. They founded in 2006 so it took them almost 5 years to gain the momentum.

Is there any relationship between loan amount requested and the loan amount granted?


ggplot 2: Loan Amount vs Funded Amount
ggplot(loan_data, aes(loan_amnt, funded_amnt)) + 
  geom_point(aes(colour = borrower_score)) +
  labs(title = 'ggplot 2: Loan Amount vs. Funded Amount') +
  geom_smooth()
loan-vs-funded-amount
loan-vs-funded-amount

In this plot, we see how the scatter plot looks like for the applied amount of loan vs the funded amount of loan.

I’ve used borrower_score variable as colour aesthetic to see how these people fall into these funded amount and what’s their scores look like regarding to that amount.

I’ve used two variables, loan_amnt and funded_amnt.

As we can see in the plot above, most of the time they approved the loan amount applied by the borrowers.

Although, in few cases they reduced the amount of loan and there could be multiple reasons for that.

Mostly they approved loans for the people with higher borrower scores but in many cases they approved loans for the people with lower borrower scores.

What was the installments for the funded loan?


ggplot 3: Installment vs Funded Amount
ggplot(loan_data, aes(installment, funded_amnt)) + 
  geom_point(aes(colour = borrower_score)) +
  labs(title = 'ggplot 3: Installment vs. Funded Amount') +
  geom_smooth()
installment-vs-funded-amount
installment-vs-funded-amount

In this plot, we see how the scatter plot looks like for the amount of installments for funded amount of loan.

I’ve used two variables, installment and funded_amnt.

I’ve used borrower_score variable as colour aesthetic to see how these people fall into these funded amount and what’s their scores look like regarding to that amount.

As we can can clearly see the installments for lower loans are lower and installments for higher loans are higher. The smooth curve seems like supporting this conclusion.

The one interesting fact we can discover from this plot is there is a range of installments for the same amount of loan. That means some people paid higher installments and some people paid lower installments with the same amount of loan. Their borrower score could be one of the reasons for that.

Also, We see most of the higher installments are from the people with lower borrower scores but still we don’t see any huge difference between the installments of people with lower borrower score vs people with higher borrower score.

I want to call this “A Post Office Example”

Is there any relationship between the funded amount and the annual income?


ggplot 4: Annual Income vs Funded Amount
ggplot(loan_data, aes(annual_inc, funded_amnt)) + 
  geom_point(aes(colour = borrower_score)) +
  labs(title = 'ggplot 4: Annual Income vs. Funded Amount') +
  geom_smooth()
annual-income-vs-funded-amount
annual-income-vs-funded-amount

In this plot, we see how the scatter plot looks like for the funded amount vs their annual income.

I’ve used two variables, annual_inc and funded_amnt.

I’ve used borrower_score variable as colour aesthetic to see how these people fall into these funded amount and what’s their scores look like regarding to that amount.

It looks like there are some really high income borrowers still applying for loans and borrowing money!

And by that I am talking about super rich people! This one person has an annual income of


max(loan_data$annual_inc,na.rm = TRUE)

## [1] 6e+06

That is whopping 6000000!!!!!

It’s Unbelievable! An annual income of $6 million. That is like tier 3 position in top 10 public companies… who is this person?


loan_data[which(loan_data$annual_inc == max(loan_data$annual_inc,na.rm = TRUE)),]$emp_title

## [1] "post office"

What? A post office guy??? I don’t believe that this is a valid data.

Apparently the Lending Club didn’t do a good job maintaining the information very well.

Let’s see some other high income people’s loan data look like.


loan_data[which(loan_data$annual_inc > 1000000),]$emp_title

##  [1] "Montgomery ISD"                    
##  [2] "St. John Lutheran Church"          
##  [3] "post office"                       
##  [4] "Dept of army"                      
##  [5] "Hewlett Packard"                   
##  [6] "Lockheed Martin"                   
##  [7] "at&t wireless"                     
##  [8] "Convent of the Sacred Heart"       
##  [9] "WCP"                               
## [10] "TelSource Corp"                    
## [11] "NYCDOE"                            
## [12] "Stryker Instruments"               
## [13] "Avis Budget Group"                 
## [14] "Lea Regional Hospital/Pecos Valley"

Well, most of million dollar earning people have legitimate job titles such as managing director, portfolio manager, managing partner but here, there are post office? NYCDOE? WCP?

I hope there are only the handful cases of bad data so to go forward with my analysis, I’m going to to get rid of peoples with an annual income of greater than $500k.


loan_data2 <- filter(loan_data,annual_inc<500000)

Let’s replot the same chart:


ggplot(loan_data2, aes(annual_inc, funded_amnt)) + 
  geom_point(aes(colour = borrower_score)) +
  labs(title = 'ggplot 5: Annual Income vs. Funded Amount') +
  geom_smooth()
annual-income-vs-funded-amount-2
annual-income-vs-funded-amount-2

Much better! As we know that the Lending Club cap the funding limit to $40k. So we don’t see any high funded amount greater than $40k.

Overall we can see a linear relationship for person’s annual income < $100k.

Apart from that, we can see the flat going regression line due to the hard cap of $40k.

If we look into annual income < $100k borrowers, we can see a clearer linear relationship.

What was the distribution for the loan?

ggplot 6: Distribution plot: Loan Amount vs Purpose for Loan Status

In this scatterplot, we’re going to see the distribution of the loans for different purposes and their loan statuses as well.

We’re going to use three variables, loan_amnt, loan_status and purpose.


ggplot(loan_data, aes(purpose, loan_amnt, fill = loan_status)) + geom_boxplot() + labs(title = 'ggplot 6: Loan Amount vs Purpose for Loan Status') + theme(axis.text.x=element_text(size=8, angle = 90))
loan-amount-vs-purpose
loan-amount-vs-purpose

As we can see in the plot above, the most stated reason for loan was small business followed by credit card and debt consolidation.

More people are with the status fully paid for the house loan than charged off.

House loans are generally applied by a family sos the conclusion might indicate that chances of a house loan of being fully paid is higher than any other loans.

In general, small business loans are higher and riskier loans.

But in this case(if we compare charged off vs fully paid loans) wedding, debt consolidation and credit card loans seem like the riskiest loans.


summary(loan_data$funded_amnt)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     500    5400    9650   10959   15000   35000       3

Also with the code above, we can easily check the stats for a single variable too.

What was vast majority for loan terms?


table(loan_data$term)

## 
##    36    60 
## 29096 10690

The vast majority are 36 months.

How many people were with more than 10 years of employment?


table(loan_data$emp_length)

## 
##    1    2    3    4    5    6    7    8    9   10 
## 3247 4394 4098 3444 3286 2231 1775 1485 1259 8899

8872 people were with more than 10 years of employment.

Did they consider granting loans to unemployed people? Did they consider number of years of employment?

ggplot 7: Let’s plot all the people with their employment status via ggplot and geom_bar.


ggplot(loan_data, aes(emp_length)) + geom_bar(fill = "dodgerblue") +   ggtitle("ggplot 7: Employment Status Listed In The Loan Application")
employment
employment

Let’s make that bar chart more meaningful.
 


library(ggplot2)
library(scales)
loan_data$employ <- NA
loan_data$employ[loan_data$emp_length == "n/a"] <- "Unemployed"
loan_data$employ[loan_data$emp_length == "< 1 year"] <- "Less than 2 years"
loan_data$employ[loan_data$emp_length == "1 year"] <- "Less than 2 years"
loan_data$employ[loan_data$emp_length == "2 years"] <- "2-4 years"
loan_data$employ[loan_data$emp_length == "3 years"] <- "2-4 years"
loan_data$employ[loan_data$emp_length == "4 years"] <- "2-4 years"
loan_data$employ[loan_data$emp_length == "5 years"] <- "5-9 years"
loan_data$employ[loan_data$emp_length == "6 years"] <- "5-9 years"
loan_data$employ[loan_data$emp_length == "7 years"] <- "5-9 years"
loan_data$employ[loan_data$emp_length == "8 years"] <- "5-9 years"
loan_data$employ[loan_data$emp_length == "9 years"] <- "5-9 years"
loan_data$employ[loan_data$emp_length == "10+ years"] <- "10+ years"
loan_data$employ <- factor(loan_data$employ,
                      levels = c("Unemployed",
                                 "Less than 2 years",
                                 "2-4 years",
                                 "5-9 years",
                                 "10+ years"))

ggplot(loan_data, aes(x = factor(employ))) +  
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = "dodgerblue") + 
  scale_y_continuous(labels = percent) +
  labs(x = "Employment Status", y = "") +
  ggtitle("ggplot 8: Employment Status Listed In The Loan Application")
employment-2
employment-2

We’ve reduced the number of variables and created less variable categories just for this analysis.

After that, we’ve plotted it with ggplot as well as we’ve used geom_bar to make the bas chart.

scale_y_continuous means the type of variable, that is continuous and we’ve plotted it using percent, while axes x will be the Employment Status.

From the bar chart above, we can see that most of the people were with 2-4 years of experience.

Also, they’ve approved loan for fewer unemployed people.

How many overall defaulters?

Let’s see the same type of analysis for Loan Status

First of all, let’s see the different status of loan and the number of people with each of that status.


table(loan_data$loan_status)

## 
##             Charged Off  Fully Paid 
##           3        5670       34116

There are just 2 status. Charged off(the person is defaulter) and fully paid.

Let’s plot it with the code below:


ggplot(loan_data, aes(loan_status)) + geom_bar(fill = "dodgerblue") +   ggtitle("ggplot 9: Loan Status Bar Chart")
loan-status-bar
loan-status-bar

Let’s make that bar chart more meaningful.


loan_data$status <- gsub("", "", loan_data$loan_status)

loan_data$status[loan_data$status == "Charged Off"] <- "Default"
loan_data$status[loan_data$status == "Fully Paid"] <- "Good Credit"

ggplot(loan_data, aes(x = factor(status))) +
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = "dodgerblue") +
  scale_y_continuous(labels = percent) +
  labs(x = "Loan Status", y = "") +
  ggtitle("ggplot 10: Loan Status Bar Chart")
loan-status-bar-2
loan-status-bar-2

So, we’ve assigned names for two loan status variables just for this analysis.

After that, we’ve plotted it with ggplot as well as we’ve used geom_bar to make the bas chart.

scale_y_continuous means the type of variable, that is continuous and we’ve plotted it using percent, while axes x will be the Loan Status.

We have matured loan data and that’s why we see only two statuses (Charged off and Fully paid). If we use the current data then we might see some other variables like currently paying.

As we can see in the bar chart above, more than 10% of the borrowers were defaulters.

Did more people owned the home or rented?

Home Ownership In The Loan Application


ggplot(loan_data, aes(home_ownership)) + geom_bar(fill = "dodgerblue") +   ggtitle("ggplot 11: Home Ownership In The Loan Application")
home-ownership
home-ownership

We’ve plotted home_ownership variable using ggplot an geom_bar.

It is obvious from the graph that more people were with rented home followed by mortgage. Few people owned home.

And it is obvious that if you have money to own a house(especially in USA) then you’d have money for other purposes. In other words, you won’t be needing loans. But that’s not always true!

Logistic Regression on Loan Data

Logistic regression is useful for (discrete) qualitative responses referred to as categorical.
It represents the probability that the response belongs to a category rather than telling about the outcome directly as default or not.

This function will ensure that probabilities are within the range 0 and 1. We are going to use the stats ‘::glm()’.

We need to keep in mind that a call to ‘stats::glm()’ will not return all of the model statistics by default. We can review the object’s attributes with the function ‘base::attributes()’.

Also, we’ll use the function base::summary(). That is a generic function to produce summaries of the results of different model fitting functions. So for glm object, it will return model statistics like

  • z statistic
  • deviance residuals
  • p-value
  • coefficients
  • AIC
  • standard error and some more.

To use GLM function we need to set the stringsAsFactors = T for the data

So, I am going to use the main csv file and make a new data frame loan_data_LR in following code:


loan_data_LR <- data.frame(read.csv("LoanStats2007_11.csv", stringsAsFactors = T, skip=1))

Now let’s apply the glm function on the new data frame:


logistic_model <- glm(formula = loan_status ~ payment_inc_ratio + purpose + home_ownership + emp_length + borrower_score, family = "binomial", data = loan_data_LR)
class(logistic_model)

## [1] "glm" "lm"

Of course the class for the model would be glm.

This dataset has a binary response (outcome that is dependent variable) variable called loan_status.

There are five predictor variables: payment_inc_ratio, purpose, home_ownership, emp_length and borrower_score.

We treated the variables payment_inc_ratio and borrower_score as continuous. emp_length takes values from unemployed to more than 10 years depending on the job experience. purpose and home_ownership takes discrete values.

Let’s check the attributes for this model:


attributes(logistic_model)

## $names
##  [1] "coefficients"      "residuals"         "fitted.values"   
##  [4] "effects"           "R"                 "rank"            
##  [7] "qr"                "family"            "linear.predictors"
## [10] "deviance"          "aic"               "null.deviance"   
## [13] "iter"              "weights"           "prior.weights"   
## [16] "df.residual"       "df.null"           "y"               
## [19] "converged"         "boundary"          "model"           
## [22] "na.action"         "call"              "formula"         
## [25] "terms"             "data"              "offset"          
## [28] "control"           "method"            "contrasts"       
## [31] "xlevels"         
##
## $class
## [1] "glm" "lm"

There are 31 attributes.

Let’s review the summary for this model:


summary(logistic_model)

##
## Call:
## glm(formula = loan_status ~ payment_inc_ratio + purpose + home_ownership +
##     emp_length + borrower_score, family = "binomial", data = loan_data_LR)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max 
## -2.2458   0.4649   0.5350   0.5771   1.0293 
##
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                2.119427   0.109311  19.389  < 2e-16 ***
## payment_inc_ratio         -0.006445   0.003562  -1.809 0.070381 . 
## purposecredit_card        -0.017888   0.095320  -0.188 0.851141   
## purposedebt_consolidation -0.401779   0.086436  -4.648 3.35e-06 ***
## purposeeducational        -0.580027   0.169411  -3.424 0.000618 ***
## purposehome_improvement   -0.178803   0.101711  -1.758 0.078757 . 
## purposehouse              -0.432073   0.164749  -2.623 0.008726 **
## purposemajor_purchase      0.026757   0.109600   0.244 0.807130   
## purposemedical            -0.428025   0.134763  -3.176 0.001493 **
## purposemoving             -0.448911   0.141654  -3.169 0.001529 **
## purposeother              -0.473573   0.094202  -5.027 4.98e-07 ***
## purposerenewable_energy   -0.640996   0.268155  -2.390 0.016830 * 
## purposesmall_business     -1.124679   0.099390 -11.316  < 2e-16 ***
## purposevacation           -0.287136   0.170535  -1.684 0.092232 . 
## purposewedding             0.033955   0.136667   0.248 0.803786   
## home_ownershipNONE         8.740098  68.973744   0.127 0.899165   
## home_ownershipOTHER       -0.352006   0.264271  -1.332 0.182863   
## home_ownershipOWN         -0.104654   0.056545  -1.851 0.064198 . 
## home_ownershipRENT        -0.179675   0.032176  -5.584 2.35e-08 ***
## emp_length< 1 year         0.163174   0.053389   3.056 0.002241 **
## emp_length1 year           0.146157   0.059725   2.447 0.014398 * 
## emp_length2 years          0.235513   0.054922   4.288 1.80e-05 ***
## emp_length3 years          0.171877   0.055238   3.112 0.001861 **
## emp_length4 years          0.175489   0.058713   2.989 0.002800 **
## emp_length5 years          0.121583   0.058993   2.061 0.039306 * 
## emp_length6 years          0.126263   0.068585   1.841 0.065625 . 
## emp_length7 years          0.054891   0.073607   0.746 0.455833   
## emp_length8 years          0.124219   0.081139   1.531 0.125784   
## emp_length9 years          0.221098   0.090289   2.449 0.014334 * 
## emp_lengthn/a             -0.403023   0.081132  -4.968 6.78e-07 ***
## borrower_score             0.135297   0.112703   1.200 0.229954   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 32585  on 39785  degrees of freedom
## Residual deviance: 32144  on 39755  degrees of freedom
##   (3 observations deleted due to missingness)
## AIC: 32206
##
## Number of Fisher Scoring iterations: 9

The summary output is similar to the main model output below. So, I am going to discuss about both outputs together.

Let’s look at the model itself:


logistic_model

##
## Call:  glm(formula = loan_status ~ payment_inc_ratio + purpose + home_ownership +
##     emp_length + borrower_score, family = "binomial", data = loan_data_LR)
##
## Coefficients:
##               (Intercept)          payment_inc_ratio 
##                  2.119427                  -0.006445 
##        purposecredit_card  purposedebt_consolidation 
##                 -0.017888                  -0.401779 
##        purposeeducational    purposehome_improvement 
##                 -0.580027                  -0.178803 
##              purposehouse      purposemajor_purchase 
##                 -0.432073                   0.026757 
##            purposemedical              purposemoving 
##                 -0.428025                  -0.448911 
##              purposeother    purposerenewable_energy 
##                 -0.473573                  -0.640996 
##     purposesmall_business            purposevacation 
##                 -1.124679                  -0.287136 
##            purposewedding         home_ownershipNONE 
##                  0.033955                   8.740098 
##       home_ownershipOTHER          home_ownershipOWN 
##                 -0.352006                  -0.104654 
##        home_ownershipRENT         emp_length< 1 year 
##                 -0.179675                   0.163174 
##          emp_length1 year          emp_length2 years 
##                  0.146157                   0.235513 
##         emp_length3 years          emp_length4 years 
##                  0.171877                   0.175489 
##         emp_length5 years          emp_length6 years 
##                  0.121583                   0.126263 
##         emp_length7 years          emp_length8 years 
##                  0.054891                   0.124219 
##         emp_length9 years              emp_lengthn/a 
##                  0.221098                  -0.403023 
##            borrower_score 
##                  0.135297 
##
## Degrees of Freedom: 39785 Total (i.e. Null);  39755 Residual
##   (3 observations deleted due to missingness)
## Null Deviance:       32580
## Residual Deviance: 32140     AIC: 32210

In the output above, the first output we see is the call, that is just R reminding us about what the model ran, what options specified, etc.

Next output we see is the deviance residuals, which are a measure of model fit. That part of output represents the distribution of the deviance residuals for all individual cases used in the model and below we discuss on how to use the summaries of deviance statistics to assess the model fit.

The next portion of the output demonstrations the coefficients and their standard errors as well as the z-statistic and the associated p-values.

Both payment_inc_ratio and borrower_score are statistically significant.

The regression coefficients tell us about the change in the log odds in the outcome for one unit increase in the predictor variable.

For every one unit change in payment_inc_ratio, the log odds of fully paid (versus charged off) increases by -0.006445. So they are in negative relationship.

For a one unit increase in borrower_score, the log odds of loan being fully paid increases by 0.135.

All the purposes of loans are in negative relationship other than wedding and major purchase. Which seems a bit odd because we came to conclusion about wedding loans that they seem more riskier but guess what, they are not!

Unemployed people increases the risk of loan by 0.403023 as it’s in negative relationship (-0.403023)

Predicted Values

Let’s look at the predictions from the model logistic_model:


pred <- predict(logistic_model)
summary(pred)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.3589  1.6867  1.7936  1.8253  2.0496 10.5782
Converting these values to probabilities is a simple transform:
prob <- 1/(1 + exp(-pred))
summary(prob)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.5888  0.8438  0.8574  0.8575  0.8859  1.0000

These are on a scale from 0 to 1 and don’t yet declare whether the predicted value is default or paid off.

We could declare any value greater than 0.5 as default.

Model fit

For the model fit in “Logistic Regression and the GLM”, the regression coefficient for purpose_small_business is 1.21526. This means that a loan to a small business compared to a loan to pay off credit card debt reduces the odds of defaulting versus being paid off by exp(1.21526)≈3.4.

Clearly, loans for the purpose of creating or expanding a small business are considerably riskier than other types of loans.

The variable borrower_score is a score on the borrowers’ creditworthiness and ranges from 0 (low) to 1 (high).

The odds of the best borrowers relative to the worst borrowers defaulting on their loans is smaller by a factor of exp(−4.61264)≈0.01.

In other words, the default risk from the borrowers with the poorest creditworthiness is 100 times greater than that of the best borrowers!

Bagging and the Random Forest

Bagging is like the basic algorithm for ensembles, except that, instead of fitting the various models to the same data, each new model is fit to a bootstrap resample

Random forests are a collective learning method for regression, classification and other jobs, that operate by creating an assembly of decision trees at training time and see the output class that is the mode of the classes or mean prediction (which is regression) of the individual trees.

The random forest is based on applying bagging to decision trees with one important extension: in addition to sampling the records, the algorithm also samples the variables.

In traditional decision trees, to determine how to create a subpartition of a partition A, the algorithm makes the choice of variable and split point by minimizing a criterion such as Gini impurity. With random forests, at each stage of the algorithm, the choice of variable is limited to a random subset of variables.

Let’s make the new data frame for the Random Forest analysis only. We’ll read the same data file that we’re using since the beginning.


library(dplyr)
loan_RF1 <- data.frame(read.csv("LoanStats2007_11.csv", skip=1))
loan_RF <- select(loan_RF1, 6,8,13,15,17,18,21,27)
loan_RF<-loan_RF[complete.cases(loan_RF),]
dim(loan_RF)

## [1] 39786     8

Random Forest works with continuous variables, so we’ve selected the continuous variables columns by select function from dplyr library.

We’ve scrubbed data again here and considered only complete case means the cases without missing values.


set.seed(2)
s=sample(1:nrow(loan_RF),15000)
loan_train=loan_RF[s,]
loan_test=loan_RF[-s,]

We’ve set the seed to generate random number to make the sample as well as to use it to create more datasets.

Then with the sample we’ve created 2 datasets loan_train and loan_test. We’ve divided the data into loan_train and loan_test(50:50%)

Dropping the empty level in the factor


loan_train$loan_status <- factor(loan_train$loan_status)

To avoid problem with data sets when using randomForest, we’ve used the code above to make sure it stores data as factors so that the modeling functions(Random Forest) will treat this data correctly.


library(randomForest)

rf.loan=randomForest(loan_train$loan_status~.,data = loan_train)#U CAN USE ARGUMENT do.trace=T rf.loan

##
## Call:
##  randomForest(formula = loan_train$loan_status ~ ., data = loan_train)
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
##
##         OOB estimate of  error rate: 14.55%
## Confusion matrix:
##             Charged Off Fully Paid  class.error
## Charged Off           5       2173 0.9977043159
## Fully Paid           10      12812 0.0007799095

rf.loan above is a call to the random Forest model.

Let’s talk about the formula in the output

The output shows the format to call the random forest function but we haven’t used the names of the functions and variables and that works in r

loan_train$loan_status(loan_status variable from the loan_train dataset) is the target variable which is the field we want to predict).

The fields after the ~ are the predictor variables and are used to predict the target variable.
data(loan_train) is the Data frame containing the variables in the model.

Classification is identifying the class or binary variable.

By default, 500 trees are trained. Since there are only two variables in the predictor set, the algorithm randomly selects the variable on which to split at each stage (i.e., a bootstrap subsample of size 1).
Number of Variables tried at each split – The number of predictive variables taken into consideration at each split within a tree. At each split, the model only uses a subset of variables so that it can reduce the bias towards the variables which are influential.

The out-of-bag (OOB) estimate of error is the error rate for the trained models, applied to the data left out of the training set for that tree.

The OOB estimated error is 14.55% which can explained as


(10+2173)/15000

## [1] 0.1455333
  • where the 15000 is the value we used while creating sample s

The predicted values can be obtained from the predict function as follows:


rf.loan.pred=predict(rf.loan,newdata = loan_test)
t=table(loan_test$loan_status,rf.loan.pred)
t

##              rf.loan.pred
##               Charged Off Fully Paid
##                         0          0
##   Charged Off           3       3489
##   Fully Paid            9      21285

The interpretation of the code above is


#interpretation:

(1149+832)/15000

## [1] 0.1320667

error=13.20% by random forest model.

importance(rf.loan)

##                     MeanDecreaseGini
## term                        78.88750
## installment                628.20467
## home_ownership              75.20087
## verification_status         73.62028
## borrower_score             275.31055
## payment_inc_ratio          625.54102
## dti                        612.21564

head(getTree(rf.loan, 1))

##   left daughter right daughter split var split point status prediction
## 1             2              3         1      2.0000      1          0
## 2             4              5         6     12.0855      1          0
## 3             6              7         3      2.0000      1          0
## 4             8              9         4      6.0000      1          0
## 5            10             11         5      0.3750      1          0
## 6             0              0         0      0.0000     -1          2

The variable importance plot given below shows how important each variable was when classifying the data.

On Y-axis: The predictor variables On X-axis: The mean decrease gini which is a measure of how, on each node, each variable contributes to the purity in a tree.

If we reduce the number of trees within the forest which is like reducing variables in the model, can decrease computational power and time without any decrease in the model accuracy.

Although, it is very risky to have too few predictive variables because then the model won’t be able to separate the classes accurately.


varImpPlot(rf.loan)
rf-loan
rf-loan

As we can see in the plt above,

Variables installments,payment_inc_ratio and dti are the most important variables contributing to the output.

Random Forest is useful in increasing the Sensitivity. Our primary objective was to predict and avoid Defaults so Random Forest works very well.

If we select only 3 columns, dti, payment_inc_ratio and loan_status, the results are almost different.(and no wonder as there will be just 3 variables!)


library(dplyr) x <- data.frame(read.csv("LoanStats2007_11.csv", skip=1))
xRF <- select(x, 17,21,27)
xRF<-xRF[complete.cases(xRF),]
dim(xRF)

## [1] 39786     3
#xRF

Now, we have much more data with just 3 columns.


set.seed(2)
s=sample(1:nrow(xRF),15000)
xtrain=xRF[s,]
xtest=xRF[-s,]

xtrain$loan_status <- factor(xtrain$loan_status)
library(randomForest)
rf.xloan=randomForest(xtrain$loan_status~.,data = xtrain)#U CAN USE ARGUMENT do.trace=T
rf.xloan

##
## Call:
##  randomForest(formula = xtrain$loan_status ~ ., data = xtrain)
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
##
##         OOB estimate of  error rate: 16.28%
## Confusion matrix:
##             Charged Off Fully Paid class.error
## Charged Off          54       2124  0.97520661
## Fully Paid          318      12504  0.02480112

In output, the number of charged off anf fully paid increases as the data we have is in higher amount. 

Also the OOB estimate of error increases.


rf.xloan.pred=predict(rf.xloan,newdata = xtest)
xt=table(xtest$loan_status,rf.xloan.pred)
xt

##              rf.xloan.pred
##               Charged Off Fully Paid
##                         0          0
##   Charged Off          82       3410
##   Fully Paid          489      20805

Same with this output, if we see the amount of data increase, there’s just a slight increase in this.

The beauty of this model probably!


importance(rf.xloan)

##                   MeanDecreaseGini
## payment_inc_ratio         1947.624
## dti                       1664.802

The MeanDecreaseGini is a lot higher because there are no other variables contributing to the purity in a tree on each node. So all by themselves!


head(getTree(rf.xloan, 1))

##   left daughter right daughter split var split point status prediction
## 1             2              3         1    0.706160      1          0
## 2             4              5         2   15.030000      1          0
## 3             6              7         1    1.259445      1          0
## 4             8              9         2    6.790000      1          0
## 5            10             11         2   15.825000      1          0
## 6            12             13         2    8.595000      1          0

varImpPlot(rf.xloan)
rf-xloan
rf-xloan

This is we can say a slightly bigger picture. Because in previous graph, we saw payment_inc_ratio and dti as the most important variables contributing to the output along with installments but in this one we can see the real difference. It’s like zooming into a picture!

Discussion and Conclusion

Finally, after working on this data for almost 2 weeks I feel like I have got the answers for my questions.

In this analysis, I have developed 2 models using logistic regression and random forest to predict if a borrower will pay the loan based on past data from Lending Club and to help the investors planning about what investment approach to go for.

Most of the classification problems in the world are not balanced. Also, almost every time data sets have NAs / missing values(Otherwise they won’t create a whole function for that!). In this analysis, I tried to deal with both missing values as well as imbalanced data sets.

After model building, r code, testing, plotting and data analysis, I’ve made the following conclusions:

In the ggplot 1: Loans issued over the years From 2007 to 2011 data, people applied for the loan most was in the year 2011. We can say that they founded in 2006 so it took them almost 5 years to gain the movementum. It tells about them gaining popularity in the market too. They can be a tough competition for traditional banks.

In ggplot 2: Loan Amount vs Funded Amount, we came to conclusion that in few cases they reduced the amount of loan approved than actual amount of applied loan and there could be multiple reasons for that. Mostly they approved loans for the people with higher borrower scores but in many cases they approved loans for the people with lower borrower scores

In ggplot 3: Installment vs Funded Amount, the one interesting fact we discovered was there was a range of installments for the same amount of loan. That means some people paid higher installments and some people paid lower installments for the same amount of loan. Their borrower score could have been one of the reasons for that.

In many classification analysis, False Negatives can cost a lot more than False Positives and that’s why we can reduce cut-off points to reduce the number of False Negatives

The cut-off value (or known as discrimination threshold) selection in logistic regression is really important factor because if investors want to decide on which loan to invest in or not to invest in. Even if they decide to invest in bad rated borrower they can raise the installments amount based on this model and then the cut-off value selection will be helpful in determining who will get a loan.

In ggplot 4: Annual Income vs Funded Amount we tried to dig into the annual income of the borrowers and the amount funded to them. There were outliers. It is obvious that these models are not flawless, and that they exhibited to have equally low presentation. That might be because of the bad data as I’ve pointed out in ‘Post Office Example’.

After finding the outliers in ggplot 4, I tried to remove those outliers and created ggplot 5: Annual Income vs. Funded Amount. There were some extreme values in the dataset. We saw a linear relationship for person’s annual income and the funded amount.

In ggplot 6: Distribution plot: Loan Amount vs Purpose for Loan Status, we saw distribution of loan and their purpose with their status. More people are with the status fully paid for the house loan than charged off. House loans are generally applied by a family so the conclusion might indicate that chances of a house loan of being fully paid were higher than any other loans. In general, small business loans are higher and riskier loans. But in this case(if we compare charged off vs fully paid loans) wedding, debt consolidation and credit card loans seem like the riskiest loans.

In ggplot 7: Employment Status Listed In The Loan Application, we discovered that the people with jobs got loans approved almost all the time rather than the people without jobs. They did approve loans for unemployed people but there could be multiple reasons for that. For ex. the person might be a student or just lost a job and was working for several years in past etc.

From ggplot 10: Loan Status Bar Chart we came to conclude about the percentage of defaulters. More than 10% of the borrowers were defaulters.

From ggplot 11: Home Ownership In The Loan Application, we concluded that more people were with rented home followed by mortgage. Few people owned home. And it is obvious that if you have money to own a house(especially in USA) then you’d have money for other purposes. In other words, you won’t be needing loans. But that’s not always true as we saw that in the post office example!

From logistic regression we came concluded that The classification accuracy is challenging measure for extreme classification which is very common situation in banking sector and that can lead them to make wrong conclusions. Like the “Post Office Example”__, what if that’s a real data? It must be counted as outlier. So, it is wise to evaluate these situations in ROC-curve in which we can see true positive values against false positive values.

For a one unit increase in borrower_score, the log odds of loan being fully paid increases by 0.135. which is good that in the portfolio the investors are willing to accept, they can decide the bad rate percentage and based on that bad rates they can make a decision very easily about the percentage of the loans that they are willing to finance.

Unemployed people increases the risk of loan by 0.403023 as it’s in negative relationship (-0.403023) And that’s why we saw the trend in ggplot 7: Employment Status Listed In The Loan Application, people with jobs got loans approved almost all the time rather than the people without jobs.

From model fit, we discovered that loans for the purpose of creating or expanding a small business are considerably riskier than other types of loans.

In Random Forest model we came to conclusion that installments, payment_inc_ratio and dti are the most important variables contributing to the output. There will always be value in any model that can support determine future investment consequences. LC members who might use this model to make economic pronouncements are by no means guaranteed certain results, but a model like this will expectantly help them scrutinize through distinct geographies they might be examining so that they can lend to the accurately important ones.

This is very important conclusion because it applies to our daily life too. We can easily make negative decisions in life. The same way loan risk models demonstrate outstanding performance in predicting true negative cases, but not so outstanding performance in predicting true positive cases due to extreme classification. But it can be approved by applying over or under sampling or synthetic data generation.

I also want to point out about the new very popular method ‘Survival Analysis’ where we can include the probabilities of charged off or default change over time. That would be very interesting because people change and their habit changes. They might have not been a good borrower in past, but they can be in future. So, we need to include that change over time and we need to add more variables about their behavior too as they can help us make a better decision about the investment.

References

Brent, July 2017. Lending Club Analysis. Retrieved from https://rpubs.com/brentd/lendingclub

J Herdmann, Winter 2013. Lending Club Loans Data Analysis: Complete Analysis Code. Retrieved from http://rstudio-pubs-static.s3.amazonaws.com/5346_a009a0dc6db04fb8858a20467da5636a.html

NA, NA. Lending Club Loan Data – Exploratory Analysis. Retrieved from http://rstudio-pubs-static.s3.amazonaws.com/290261_676d9bb194ae4c9882f599e7c0a808f2.html

NA, NA. A Mathematical Approach to Investing in Lending Club. Retrieved from http://rstudio-pubs-static.s3.amazonaws.com/13866_510072d5a4c548a19d97c043912ed5ac.html

Daniel Park, September 12, 2016. Lending Club Data: a Brief Analysis. Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/208332_c1bab2ab0b66488a89f387e00aaf01e3.html

NA, NA. Predicting Loan Payment. Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/263968_5057ec1f5a2e48a89aab7f568fc37ade.html

Brent, July 2017. Lending Club Analysis. Retrieved from http://www.analyzingdata.org/portfolio/005-Lendingclub_project

Rachel, NA. Lending Club. Retrieved from https://rpubs.com/brentd/lendingclub

Ashok, June 10, 2016. Lending Club Loan Data – Exploratory Data Analysis. Retrieved from https://rpubs.com/ashok1230/LendingClub

Vikash Singh, May 2016. Credit Risk Modeling using Logistic Regression in R. Retrieved from http://rstudio-pubs-static.s3.amazonaws.com/182183_c8214c7b7b86420bb48d3378f24c0257.html

NA, NA. Introduction. Retrieved from https://github.com/steve-liang/LCAnalytics

NA, NA. Lending Club Loans. Retrieved from http://faculty.baruch.cuny.edu/smanzan/AEDR/_book/lending-club-loans.html

Boris Sertic, July 24, 2017. LOAN RISK PREDICTION: AN APPLICATION FOR INVESTORS AT LENDING CLUB. Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/293584_265e0868a3244c47b9f05b4b8223afa2.html

Neha More, Nov 3rd, 2017. Predicting whether customers will successfully repay the loans or No: Using Decision Tree and Random Forest Model. Retrieved from https://rpubs.com/more11neha/DTRF

Randi Skrelja, December 13, 2015. 607 Data Acquisition and Management Final. Retrieved from https://rpubs.com/vskrelja/134692

Nicole Cruise, 2007. The Forest Model. Retrieved from http://blog.keyrus.co.uk/alteryxs_r_random_forest_output_explained.html

Boris Sertic, July 24, 2017. LOAN RISK PREDICTION: AN APPLICATION FOR INVESTORS AT LENDING CLUB. Retrieved from https://api.rpubs.com/boris_sertic/lrpm

Ted O’Rourke, June 19, 2016. Lending Club – Predicting Loan Outcomes. Retrieved from https://rpubs.com/torourke97/190551


Most Read

#1 Solution to the error “Visual Studio Code can’t be opened because Apple cannot check it for malicious software” #2 How to add Read More Read Less Button using JavaScript? #3 How to check if radio button is checked or not using JavaScript? #4 Solution to “TypeError: ‘x’ is not iterable” in Angular 9 #5 PHP Login System using PDO Part 1: Create User Registration Page #6 How to uninstall Cocoapods from the Mac OS?

Recently Posted

#Apr 8 JSON.stringify() in JavaScript #Apr 7 Middleware in NextJS #Jan 17 4 advanced ways to search Colleague #Jan 16 Colleague UI Basics: The Search Area #Jan 16 Colleague UI Basics: The Context Area #Jan 16 Colleague UI Basics: Accessing the user interface
You might also like these
How to add a Line Chart in Angular App?AngularHow to send and receive query strings via links in Angular 9?AngularSome SQL LIKE Operators We Need to Keep in MindSQL/MySQLHow to reference an aliased column in the WHERE clause?SQL/MySQLHow does Binding work in JavaScript?JavaScriptCROSS JOIN in PostgresPostgres