Introduction
About Lending Club
The Lending Club is a peer-to-peer lending company that compares borrowers with investors through an web platform. It is the world’s biggest online marketplace that connects borrowers and investors.
The Lending Club is renovating the banking system to make credit cheap and investing more fulfilling. Lending Club works at a lower cost than traditional bank systems lending programs and pass the investments on to the borrowers in the form of lower charges and to investors they give solid risk-adjusted returns.
How it works?
The Lending Club services people who need personal loans between $1,000 and $40,000. Borrowers obtain the full amount of the issued loan minus the initial fee, which is paid to the company. Investors buy notes backed by the personal loans and they pay Lending Club a service fee. The company shares that data about the loans borrowed through its platform during certain time like a quarter, a year or couple of years.
About The Data
The Landing Club releases data every quarter. So far, the data they published is available till 2018 Q3 and can be dated back to 2007. The data includes in 4 files updated every quarter. They update it on the same day as the quarterly results of the company are published.
They comprise information on nearly all the loans issued by LC. The only loans missing from the data files are the few loans where the Lending Club was not authorized to release the facts of the transactions publicly.
The info available for each loan comprises of all the details of the loans at the time of the issuance along with more info related to the revised status of loan for example how much principal has been paid until now, the interest, whether the loan was fully paid, or it’s defaulted, or if the debtor is late on payments etc.
Demonstrating credit risk for both personal and business loans is of utmost importance for financial institutions. The possibility that a borrower will default is a key factor in getting to a credit risk measure.
This analysis will focus on the Loan Data from 2007 to 2011.
The data regarding this project can be accessed here.
Obtain the data
loan_data1a <- data.frame(read.csv("LoanStats2007_11.csv", stringsAsFactors = F, skip=1))
Scrub the data
I’ve installed all the packages required for the analysis and I’m going to install the libraries in the code chunk below:
library(zoo)
library(ggplot2)
library(scales)
library(extrafont)
library(mapproj)
library(randomForest)
library(lubridate)
library(dplyr)
library(tidyr)
I’ve already skipped the first row in the previous code because it was just a source url for the data.
There are more than 140 columns with more than 125 variables. We don’t need all those variables for this analysis.
I’m going to skip the unnecessary data with select() function and keep the columns which are important for our analysis.
using dplyr package selecting the columns that we need
loan_data <- dplyr::select(loan_data1a, 3:4, 6:9, 11:18, 21, 23, 26:30, 33:37)
alternatively we can use this:
loan_data <- select(loan_data1a, loan_amnt, funded_amnt, term, int_rate, installment, emp_title, emp_length, home_ownership, annual_inc, verification_status, issue_d, borrower_score, payment_inc_ratio, purpose, addr_state, dti, delinq_2yrs, earliest_cr_line, inq_last_6mths, open_acc, pub_rec, revol_bal, revol_util, total_acc)
The following code chunk will remove the missing values from loan_data:
loan_data <- na.omit(loan_data)
OR
loan_data %>% drop_na()
OR
apply(loan_data,2,max,na.rm=TRUE); '
this will remove the NA’s from columns that contain them.
NOTE: The lines with ## are Results of a code chunk.
## loan_amnt funded_amnt term
## " 9975" " 9975" " 60 months"
## int_rate installment grade
## "9.99%" " 99.99" "G"
## emp_title emp_length home_ownership
## "Zynga" "n/a" "RENT"
## annual_inc verification_status issue_d
## " 99999.00" "Verified" "Sep-11"
## loan_status borrower_score payment_inc_ratio
## "Fully Paid" "1.00" " 9.9992000"
## purpose addr_state dti
## "wedding" "WY" " 9.99"
## delinq_2yrs earliest_cr_line inq_last_6mths
## " 9" "Sep-99" " 8"
## open_acc pub_rec revol_bal
## " 9" " 4" " 9998"
## revol_util total_acc
## "99.90%" "90"
Explore the Data
This data section defines each column and describes the data including the data type (e.g., number, string, character) and allowable values (e.g., minimum and maximums) using functions as class(), str(), and summary().
class(loan_data)
## [1] "data.frame"
The code above determines the class of an object(which in this case is loan_data) and as we see the result is data.frame for the loan_data
str(loan_data)
## 'data.frame': 39789 obs. of 26 variables:
## $ loan_amnt : int 500 500 500 500 500 700 725 750 800 900 ...
## $ funded_amnt : int 500 500 500 500 500 700 725 750 800 900 ...
## $ term : chr " 36 months" " 36 months" " 36 months" " 36 months" ...
## $ int_rate : chr "9.76%" "10.71%" "10.46%" "11.41%" ...
## $ installment : num 16.1 16.3 16.2 16.5 15.7 ...
## $ grade : chr "B" "B" "B" "C" ...
## $ emp_title : chr "Hughes, Hubbard & Reed LLP" "" "THe University of Illinois" "Global Travel International -and- Global Domains International" ...
## $ emp_length : chr "7 years" "< 1 year" "3 years" "< 1 year" ...
## $ home_ownership : chr "MORTGAGE" "MORTGAGE" "MORTGAGE" "RENT" ...
## $ annual_inc : num 59000 7904 26000 19500 18000 ...
## $ verification_status: chr "Not Verified" "Not Verified" "Not Verified" "Not Verified" ...
## $ issue_d : chr "Mar-08" "Jan-08" "Jan-08" "Jan-08" ...
## $ loan_status : chr "Fully Paid" "Fully Paid" "Fully Paid" "Fully Paid" ...
## $ borrower_score : num 0.35 0.65 0.55 0.75 0.5 0.4 0.7 0.45 0.4 0.45 ...
## $ payment_inc_ratio : num 9.546 1.046 0.189 0.573 1.172 ...
## $ purpose : chr "other" "vacation" "small_business" "other" ...
## $ addr_state : chr "NY" "CA" "IL" "VA" ...
## $ dti : num 22.17 3.04 14.17 3.69 4.27 ...
## $ delinq_2yrs : int 0 1 0 0 0 0 0 0 0 0 ...
## $ earliest_cr_line : chr "Aug-95" "Feb-89" "Jul-94" "Nov-83" ...
## $ inq_last_6mths : int 0 2 0 0 0 1 0 0 0 2 ...
## $ open_acc : int 9 3 8 8 4 4 4 8 8 2 ...
## $ pub_rec : int 0 0 0 1 0 0 0 0 0 0 ...
## $ revol_bal : int 65414 44 5643 12229 0 0 1814 12220 19901 167 ...
## $ revol_util : chr "47.80%" "3.70%" "60.70%" "90.60%" ...
## $ total_acc : int 26 6 28 15 4 4 10 8 15 4 ...
This is perhaps the most useful function in R.
This str function provides valuable information about the structure of loan_data object.
The information we can see on the variable view tab in SPSS, the above function provides the same info.
It shows the class of the object as well as number of observations (which are 39789 in our example) and number of variables (which are 27 in our selected data.frame)
We can see that it tells us about the type of the variable in each rows whether it’s integer or character or number or something else and in the same row it prints the first few observations(or values) for the variable.
For numeric analysis, we will convert some of the chr object to numeric like int_rate is displaying as a char ‘10.5%’ need to be converted to 0.105 as a numeric value.
loan_data$term <- as.numeric(substr(loan_data$term, 1,3))
loan_data$emp_length <- as.numeric(substr(loan_data$emp_length, 1,2))
loan_data$int_rate <- as.numeric(gsub("%", "", loan_data$int_rate)) / 100
loan_data$revol_util <- as.numeric(gsub("%", "", loan_data$revol_util)) / 100
Let’s check again:
str(loan_data)
## 'data.frame': 39789 obs. of 26 variables:
## $ loan_amnt : int 500 500 500 500 500 700 725 750 800 900 ...
## $ funded_amnt : int 500 500 500 500 500 700 725 750 800 900 ...
## $ term : num 36 36 36 36 36 36 36 36 36 36 ...
## $ int_rate : num 0.0976 0.1071 0.1046 0.1141 0.0807 ...
## $ installment : num 16.1 16.3 16.2 16.5 15.7 ...
## $ grade : chr "B" "B" "B" "C" ...
## $ emp_title : chr "Hughes, Hubbard & Reed LLP" "" "THe University of Illinois" "Global Travel International -and- Global Domains International" ...
## $ emp_length : num 7 NA 3 NA NA NA 1 NA 2 4 ...
## $ home_ownership : chr "MORTGAGE" "MORTGAGE" "MORTGAGE" "RENT" ...
## $ annual_inc : num 59000 7904 26000 19500 18000 ...
## $ verification_status: chr "Not Verified" "Not Verified" "Not Verified" "Not Verified" ...
## $ issue_d : chr "Mar-08" "Jan-08" "Jan-08" "Jan-08" ...
## $ loan_status : chr "Fully Paid" "Fully Paid" "Fully Paid" "Fully Paid" ...
## $ borrower_score : num 0.35 0.65 0.55 0.75 0.5 0.4 0.7 0.45 0.4 0.45 ...
## $ payment_inc_ratio : num 9.546 1.046 0.189 0.573 1.172 ...
## $ purpose : chr "other" "vacation" "small_business" "other" ...
## $ addr_state : chr "NY" "CA" "IL" "VA" ...
## $ dti : num 22.17 3.04 14.17 3.69 4.27 ...
## $ delinq_2yrs : int 0 1 0 0 0 0 0 0 0 0 ...
## $ earliest_cr_line : chr "Aug-95" "Feb-89" "Jul-94" "Nov-83" ...
## $ inq_last_6mths : int 0 2 0 0 0 1 0 0 0 2 ...
## $ open_acc : int 9 3 8 8 4 4 4 8 8 2 ...
## $ pub_rec : int 0 0 0 1 0 0 0 0 0 0 ...
## $ revol_bal : int 65414 44 5643 12229 0 0 1814 12220 19901 167 ...
## $ revol_util : num 0.478 0.037 0.607 0.906 0 NA 0.07 0.849 0.298 0.033 ...
## $ total_acc : int 26 6 28 15 4 4 10 8 15 4 ...
Following are the Variables in the data we selected.
The definitions are given from Lending Club Data Dictionary which can be accessed online from the same loan data source here.
Dependent Variable
“loan status”: Current status of the loan e.g. defaulted borrower or the loan was charged off
Independent Variables
“loan_amnt”: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
funded_amnt: The total amount committed to that loan at that point in time.
“term”: The number of payments on the loan. Values are in months and can be either 36 or 60.
int_rate: Interest Rate on the loan
installment: The monthly payment owed by the borrower if the loan originates.
emp_title: The job title supplied by the Borrower when applying for the loan.
“emp_length”: Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
“home_ownership”: The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER
“annual_inc”: The self-reported annual income provided by the borrower during registration.
grade: LC assigned loan grade.
verification_status: Indicates if income was verified by LC, not verified, or if the income source was verified
issue_d: The month which the loan was funded
“borrower_score”: Credit score of the borrower.
payment_inc_ratio: Borrower’s payment to income ratio
“purpose”: A category provided by the borrower for the loan request.
addr_state: The state provided by the borrower in the loan application
“dti”: A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.
“delinq_2yrs_zero”: The number of 30+ days past-due incidences of delinquency in the borrower’s credit file for the past 2 years
earliest_cr_line: The month the borrower’s earliest reported credit line was opened
inc_last_6mths: The number of inquiries in past 6 months (excluding auto and mortgage inquiries)
“open_acc”: The number of open credit lines in the borrower’s credit file.
“pub_rec”: The borrower’s number of critical public records (bankruptcy filings or tax liens)
“revol_bal”: Total credit revolving balance
“revol_util”: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
summary(loan_data) it won’t have missing values
## loan_amnt funded_amnt term int_rate
## Min. : 500 Min. : 500 Min. :36.00 Min. :0.0542
## 1st Qu.: 5500 1st Qu.: 5400 1st Qu.:36.00 1st Qu.:0.0925
## Median :10000 Median : 9650 Median :36.00 Median :0.1186
## Mean :11231 Mean :10959 Mean :42.45 Mean :0.1203
## 3rd Qu.:15000 3rd Qu.:15000 3rd Qu.:60.00 3rd Qu.:0.1459
## Max. :35000 Max. :35000 Max. :60.00 Max. :0.2459
## NA's :3 NA's :3 NA's :3 NA's :3
## installment grade emp_title emp_length
## Min. : 15.69 Length:39789 Length:39789 Min. : 1.000
## 1st Qu.: 167.08 Class :character Class :character 1st Qu.: 3.000
## Median : 280.61 Mode :character Mode :character Median : 5.000
## Mean : 324.73 Mean : 5.644
## 3rd Qu.: 430.78 3rd Qu.:10.000
## Max. :1305.19 Max. :10.000
## NA's :3 NA's :5671
## home_ownership annual_inc verification_status
## Length:39789 Min. : 4000 Length:39789
## Class :character 1st Qu.: 40500 Class :character
## Mode :character Median : 59000 Mode :character
## Mean : 68979
## 3rd Qu.: 82342
## Max. :6000000
## NA's :3
## issue_d loan_status borrower_score payment_inc_ratio
## Length:39789 Length:39789 Min. :0.0500 Min. : 0.04889
## Class :character Class :character 1st Qu.:0.4000 1st Qu.: 4.36452
## Mode :character Mode :character Median :0.5000 Median : 7.01757
## Mean :0.4991 Mean : 7.61661
## 3rd Qu.:0.6000 3rd Qu.:10.33960
## Max. :1.0000 Max. :43.54560
## NA's :3 NA's :3
## purpose addr_state dti delinq_2yrs
## Length:39789 Length:39789 Min. : 0.00 Min. : 0.0000
## Class :character Class :character 1st Qu.: 8.18 1st Qu.: 0.0000
## Mode :character Mode :character Median :13.41 Median : 0.0000
## Mean :13.32 Mean : 0.1465
## 3rd Qu.:18.60 3rd Qu.: 0.0000
## Max. :29.99 Max. :11.0000
## NA's :3 NA's :3
## earliest_cr_line inq_last_6mths open_acc pub_rec
## Length:39789 Min. :0.000 Min. : 2.000 Min. :0.00000
## Class :character 1st Qu.:0.000 1st Qu.: 6.000 1st Qu.:0.00000
## Mode :character Median :1.000 Median : 9.000 Median :0.00000
## Mean :0.869 Mean : 9.294 Mean :0.05514
## 3rd Qu.:1.000 3rd Qu.:12.000 3rd Qu.:0.00000
## Max. :8.000 Max. :44.000 Max. :4.00000
## NA's :3 NA's :3 NA's :3
## revol_bal revol_util total_acc
## Min. : 0 Min. :0.0000 Min. : 2.00
## 1st Qu.: 3704 1st Qu.:0.2540 1st Qu.:13.00
## Median : 8860 Median :0.4930 Median :20.00
## Mean : 13392 Mean :0.4886 Mean :22.09
## 3rd Qu.: 17065 3rd Qu.:0.7240 3rd Qu.:29.00
## Max. :149588 Max. :0.9990 Max. :90.00
## NA's :3 NA's :53 NA's :3
Summary (or descriptive) statistics are useful to calculate summary statistics, including the mean, standard deviation, range, and percentiles.
For each numerical variable, the code above prints
Range(Minimum and Maximum values),
1st and 3rd Quantiles(quartiles are the three cut points that will divide a dataset into four equal-sized groups),
Mean(Average) as well as
Medians(Middle value separating the data from higher half and lower half) as we can see in the output.
For example, the first variable loan_amt, the Lending Club funded the loan between the range 500 to 35000 as we can see it in Min and Max values. The Mean was 11231 and the Median was 10000. The 1st Qu. was 5500 and the 3rd Qu. was 15000 in the data. (Note: All the values are in dollar amount.)
Scatter plot with ggplot2 (including graphs)
In which month year people apply for the loan most?
ggplot 1: Loans issued over the years [2007 – 2011]
library(lubridate)
loan_data$issue_d <- dmy(paste0("01-",loan_data$issue_d))
loan_amnt_by_month <- aggregate(loan_amnt ~ issue_d, data = loan_data, sum)
ggplot(loan_amnt_by_month, aes(issue_d, loan_amnt)) + geom_bar(stat = "identity") + labs(title = 'ggplot 1: Loans issued over the years')

From 2007 to 2011 data, people applied for the loan in the 2011. They founded in 2006 so it took them almost 5 years to gain the momentum.
Is there any relationship between loan amount requested and the loan amount granted?
ggplot 2: Loan Amount vs Funded Amount
ggplot(loan_data, aes(loan_amnt, funded_amnt)) +
geom_point(aes(colour = borrower_score)) +
labs(title = 'ggplot 2: Loan Amount vs. Funded Amount') +
geom_smooth()

In this plot, we see how the scatter plot looks like for the applied amount of loan vs the funded amount of loan.
I’ve used borrower_score variable as colour aesthetic to see how these people fall into these funded amount and what’s their scores look like regarding to that amount.
I’ve used two variables, loan_amnt and funded_amnt.
As we can see in the plot above, most of the time they approved the loan amount applied by the borrowers.
Although, in few cases they reduced the amount of loan and there could be multiple reasons for that.
Mostly they approved loans for the people with higher borrower scores but in many cases they approved loans for the people with lower borrower scores.
What was the installments for the funded loan?
ggplot 3: Installment vs Funded Amount
ggplot(loan_data, aes(installment, funded_amnt)) +
geom_point(aes(colour = borrower_score)) +
labs(title = 'ggplot 3: Installment vs. Funded Amount') +
geom_smooth()

In this plot, we see how the scatter plot looks like for the amount of installments for funded amount of loan.
I’ve used two variables, installment and funded_amnt.
I’ve used borrower_score variable as colour aesthetic to see how these people fall into these funded amount and what’s their scores look like regarding to that amount.
As we can can clearly see the installments for lower loans are lower and installments for higher loans are higher. The smooth curve seems like supporting this conclusion.
The one interesting fact we can discover from this plot is there is a range of installments for the same amount of loan. That means some people paid higher installments and some people paid lower installments with the same amount of loan. Their borrower score could be one of the reasons for that.
Also, We see most of the higher installments are from the people with lower borrower scores but still we don’t see any huge difference between the installments of people with lower borrower score vs people with higher borrower score.
I want to call this “A Post Office Example”
Is there any relationship between the funded amount and the annual income?
ggplot 4: Annual Income vs Funded Amount
ggplot(loan_data, aes(annual_inc, funded_amnt)) +
geom_point(aes(colour = borrower_score)) +
labs(title = 'ggplot 4: Annual Income vs. Funded Amount') +
geom_smooth()

In this plot, we see how the scatter plot looks like for the funded amount vs their annual income.
I’ve used two variables, annual_inc and funded_amnt.
I’ve used borrower_score variable as colour aesthetic to see how these people fall into these funded amount and what’s their scores look like regarding to that amount.
It looks like there are some really high income borrowers still applying for loans and borrowing money!
And by that I am talking about super rich people! This one person has an annual income of
max(loan_data$annual_inc,na.rm = TRUE)
## [1] 6e+06
That is whopping 6000000!!!!!
It’s Unbelievable! An annual income of $6 million. That is like tier 3 position in top 10 public companies… who is this person?
loan_data[which(loan_data$annual_inc == max(loan_data$annual_inc,na.rm = TRUE)),]$emp_title
## [1] "post office"
What? A post office guy??? I don’t believe that this is a valid data.
Apparently the Lending Club didn’t do a good job maintaining the information very well.
Let’s see some other high income people’s loan data look like.
loan_data[which(loan_data$annual_inc > 1000000),]$emp_title
## [1] "Montgomery ISD"
## [2] "St. John Lutheran Church"
## [3] "post office"
## [4] "Dept of army"
## [5] "Hewlett Packard"
## [6] "Lockheed Martin"
## [7] "at&t wireless"
## [8] "Convent of the Sacred Heart"
## [9] "WCP"
## [10] "TelSource Corp"
## [11] "NYCDOE"
## [12] "Stryker Instruments"
## [13] "Avis Budget Group"
## [14] "Lea Regional Hospital/Pecos Valley"
Well, most of million dollar earning people have legitimate job titles such as managing director, portfolio manager, managing partner but here, there are post office? NYCDOE? WCP?
I hope there are only the handful cases of bad data so to go forward with my analysis, I’m going to to get rid of peoples with an annual income of greater than $500k.
loan_data2 <- filter(loan_data,annual_inc<500000)
Let’s replot the same chart:
ggplot(loan_data2, aes(annual_inc, funded_amnt)) +
geom_point(aes(colour = borrower_score)) +
labs(title = 'ggplot 5: Annual Income vs. Funded Amount') +
geom_smooth()

Much better! As we know that the Lending Club cap the funding limit to $40k. So we don’t see any high funded amount greater than $40k.
Overall we can see a linear relationship for person’s annual income < $100k.
Apart from that, we can see the flat going regression line due to the hard cap of $40k.
If we look into annual income < $100k borrowers, we can see a clearer linear relationship.
What was the distribution for the loan?
ggplot 6: Distribution plot: Loan Amount vs Purpose for Loan Status
In this scatterplot, we’re going to see the distribution of the loans for different purposes and their loan statuses as well.
We’re going to use three variables, loan_amnt, loan_status and purpose.
ggplot(loan_data, aes(purpose, loan_amnt, fill = loan_status)) + geom_boxplot() + labs(title = 'ggplot 6: Loan Amount vs Purpose for Loan Status') + theme(axis.text.x=element_text(size=8, angle = 90))

As we can see in the plot above, the most stated reason for loan was small business followed by credit card and debt consolidation.
More people are with the status fully paid for the house loan than charged off.
House loans are generally applied by a family sos the conclusion might indicate that chances of a house loan of being fully paid is higher than any other loans.
In general, small business loans are higher and riskier loans.
But in this case(if we compare charged off vs fully paid loans) wedding, debt consolidation and credit card loans seem like the riskiest loans.
summary(loan_data$funded_amnt)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 500 5400 9650 10959 15000 35000 3
Also with the code above, we can easily check the stats for a single variable too.
What was vast majority for loan terms?
table(loan_data$term)
##
## 36 60
## 29096 10690
The vast majority are 36 months.
How many people were with more than 10 years of employment?
table(loan_data$emp_length)
##
## 1 2 3 4 5 6 7 8 9 10
## 3247 4394 4098 3444 3286 2231 1775 1485 1259 8899
8872 people were with more than 10 years of employment.
Did they consider granting loans to unemployed people? Did they consider number of years of employment?
ggplot 7: Let’s plot all the people with their employment status via ggplot and geom_bar.
ggplot(loan_data, aes(emp_length)) + geom_bar(fill = "dodgerblue") + ggtitle("ggplot 7: Employment Status Listed In The Loan Application")

Let’s make that bar chart more meaningful.
library(ggplot2)
library(scales)
loan_data$employ <- NA
loan_data$employ[loan_data$emp_length == "n/a"] <- "Unemployed"
loan_data$employ[loan_data$emp_length == "< 1 year"] <- "Less than 2 years"
loan_data$employ[loan_data$emp_length == "1 year"] <- "Less than 2 years"
loan_data$employ[loan_data$emp_length == "2 years"] <- "2-4 years"
loan_data$employ[loan_data$emp_length == "3 years"] <- "2-4 years"
loan_data$employ[loan_data$emp_length == "4 years"] <- "2-4 years"
loan_data$employ[loan_data$emp_length == "5 years"] <- "5-9 years"
loan_data$employ[loan_data$emp_length == "6 years"] <- "5-9 years"
loan_data$employ[loan_data$emp_length == "7 years"] <- "5-9 years"
loan_data$employ[loan_data$emp_length == "8 years"] <- "5-9 years"
loan_data$employ[loan_data$emp_length == "9 years"] <- "5-9 years"
loan_data$employ[loan_data$emp_length == "10+ years"] <- "10+ years"
loan_data$employ <- factor(loan_data$employ,
levels = c("Unemployed",
"Less than 2 years",
"2-4 years",
"5-9 years",
"10+ years"))
ggplot(loan_data, aes(x = factor(employ))) +
geom_bar(aes(y = (..count..)/sum(..count..)), fill = "dodgerblue") +
scale_y_continuous(labels = percent) +
labs(x = "Employment Status", y = "") +
ggtitle("ggplot 8: Employment Status Listed In The Loan Application")

We’ve reduced the number of variables and created less variable categories just for this analysis.
After that, we’ve plotted it with ggplot as well as we’ve used geom_bar to make the bas chart.
scale_y_continuous means the type of variable, that is continuous and we’ve plotted it using percent, while axes x will be the Employment Status.
From the bar chart above, we can see that most of the people were with 2-4 years of experience.
Also, they’ve approved loan for fewer unemployed people.
How many overall defaulters?
Let’s see the same type of analysis for Loan Status
First of all, let’s see the different status of loan and the number of people with each of that status.
table(loan_data$loan_status)
##
## Charged Off Fully Paid
## 3 5670 34116
There are just 2 status. Charged off(the person is defaulter) and fully paid.
Let’s plot it with the code below:
ggplot(loan_data, aes(loan_status)) + geom_bar(fill = "dodgerblue") + ggtitle("ggplot 9: Loan Status Bar Chart")

Let’s make that bar chart more meaningful.
loan_data$status <- gsub("", "", loan_data$loan_status)
loan_data$status[loan_data$status == "Charged Off"] <- "Default"
loan_data$status[loan_data$status == "Fully Paid"] <- "Good Credit"
ggplot(loan_data, aes(x = factor(status))) +
geom_bar(aes(y = (..count..)/sum(..count..)), fill = "dodgerblue") +
scale_y_continuous(labels = percent) +
labs(x = "Loan Status", y = "") +
ggtitle("ggplot 10: Loan Status Bar Chart")

So, we’ve assigned names for two loan status variables just for this analysis.
After that, we’ve plotted it with ggplot as well as we’ve used geom_bar to make the bas chart.
scale_y_continuous means the type of variable, that is continuous and we’ve plotted it using percent, while axes x will be the Loan Status.
We have matured loan data and that’s why we see only two statuses (Charged off and Fully paid). If we use the current data then we might see some other variables like currently paying.
As we can see in the bar chart above, more than 10% of the borrowers were defaulters.
Did more people owned the home or rented?
Home Ownership In The Loan Application
ggplot(loan_data, aes(home_ownership)) + geom_bar(fill = "dodgerblue") + ggtitle("ggplot 11: Home Ownership In The Loan Application")

We’ve plotted home_ownership variable using ggplot an geom_bar.
It is obvious from the graph that more people were with rented home followed by mortgage. Few people owned home.
And it is obvious that if you have money to own a house(especially in USA) then you’d have money for other purposes. In other words, you won’t be needing loans. But that’s not always true!
Logistic Regression on Loan Data
Logistic regression is useful for (discrete) qualitative responses referred to as categorical.
It represents the probability that the response belongs to a category rather than telling about the outcome directly as default or not.
This function will ensure that probabilities are within the range 0 and 1. We are going to use the stats ‘::glm()’.
We need to keep in mind that a call to ‘stats::glm()’ will not return all of the model statistics by default. We can review the object’s attributes with the function ‘base::attributes()’.
Also, we’ll use the function base::summary(). That is a generic function to produce summaries of the results of different model fitting functions. So for glm object, it will return model statistics like
- z statistic
- deviance residuals
- p-value
- coefficients
- AIC
- standard error and some more.
To use GLM function we need to set the stringsAsFactors = T for the data
So, I am going to use the main csv file and make a new data frame loan_data_LR in following code:
loan_data_LR <- data.frame(read.csv("LoanStats2007_11.csv", stringsAsFactors = T, skip=1))
Now let’s apply the glm function on the new data frame:
logistic_model <- glm(formula = loan_status ~ payment_inc_ratio + purpose + home_ownership + emp_length + borrower_score, family = "binomial", data = loan_data_LR)
class(logistic_model)
## [1] "glm" "lm"
Of course the class for the model would be glm.
This dataset has a binary response (outcome that is dependent variable) variable called loan_status.
There are five predictor variables: payment_inc_ratio, purpose, home_ownership, emp_length and borrower_score.
We treated the variables payment_inc_ratio and borrower_score as continuous. emp_length takes values from unemployed to more than 10 years depending on the job experience. purpose and home_ownership takes discrete values.
Let’s check the attributes for this model:
attributes(logistic_model)
## $names
## [1] "coefficients" "residuals" "fitted.values"
## [4] "effects" "R" "rank"
## [7] "qr" "family" "linear.predictors"
## [10] "deviance" "aic" "null.deviance"
## [13] "iter" "weights" "prior.weights"
## [16] "df.residual" "df.null" "y"
## [19] "converged" "boundary" "model"
## [22] "na.action" "call" "formula"
## [25] "terms" "data" "offset"
## [28] "control" "method" "contrasts"
## [31] "xlevels"
##
## $class
## [1] "glm" "lm"
There are 31 attributes.
Let’s review the summary for this model:
summary(logistic_model)
##
## Call:
## glm(formula = loan_status ~ payment_inc_ratio + purpose + home_ownership +
## emp_length + borrower_score, family = "binomial", data = loan_data_LR)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2458 0.4649 0.5350 0.5771 1.0293
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.119427 0.109311 19.389 < 2e-16 ***
## payment_inc_ratio -0.006445 0.003562 -1.809 0.070381 .
## purposecredit_card -0.017888 0.095320 -0.188 0.851141
## purposedebt_consolidation -0.401779 0.086436 -4.648 3.35e-06 ***
## purposeeducational -0.580027 0.169411 -3.424 0.000618 ***
## purposehome_improvement -0.178803 0.101711 -1.758 0.078757 .
## purposehouse -0.432073 0.164749 -2.623 0.008726 **
## purposemajor_purchase 0.026757 0.109600 0.244 0.807130
## purposemedical -0.428025 0.134763 -3.176 0.001493 **
## purposemoving -0.448911 0.141654 -3.169 0.001529 **
## purposeother -0.473573 0.094202 -5.027 4.98e-07 ***
## purposerenewable_energy -0.640996 0.268155 -2.390 0.016830 *
## purposesmall_business -1.124679 0.099390 -11.316 < 2e-16 ***
## purposevacation -0.287136 0.170535 -1.684 0.092232 .
## purposewedding 0.033955 0.136667 0.248 0.803786
## home_ownershipNONE 8.740098 68.973744 0.127 0.899165
## home_ownershipOTHER -0.352006 0.264271 -1.332 0.182863
## home_ownershipOWN -0.104654 0.056545 -1.851 0.064198 .
## home_ownershipRENT -0.179675 0.032176 -5.584 2.35e-08 ***
## emp_length< 1 year 0.163174 0.053389 3.056 0.002241 **
## emp_length1 year 0.146157 0.059725 2.447 0.014398 *
## emp_length2 years 0.235513 0.054922 4.288 1.80e-05 ***
## emp_length3 years 0.171877 0.055238 3.112 0.001861 **
## emp_length4 years 0.175489 0.058713 2.989 0.002800 **
## emp_length5 years 0.121583 0.058993 2.061 0.039306 *
## emp_length6 years 0.126263 0.068585 1.841 0.065625 .
## emp_length7 years 0.054891 0.073607 0.746 0.455833
## emp_length8 years 0.124219 0.081139 1.531 0.125784
## emp_length9 years 0.221098 0.090289 2.449 0.014334 *
## emp_lengthn/a -0.403023 0.081132 -4.968 6.78e-07 ***
## borrower_score 0.135297 0.112703 1.200 0.229954
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 32585 on 39785 degrees of freedom
## Residual deviance: 32144 on 39755 degrees of freedom
## (3 observations deleted due to missingness)
## AIC: 32206
##
## Number of Fisher Scoring iterations: 9
The summary output is similar to the main model output below. So, I am going to discuss about both outputs together.
Let’s look at the model itself:
logistic_model
##
## Call: glm(formula = loan_status ~ payment_inc_ratio + purpose + home_ownership +
## emp_length + borrower_score, family = "binomial", data = loan_data_LR)
##
## Coefficients:
## (Intercept) payment_inc_ratio
## 2.119427 -0.006445
## purposecredit_card purposedebt_consolidation
## -0.017888 -0.401779
## purposeeducational purposehome_improvement
## -0.580027 -0.178803
## purposehouse purposemajor_purchase
## -0.432073 0.026757
## purposemedical purposemoving
## -0.428025 -0.448911
## purposeother purposerenewable_energy
## -0.473573 -0.640996
## purposesmall_business purposevacation
## -1.124679 -0.287136
## purposewedding home_ownershipNONE
## 0.033955 8.740098
## home_ownershipOTHER home_ownershipOWN
## -0.352006 -0.104654
## home_ownershipRENT emp_length< 1 year
## -0.179675 0.163174
## emp_length1 year emp_length2 years
## 0.146157 0.235513
## emp_length3 years emp_length4 years
## 0.171877 0.175489
## emp_length5 years emp_length6 years
## 0.121583 0.126263
## emp_length7 years emp_length8 years
## 0.054891 0.124219
## emp_length9 years emp_lengthn/a
## 0.221098 -0.403023
## borrower_score
## 0.135297
##
## Degrees of Freedom: 39785 Total (i.e. Null); 39755 Residual
## (3 observations deleted due to missingness)
## Null Deviance: 32580
## Residual Deviance: 32140 AIC: 32210
In the output above, the first output we see is the call, that is just R reminding us about what the model ran, what options specified, etc.
Next output we see is the deviance residuals, which are a measure of model fit. That part of output represents the distribution of the deviance residuals for all individual cases used in the model and below we discuss on how to use the summaries of deviance statistics to assess the model fit.
The next portion of the output demonstrations the coefficients and their standard errors as well as the z-statistic and the associated p-values.
Both payment_inc_ratio and borrower_score are statistically significant.
The regression coefficients tell us about the change in the log odds in the outcome for one unit increase in the predictor variable.
For every one unit change in payment_inc_ratio, the log odds of fully paid (versus charged off) increases by -0.006445. So they are in negative relationship.
For a one unit increase in borrower_score, the log odds of loan being fully paid increases by 0.135.
All the purposes of loans are in negative relationship other than wedding and major purchase. Which seems a bit odd because we came to conclusion about wedding loans that they seem more riskier but guess what, they are not!
Unemployed people increases the risk of loan by 0.403023 as it’s in negative relationship (-0.403023)
Predicted Values
Let’s look at the predictions from the model logistic_model:
pred <- predict(logistic_model)
summary(pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3589 1.6867 1.7936 1.8253 2.0496 10.5782
Converting these values to probabilities is a simple transform:
prob <- 1/(1 + exp(-pred))
summary(prob)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5888 0.8438 0.8574 0.8575 0.8859 1.0000
These are on a scale from 0 to 1 and don’t yet declare whether the predicted value is default or paid off.
We could declare any value greater than 0.5 as default.
Model fit
For the model fit in “Logistic Regression and the GLM”, the regression coefficient for purpose_small_business is 1.21526. This means that a loan to a small business compared to a loan to pay off credit card debt reduces the odds of defaulting versus being paid off by exp(1.21526)≈3.4.
Clearly, loans for the purpose of creating or expanding a small business are considerably riskier than other types of loans.
The variable borrower_score is a score on the borrowers’ creditworthiness and ranges from 0 (low) to 1 (high).
The odds of the best borrowers relative to the worst borrowers defaulting on their loans is smaller by a factor of exp(−4.61264)≈0.01.
In other words, the default risk from the borrowers with the poorest creditworthiness is 100 times greater than that of the best borrowers!
Bagging and the Random Forest
Bagging is like the basic algorithm for ensembles, except that, instead of fitting the various models to the same data, each new model is fit to a bootstrap resample
Random forests are a collective learning method for regression, classification and other jobs, that operate by creating an assembly of decision trees at training time and see the output class that is the mode of the classes or mean prediction (which is regression) of the individual trees.
The random forest is based on applying bagging to decision trees with one important extension: in addition to sampling the records, the algorithm also samples the variables.
In traditional decision trees, to determine how to create a subpartition of a partition A, the algorithm makes the choice of variable and split point by minimizing a criterion such as Gini impurity. With random forests, at each stage of the algorithm, the choice of variable is limited to a random subset of variables.
Let’s make the new data frame for the Random Forest analysis only. We’ll read the same data file that we’re using since the beginning.
library(dplyr)
loan_RF1 <- data.frame(read.csv("LoanStats2007_11.csv", skip=1))
loan_RF <- select(loan_RF1, 6,8,13,15,17,18,21,27)
loan_RF<-loan_RF[complete.cases(loan_RF),]
dim(loan_RF)
## [1] 39786 8
Random Forest works with continuous variables, so we’ve selected the continuous variables columns by select function from dplyr library.
We’ve scrubbed data again here and considered only complete case means the cases without missing values.
set.seed(2)
s=sample(1:nrow(loan_RF),15000)
loan_train=loan_RF[s,]
loan_test=loan_RF[-s,]
We’ve set the seed to generate random number to make the sample as well as to use it to create more datasets.
Then with the sample we’ve created 2 datasets loan_train and loan_test. We’ve divided the data into loan_train and loan_test(50:50%)
Dropping the empty level in the factor
loan_train$loan_status <- factor(loan_train$loan_status)
To avoid problem with data sets when using randomForest, we’ve used the code above to make sure it stores data as factors so that the modeling functions(Random Forest) will treat this data correctly.
library(randomForest)
rf.loan=randomForest(loan_train$loan_status~.,data = loan_train)#U CAN USE ARGUMENT do.trace=T rf.loan
##
## Call:
## randomForest(formula = loan_train$loan_status ~ ., data = loan_train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 14.55%
## Confusion matrix:
## Charged Off Fully Paid class.error
## Charged Off 5 2173 0.9977043159
## Fully Paid 10 12812 0.0007799095
rf.loan above is a call to the random Forest model.
Let’s talk about the formula in the output
The output shows the format to call the random forest function but we haven’t used the names of the functions and variables and that works in r
loan_train$loan_status(loan_status variable from the loan_train dataset) is the target variable which is the field we want to predict).
The fields after the ~ are the predictor variables and are used to predict the target variable.
data(loan_train) is the Data frame containing the variables in the model.
Classification is identifying the class or binary variable.
By default, 500 trees are trained. Since there are only two variables in the predictor set, the algorithm randomly selects the variable on which to split at each stage (i.e., a bootstrap subsample of size 1).
Number of Variables tried at each split – The number of predictive variables taken into consideration at each split within a tree. At each split, the model only uses a subset of variables so that it can reduce the bias towards the variables which are influential.
The out-of-bag (OOB) estimate of error is the error rate for the trained models, applied to the data left out of the training set for that tree.
The OOB estimated error is 14.55% which can explained as
(10+2173)/15000
## [1] 0.1455333
- where the 15000 is the value we used while creating sample s
The predicted values can be obtained from the predict function as follows:
rf.loan.pred=predict(rf.loan,newdata = loan_test)
t=table(loan_test$loan_status,rf.loan.pred)
t
## rf.loan.pred
## Charged Off Fully Paid
## 0 0
## Charged Off 3 3489
## Fully Paid 9 21285
The interpretation of the code above is
#interpretation:
(1149+832)/15000
## [1] 0.1320667
error=13.20% by random forest model.
importance(rf.loan)
## MeanDecreaseGini
## term 78.88750
## installment 628.20467
## home_ownership 75.20087
## verification_status 73.62028
## borrower_score 275.31055
## payment_inc_ratio 625.54102
## dti 612.21564
head(getTree(rf.loan, 1))
## left daughter right daughter split var split point status prediction
## 1 2 3 1 2.0000 1 0
## 2 4 5 6 12.0855 1 0
## 3 6 7 3 2.0000 1 0
## 4 8 9 4 6.0000 1 0
## 5 10 11 5 0.3750 1 0
## 6 0 0 0 0.0000 -1 2
The variable importance plot given below shows how important each variable was when classifying the data.
On Y-axis: The predictor variables On X-axis: The mean decrease gini which is a measure of how, on each node, each variable contributes to the purity in a tree.
If we reduce the number of trees within the forest which is like reducing variables in the model, can decrease computational power and time without any decrease in the model accuracy.
Although, it is very risky to have too few predictive variables because then the model won’t be able to separate the classes accurately.
varImpPlot(rf.loan)

As we can see in the plt above,
Variables installments,payment_inc_ratio and dti are the most important variables contributing to the output.
Random Forest is useful in increasing the Sensitivity. Our primary objective was to predict and avoid Defaults so Random Forest works very well.
If we select only 3 columns, dti, payment_inc_ratio and loan_status, the results are almost different.(and no wonder as there will be just 3 variables!)
library(dplyr) x <- data.frame(read.csv("LoanStats2007_11.csv", skip=1))
xRF <- select(x, 17,21,27)
xRF<-xRF[complete.cases(xRF),]
dim(xRF)
## [1] 39786 3
#xRF
Now, we have much more data with just 3 columns.
set.seed(2)
s=sample(1:nrow(xRF),15000)
xtrain=xRF[s,]
xtest=xRF[-s,]
xtrain$loan_status <- factor(xtrain$loan_status)
library(randomForest)
rf.xloan=randomForest(xtrain$loan_status~.,data = xtrain)#U CAN USE ARGUMENT do.trace=T
rf.xloan
##
## Call:
## randomForest(formula = xtrain$loan_status ~ ., data = xtrain)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 16.28%
## Confusion matrix:
## Charged Off Fully Paid class.error
## Charged Off 54 2124 0.97520661
## Fully Paid 318 12504 0.02480112
In output, the number of charged off anf fully paid increases as the data we have is in higher amount.
Also the OOB estimate of error increases.
rf.xloan.pred=predict(rf.xloan,newdata = xtest)
xt=table(xtest$loan_status,rf.xloan.pred)
xt
## rf.xloan.pred
## Charged Off Fully Paid
## 0 0
## Charged Off 82 3410
## Fully Paid 489 20805
Same with this output, if we see the amount of data increase, there’s just a slight increase in this.
The beauty of this model probably!
importance(rf.xloan)
## MeanDecreaseGini
## payment_inc_ratio 1947.624
## dti 1664.802
The MeanDecreaseGini is a lot higher because there are no other variables contributing to the purity in a tree on each node. So all by themselves!
head(getTree(rf.xloan, 1))
## left daughter right daughter split var split point status prediction
## 1 2 3 1 0.706160 1 0
## 2 4 5 2 15.030000 1 0
## 3 6 7 1 1.259445 1 0
## 4 8 9 2 6.790000 1 0
## 5 10 11 2 15.825000 1 0
## 6 12 13 2 8.595000 1 0
varImpPlot(rf.xloan)

This is we can say a slightly bigger picture. Because in previous graph, we saw payment_inc_ratio and dti as the most important variables contributing to the output along with installments but in this one we can see the real difference. It’s like zooming into a picture!
Discussion and Conclusion
Finally, after working on this data for almost 2 weeks I feel like I have got the answers for my questions.
In this analysis, I have developed 2 models using logistic regression and random forest to predict if a borrower will pay the loan based on past data from Lending Club and to help the investors planning about what investment approach to go for.
Most of the classification problems in the world are not balanced. Also, almost every time data sets have NAs / missing values(Otherwise they won’t create a whole function for that!). In this analysis, I tried to deal with both missing values as well as imbalanced data sets.
After model building, r code, testing, plotting and data analysis, I’ve made the following conclusions:
In the ggplot 1: Loans issued over the years From 2007 to 2011 data, people applied for the loan most was in the year 2011. We can say that they founded in 2006 so it took them almost 5 years to gain the movementum. It tells about them gaining popularity in the market too. They can be a tough competition for traditional banks.
In ggplot 2: Loan Amount vs Funded Amount, we came to conclusion that in few cases they reduced the amount of loan approved than actual amount of applied loan and there could be multiple reasons for that. Mostly they approved loans for the people with higher borrower scores but in many cases they approved loans for the people with lower borrower scores
In ggplot 3: Installment vs Funded Amount, the one interesting fact we discovered was there was a range of installments for the same amount of loan. That means some people paid higher installments and some people paid lower installments for the same amount of loan. Their borrower score could have been one of the reasons for that.
In many classification analysis, False Negatives can cost a lot more than False Positives and that’s why we can reduce cut-off points to reduce the number of False Negatives
The cut-off value (or known as discrimination threshold) selection in logistic regression is really important factor because if investors want to decide on which loan to invest in or not to invest in. Even if they decide to invest in bad rated borrower they can raise the installments amount based on this model and then the cut-off value selection will be helpful in determining who will get a loan.
In ggplot 4: Annual Income vs Funded Amount we tried to dig into the annual income of the borrowers and the amount funded to them. There were outliers. It is obvious that these models are not flawless, and that they exhibited to have equally low presentation. That might be because of the bad data as I’ve pointed out in ‘Post Office Example’.
After finding the outliers in ggplot 4, I tried to remove those outliers and created ggplot 5: Annual Income vs. Funded Amount. There were some extreme values in the dataset. We saw a linear relationship for person’s annual income and the funded amount.
In ggplot 6: Distribution plot: Loan Amount vs Purpose for Loan Status, we saw distribution of loan and their purpose with their status. More people are with the status fully paid for the house loan than charged off. House loans are generally applied by a family so the conclusion might indicate that chances of a house loan of being fully paid were higher than any other loans. In general, small business loans are higher and riskier loans. But in this case(if we compare charged off vs fully paid loans) wedding, debt consolidation and credit card loans seem like the riskiest loans.
In ggplot 7: Employment Status Listed In The Loan Application, we discovered that the people with jobs got loans approved almost all the time rather than the people without jobs. They did approve loans for unemployed people but there could be multiple reasons for that. For ex. the person might be a student or just lost a job and was working for several years in past etc.
From ggplot 10: Loan Status Bar Chart we came to conclude about the percentage of defaulters. More than 10% of the borrowers were defaulters.
From ggplot 11: Home Ownership In The Loan Application, we concluded that more people were with rented home followed by mortgage. Few people owned home. And it is obvious that if you have money to own a house(especially in USA) then you’d have money for other purposes. In other words, you won’t be needing loans. But that’s not always true as we saw that in the post office example!
From logistic regression we came concluded that The classification accuracy is challenging measure for extreme classification which is very common situation in banking sector and that can lead them to make wrong conclusions. Like the “Post Office Example”__, what if that’s a real data? It must be counted as outlier. So, it is wise to evaluate these situations in ROC-curve in which we can see true positive values against false positive values.
For a one unit increase in borrower_score, the log odds of loan being fully paid increases by 0.135. which is good that in the portfolio the investors are willing to accept, they can decide the bad rate percentage and based on that bad rates they can make a decision very easily about the percentage of the loans that they are willing to finance.
Unemployed people increases the risk of loan by 0.403023 as it’s in negative relationship (-0.403023) And that’s why we saw the trend in ggplot 7: Employment Status Listed In The Loan Application, people with jobs got loans approved almost all the time rather than the people without jobs.
From model fit, we discovered that loans for the purpose of creating or expanding a small business are considerably riskier than other types of loans.
In Random Forest model we came to conclusion that installments, payment_inc_ratio and dti are the most important variables contributing to the output. There will always be value in any model that can support determine future investment consequences. LC members who might use this model to make economic pronouncements are by no means guaranteed certain results, but a model like this will expectantly help them scrutinize through distinct geographies they might be examining so that they can lend to the accurately important ones.
This is very important conclusion because it applies to our daily life too. We can easily make negative decisions in life. The same way loan risk models demonstrate outstanding performance in predicting true negative cases, but not so outstanding performance in predicting true positive cases due to extreme classification. But it can be approved by applying over or under sampling or synthetic data generation.
I also want to point out about the new very popular method ‘Survival Analysis’ where we can include the probabilities of charged off or default change over time. That would be very interesting because people change and their habit changes. They might have not been a good borrower in past, but they can be in future. So, we need to include that change over time and we need to add more variables about their behavior too as they can help us make a better decision about the investment.
References
Brent, July 2017. Lending Club Analysis. Retrieved from https://rpubs.com/brentd/lendingclub
J Herdmann, Winter 2013. Lending Club Loans Data Analysis: Complete Analysis Code. Retrieved from http://rstudio-pubs-static.s3.amazonaws.com/5346_a009a0dc6db04fb8858a20467da5636a.html
NA, NA. Lending Club Loan Data – Exploratory Analysis. Retrieved from http://rstudio-pubs-static.s3.amazonaws.com/290261_676d9bb194ae4c9882f599e7c0a808f2.html
NA, NA. A Mathematical Approach to Investing in Lending Club. Retrieved from http://rstudio-pubs-static.s3.amazonaws.com/13866_510072d5a4c548a19d97c043912ed5ac.html
Daniel Park, September 12, 2016. Lending Club Data: a Brief Analysis. Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/208332_c1bab2ab0b66488a89f387e00aaf01e3.html
NA, NA. Predicting Loan Payment. Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/263968_5057ec1f5a2e48a89aab7f568fc37ade.html
Brent, July 2017. Lending Club Analysis. Retrieved from http://www.analyzingdata.org/portfolio/005-Lendingclub_project
Rachel, NA. Lending Club. Retrieved from https://rpubs.com/brentd/lendingclub
Ashok, June 10, 2016. Lending Club Loan Data – Exploratory Data Analysis. Retrieved from https://rpubs.com/ashok1230/LendingClub
Vikash Singh, May 2016. Credit Risk Modeling using Logistic Regression in R. Retrieved from http://rstudio-pubs-static.s3.amazonaws.com/182183_c8214c7b7b86420bb48d3378f24c0257.html
NA, NA. Introduction. Retrieved from https://github.com/steve-liang/LCAnalytics
NA, NA. Lending Club Loans. Retrieved from http://faculty.baruch.cuny.edu/smanzan/AEDR/_book/lending-club-loans.html
Boris Sertic, July 24, 2017. LOAN RISK PREDICTION: AN APPLICATION FOR INVESTORS AT LENDING CLUB. Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/293584_265e0868a3244c47b9f05b4b8223afa2.html
Neha More, Nov 3rd, 2017. Predicting whether customers will successfully repay the loans or No: Using Decision Tree and Random Forest Model. Retrieved from https://rpubs.com/more11neha/DTRF
Randi Skrelja, December 13, 2015. 607 Data Acquisition and Management Final. Retrieved from https://rpubs.com/vskrelja/134692
Nicole Cruise, 2007. The Forest Model. Retrieved from http://blog.keyrus.co.uk/alteryxs_r_random_forest_output_explained.html
Boris Sertic, July 24, 2017. LOAN RISK PREDICTION: AN APPLICATION FOR INVESTORS AT LENDING CLUB. Retrieved from https://api.rpubs.com/boris_sertic/lrpm
Ted O’Rourke, June 19, 2016. Lending Club – Predicting Loan Outcomes. Retrieved from https://rpubs.com/torourke97/190551
Data Analysis Logistic Regression Random Forest RStudio The Lending Club