Introduction¶
I have chosen to work with data from a personal finance company called Prosper. This dataset is comprised of 113,937 rows of customer listing data, with different variables describing each listing, such as loan amount, borrower rate (or interest rate), current loan status, and borrower income. I believe this will be an enlightening exploration, as I have little experience working with financial data.
More information on the variables can be found here.
Preliminary Wrangling¶
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
loans = pd.read_csv('data/prosperLoanData.csv')
loans.shape
(113937, 81)
loans.head()
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | ... | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 193129 | 2007-08-26 19:09:29.263000000 | C | 36 | Completed | 2009-08-14 00:00:00 | 0.16516 | 0.1580 | 0.1380 | ... | -133.18 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 258 |
| 1 | 10273602499503308B223C1 | 1209647 | 2014-02-27 08:28:07.900000000 | NaN | 36 | Current | NaN | 0.12016 | 0.0920 | 0.0820 | ... | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 2 | 0EE9337825851032864889A | 81716 | 2007-01-05 15:00:47.090000000 | HR | 36 | Completed | 2009-12-17 00:00:00 | 0.28269 | 0.2750 | 0.2400 | ... | -24.20 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 41 |
| 3 | 0EF5356002482715299901A | 658116 | 2012-10-22 11:02:35.010000000 | NaN | 36 | Current | NaN | 0.12528 | 0.0974 | 0.0874 | ... | -108.01 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 158 |
| 4 | 0F023589499656230C5E3E2 | 909464 | 2013-09-14 18:38:39.097000000 | NaN | 36 | Current | NaN | 0.24614 | 0.2085 | 0.1985 | ... | -60.27 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 20 |
5 rows × 81 columns
loan_df = loans[['Term', 'BorrowerAPR', 'LoanOriginalAmount',
'ProsperRating (Alpha)','ListingCategory (numeric)',
'EmploymentStatus','CreditScoreRangeLower',
'CreditScoreRangeUpper', 'DebtToIncomeRatio',
'StatedMonthlyIncome', 'MonthlyLoanPayment']].copy()
loan_df.head()
| Term | BorrowerAPR | LoanOriginalAmount | ProsperRating (Alpha) | ListingCategory (numeric) | EmploymentStatus | CreditScoreRangeLower | CreditScoreRangeUpper | DebtToIncomeRatio | StatedMonthlyIncome | MonthlyLoanPayment | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 36 | 0.16516 | 9425 | NaN | 0 | Self-employed | 640.0 | 659.0 | 0.17 | 3083.333333 | 330.43 |
| 1 | 36 | 0.12016 | 10000 | A | 2 | Employed | 680.0 | 699.0 | 0.18 | 6125.000000 | 318.93 |
| 2 | 36 | 0.28269 | 3001 | NaN | 0 | Not available | 480.0 | 499.0 | 0.06 | 2083.333333 | 123.32 |
| 3 | 36 | 0.12528 | 10000 | A | 16 | Employed | 800.0 | 819.0 | 0.15 | 2875.000000 | 321.45 |
| 4 | 36 | 0.24614 | 15000 | D | 2 | Employed | 680.0 | 699.0 | 0.26 | 9583.333333 | 563.97 |
What is the structure of your dataset?¶
The original dataset contains 113,937 rows detailing loan listings from Prosper with 81 different features. I chose a subset of 11 of those features to create a new dataframe, as seen above. I kept Term, BorrowerAPR, LoanOriginalAmount, ProsperRating (Alpha), ListingCategory (numeric), EmploymentStatus, CreditScoreRangeLower, CreditScoreRangeUpper, DebtToIncomeRatio, StatedMonthlyIncome, and MonthlyLoanPayment, as I am most interested in exploring these variables.
What is/are the main feature(s) of interest in your dataset?¶
I am primarily interested in exploring which features are the greatest predictors of BorrowerAPR.
What features in the dataset do you think will help support your investigation into your feature(s) of interest?¶
There are a number of features that will help support my investigation, and I believe the most impactful will be Credit Score, Monthly Loan Payment, and Term.
Univariate Exploration¶
Starting with my primary feature of interest, Borrower APR, I want to take a look at the distribution.
#Look at descriptive statistics for Borrower APR
loan_df['BorrowerAPR'].describe()
count 113912.000000 mean 0.218828 std 0.080364 min 0.006530 25% 0.156290 50% 0.209760 75% 0.283810 max 0.512290 Name: BorrowerAPR, dtype: float64
bins = np.arange(0, loan_df['BorrowerAPR'].max()+0.00625, 0.00625)
plt.figure(figsize=[8, 5])
sns.histplot(data=loan_df, x='BorrowerAPR', bins=bins)
plt.xlabel('Borrower APR');
Borrower APR has a multimodal distribution. There is a peak around 0.1, 0.2, 0.3, and around 0.35. This last peak has a noticeably higher frequency than the others, showing as a steep spike on the plot.
Next, I want to examine borrowers' credit score range. I assume that most users will have at least a 'good' score (mid-600s).
#fill na values with 0
loan_df['CreditScoreRangeLower'].fillna(0, inplace=True)
loan_df['CreditScoreRangeUpper'].fillna(0, inplace=True)
#create new variable to hold midpoint of range
loan_df['CreditScoreRange'] = (loan_df['CreditScoreRangeLower']+loan_df['CreditScoreRangeUpper'])/2
#create bins for each credit range
bins = [300, 580, 670, 740, 800, 851, np.inf]
credit_labels = ['<300', '300 - 579', '580 - 669', '670 - 739', '740 - 799', '800 - 850']
# use pd.cut() to apply the bins to the CreditScoreRange variable
loan_df['CreditScoreRange'] = pd.cut(loan_df['CreditScoreRange'], bins=bins, labels=credit_labels, right=False)
#convert to ordered categorical data type
loan_df['CreditScoreRange'] = pd.Categorical(loan_df['CreditScoreRange'], categories=credit_labels, ordered=True)
loan_df.drop(columns={'CreditScoreRangeLower', 'CreditScoreRangeUpper'}, inplace=True)
plt.figure(figsize=[8, 5])
sns.countplot(data=loan_df, x='CreditScoreRange')
plt.xlabel('Credit Score Range')
plt.xticks(rotation=15);
The trend for credit score range is lower than I had expected. It looks like most borrowers fit into the 580-669 range, which is fair. The next largest group is 300-579, which is poor, and there are noticeably less in the good-excellent ranges.
Let's investigate the distribution of monthly loan payment.
bins = np.arange(0, loan_df['MonthlyLoanPayment'].max()+25, 25)
plt.figure(figsize=[8, 5])
sns.histplot(data=loan_df, x='MonthlyLoanPayment', bins=bins)
plt.xlabel('Monthly Loan Payment ($)');
# result has a long tail in the distribution, so I'm going to use a log scale
log_binsize = 0.0125
bins = 10 ** np.arange(2.4, np.log10(loan_df['MonthlyLoanPayment'].max())+log_binsize, log_binsize)
plt.figure(figsize=[8, 5])
sns.histplot(data = loan_df, x = 'MonthlyLoanPayment', bins = bins)
plt.xscale('log')
plt.xticks([500, 1e3, 2.5e3], [500, '1k', '2.5k'])
plt.xlabel('Monthly Loan Payment ($)');
Monthly loan payment has a long-tailed distribution, with most borrowers' monthly payments being on the lower end. After transforming the plot with a log scale, the distribution appears to be trimodal. One peak is seen between 0 and 500, a second is right around 500, and a third is just below 1000.
I'm curious as to what the loan term looks like, and if it is also trimodal.
bins = np.arange(10, loan_df['Term'].max()+1, 1)
plt.figure(figsize=[8, 5])
sns.histplot(data=loan_df, x='Term', bins=bins)
plt.xlabel('Loan Term (in months)');
When looking at the distribution of loan term, there are three very distinct groups, one at around 12, one at around 36, and one at around 60. This likely indicates that Prosper offers loans with 1-year, 3-year, and 5-year terms, the most popular of which is the 3-year term.
I've included a number of other interesting variables in my data subset, so let's take a look at the following:
Continuous data types: LoanOriginalAmount, DebtToIncomeRatio, StatedMonthlyIncome
Categorical data types: ProsperRating (Alpha), ListingCategory (numeric), EmploymentStatus
Starting with the continuous features.
#create plots for each variable
fig, ax = plt.subplots(nrows=3, figsize = [8,10])
sns.histplot(data=loan_df, x='LoanOriginalAmount', ax=ax[0])
sns.histplot(data=loan_df, x='DebtToIncomeRatio', ax=ax[1])
sns.histplot(data=loan_df, x='StatedMonthlyIncome', ax=ax[2]);
A pattern is instantly clear from the Loan Original Amount plot. There are large steep peaks at the first thousand counts (1-5 thousand), followed by 10000, 15000, 20000, and 25000. These are likely representative of the more standard or popular loan amounts.
Debt To Income Ratio and Stated Monthly Income have long-tailed distributions, so let's replot with a log scale.
#replot debt to income ratio with log scale
#create bins
log_binsize = 0.05
bins = 10 ** np.arange(-2.4, np.log10(loan_df['DebtToIncomeRatio'].max())+log_binsize, log_binsize)
sns.histplot(data=loan_df, x='DebtToIncomeRatio', bins=bins)
plt.xscale('log')
plt.xticks([0.01, 0.05, 0.1, 0.3, 0.5, 1, 3, 5, 10], [0.01, 0.05, 0.1, 0.3, 0.5, 1, 3, 5, 10])
plt.xlabel('Debt to Income Ratio');
Once plotted with a log scale, we can see this variable a little bit clearer. It looks like most borrowers have a debt-to-income ratio around or less than 0.3 (or 30%).
#replot stated monthly income with log scale
#create bins
log_binsize = 0.05
bins = 10 ** np.arange(2, np.log10(loan_df['StatedMonthlyIncome'].max())+log_binsize, log_binsize)
sns.histplot(data=loan_df, x='StatedMonthlyIncome', bins=bins)
plt.xscale('log')
plt.xticks([100, 500, 1000, 5000, 10000, 50000, 100000], [100, 500, '1k', '5k', '10k', '50k', '100k'])
plt.xlabel('Stated Monthly Income ($)');
Using a log scale, we can see a roughly normal distribution, with a peak right around 5k. This makes sense, as this is right around the average monthly income in the US.
Let's move on to the categorical variables.
#fix datatypes of categorical variables
#first: ProsperRating -- should be in order of the rating with AA being the highest
loan_df['ProsperRating (Alpha)'].fillna('N/A', inplace=True)
ratings = ['N/A', 'HR', 'E', 'D', 'C', 'B', 'A', 'AA']
loan_df['ProsperRating (Alpha)'] = pd.Categorical(loan_df['ProsperRating (Alpha)'], categories=ratings, ordered=True)
#ListingCategory changed to string type
#replaced numeric values with what they represent for easier readability
loan_df['ListingCategory (numeric)'] = loan_df['ListingCategory (numeric)'].astype(str)
category_map = {'0':'Not Available', '1':'Debt Consolidation',
'2':'Home Improvement', '3':'Business', '4':'Personal Loan',
'5':'Student Use', '6':'Auto', '7':'Other', '8':'Baby & Adoption',
'9':'Boat', '10':'Cosmetic Procedure', '11':'Engagement Ring',
'12':'Green Loans', '13':'Household Expenses', '14':'Large Purchases',
'15':'Medical/Dental', '16':'Motorcycle', '17':'RV', '18':'Taxes',
'19':'Vacation', '20':'Wedding Loans'}
loan_df['ListingCategory (numeric)'] = loan_df['ListingCategory (numeric)'].map(category_map)
#Renaming this column, as the listing categories are no longer numeric
loan_df = loan_df.rename(columns={'ListingCategory (numeric)': 'ListingCategory'})
#filling null values in EmploymentStatus to match the corresponding label
loan_df['EmploymentStatus'].fillna('Not Available', inplace=True)
#create plots for each variable
fig, ax = plt.subplots(nrows=3, figsize = [5,15])
sns.countplot(data=loan_df, x='ProsperRating (Alpha)', ax=ax[0])
sns.countplot(data=loan_df, y='ListingCategory', ax=ax[1])
ax[1].tick_params(axis='y', labelsize=8)
sns.countplot(data=loan_df, y='EmploymentStatus', ax=ax[2]);
Omitting the N/A values, the distribution of Prosper Rating is fairly normal, with most borrowers having a 'C' (or middle) rating.
The Listing Categories plot shows a very obvious leader, with most listings being "Debt Consolidation" loans. Upon further investigation on their website, it looks like this is one of Prosper's specialties.
The final plot in this group is that showing the counts of different Employment Statuses. From this, we can see that most borrowers are listed as "Employed" or "Full-Time", which makes sense as these are the types of employment that pose the least risk.
Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?¶
Borrower APR has a multimodal distribution, with peaks at 0.1, 0.2, 0.3, and 0.35. The final peak is marked by a very sharp jump in frequency.
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?¶
In order to properly view the credit score range of borrowers, I created a new variable using the CreditScoreRangeLower and CreditScoreRangeUpper features. Both this (new) credit score range and the (original) prosper rating features are ordinal, so I had to first convert these to the appropriate ordered categorical data type before plotting.
In both the loan term and loan original amount features, I saw very obvious patterns in the frequency. This told me that there are standards set (perhaps by Prosper) in terms of how much and how long a loan can be taken out for.
Finally, there were a couple of features that I plotted using a log scale in order to better examine. These include monthly loan payment, debt to income ratio, and stated monthly income. Once transformed, I was better able to identify the trends in these plots.
Bivariate Exploration¶
I'll start by creating a heatmap to show the correlation between numeric variables.
#assigning numeric vars and categoric vars to lists that I can use later on
numeric_vars = ['Term', 'BorrowerAPR', 'LoanOriginalAmount',
'DebtToIncomeRatio', 'StatedMonthlyIncome',
'MonthlyLoanPayment']
categoric_vars = ['CreditScoreRange', 'ProsperRating (Alpha)',
'ListingCategory', 'EmploymentStatus']
# correlation plot
plt.figure(figsize = [5, 5])
sns.heatmap(loan_df[numeric_vars].corr(), annot = True, fmt = '.3f',
cmap = 'coolwarm', center = 0, vmin=-1);
In the plot above, we can see mostly mild correlations with the exception of that between monthly loan payment and loan original amount. These variables are positively correlated, with a value of 0.932, which means that as monthly loan payment amounts increase, so will the loan original amount.
We can use a pair grid to examine these relationships further.
g = sns.PairGrid(data = loan_df, vars = numeric_vars)
g = g.map_diag(plt.hist, bins = 20);
g.map_offdiag(plt.scatter);
The pair grid confirms what we saw in the previous plot, with there being a strong linear relationship between monthly loan payment and loan original amount. However, these scatterplots allow us to identify a new pattern-- There are a few variables that seem to have non-linear relationships. Looking at debt to income ratio and loan original amount / monthly loan payment, we can see the data follows a non-linear trend.
The plots for Borrower APR vs loan original amount and monthly loan payment do not have a strong linear correlation, but the wedge shape indicated there may be something more there. I plan on investigating this in my multivariate analysis.
Next, I want to start looking at how categorical features relate to our variable of interest, BorrowerAPR.
# plot BorrowerAPR against categorical features
plt.figure(figsize = [10, 10])
g = sns.PairGrid(loan_df, x_vars=categoric_vars, y_vars=['BorrowerAPR'], height=4)
g.map(sns.boxplot)
for ax in g.axes.flat:
ax.tick_params(axis='x', labelrotation=90)
<Figure size 1000x1000 with 0 Axes>
From this pair grid, we can immediately see that something interesting is happening in the first two plots: credit score range and prosper rating. Below, I will use violin plots to further examine them. The last two plots, listing category and employer status, seem to have no strong relationship with borrower APR.
# getting a closer look at BorrowerAPR vs credit score range and prosper rating
plt.figure()
g = sns.PairGrid(loan_df, x_vars=categoric_vars[:2], y_vars=['BorrowerAPR'], height=5)
g.map(sns.violinplot)
for ax in g.axes.flat:
ax.tick_params(axis='x', labelrotation=90)
<Figure size 640x480 with 0 Axes>
There is a very similar pattern found in each of these plots: As the borrower's Credit Score (or Prosper Rating) goes up, their APR goes down. It would appear that these two features (Credit Score Range and Prosper Rating) have a strong impact on Borrower APR, likely because they reflect the borrower's risk level.
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?¶
Looking at the feature of interest-- Borrower APR-- there were two very apparent relationships with Credit Score Range and Prosper Rating. In each case, the 'better' or higher a person's score/rating, the lower their APR is.
While there did not seem to be any relationships betweeen Borrower APR and other numeric variables, the plots for Loan Original Amount and Monthly Loan Payment are wedge-shaped. I wonder if this is due to other features in play, and if the wedge shape is due to trend lines from multiple groups.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?¶
I found that Monthly Loan Payment and Loan Original Amount are highly correlated, which was expected; the higher a person's loan amount is, the higher their monthly payments will be, and vice versa. Since the Loan Original Amount is usually decided upon before calculating Monthly Loan Payment, I think it is safe to assume in this case that the latter is dependent upon the former.
There also appeared to be a non-linear relationship between Debt to Income Ratio and Monthly Loan Payment/Loan Original Amount. I'm curious if this has anything to do with the wedge-shaped plots mentioned above.
Multivariate Exploration¶
I will now further explore the wedge-shaped plot seen with BorrowerAPR vs LoanOriginalAmount. While another wedge-shape was seen in the plot of BorrowerAPR vs MonthlyLoanPayment, I don't think this addition would be especially useful. I previously established that Monthly Payments are likely dependent upon the Loan Amount, so using LoanOriginalAmount to further my investigation will provide more reliable results. I believe that there may be some underlying trends within the wedge-shape, which will perhaps be revealed through multivariate analysis.
# faceted heatmap plotting BorrowerAPR against *LoanOriginalAmount, broken out by CreditScoreRange
g = sns.FacetGrid(loan_df, col='CreditScoreRange', col_wrap=3)
g.map_dataframe(sns.histplot, x='BorrowerAPR', y='LoanOriginalAmount', bins=15,cmap='inferno_r');
I started here with a faceted heatmap, breaking out Borrower APR vs Loan Amounts/Monthly Payments by Credit Score Range. With these plots, there are some trends that I can see: there is generally a high APR and low Loan Amount/Monthly Payment for the lower Credit Score Ranges, and vice versa for the highest Credit Score Ranges. Mid-range is where a more linear relationship appears, with APR increasing as the Loan Amount/Monthly Payment does.
Next, I want to take a look at what different third variables affect in a graph of borrower APR vs Loan Amount.
# filtering out rows with no prosper rating to remove irrelevant data from the plot
filtered_df = loan_df[loan_df['ProsperRating (Alpha)'] != 'N/A'].copy()
filtered_df['ProsperRating (Alpha)'] = pd.Categorical(filtered_df['ProsperRating (Alpha)'], categories = ['HR', 'E', 'D', 'C', 'B', 'A', 'AA'])
# subplots to examine APR vs Loan Amount, using varying third features
fig, ax = plt.subplots(3, 1, figsize = [8,15])
# Credit Score Range
sns.scatterplot(loan_df, x='BorrowerAPR', y='LoanOriginalAmount', hue='CreditScoreRange', s=15,
edgecolor=None, palette='viridis_r', alpha=0.5, ax=ax[0])
# Prosper Rating
sns.scatterplot(filtered_df, x='BorrowerAPR', y='LoanOriginalAmount', hue='ProsperRating (Alpha)', s=15,
edgecolor=None, palette='viridis_r', alpha=0.5, ax=ax[1])
# Debt to Income Ratio
sns.scatterplot(loan_df, x='BorrowerAPR', y='LoanOriginalAmount', hue='DebtToIncomeRatio', s=15,
edgecolor=None, palette='viridis_r', alpha=0.5, ax=ax[2]);
These resulting plots reveal some very interesting relationships. Starting at the top, the plots with Credit Score Range and Prosper Rating appear very similar. Both show a lower Credit Score Range/Prosper Rating as being associated with high APR and a low Loan Original Amount, and higher 'score's associated with low APR and high Loan Original Amounts. However, something interesting is happening in the ProsperRating plot: Here, we can see very clear cutoffs for the groups, at about every 0.05 along the Borrower APR (x) axis. This perhaps suggests that Prosper assigns set ranges of APR based on the user's given Prosper Rating.
Looking at the last plot, the addition of Debt to Income Ratio as a third variable has not helped to identify any new relationships.
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?¶
In order to learn more about the relationship between Borrower APR and Loan Amounts, I looked at the impact of using Credit Score Range, Prosper Rating, and Debt to Income Ratio as a third variable. I was particularly interested in seeing if these features would provide further insights about the wedge-shaped plots seen in the previous section. My multivariate exploration showed that this was indeed the case. While it is diffult to identify any one trend from the Borrower APR vs Loan Amounts, the addition of Credit Score Range and Prosper Rating (individually) reveals a relationship; Users with a lower credit score or Prosper Rating are likely to see low Loan Original Amounts and high Borrower APR, while those with high scores see the opposite. Mid-range scores are where a more linear relationship emerges, with increasing Loan Original Amounts resulting in increased APR. My investigation proved the effect of Debt to Income Ratio on Borrower APR vs Loan Original Amount to be minimal.
Were there any interesting or surprising interactions between features?¶
I suppose I wasn't expecting the Prosper Rating plot to have such clear boundaries between Borrower APR for each rating. This led me to believe that the Prosper Ratings are highly impactful in determining Borrower APR, as the company may use them to determine a specific range of APR for the user. I was also surprised at the lack of patterns present in the plot with Debt to Income Ratio; My interpretation of this plot is that most users on the site have a low ratio, making it irrelevant in determining the most impactful features.
# save modified dataset to data folder
loan_df.to_csv('data/loanDataModified.csv', index=False)
Conclusions¶
In this exploration, I wanted to investigate how different features affect Borrower APR. One of the first things I did was transform the original upper and lower credit score range limits into a categorical variable titled 'CreditScoreRange'. This allowed me to sort through the data much easier, according to a more standard model of credit score.
I found that Credit Score Range and Prosper Rating had the largest impact, where higher scores are associated with a lower APR and vice versa. When I looked at a plot of Borrower APR vs Loan Original Amount, I couldn't identify a clear relationship. The wedge-shaped distribution of points led me to believe that further information was needed, so I created new plots with Credit Score Range and Prosper Rating as a third variable. This revealed that individual trends exist within the Borrower APR vs Loan Original Amount plot, according to the user's scores.
Aside from the primary feature of interest, there was one relationship that stood out: I found that Loan Original Amount and Monthly Loan Payment have a high positive correlation. This was not too surprising, as total loan amount is typically used to calculate monthly payments.