Investigation Overview¶
In this investigation, I was interested in finding out how different loan listing features affect a user's Borrower APR. I focused mostly on variables I found to be most interesting, including credit score range, Prosper rating, the original loan amount, and monthly loan payment.
Dataset Overview and Executive Summary¶
The original dataset comes from a personal finance company called Prosper, and includes 113,937 rows of customer loan listing data. With 81 different variables in the original dataset, I decided to choose a smaller subset to work with. These variables of interest included: Term, BorrowerAPR, LoanOriginalAmount, ProsperRating, ListingCategory, EmploymentStatus, CreditScoreRange, DebtToIncomeRatio, StatedMonthlyIncome, and MonthlyLoanPayment. Some key insights I found throughout my exploration are:
- Borrower APR has a multimodal distribution, with most values falling between 0.05 and 0.4
- Borrower APR is inversely related to Credit Score Range and Prosper Rating
- The relationship between Borrower APR and Loan Original Amount changes as Credit Score Range / Prosper Rating change.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")
# load in the dataset into a pandas dataframe
loan_df = pd.read_csv('data/loanDataModified.csv')
Rubric Tip: Each visualization in the slideshow is associated with descriptive comments that accurately depict their purpose and your observation.
Distribution of Borrower APR¶
In this dataset, most values for Borrower APR fall between 0.05 and 0.4. The distribution of values is multimodal, with peaks around 0.1, 0.2, 0.3, and 0.35. This means there are four distinct subgroups, marked by each digit of precision.
bins = np.arange(0, loan_df['BorrowerAPR'].max()+0.00625, 0.00625)
plt.figure(figsize=[8, 5])
sns.histplot(data=loan_df, x='BorrowerAPR', bins=bins, edgecolor=None)
plt.title('Distribution of Borrower APR')
plt.xlabel('Borrower APR');
Borrower APR by Credit Score Range and Prosper Rating¶
There is a clear relationship between Borrower APR and a user's Credit Score Range and Prosper Rating. In both cases, as the user's score/rating goes up, their APR goes down. It would appear that these two features have a strong impact on Borrower APR, likely because they reflect one's risk level. It is also clear that the range of Borrower APR for each Prosper Rating is much more clearly defined than those for the different Credit Score Ranges.
#assigning numeric vars and categoric vars to lists
numeric_vars = ['Term', 'BorrowerAPR', 'LoanOriginalAmount',
'DebtToIncomeRatio', 'StatedMonthlyIncome',
'MonthlyLoanPayment']
categoric_vars = ['CreditScoreRange', 'ProsperRating (Alpha)',
'ListingCategory', 'EmploymentStatus']
#setting up categorical variable types
loan_df['ProsperRating (Alpha)'].fillna('N/A', inplace=True)
ratings = ['N/A', 'HR', 'E', 'D', 'C', 'B', 'A', 'AA']
credit_labels = ['<300', '300 - 579', '580 - 669', '670 - 739', '740 - 799', '800 - 850']
loan_df['ProsperRating (Alpha)'] = pd.Categorical(loan_df['ProsperRating (Alpha)'], categories=ratings, ordered=True)
loan_df['CreditScoreRange'] = pd.Categorical(loan_df['CreditScoreRange'], categories=credit_labels, ordered=True)
fig, ax = plt.subplots(1, 2, figsize = [12,5])
sns.violinplot(loan_df, x='CreditScoreRange', y='BorrowerAPR', ax=ax[0])
ax[0].tick_params(axis='x', labelrotation=90)
ax[0].set_xlabel('Credit Score Range')
ax[0].set_ylabel('Borrower APR')
ax[0].set_title('Borrower APR by Credit Score Range')
sns.violinplot(loan_df, x='ProsperRating (Alpha)', y='BorrowerAPR', ax=ax[1])
ax[1].tick_params(axis='x', labelrotation=90)
ax[1].set_xlabel('Prosper Rating')
ax[1].set_ylabel('Borrower APR')
ax[1].set_title('Borrower APR by Prosper Rating');
Borrower APR by Loan Original Amount for Varying Credit Score Ranges and Prosper Ratings¶
Adding another dimension to these plots, we can see how Loan Original Amount and Borrower APR interact at each Credit Score Range and Prosper Rating. In both cases, low scores/ratings are associated with high APR and low Loan Original Amounts, high scores/ratings are associated with low APR and high Loan Original Amounts, and mid-range scores/ratings show a more linear relationship. Once again, the plot for Prosper Ratings has much clearer boundaries along the Borrower APR axis, cutting off at about every 0.05. This would appear to indicate that Prosper Rating has a more marked effect on Borrower APR than Credit Score.
# filtering out rows with no prosper rating to remove irrelevant data from the plot
filtered_df = loan_df[loan_df['ProsperRating (Alpha)'] != 'N/A'].copy()
filtered_df['ProsperRating (Alpha)'] = pd.Categorical(filtered_df['ProsperRating (Alpha)'], categories = ['HR', 'E', 'D', 'C', 'B', 'A', 'AA'])
fig, ax = plt.subplots(2, 1, figsize = [10,12])
plt.subplots_adjust(hspace=0.4)
# Credit Score Range
sns.scatterplot(loan_df, x='BorrowerAPR', y='LoanOriginalAmount', hue='CreditScoreRange', s=15,
edgecolor=None, palette='viridis_r', alpha=0.5, ax=ax[0])
ax[0].set_xlabel('Borrower APR')
ax[0].set_ylabel('Loan Original Amount')
ax[0].set_title('Borrower APR by Loan Original Amount for Varying Credit Score Ranges')
ax[0].legend(title='Credit Score Range', markerscale=1.5)
# Prosper Rating
sns.scatterplot(filtered_df, x='BorrowerAPR', y='LoanOriginalAmount', hue='ProsperRating (Alpha)', s=15,
edgecolor=None, palette='viridis_r', alpha=0.5, ax=ax[1])
ax[1].set_xlabel('Borrower APR')
ax[1].set_ylabel('Loan Original Amount')
ax[1].set_title('Borrower APR by Loan Original Amount for Varying Propser Ratings')
ax[1].legend(title='Prosper Rating', markerscale=1.5);