Predicting Loan Default Rates using Machine Learning¶

Project Proposal: DS 2500 Spring 2023 Section 3
Rohan Chopra

Description/Motivation¶

The proposal aims to address the issue of loan default, which is a major problem for financial institutions and banks. When borrowers default on loans, it can lead to severe financial consequences, including damage to their credit scores, additional fees and interest charges, and potential legal action. For financial institutions, loan defaults can lead to decreased profitability, instability, and an increased likelihood of insolvency. The high rates of loan defaults in the US, as evidenced by the Federal Reserve report, underscores the need for effective management of risk by financial institutions. According to the report, the overall student loan default rate in the US was 9.7% in 2019, and the mortgage delinquency rate was 4.9% in the second quarter of 2020. The motivation for this project is to address this problem by developing a reliable and accurate model that can predict the likelihood of loan default. By predicting which loans are more likely to default, financial institutions can take proactive measures to mitigate their risk, such as adjusting interest rates or denying loans to high-risk borrowers.

Citations¶

  • Household Debt and Credit Report
  • Deutsche Bank sees U.S. leveraged loan defaults near record highs in 2024
  • Leveraged Loan Default Volume In The U.S. Has Tripled This Year
  • Auto loan delinquencies are rising. Here’s what to do if you’re struggling with payments]
  • Defaults on US junk loans expected to climb as rate rises squeeze earnings
  • What Happens If You Default on a Loan?
  • What Does it Mean to Default on a Loan? What Happens When You Default?
  • Default: What It Means, What Happens When You Default, Examples
  • Student Loan Delinquency and Default

Relevant Academic Articles¶

  • Altman, E. I. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. The Journal of Finance, 23(4), 589-609.
    This paper introduces the Altman Z-Score, a statistical tool for predicting the likelihood of a firm going bankrupt based on financial ratios. While the paper focuses on corporate bankruptcy, the approach could be adapted for predicting loan default rates.

  • Davis, L., & Tschirhart, J. (1981). Predicting bank failures. Federal Reserve Bank of Richmond Economic Review, 67(3), 3-11.
    This paper explores the use of logistic regression models for predicting bank failures based on financial variables. The approach could be applied to predict loan default rates.

  • Kim, J. H., & Lee, J. H. (2016). Loan delinquency prediction using Bayesian networks in peer-to-peer lending. Expert Systems with Applications, 51.
    This paper uses Bayesian networks to predict loan delinquency in peer-to-peer lending. While the study focuses on a specific type of lending, the approach could be applied to other contexts.

  • Xie, Q., & Zhang, B. (2019). A novel deep learning model for credit risk prediction. Journal of Intelligent & Fuzzy Systems, 36(4), 3403-3414.
    This paper proposes a deep learning model for credit risk prediction, which could be adapted for predicting loan default rates. The model is based on a convolutional neural network architecture.

  • Chen, J., Zhang, B., & Xie, Q. (2020). Predicting credit default risk with deep learning. Journal of Intelligent & Fuzzy Systems, 39(2), 1527-1536.
    This paper also proposes a deep learning approach for predicting credit default risk, which could be adapted for predicting loan default rates. The model is based on a long short-term memory (LSTM) network.

Dataset¶

The project will use the "Loan Default Problem" dataset from Kaggle, which contains historical data on loan applications from a financial institution. The dataset has a total of 13 variables, including information about the borrower's credit history, loan amount, employment income, and other relevant factors. The target variable is "Loan_Status", which indicates whether the loan was approved or not. The dataset is comprehensive enough to provide insights into the borrower's financial situation and includes important variables known to be correlated with loan default rates. The dataset has been cleaned and preprocessed, and is ready to be used for machine learning purposes. It can be used to build models that predict the likelihood of a borrower defaulting on a loan, which can help financial institutions manage risk and make more informed decisions about loan approvals.

Data Dictionary¶

Header Description
Loan_ID Unique identifier for each loan application
Gender Gender of the borrower (Male/Female)
Married Marital status of the borrower (Yes/No)
Dependents Number of dependents of the borrower (0/1/2/3+)
Education Education level of the borrower (Graduate/Not Graduate)
Self_Employed Whether the borrower is self-employed or not (Yes/No)
ApplicantIncome Income of the borrower
CoapplicantIncome Income of the co-applicant
LoanAmount Loan amount requested by the borrower
Loan_Amount_Term Term of the loan in months
Credit_History Credit history of the borrower (1 = Good, 0 = Bad)
Property_Area Property location of the borrower (Urban/Semiurban/Rural)
Loan_Status Whether the loan was approved or not (Y/N)

Abbreviated Datset¶

In [1]:
import pandas as pd

# Load the CSV file into a pandas dataframe
df = pd.read_csv('/Users/rohanchopra/Desktop/DS2500/Project Proposal/Loan Prediction Problem Dataset.csv')

# Show the first 20 rows of the dataframe
display(df.head(21))
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
0 LP001002 Male No 0 Graduate No 5849 0.0 NaN 360.0 1.0 Urban Y
1 LP001003 Male Yes 1 Graduate No 4583 1508.0 128.0 360.0 1.0 Rural N
2 LP001005 Male Yes 0 Graduate Yes 3000 0.0 66.0 360.0 1.0 Urban Y
3 LP001006 Male Yes 0 Not Graduate No 2583 2358.0 120.0 360.0 1.0 Urban Y
4 LP001008 Male No 0 Graduate No 6000 0.0 141.0 360.0 1.0 Urban Y
5 LP001011 Male Yes 2 Graduate Yes 5417 4196.0 267.0 360.0 1.0 Urban Y
6 LP001013 Male Yes 0 Not Graduate No 2333 1516.0 95.0 360.0 1.0 Urban Y
7 LP001014 Male Yes 3+ Graduate No 3036 2504.0 158.0 360.0 0.0 Semiurban N
8 LP001018 Male Yes 2 Graduate No 4006 1526.0 168.0 360.0 1.0 Urban Y
9 LP001020 Male Yes 1 Graduate No 12841 10968.0 349.0 360.0 1.0 Semiurban N
10 LP001024 Male Yes 2 Graduate No 3200 700.0 70.0 360.0 1.0 Urban Y
11 LP001027 Male Yes 2 Graduate NaN 2500 1840.0 109.0 360.0 1.0 Urban Y
12 LP001028 Male Yes 2 Graduate No 3073 8106.0 200.0 360.0 1.0 Urban Y
13 LP001029 Male No 0 Graduate No 1853 2840.0 114.0 360.0 1.0 Rural N
14 LP001030 Male Yes 2 Graduate No 1299 1086.0 17.0 120.0 1.0 Urban Y
15 LP001032 Male No 0 Graduate No 4950 0.0 125.0 360.0 1.0 Urban Y
16 LP001034 Male No 1 Not Graduate No 3596 0.0 100.0 240.0 NaN Urban Y
17 LP001036 Female No 0 Graduate No 3510 0.0 76.0 360.0 0.0 Urban N
18 LP001038 Male Yes 0 Not Graduate No 4887 0.0 133.0 360.0 1.0 Rural N
19 LP001041 Male Yes 0 Graduate NaN 2600 3500.0 115.0 NaN 1.0 Urban Y
20 LP001043 Male Yes 0 Not Graduate No 7660 0.0 104.0 360.0 0.0 Urban N

Methodologies and their potential problems¶

First, exploratory data analysis will be performed to identify relationships between variables and any patterns or trends in the data. Next, the data will be preprocessed through feature selection, data cleaning, and data transformation. Finally, several machine learning models, including logistic regression, decision trees, and random forests, will be built to predict loan default rates. Model performance will be evaluated using appropriate metrics such as accuracy, precision, recall, and F1-score. Here is how each of the three machine learning methods could be implemented:

Logistic Regression
The goal is to predict whether a borrower will default on a loan or not, which is a binary outcome. Logistic regression can be used to model the relationship between the borrower's credit history, loan amount, employment status, and other relevant factors, and the likelihood of defaulting on a loan. The model would be trained on this dataset that includes historical data on loan applications, and the model would learn to identify the key factors that contribute to loan defaults. Once the logistic regression model is trained, it can be used to predict the likelihood of default for new loan applications. Given a set of predictor variables for a new loan application, the model would output a probability of default, which can be used by financial institutions to manage risk and make informed decisions about lending. Additionally, logistic regression models provide coefficients for each predictor variable, which can be used to interpret the impact of each variable on the likelihood of default. This can provide insights into the factors that are most important for predicting loan default rates.

Decision Trees
Decision trees could be used to identify the most important features that are associated with loan defaults. For example, the decision tree might split the data based on the borrower's credit score, debt-to-income ratio, or other factors. Once the most important features have been identified, the decision tree can be used to predict the likelihood of loan default based on the values of those features. Decision trees have the advantage of being easy to interpret and visualize, which can be useful for identifying patterns and insights in the data. However, decision trees can also suffer from overfitting, where the model becomes too complex and fits the training data too closely, which can lead to poor generalization to new data.

Random Forest
Random Forest can be used to identify the most important features or variables that contribute to the likelihood of loan default. The algorithm creates multiple decision trees by selecting a random subset of features at each split and bootstrapping the original data to create new training sets. Each decision tree is trained on a different bootstrap sample, which introduces variability into the model, and the final prediction is made by grouping the predictions of all the trees. Additionally, the algorithm provides a measure of variable importance, which can be used to identify the most critical factors that contribute to loan default rates. This information can be used by financial institutions to make better decisions about which loan applications to approve and how to manage risk more effectively.

Expected Outcomes¶

The expected outcomes of this project are the development of a reliable and accurate model for predicting loan default rates that can be used by financial institutions to improve their risk management strategies. Additionally, the project aims to identify the key factors that contribute to loan defaults and provide insights that can be used to mitigate the risks associated with lending. By using machine learning techniques, this project has the potential to reduce the high rates of loan defaults in the US and improve the financial stability of both borrowers and financial institutions.