The COVID-19 Pandemic is one of the defining moments of the 21st century, upending most people's day-to-day lives for nearly three years, with lasting consequences towards society as a whole. Even as the federal government moves to end the public health emergency this May, there seems to have been little reflection on the four trillion dollars that the federal government has spent on various programs to try to minimize the impact of the pandemic.
The Paycheck Protection Program (PPP) was one of the largest pandemic releif programs, with over 10 million loans totaling over $800 billion.
PPP loans were meant to help carry businesses that wouldn't have stayed open otherwise through the pandemic, with the loans being forgiven if the business met certain criteria. However, the program was riddled with fraud, with an estimated $64 Billion in loans going to businesses that were ineligible for the program, or that were not in fact businesses at all. Additionally, PPP loans were distributed in a highley unequal manner, leaving many minority-owned businesses outmaneuvered by larger, more established businesses.
Despite the programs challenges, PPP loans have saved millions of jobs, arguably leaving America in a better economic position than many other developed nations.
The PPP program was a massive undertaking, and the sheer volume of data involved makes it difficult to analyze without using machine learning. Using clustering, we can group PPP loans into categories, finding patterns in the data that could signify fraudulent activity that wouldn't traditionally appear as fraud. Alternatively, we could use regression as to find out how much underrepresented groups were left out of the PPP program.
The dataset that the government released due to a FOIA request is a 13-part CSV file containing information on all PPP loans over 150 thousand dollars. It's somewhat terrifying. All can be found using SBA's website.
The dataset contains the following columns, with a description of each column provided by the SBA:
Field Name | Field Description |
---|---|
LoanNumber | Loan Number (unique identifier) |
DateApproved | Loan Funded Date |
SBAOfficeCode | SBA Origination Office Code |
ProcessingMethod | Loan Delivery Method (PPP for first draw; PPS for second draw) |
BorrowerName | Borrower Name |
BorrowerAddress | Borrower Street Address |
BorrowerCity | Borrower City |
BorrowerState | Borrower State |
BorrowerZip | Borrower Zip Code |
LoanStatusDate | Loan Status Date - Loan Status Date is blank when the loan is disbursed but not Paid In Full or Charged Off |
LoanStatus | Loan Status Description - Loan Status is replaced by 'Exemption 4' when the loan is disbursed but not Paid in Full or Charged Off |
Term | Loan Maturity in Months |
SBAGuarantyPercentage | SBA Guaranty Percentage |
InitialApprovalAmount | Loan Approval Amount(at origination) |
CurrentApprovalAmount | Loan Approval Amount (current) |
UndisbursedAmount | Undisbursed Amount |
FranchiseName | Franchise Name |
ServicingLenderLocationID | Lender Location ID (unique identifier) |
ServicingLenderName | Servicing Lender Name |
ServicingLenderAddress | Servicing Lender Street Address |
ServicingLenderCity | Servicing Lender City |
ServicingLenderState | Servicing Lender State |
ServicingLenderZip | Servicing Lender Zip Code |
RuralUrbanIndicator | Rural or Urban Indicator (R/U) |
HubzoneIndicator | Hubzone Indicator (Y/N) |
LMIIndicator | LMI Indicator (Y/N) |
BusinessAgeDescription | Business Age Description |
ProjectCity | Project City |
ProjectCountyName | Project County Name |
ProjectState | Project State |
ProjectZip | Project Zip Code |
CD | Project Congressional District |
JobsReported | Number of Employees |
NAICSCode | NAICS 6 digit code |
Race | Borrower Race Description |
Ethnicity | Borrower Ethnicity Description |
UTILITIES_PROCEED | Note: Proceed data is lender reported at origination. On the PPP application the proceeds fields were check boxes. |
PAYROLL_PROCEED | |
MORTGAGE_INTEREST_PROCEED | |
RENT_PROCEED | |
REFINANCE_EIDL_PROCEED | |
HEALTH_CARE_PROCEED | |
DEBT_INTEREST_PROCEED | |
BusinessType | Business Type Description |
OriginatingLenderLocationID | Originating Lender ID (unique identifier) |
OriginatingLender | Originating Lender Name |
OriginatingLenderCity | Originating Lender City |
OriginatingLenderState | Originating Lender State |
Gender | Gender Indicator |
Veteran | Veteran Indicator |
NonProfit | 'Yes' if Business Type = Non-Profit Organization or Non-Profit Childcare Center or 501(c) Non Profit |
ForgivenessAmount | Forgiveness Amount |
ForgivenessDate | Forgiveness Paid Date |
The data itself needs some work to be fully useable. Some columns are a mix of numeric and string data, and many are simply missing data. Despite this, there are some interesting things to be found in the data.
If building a regression model, it would be interesting to use gender, veteran status, race, and non-profit status as predictors for the amount of money a business asked for, received, and was forgiven. While it's harder to find the potential datapoints for clustering, I would imagine that the lender could play a big role.
import pandas as pd
#importing the first of 13 datasets. all the datasets are in the same website as the first.
df_0 = pd.read_csv('https://data.sba.gov/dataset/8aa276e2-6cab-4f86-aca4-a7dde42adf24/resource/2b55e11d-7e75-4bbb-b526-69a06c0c4731/download/public_150k_plus_230101.csv')
df_0.head(5)
LoanNumber | DateApproved | SBAOfficeCode | ProcessingMethod | BorrowerName | BorrowerAddress | BorrowerCity | BorrowerState | BorrowerZip | LoanStatusDate | ... | BusinessType | OriginatingLenderLocationID | OriginatingLender | OriginatingLenderCity | OriginatingLenderState | Gender | Veteran | NonProfit | ForgivenessAmount | ForgivenessDate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 9547507704 | 05/01/2020 | 464 | PPP | SUMTER COATINGS, INC. | 2410 Highway 15 South | Sumter | NaN | 29150-9662 | 12/18/2020 | ... | Corporation | 19248 | Synovus Bank | COLUMBUS | GA | Unanswered | Unanswered | NaN | 773553.37 | 11/20/2020 |
1 | 9777677704 | 05/01/2020 | 464 | PPP | PLEASANT PLACES, INC. | 7684 Southrail Road | North Charleston | NaN | 29420-9000 | 09/28/2021 | ... | Sole Proprietorship | 19248 | Synovus Bank | COLUMBUS | GA | Male Owned | Non-Veteran | NaN | 746336.24 | 08/12/2021 |
2 | 5791407702 | 05/01/2020 | 1013 | PPP | BOYER CHILDREN'S CLINIC | 1850 BOYER AVE E | SEATTLE | NaN | 98112-2922 | 03/17/2021 | ... | Non-Profit Organization | 9551 | Bank of America, National Association | CHARLOTTE | NC | Unanswered | Unanswered | Y | 696677.49 | 02/10/2021 |
3 | 6223567700 | 05/01/2020 | 920 | PPP | KIRTLEY CONSTRUCTION INC | 1661 MARTIN RANCH RD | SAN BERNARDINO | NaN | 92407-1740 | 10/16/2021 | ... | Corporation | 9551 | Bank of America, National Association | CHARLOTTE | NC | Male Owned | Non-Veteran | NaN | 395264.11 | 09/10/2021 |
4 | 9662437702 | 05/01/2020 | 101 | PPP | AERO BOX LLC | NaN | NaN | NaN | NaN | 08/17/2021 | ... | NaN | 57328 | The Huntington National Bank | COLUMBUS | OH | Unanswered | Unanswered | NaN | 370819.35 | 04/08/2021 |
5 rows × 53 columns