March Madness Data¶

1. Description¶

March Madness is one of the most exciting times for sports fans each year. 64 of the best college basketball teams compete for the national title. It's also a lot of fun to make a bracket to attempt to predict the outcome of all the games in the tournament. There are various contests fans can enter to possibly win millions of dollars for their bracket. In the past, Warren Buffet has also vowed to give 1 billion dollars to anyone who makes a perfect bracket. Unfortunately, no one has even come close to creating a perfect bracket, probably because the odds are 1 in 9.2 quintillion. Of the millions of people that try each year, the closest a person has ever come to a perfect bracket was Gregg Nigl of Columbus, Ohio, in 2019. He predicted the first 49 games correctly but didn't even make it through the sweet 16. Overall, data science and machine learning can provide meaningful insights for bracket creators. It could even win someone a lot of money.

Source - Has anyone ever had a perfect bracket for March Madness?

2. Data¶

Kaggle link (Tournament Team Data.csv)

In [3]:

import pandas as pd

# create dataframe
df = pd.read_csv('tournament_team_data.csv', low_memory=False)
df = df.drop(columns = ['TEAM.1']) # don't need 2 team name columns
df = df.dropna()
display(df.head(20))

	YEAR	SEED	TEAM	ROUND	KENPOM ADJUSTED EFFICIENCY	KENPOM ADJUSTED OFFENSE	KENPOM ADJUSTED DEFENSE	KENPOM ADJUSTED TEMPO	BARTTORVIK ADJUSTED EFFICIENCY	BARTTORVIK ADJUSTED OFFENSE	...	3PT RATE DEFENSE	OP ASSIST %	OP O REB %	OP D REB %	BLOCKED %	TURNOVER % DEFENSE	WINS ABOVE BUBBLE	WIN %	POINTS PER POSSESSION OFFENSE	POINTS PER POSSESSION DEFENSE
0	2022	1	Kansas	1	25.5	119.4	93.9	69.1	27.2	120.1	...	34.2	47.5	28.9	66.6	7.8	18.4	10.4	82.35	1.119	0.970
1	2022	1	Arizona	16	27.2	119.6	92.4	72.2	25.6	117.4	...	34.5	46.8	28.3	65.5	7.0	17.7	8.8	91.18	1.155	0.922
2	2022	1	Gonzaga	16	33.0	121.8	88.8	72.5	31.8	120.2	...	33.9	40.6	23.0	71.0	6.6	17.0	6.7	89.66	1.190	0.885
3	2022	1	Baylor	32	26.3	117.9	91.6	67.2	26.3	116.6	...	35.9	55.5	28.4	63.7	7.3	22.9	8.9	81.25	1.112	0.925
4	2022	2	Duke	4	23.7	119.4	95.7	67.4	25.8	119.9	...	33.8	51.8	28.5	68.1	8.1	16.1	7.2	82.35	1.169	0.979
5	2022	2	Villanova	4	24.1	118.0	93.8	62.6	24.5	117.7	...	43.0	50.0	28.0	69.1	10.9	18.8	7.4	78.79	1.127	0.979
6	2022	2	Auburn	32	24.5	113.6	89.1	70.0	22.9	112.3	...	36.3	48.8	29.3	66.8	10.1	20.7	7.4	84.38	1.085	0.924
7	2022	2	Kentucky	64	26.6	120.2	93.6	67.3	25.1	118.6	...	36.0	46.2	24.9	62.2	6.4	17.4	6.9	78.79	1.142	0.948
8	2022	3	Texas Tech	16	24.6	109.7	85.1	66.5	22.9	109.3	...	45.6	53.8	26.1	66.7	7.2	23.6	6.5	73.53	1.051	0.884
9	2022	3	Purdue	16	22.3	121.3	99.0	65.8	25.5	122.5	...	41.0	53.1	23.6	64.8	6.7	14.1	7.1	79.41	1.185	1.022
10	2022	3	Tennessee	32	25.2	111.4	86.2	67.2	23.5	110.5	...	39.4	51.3	27.4	67.2	7.8	22.9	8.0	78.79	1.058	0.907
11	2022	3	Wisconsin	32	15.6	110.4	94.8	66.5	16.8	110.0	...	33.7	45.9	24.0	74.2	9.0	16.9	6.5	77.42	1.053	0.991
12	2022	4	Arkansas	8	19.0	111.1	92.1	70.6	17.2	109.2	...	39.2	54.4	25.8	69.1	8.7	20.6	4.5	75.76	1.060	0.945
13	2022	4	Providence	16	13.9	111.9	98.0	65.2	14.7	111.1	...	36.1	46.3	28.0	69.5	9.9	15.8	6.0	83.33	1.066	0.995
14	2022	4	UCLA	16	24.8	116.1	91.2	65.5	23.6	115.3	...	37.8	51.1	24.8	68.1	7.9	19.7	5.1	78.13	1.117	0.948
15	2022	4	Illinois	32	19.6	113.7	94.1	67.1	21.1	114.1	...	29.6	41.7	25.8	66.6	8.6	15.6	4.2	70.97	1.102	0.984
16	2022	5	Houston	8	26.5	117.3	90.9	63.8	28.6	117.0	...	42.9	55.4	27.3	62.2	6.4	21.7	6.2	85.29	1.147	0.890
17	2022	5	Saint Mary's	32	19.8	109.8	90.0	63.5	17.9	108.3	...	28.8	35.9	21.9	72.7	8.8	19.5	3.9	77.42	1.060	0.933
18	2022	5	Connecticut	64	19.3	113.9	94.6	64.9	18.6	113.0	...	32.6	44.2	27.5	62.0	8.3	18.0	3.2	71.88	1.099	0.953
19	2022	5	Iowa	64	23.5	121.5	98.0	69.6	23.3	120.9	...	36.5	54.5	30.2	67.8	7.3	19.3	4.6	74.29	1.183	1.007

20 rows × 40 columns

This dataset includes every single team to compete in March Madness from 2008-2022 (exluding 2020 when there was no tournament due to covid). For each year, and for each team in the tournament that year, this dataset provides "the average stats of the team from the entire season (including their conference tournament and not including the March Madness tournament)." It also includes their seed and the round they got eliminated from the tournament in that year.

Data Dictionary¶

YEAR: year of march madness tournament
SEED: what seed the team was given
TEAM: the NCAA college basketball team
ROUND: round the team got eliminated from the tournament
- 64 = round of 64 (the first round)
- 32 = round of 32 (the second round)
- 16 = the sweet 16 (third round)
- 8 = the elite 8 (fourth round)
- 4 = the final 4 (fifth round)
- 1 = the championship game

The rest of the features are the average stats for each team before March Madness in a given year. I was able to get definitions for a few, but would need to continue researching to define them all. If this project goes through, we would probably need to remove some of the stats that are repetitive (ie. Kenpom adjusted efficiency is the Kenpom adjusted offense - Kenpom adjusted defense. Therefore we only need adjusted efficiency, and not the other 2.)

KENPOM ADJUSTED EFFICIENCY: This is how KenPom determines the overall ranking of teams. The more positive, the better. This takes the offensive efficiency minus the defensive efficiency to determine by how many points a team would outscore the “average” Division I program by.
KENPOM ADJUSTED OFFENSE: This is the amount of points a team would score per 100 possessions, or trips down the floor with the basketball, against an average Division I opponent.
KENPOM ADJUSTED DEFENSE: This is the amount of points a team would allow per 100 possessions, against an average Division I opponent.
KENPOM ADJUSTED TEMPO: The amount of possessions that a team has per 40 minutes (over the course of one game).
BARTTORVIK ADJUSTED EFFICIENCY:
BARTTORVIK ADJUSTED OFFENSE:
BARTTORVIK ADJUSTED DEFENSE:
BARTHAG:
ELITE SOS:
BARTTORVIK ADJUSTED TEMPO:
2PT %: The percentage of field goals attempted by a player or team that are 2 pointers
3PT %: The percentage of field goals attempted by a player or team that are 3 pointers
FREE THROW %: What percent of free throws does this team actually make.
EFG %: Measures field goal percentage adjusting for made 3-point field goals being 1.5 times more valuable than made 2-point field goals.
FREE THROW RATE: The percentage of plays where a player or team shoots free throws as the result of a foul
3PT RATE: The percentage of points scored by a player or team that are from 3 point field goals
ASSIST %:
OFFENSIVE REBOUND %:
DEFENSIVE REBOUND %:
BLOCK %: What percent of the time a team blocks opposing team field goals
TURNOVER %:
2PT % DEFENSE:
3PT % DEFENSE:
FREE THROW % DEFENSE:
EFG % DEFENSE:
FREE THROW RATE DEFENSE:
3PT RATE DEFENSE:
OP ASSIST %:
OP O REB %:
OP D REB %:
BLOCKED %: What percent of the time a teams field goals get blocked
TURNOVER % DEFENSE:
WINS ABOVE BUBBLE:
WIN %: Percentage of games won
POINTS PER POSSESSION OFFENSE: The number of points a player or team scores per possession
POINTS PER POSSESSION DEFENSE: The number of points the opposing player or team scores per possession on this team

Sources for stat decriptions¶

KenPom rankings explained & how to better evaluate Rutgers basketball

Stat Glossary - NBA

3. Proposal¶

We could create a classification model where the classes(y/output) are the round the team made it to in the tournament. The model would use all of stats(x/input) provided for the teams to predict which round they will make it to in the tournament. This info will help bracket makers choose how far teams will go based on their regular season stats.