March Madness is one of the most exciting times for sports fans each year. 64 of the best college basketball teams compete for the national title. It's also a lot of fun to make a bracket to attempt to predict the outcome of all the games in the tournament. There are various contests fans can enter to possibly win millions of dollars for their bracket. In the past, Warren Buffet has also vowed to give 1 billion dollars to anyone who makes a perfect bracket. Unfortunately, no one has even come close to creating a perfect bracket, probably because the odds are 1 in 9.2 quintillion. Of the millions of people that try each year, the closest a person has ever come to a perfect bracket was Gregg Nigl of Columbus, Ohio, in 2019. He predicted the first 49 games correctly but didn't even make it through the sweet 16. Overall, data science and machine learning can provide meaningful insights for bracket creators. It could even win someone a lot of money.
Source - Has anyone ever had a perfect bracket for March Madness?
import pandas as pd
# create dataframe
df = pd.read_csv('tournament_team_data.csv', low_memory=False)
df = df.drop(columns = ['TEAM.1']) # don't need 2 team name columns
df = df.dropna()
display(df.head(20))
YEAR | SEED | TEAM | ROUND | KENPOM ADJUSTED EFFICIENCY | KENPOM ADJUSTED OFFENSE | KENPOM ADJUSTED DEFENSE | KENPOM ADJUSTED TEMPO | BARTTORVIK ADJUSTED EFFICIENCY | BARTTORVIK ADJUSTED OFFENSE | ... | 3PT RATE DEFENSE | OP ASSIST % | OP O REB % | OP D REB % | BLOCKED % | TURNOVER % DEFENSE | WINS ABOVE BUBBLE | WIN % | POINTS PER POSSESSION OFFENSE | POINTS PER POSSESSION DEFENSE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022 | 1 | Kansas | 1 | 25.5 | 119.4 | 93.9 | 69.1 | 27.2 | 120.1 | ... | 34.2 | 47.5 | 28.9 | 66.6 | 7.8 | 18.4 | 10.4 | 82.35 | 1.119 | 0.970 |
1 | 2022 | 1 | Arizona | 16 | 27.2 | 119.6 | 92.4 | 72.2 | 25.6 | 117.4 | ... | 34.5 | 46.8 | 28.3 | 65.5 | 7.0 | 17.7 | 8.8 | 91.18 | 1.155 | 0.922 |
2 | 2022 | 1 | Gonzaga | 16 | 33.0 | 121.8 | 88.8 | 72.5 | 31.8 | 120.2 | ... | 33.9 | 40.6 | 23.0 | 71.0 | 6.6 | 17.0 | 6.7 | 89.66 | 1.190 | 0.885 |
3 | 2022 | 1 | Baylor | 32 | 26.3 | 117.9 | 91.6 | 67.2 | 26.3 | 116.6 | ... | 35.9 | 55.5 | 28.4 | 63.7 | 7.3 | 22.9 | 8.9 | 81.25 | 1.112 | 0.925 |
4 | 2022 | 2 | Duke | 4 | 23.7 | 119.4 | 95.7 | 67.4 | 25.8 | 119.9 | ... | 33.8 | 51.8 | 28.5 | 68.1 | 8.1 | 16.1 | 7.2 | 82.35 | 1.169 | 0.979 |
5 | 2022 | 2 | Villanova | 4 | 24.1 | 118.0 | 93.8 | 62.6 | 24.5 | 117.7 | ... | 43.0 | 50.0 | 28.0 | 69.1 | 10.9 | 18.8 | 7.4 | 78.79 | 1.127 | 0.979 |
6 | 2022 | 2 | Auburn | 32 | 24.5 | 113.6 | 89.1 | 70.0 | 22.9 | 112.3 | ... | 36.3 | 48.8 | 29.3 | 66.8 | 10.1 | 20.7 | 7.4 | 84.38 | 1.085 | 0.924 |
7 | 2022 | 2 | Kentucky | 64 | 26.6 | 120.2 | 93.6 | 67.3 | 25.1 | 118.6 | ... | 36.0 | 46.2 | 24.9 | 62.2 | 6.4 | 17.4 | 6.9 | 78.79 | 1.142 | 0.948 |
8 | 2022 | 3 | Texas Tech | 16 | 24.6 | 109.7 | 85.1 | 66.5 | 22.9 | 109.3 | ... | 45.6 | 53.8 | 26.1 | 66.7 | 7.2 | 23.6 | 6.5 | 73.53 | 1.051 | 0.884 |
9 | 2022 | 3 | Purdue | 16 | 22.3 | 121.3 | 99.0 | 65.8 | 25.5 | 122.5 | ... | 41.0 | 53.1 | 23.6 | 64.8 | 6.7 | 14.1 | 7.1 | 79.41 | 1.185 | 1.022 |
10 | 2022 | 3 | Tennessee | 32 | 25.2 | 111.4 | 86.2 | 67.2 | 23.5 | 110.5 | ... | 39.4 | 51.3 | 27.4 | 67.2 | 7.8 | 22.9 | 8.0 | 78.79 | 1.058 | 0.907 |
11 | 2022 | 3 | Wisconsin | 32 | 15.6 | 110.4 | 94.8 | 66.5 | 16.8 | 110.0 | ... | 33.7 | 45.9 | 24.0 | 74.2 | 9.0 | 16.9 | 6.5 | 77.42 | 1.053 | 0.991 |
12 | 2022 | 4 | Arkansas | 8 | 19.0 | 111.1 | 92.1 | 70.6 | 17.2 | 109.2 | ... | 39.2 | 54.4 | 25.8 | 69.1 | 8.7 | 20.6 | 4.5 | 75.76 | 1.060 | 0.945 |
13 | 2022 | 4 | Providence | 16 | 13.9 | 111.9 | 98.0 | 65.2 | 14.7 | 111.1 | ... | 36.1 | 46.3 | 28.0 | 69.5 | 9.9 | 15.8 | 6.0 | 83.33 | 1.066 | 0.995 |
14 | 2022 | 4 | UCLA | 16 | 24.8 | 116.1 | 91.2 | 65.5 | 23.6 | 115.3 | ... | 37.8 | 51.1 | 24.8 | 68.1 | 7.9 | 19.7 | 5.1 | 78.13 | 1.117 | 0.948 |
15 | 2022 | 4 | Illinois | 32 | 19.6 | 113.7 | 94.1 | 67.1 | 21.1 | 114.1 | ... | 29.6 | 41.7 | 25.8 | 66.6 | 8.6 | 15.6 | 4.2 | 70.97 | 1.102 | 0.984 |
16 | 2022 | 5 | Houston | 8 | 26.5 | 117.3 | 90.9 | 63.8 | 28.6 | 117.0 | ... | 42.9 | 55.4 | 27.3 | 62.2 | 6.4 | 21.7 | 6.2 | 85.29 | 1.147 | 0.890 |
17 | 2022 | 5 | Saint Mary's | 32 | 19.8 | 109.8 | 90.0 | 63.5 | 17.9 | 108.3 | ... | 28.8 | 35.9 | 21.9 | 72.7 | 8.8 | 19.5 | 3.9 | 77.42 | 1.060 | 0.933 |
18 | 2022 | 5 | Connecticut | 64 | 19.3 | 113.9 | 94.6 | 64.9 | 18.6 | 113.0 | ... | 32.6 | 44.2 | 27.5 | 62.0 | 8.3 | 18.0 | 3.2 | 71.88 | 1.099 | 0.953 |
19 | 2022 | 5 | Iowa | 64 | 23.5 | 121.5 | 98.0 | 69.6 | 23.3 | 120.9 | ... | 36.5 | 54.5 | 30.2 | 67.8 | 7.3 | 19.3 | 4.6 | 74.29 | 1.183 | 1.007 |
20 rows × 40 columns
This dataset includes every single team to compete in March Madness from 2008-2022 (exluding 2020 when there was no tournament due to covid). For each year, and for each team in the tournament that year, this dataset provides "the average stats of the team from the entire season (including their conference tournament and not including the March Madness tournament)." It also includes their seed and the round they got eliminated from the tournament in that year.
The rest of the features are the average stats for each team before March Madness in a given year. I was able to get definitions for a few, but would need to continue researching to define them all. If this project goes through, we would probably need to remove some of the stats that are repetitive (ie. Kenpom adjusted efficiency is the Kenpom adjusted offense - Kenpom adjusted defense. Therefore we only need adjusted efficiency, and not the other 2.)
KENPOM ADJUSTED EFFICIENCY: This is how KenPom determines the overall ranking of teams. The more positive, the better. This takes the offensive efficiency minus the defensive efficiency to determine by how many points a team would outscore the “average” Division I program by.
KENPOM ADJUSTED OFFENSE: This is the amount of points a team would score per 100 possessions, or trips down the floor with the basketball, against an average Division I opponent.
KENPOM ADJUSTED DEFENSE: This is the amount of points a team would allow per 100 possessions, against an average Division I opponent.
KENPOM ADJUSTED TEMPO: The amount of possessions that a team has per 40 minutes (over the course of one game).
BARTTORVIK ADJUSTED EFFICIENCY:
BARTTORVIK ADJUSTED OFFENSE:
BARTTORVIK ADJUSTED DEFENSE:
BARTHAG:
ELITE SOS:
BARTTORVIK ADJUSTED TEMPO:
2PT %: The percentage of field goals attempted by a player or team that are 2 pointers
3PT %: The percentage of field goals attempted by a player or team that are 3 pointers
FREE THROW %: What percent of free throws does this team actually make.
EFG %: Measures field goal percentage adjusting for made 3-point field goals being 1.5 times more valuable than made 2-point field goals.
FREE THROW RATE: The percentage of plays where a player or team shoots free throws as the result of a foul
3PT RATE: The percentage of points scored by a player or team that are from 3 point field goals
ASSIST %:
OFFENSIVE REBOUND %:
DEFENSIVE REBOUND %:
BLOCK %: What percent of the time a team blocks opposing team field goals
TURNOVER %:
2PT % DEFENSE:
3PT % DEFENSE:
FREE THROW % DEFENSE:
EFG % DEFENSE:
FREE THROW RATE DEFENSE:
3PT RATE DEFENSE:
OP ASSIST %:
OP O REB %:
OP D REB %:
BLOCKED %: What percent of the time a teams field goals get blocked
TURNOVER % DEFENSE:
WINS ABOVE BUBBLE:
WIN %: Percentage of games won
POINTS PER POSSESSION OFFENSE: The number of points a player or team scores per possession
POINTS PER POSSESSION DEFENSE: The number of points the opposing player or team scores per possession on this team
KenPom rankings explained & how to better evaluate Rutgers basketball
We could create a classification model where the classes(y/output) are the round the team made it to in the tournament. The model would use all of stats(x/input) provided for the teams to predict which round they will make it to in the tournament. This info will help bracket makers choose how far teams will go based on their regular season stats.