March Madness Data¶

1. Description¶

March Madness is one of the most exciting times for sports fans each year. 64 of the best college basketball teams compete for the national title. It's also a lot of fun to make a bracket to attempt to predict the outcome of all the games in the tournament. There are various contests fans can enter to possibly win millions of dollars for their bracket. In the past, Warren Buffet has also vowed to give 1 billion dollars to anyone who makes a perfect bracket. Unfortunately, no one has even come close to creating a perfect bracket, probably because the odds are 1 in 9.2 quintillion. Of the millions of people that try each year, the closest a person has ever come to a perfect bracket was Gregg Nigl of Columbus, Ohio, in 2019. He predicted the first 49 games correctly but didn't even make it through the sweet 16. Overall, data science and machine learning can provide meaningful insights for bracket creators. It could even win someone a lot of money.

Source - Has anyone ever had a perfect bracket for March Madness?

2. Data¶

Kaggle link (Tournament Team Data.csv)

In [3]:
import pandas as pd

# create dataframe
df = pd.read_csv('tournament_team_data.csv', low_memory=False)
df = df.drop(columns = ['TEAM.1']) # don't need 2 team name columns
df = df.dropna()
display(df.head(20))
YEAR SEED TEAM ROUND KENPOM ADJUSTED EFFICIENCY KENPOM ADJUSTED OFFENSE KENPOM ADJUSTED DEFENSE KENPOM ADJUSTED TEMPO BARTTORVIK ADJUSTED EFFICIENCY BARTTORVIK ADJUSTED OFFENSE ... 3PT RATE DEFENSE OP ASSIST % OP O REB % OP D REB % BLOCKED % TURNOVER % DEFENSE WINS ABOVE BUBBLE WIN % POINTS PER POSSESSION OFFENSE POINTS PER POSSESSION DEFENSE
0 2022 1 Kansas 1 25.5 119.4 93.9 69.1 27.2 120.1 ... 34.2 47.5 28.9 66.6 7.8 18.4 10.4 82.35 1.119 0.970
1 2022 1 Arizona 16 27.2 119.6 92.4 72.2 25.6 117.4 ... 34.5 46.8 28.3 65.5 7.0 17.7 8.8 91.18 1.155 0.922
2 2022 1 Gonzaga 16 33.0 121.8 88.8 72.5 31.8 120.2 ... 33.9 40.6 23.0 71.0 6.6 17.0 6.7 89.66 1.190 0.885
3 2022 1 Baylor 32 26.3 117.9 91.6 67.2 26.3 116.6 ... 35.9 55.5 28.4 63.7 7.3 22.9 8.9 81.25 1.112 0.925
4 2022 2 Duke 4 23.7 119.4 95.7 67.4 25.8 119.9 ... 33.8 51.8 28.5 68.1 8.1 16.1 7.2 82.35 1.169 0.979
5 2022 2 Villanova 4 24.1 118.0 93.8 62.6 24.5 117.7 ... 43.0 50.0 28.0 69.1 10.9 18.8 7.4 78.79 1.127 0.979
6 2022 2 Auburn 32 24.5 113.6 89.1 70.0 22.9 112.3 ... 36.3 48.8 29.3 66.8 10.1 20.7 7.4 84.38 1.085 0.924
7 2022 2 Kentucky 64 26.6 120.2 93.6 67.3 25.1 118.6 ... 36.0 46.2 24.9 62.2 6.4 17.4 6.9 78.79 1.142 0.948
8 2022 3 Texas Tech 16 24.6 109.7 85.1 66.5 22.9 109.3 ... 45.6 53.8 26.1 66.7 7.2 23.6 6.5 73.53 1.051 0.884
9 2022 3 Purdue 16 22.3 121.3 99.0 65.8 25.5 122.5 ... 41.0 53.1 23.6 64.8 6.7 14.1 7.1 79.41 1.185 1.022
10 2022 3 Tennessee 32 25.2 111.4 86.2 67.2 23.5 110.5 ... 39.4 51.3 27.4 67.2 7.8 22.9 8.0 78.79 1.058 0.907
11 2022 3 Wisconsin 32 15.6 110.4 94.8 66.5 16.8 110.0 ... 33.7 45.9 24.0 74.2 9.0 16.9 6.5 77.42 1.053 0.991
12 2022 4 Arkansas 8 19.0 111.1 92.1 70.6 17.2 109.2 ... 39.2 54.4 25.8 69.1 8.7 20.6 4.5 75.76 1.060 0.945
13 2022 4 Providence 16 13.9 111.9 98.0 65.2 14.7 111.1 ... 36.1 46.3 28.0 69.5 9.9 15.8 6.0 83.33 1.066 0.995
14 2022 4 UCLA 16 24.8 116.1 91.2 65.5 23.6 115.3 ... 37.8 51.1 24.8 68.1 7.9 19.7 5.1 78.13 1.117 0.948
15 2022 4 Illinois 32 19.6 113.7 94.1 67.1 21.1 114.1 ... 29.6 41.7 25.8 66.6 8.6 15.6 4.2 70.97 1.102 0.984
16 2022 5 Houston 8 26.5 117.3 90.9 63.8 28.6 117.0 ... 42.9 55.4 27.3 62.2 6.4 21.7 6.2 85.29 1.147 0.890
17 2022 5 Saint Mary's 32 19.8 109.8 90.0 63.5 17.9 108.3 ... 28.8 35.9 21.9 72.7 8.8 19.5 3.9 77.42 1.060 0.933
18 2022 5 Connecticut 64 19.3 113.9 94.6 64.9 18.6 113.0 ... 32.6 44.2 27.5 62.0 8.3 18.0 3.2 71.88 1.099 0.953
19 2022 5 Iowa 64 23.5 121.5 98.0 69.6 23.3 120.9 ... 36.5 54.5 30.2 67.8 7.3 19.3 4.6 74.29 1.183 1.007

20 rows × 40 columns

This dataset includes every single team to compete in March Madness from 2008-2022 (exluding 2020 when there was no tournament due to covid). For each year, and for each team in the tournament that year, this dataset provides "the average stats of the team from the entire season (including their conference tournament and not including the March Madness tournament)." It also includes their seed and the round they got eliminated from the tournament in that year.

Data Dictionary¶

  • YEAR: year of march madness tournament
  • SEED: what seed the team was given
  • TEAM: the NCAA college basketball team
  • ROUND: round the team got eliminated from the tournament
    • 64 = round of 64 (the first round)
    • 32 = round of 32 (the second round)
    • 16 = the sweet 16 (third round)
    • 8 = the elite 8 (fourth round)
    • 4 = the final 4 (fifth round)
    • 1 = the championship game

The rest of the features are the average stats for each team before March Madness in a given year. I was able to get definitions for a few, but would need to continue researching to define them all. If this project goes through, we would probably need to remove some of the stats that are repetitive (ie. Kenpom adjusted efficiency is the Kenpom adjusted offense - Kenpom adjusted defense. Therefore we only need adjusted efficiency, and not the other 2.)

  • KENPOM ADJUSTED EFFICIENCY: This is how KenPom determines the overall ranking of teams. The more positive, the better. This takes the offensive efficiency minus the defensive efficiency to determine by how many points a team would outscore the “average” Division I program by.

  • KENPOM ADJUSTED OFFENSE: This is the amount of points a team would score per 100 possessions, or trips down the floor with the basketball, against an average Division I opponent.

  • KENPOM ADJUSTED DEFENSE: This is the amount of points a team would allow per 100 possessions, against an average Division I opponent.

  • KENPOM ADJUSTED TEMPO: The amount of possessions that a team has per 40 minutes (over the course of one game).

  • BARTTORVIK ADJUSTED EFFICIENCY:

  • BARTTORVIK ADJUSTED OFFENSE:

  • BARTTORVIK ADJUSTED DEFENSE:

  • BARTHAG:

  • ELITE SOS:

  • BARTTORVIK ADJUSTED TEMPO:

  • 2PT %: The percentage of field goals attempted by a player or team that are 2 pointers

  • 3PT %: The percentage of field goals attempted by a player or team that are 3 pointers

  • FREE THROW %: What percent of free throws does this team actually make.

  • EFG %: Measures field goal percentage adjusting for made 3-point field goals being 1.5 times more valuable than made 2-point field goals.

  • FREE THROW RATE: The percentage of plays where a player or team shoots free throws as the result of a foul

  • 3PT RATE: The percentage of points scored by a player or team that are from 3 point field goals

  • ASSIST %:

  • OFFENSIVE REBOUND %:

  • DEFENSIVE REBOUND %:

  • BLOCK %: What percent of the time a team blocks opposing team field goals

  • TURNOVER %:

  • 2PT % DEFENSE:

  • 3PT % DEFENSE:

  • FREE THROW % DEFENSE:

  • EFG % DEFENSE:

  • FREE THROW RATE DEFENSE:

  • 3PT RATE DEFENSE:

  • OP ASSIST %:

  • OP O REB %:

  • OP D REB %:

  • BLOCKED %: What percent of the time a teams field goals get blocked

  • TURNOVER % DEFENSE:

  • WINS ABOVE BUBBLE:

  • WIN %: Percentage of games won

  • POINTS PER POSSESSION OFFENSE: The number of points a player or team scores per possession

  • POINTS PER POSSESSION DEFENSE: The number of points the opposing player or team scores per possession on this team

Sources for stat decriptions¶

KenPom rankings explained & how to better evaluate Rutgers basketball

Stat Glossary - NBA

3. Proposal¶

We could create a classification model where the classes(y/output) are the round the team made it to in the tournament. The model would use all of stats(x/input) provided for the teams to predict which round they will make it to in the tournament. This info will help bracket makers choose how far teams will go based on their regular season stats.