Every season, the MLB hosts an action packed series of games, bringing players up to the plate to try and rival the pitcher at the mound. A pitcher has a crucial role in the game and can drastically alter the outcome. Bleacher Report mentions that since the starting pitcher has the most time with the ball, they have a role to play in the success of the team. Moreover, The Complete Pitcher explains that the pitcher controls the pace of the game and sets the tone for future innings.
Using ELO data, which is a system of rating teams, along with their starting pitchers, we'd like to see how numerically important a pitcher is to the game. If teams with high ELOs lose, it could be helpful to determine if it is a cause of their starting lineup or pitcher. On the adverse, if a team with a lower ELO than the away team pulls out a win, it would be helpful to see if the pitcher carried their win and set a positive tone.
This will allow teams to pick pitchers for their starting lineup based on features and ratings, and pay more attention to their ELO, or whether they can test out a new pitcher in a certain game and see what could have the best outcome.
I will be using a dataset from GitHub dataset of MLB Elo Data from FiveThirtyEight. The dataset contains information for every season dating back to 1871, which means that there is an expansive set of data.
Column | Definition |
---|---|
season | Year of season (more recent games at the top) |
playoff | Whether it was a playoff game |
team1 | Home team |
team2 | Away team |
elo1_pre | Home team's Elo rating before the game |
elo2_pre | Away team's Elo rating before the game |
elo_prob1 | Home team's probability of winning according to Elo ratings |
elo_prob2 | Away team's probability of winning according to Elo ratings |
elo1_post | Home team's Elo rating after the game |
elo2_post | Away team's Elo rating after the game |
pitcher1 | Name of home starting pitcher |
pitcher2 | Name of away starting pitcher |
pitcher1_rgs | Home starting pitcher's rolling game score before the game |
pitcher2_rgs | Away starting pitcher's rolling game score before the game |
score1 | Home team's score |
score2 | Away team's score |
Because the project goal is a little unclear, we do not intend on using every part. The post-game ELO ratings may not be necessary since we're mainly looking at the outcome based on the pitcher, but we'd want to keep it as a potential option and it could be useful.
import pandas as pd
mlb_info = pd.read_csv("mlb_elo.csv")
# import some of the rows
mlb_info[:6]
date | season | neutral | playoff | team1 | team2 | elo1_pre | elo2_pre | elo_prob1 | elo_prob2 | ... | pitcher1_rgs | pitcher2_rgs | pitcher1_adj | pitcher2_adj | rating_prob1 | rating_prob2 | rating1_post | rating2_post | score1 | score2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022-11-05 | 2022 | 0 | w | HOU | PHI | 1600.129119 | 1546.433821 | 0.644817 | 0.355183 | ... | 58.740666 | 61.632981 | 5.365908 | 32.200438 | 0.576373 | 0.423627 | 1591.472412 | 1544.450289 | 4 | 1 |
1 | 2022-11-03 | 2022 | 0 | w | PHI | HOU | 1548.259208 | 1598.303732 | 0.450191 | 0.549809 | ... | 51.397915 | 62.448418 | -16.017602 | 22.722308 | 0.395291 | 0.604709 | 1546.485993 | 1589.436708 | 2 | 3 |
2 | 2022-11-02 | 2022 | 0 | w | PHI | HOU | 1552.321240 | 1594.241701 | 0.465668 | 0.534332 | ... | 59.422109 | 60.192341 | 21.461511 | 13.320248 | 0.496353 | 0.503647 | 1547.636916 | 1588.285786 | 0 | 5 |
3 | 2022-11-01 | 2022 | 0 | w | PHI | HOU | 1546.100348 | 1600.462592 | 0.442003 | 0.557997 | ... | 53.260588 | 57.130727 | -6.800360 | -2.837861 | 0.455888 | 0.544112 | 1550.940292 | 1584.982409 | 7 | 0 |
4 | 2022-10-29 | 2022 | 0 | w | HOU | PHI | 1598.279125 | 1548.283816 | 0.638287 | 0.361713 | ... | 58.140886 | 62.863435 | 2.498670 | 37.551911 | 0.552880 | 0.447120 | 1589.504732 | 1546.417970 | 5 | 2 |
5 | 2022-10-28 | 2022 | 0 | w | HOU | PHI | 1601.186024 | 1545.376917 | 0.648524 | 0.351476 | ... | 63.309735 | 60.702676 | 26.245071 | 26.457182 | 0.625046 | 0.374954 | 1587.319065 | 1548.603636 | 5 | 6 |
6 rows × 26 columns
As a discrepency, there was a change in the system of ratings they used, so I will be using the data that has a newer ELO rating. This may limit the potential data that we can use, but the original rating system was confusing and unclear. Additionally, because of the expansive set of data, pitcher's skills have morphed and there have been changes in skill levels, but it should not impact too much and would still allow the team to have sufficient information for an algorithm
We can use a machine learning algorithm to predict the outcomes of the game if they put different pitchers in their lineup and use their rolling game score to see what their potential gameplay could be. This could be used for some form of regression, estimating to seek the outcome of the game given the features and different pitchers.