MLB ELO Prediction¶

Every season, the MLB hosts an action packed series of games, bringing players up to the plate to try and rival the pitcher at the mound. A pitcher has a crucial role in the game and can drastically alter the outcome. Bleacher Report mentions that since the starting pitcher has the most time with the ball, they have a role to play in the success of the team. Moreover, The Complete Pitcher explains that the pitcher controls the pace of the game and sets the tone for future innings.

Using ELO data, which is a system of rating teams, along with their starting pitchers, we'd like to see how numerically important a pitcher is to the game. If teams with high ELOs lose, it could be helpful to determine if it is a cause of their starting lineup or pitcher. On the adverse, if a team with a lower ELO than the away team pulls out a win, it would be helpful to see if the pitcher carried their win and set a positive tone.

This will allow teams to pick pitchers for their starting lineup based on features and ratings, and pay more attention to their ELO, or whether they can test out a new pitcher in a certain game and see what could have the best outcome.

Dataset¶

I will be using a dataset from GitHub dataset of MLB Elo Data from FiveThirtyEight. The dataset contains information for every season dating back to 1871, which means that there is an expansive set of data.

Column Definition
season Year of season (more recent games at the top)
playoff Whether it was a playoff game
team1 Home team
team2 Away team
elo1_pre Home team's Elo rating before the game
elo2_pre Away team's Elo rating before the game
elo_prob1 Home team's probability of winning according to Elo ratings
elo_prob2 Away team's probability of winning according to Elo ratings
elo1_post Home team's Elo rating after the game
elo2_post Away team's Elo rating after the game
pitcher1 Name of home starting pitcher
pitcher2 Name of away starting pitcher
pitcher1_rgs Home starting pitcher's rolling game score before the game
pitcher2_rgs Away starting pitcher's rolling game score before the game
score1 Home team's score
score2 Away team's score

Because the project goal is a little unclear, we do not intend on using every part. The post-game ELO ratings may not be necessary since we're mainly looking at the outcome based on the pitcher, but we'd want to keep it as a potential option and it could be useful.

In [4]:
import pandas as pd

mlb_info = pd.read_csv("mlb_elo.csv")

# import some of the rows
mlb_info[:6]
Out[4]:
date season neutral playoff team1 team2 elo1_pre elo2_pre elo_prob1 elo_prob2 ... pitcher1_rgs pitcher2_rgs pitcher1_adj pitcher2_adj rating_prob1 rating_prob2 rating1_post rating2_post score1 score2
0 2022-11-05 2022 0 w HOU PHI 1600.129119 1546.433821 0.644817 0.355183 ... 58.740666 61.632981 5.365908 32.200438 0.576373 0.423627 1591.472412 1544.450289 4 1
1 2022-11-03 2022 0 w PHI HOU 1548.259208 1598.303732 0.450191 0.549809 ... 51.397915 62.448418 -16.017602 22.722308 0.395291 0.604709 1546.485993 1589.436708 2 3
2 2022-11-02 2022 0 w PHI HOU 1552.321240 1594.241701 0.465668 0.534332 ... 59.422109 60.192341 21.461511 13.320248 0.496353 0.503647 1547.636916 1588.285786 0 5
3 2022-11-01 2022 0 w PHI HOU 1546.100348 1600.462592 0.442003 0.557997 ... 53.260588 57.130727 -6.800360 -2.837861 0.455888 0.544112 1550.940292 1584.982409 7 0
4 2022-10-29 2022 0 w HOU PHI 1598.279125 1548.283816 0.638287 0.361713 ... 58.140886 62.863435 2.498670 37.551911 0.552880 0.447120 1589.504732 1546.417970 5 2
5 2022-10-28 2022 0 w HOU PHI 1601.186024 1545.376917 0.648524 0.351476 ... 63.309735 60.702676 26.245071 26.457182 0.625046 0.374954 1587.319065 1548.603636 5 6

6 rows × 26 columns

Potential Problems¶

As a discrepency, there was a change in the system of ratings they used, so I will be using the data that has a newer ELO rating. This may limit the potential data that we can use, but the original rating system was confusing and unclear. Additionally, because of the expansive set of data, pitcher's skills have morphed and there have been changes in skill levels, but it should not impact too much and would still allow the team to have sufficient information for an algorithm

Method¶

We can use a machine learning algorithm to predict the outcomes of the game if they put different pitchers in their lineup and use their rolling game score to see what their potential gameplay could be. This could be used for some form of regression, estimating to seek the outcome of the game given the features and different pitchers.