MLB ELO Prediction¶

Every season, the MLB hosts an action packed series of games, bringing players up to the plate to try and rival the pitcher at the mound. A pitcher has a crucial role in the game and can drastically alter the outcome. Bleacher Report mentions that since the starting pitcher has the most time with the ball, they have a role to play in the success of the team. Moreover, The Complete Pitcher explains that the pitcher controls the pace of the game and sets the tone for future innings.

Using ELO data, which is a system of rating teams, along with their starting pitchers, we'd like to see how numerically important a pitcher is to the game. If teams with high ELOs lose, it could be helpful to determine if it is a cause of their starting lineup or pitcher. On the adverse, if a team with a lower ELO than the away team pulls out a win, it would be helpful to see if the pitcher carried their win and set a positive tone.

This will allow teams to pick pitchers for their starting lineup based on features and ratings, and pay more attention to their ELO, or whether they can test out a new pitcher in a certain game and see what could have the best outcome.

Dataset¶

I will be using a dataset from GitHub dataset of MLB Elo Data from FiveThirtyEight. The dataset contains information for every season dating back to 1871, which means that there is an expansive set of data.

Column	Definition
season	Year of season (more recent games at the top)
playoff	Whether it was a playoff game
team1	Home team
team2	Away team
elo1_pre	Home team's Elo rating before the game
elo2_pre	Away team's Elo rating before the game
elo_prob1	Home team's probability of winning according to Elo ratings
elo_prob2	Away team's probability of winning according to Elo ratings
elo1_post	Home team's Elo rating after the game
elo2_post	Away team's Elo rating after the game
pitcher1	Name of home starting pitcher
pitcher2	Name of away starting pitcher
pitcher1_rgs	Home starting pitcher's rolling game score before the game
pitcher2_rgs	Away starting pitcher's rolling game score before the game
score1	Home team's score
score2	Away team's score

Because the project goal is a little unclear, we do not intend on using every part. The post-game ELO ratings may not be necessary since we're mainly looking at the outcome based on the pitcher, but we'd want to keep it as a potential option and it could be useful.

In [4]:

import pandas as pd

mlb_info = pd.read_csv("mlb_elo.csv")

# import some of the rows
mlb_info[:6]

Out[4]:

	date	season	playoff	team1	team2	elo1_pre	elo2_pre	elo_prob1	elo_prob2	...	pitcher1_rgs	pitcher2_rgs	pitcher1_adj	pitcher2_adj	rating_prob1	rating_prob2	rating1_post	rating2_post	score1	score2
0	2022-11-05	2022	w	HOU	PHI	1600.129119	1546.433821	0.644817	0.355183	...	58.740666	61.632981	5.365908	32.200438	0.576373	0.423627	1591.472412	1544.450289	4	1
1	2022-11-03	2022	w	PHI	HOU	1548.259208	1598.303732	0.450191	0.549809	...	51.397915	62.448418	-16.017602	22.722308	0.395291	0.604709	1546.485993	1589.436708	2	3
2	2022-11-02	2022	w	PHI	HOU	1552.321240	1594.241701	0.465668	0.534332	...	59.422109	60.192341	21.461511	13.320248	0.496353	0.503647	1547.636916	1588.285786	0	5
3	2022-11-01	2022	w	PHI	HOU	1546.100348	1600.462592	0.442003	0.557997	...	53.260588	57.130727	-6.800360	-2.837861	0.455888	0.544112	1550.940292	1584.982409	7	0
4	2022-10-29	2022	w	HOU	PHI	1598.279125	1548.283816	0.638287	0.361713	...	58.140886	62.863435	2.498670	37.551911	0.552880	0.447120	1589.504732	1546.417970	5	2
5	2022-10-28	2022	w	HOU	PHI	1601.186024	1545.376917	0.648524	0.351476	...	63.309735	60.702676	26.245071	26.457182	0.625046	0.374954	1587.319065	1548.603636	5	6

6 rows × 26 columns

Potential Problems¶

As a discrepency, there was a change in the system of ratings they used, so I will be using the data that has a newer ELO rating. This may limit the potential data that we can use, but the original rating system was confusing and unclear. Additionally, because of the expansive set of data, pitcher's skills have morphed and there have been changes in skill levels, but it should not impact too much and would still allow the team to have sufficient information for an algorithm

Method¶

We can use a machine learning algorithm to predict the outcomes of the game if they put different pitchers in their lineup and use their rolling game score to see what their potential gameplay could be. This could be used for some form of regression, estimating to seek the outcome of the game given the features and different pitchers.