A constant problem for professional baseball hitters is that they rarely know what pitch is coming next. Every MLB pitcher has a plethora of different types of pitches (fastball, curveball, slider, etc.) that all move and break differently. To succeed at the highest levels, hitters must either manage to determine what pitch type is being thrown in that fraction of a second while the ball is in the air, or simply guess. It's no coincidence that "sign stealing" (deciphering the catcher's signs to determine what pitch will come next) is one of the most common methods of cheating in MLB, so that hitters may always know how the next pitch will behave. Past teams employing such methods (2017 Houston Astros, for instance) have unsurprisingly out-performed every other advantage-less team, showing how important this information is for hitters to have.
While explicitly stealing a team's signs is against the rules, there's nothing wrong with predicting what type of pitch will be thrown next. Using past pitch-by-pitch data, we can analyze how each pitcher pitch-sequences (in what order do they throw different types of pitches?). From this analysis, we might start to observe certain trends that would allow us to hypothesize what pitch they might throw next, given a certain situation (count, game state, previous pitch sequence). The goal of this project is to identify for each pitcher a relationship between situation and pitch type thrown, thus allowing us to predict what their next pitch might be in any given circumstance.
As mentioned before, a successful way to predict what types of pitches a hitter might see in an at-bat would have significant competitive benefits. Cutting down on how much guess-work a hitter has to do at the plate would give them a significant advantage over their competitors and, over the long run, should result in more favorable outcomes for them and their team.
A potential downside of such a predictor, however, might be that teams and players value it too much. No predictor will work 100% of the time, so there will always be some cases where a pitcher throws a different pitch than what is predicted. If a hitter goes to bat fully expecting a fastball and they instead recieve a curveball, they stand even less of a chance than if they went up there guessing. A successful predictor would not provide any guarantees, so hitter and teams should take its advice with a grain of salt and not over-rely on its input.
pybaseball
package¶Ever since the mid-2000s, MLB has been using a collection of technologies (PitchFX, Statcast, etc.) to gather countless pieces of pitch-by-pitch information about each game, such as pitch type, pitch velocity, and pitch result. Over time, this information has been packaged in a number of different formats, making it easy for the public to access and interact with this data.
For our project, we will be using a package called pybaseball
, which contains the Statcast data mentioned above, courtesy of BaseballSavant.com.
The Statcast data contained in the pybaseball
package consists of 92 different features for each pitch in a specified time frame. Some of (not all) the most important of these features for our purposes are:
In a tabular format, the data looks as follows.
pitch_type | game_date | release_speed | release_pos_x | release_pos_z | player_name | batter | pitcher | events | description | ... | fld_score | post_away_score | post_home_score | post_bat_score | post_fld_score | if_fielding_alignment | of_fielding_alignment | spin_axis | delta_home_win_exp | delta_run_exp |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SL | 10/1/22 | 80 | -1.78 | 5.67 | Tepera, Ryan | 608369 | 572193 | field_out | hit_into_play | ... | 3 | 2 | 3 | 2 | 3 | Infield shift | Standard | 74 | 0.035 | -0.114 |
FF | 10/1/22 | 93.1 | -1.6 | 5.63 | Tepera, Ryan | 608369 | 572193 | NaN | ball | ... | 3 | 2 | 3 | 2 | 3 | Infield shift | Standard | 212 | 0 | 0.015 |
SL | 10/1/22 | 84.5 | -1.59 | 5.78 | Tepera, Ryan | 543760 | 572193 | strikeout | foul_tip | ... | 3 | 2 | 3 | 2 | 3 | Infield shift | Standard | 146 | 0.052 | -0.112 |
SL | 10/1/22 | 85.2 | -1.66 | 5.75 | Tepera, Ryan | 543760 | 572193 | NaN | swinging_strike | ... | 3 | 2 | 3 | 2 | 3 | Infield shift | Standard | 126 | 0 | -0.036 |
SI | 10/1/22 | 93.9 | -1.64 | 5.69 | Tepera, Ryan | 543760 | 572193 | NaN | ball | ... | 3 | 2 | 3 | 2 | 3 | Infield shift | Standard | 219 | 0 | 0.016 |
The features provided generally fall into one of two categories: pre-pitch information and post-pitch information. Pre-pitch information is the information that is known before the pitch is thrown (count, baseruners, etc.), while post-pitch information is the information that is found after the pitch (pitch type, pitch velocity, pitch result, etc.). This project aims to use all the relevant pre-pitch features to estimate what pitch a certain pitcher may choose given a certain situation.
While the available Statcast database is very thorough, one potential issue with the data is that it treats each hitter equally. Depending on whether a hitter is considered "good" or "bad," or they are more of a fly-ball-hitter than a ground-ball-hitter, a pitcher may change how they attack that hitter. This dataset, however, is blind to these inter-hitter differences and treats each hitter equally.
While this issue may cause the resulting product to be more generalized than specifically tailored to each hitter, we do not anticipate hitter quality to be an incredibly important component of what pitch a pitcher may choose and therefore this issue should not invalidate the results. Any future versions of this predictor may choose to implement this data as a quality-of-life improvement.
This project is designed to be answered as a classification problem. Given several qualities of a situation (count, score, previous pitches, etc.), we want to be provided with a guess as to what type of pitch that pitcher would throw in such a situation. This solution makes the most sense since the results will be easily interpreted and understood by teams and hitters alike.