MLB Pitch Type Predictor¶

Problem¶

A constant problem for professional baseball hitters is that they rarely know what pitch is coming next. Every MLB pitcher has a plethora of different types of pitches (fastball, curveball, slider, etc.) that all move and break differently. To succeed at the highest levels, hitters must either manage to determine what pitch type is being thrown in that fraction of a second while the ball is in the air, or simply guess. It's no coincidence that "sign stealing" (deciphering the catcher's signs to determine what pitch will come next) is one of the most common methods of cheating in MLB, so that hitters may always know how the next pitch will behave. Past teams employing such methods (2017 Houston Astros, for instance) have unsurprisingly out-performed every other advantage-less team, showing how important this information is for hitters to have.

Solution¶

While explicitly stealing a team's signs is against the rules, there's nothing wrong with predicting what type of pitch will be thrown next. Using past pitch-by-pitch data, we can analyze how each pitcher pitch-sequences (in what order do they throw different types of pitches?). From this analysis, we might start to observe certain trends that would allow us to hypothesize what pitch they might throw next, given a certain situation (count, game state, previous pitch sequence). The goal of this project is to identify for each pitcher a relationship between situation and pitch type thrown, thus allowing us to predict what their next pitch might be in any given circumstance.

Impact¶

As mentioned before, a successful way to predict what types of pitches a hitter might see in an at-bat would have significant competitive benefits. Cutting down on how much guess-work a hitter has to do at the plate would give them a significant advantage over their competitors and, over the long run, should result in more favorable outcomes for them and their team.

A potential downside of such a predictor, however, might be that teams and players value it too much. No predictor will work 100% of the time, so there will always be some cases where a pitcher throws a different pitch than what is predicted. If a hitter goes to bat fully expecting a fastball and they instead recieve a curveball, they stand even less of a chance than if they went up there guessing. A successful predictor would not provide any guarantees, so hitter and teams should take its advice with a grain of salt and not over-rely on its input.

Dataset¶

pybaseball package¶

Ever since the mid-2000s, MLB has been using a collection of technologies (PitchFX, Statcast, etc.) to gather countless pieces of pitch-by-pitch information about each game, such as pitch type, pitch velocity, and pitch result. Over time, this information has been packaged in a number of different formats, making it easy for the public to access and interact with this data.

For our project, we will be using a package called pybaseball, which contains the Statcast data mentioned above, courtesy of BaseballSavant.com.

The Statcast data contained in the pybaseball package consists of 92 different features for each pitch in a specified time frame. Some of (not all) the most important of these features for our purposes are:

  • pitch_type
  • pitcher
  • balls, strikes (a.k.a. count)
  • on_3b, on_2b, on_1b (where are baserunners?)
  • bat_score, fld_score (what is the score of the game?)

In a tabular format, the data looks as follows.

pitch_type game_date release_speed release_pos_x release_pos_z player_name batter pitcher events description ... fld_score post_away_score post_home_score post_bat_score post_fld_score if_fielding_alignment of_fielding_alignment spin_axis delta_home_win_exp delta_run_exp
SL 10/1/22 80 -1.78 5.67 Tepera, Ryan 608369 572193 field_out hit_into_play ... 3 2 3 2 3 Infield shift Standard 74 0.035 -0.114
FF 10/1/22 93.1 -1.6 5.63 Tepera, Ryan 608369 572193 NaN ball ... 3 2 3 2 3 Infield shift Standard 212 0 0.015
SL 10/1/22 84.5 -1.59 5.78 Tepera, Ryan 543760 572193 strikeout foul_tip ... 3 2 3 2 3 Infield shift Standard 146 0.052 -0.112
SL 10/1/22 85.2 -1.66 5.75 Tepera, Ryan 543760 572193 NaN swinging_strike ... 3 2 3 2 3 Infield shift Standard 126 0 -0.036
SI 10/1/22 93.9 -1.64 5.69 Tepera, Ryan 543760 572193 NaN ball ... 3 2 3 2 3 Infield shift Standard 219 0 0.016

The features provided generally fall into one of two categories: pre-pitch information and post-pitch information. Pre-pitch information is the information that is known before the pitch is thrown (count, baseruners, etc.), while post-pitch information is the information that is found after the pitch (pitch type, pitch velocity, pitch result, etc.). This project aims to use all the relevant pre-pitch features to estimate what pitch a certain pitcher may choose given a certain situation.

Potential Problems¶

While the available Statcast database is very thorough, one potential issue with the data is that it treats each hitter equally. Depending on whether a hitter is considered "good" or "bad," or they are more of a fly-ball-hitter than a ground-ball-hitter, a pitcher may change how they attack that hitter. This dataset, however, is blind to these inter-hitter differences and treats each hitter equally.

While this issue may cause the resulting product to be more generalized than specifically tailored to each hitter, we do not anticipate hitter quality to be an incredibly important component of what pitch a pitcher may choose and therefore this issue should not invalidate the results. Any future versions of this predictor may choose to implement this data as a quality-of-life improvement.

Method¶

This project is designed to be answered as a classification problem. Given several qualities of a situation (count, score, previous pitches, etc.), we want to be provided with a guess as to what type of pitch that pitcher would throw in such a situation. This solution makes the most sense since the results will be easily interpreted and understood by teams and hitters alike.