# The English Premier League(EPL)
The English Premier League(EPL) is one of the most popular/best soccer leagues in the world. Located in England, 20 teams compete every year for the title, and just like any other good leagues, it is always very difficullt to guess who the winner will be for the season. Also, top players from all over the world gathers to EPL, which makes guessing the best player very interesting. The overall goal of this project is to build a model that can successfully predict the outcome of the 22-23 season winner, but I am also opened to ideas such as guess outcomes of every game or the whole table by the end of the season.
https://www.premierleague.com/premier-league-explained
https://worldsoccertalk.com/beginners-guide-premier-league/
import pandas as pd
df = pd.read_csv('matches.csv')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1389 entries, 0 to 1388 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 1389 non-null int64 1 date 1389 non-null object 2 time 1389 non-null object 3 comp 1389 non-null object 4 round 1389 non-null object 5 day 1389 non-null object 6 venue 1389 non-null object 7 result 1389 non-null object 8 gf 1389 non-null float64 9 ga 1389 non-null float64 10 opponent 1389 non-null object 11 xg 1389 non-null float64 12 xga 1389 non-null float64 13 poss 1389 non-null float64 14 attendance 693 non-null float64 15 captain 1389 non-null object 16 formation 1389 non-null object 17 referee 1389 non-null object 18 match report 1389 non-null object 19 notes 0 non-null float64 20 sh 1389 non-null float64 21 sot 1389 non-null float64 22 dist 1388 non-null float64 23 fk 1389 non-null float64 24 pk 1389 non-null float64 25 pkatt 1389 non-null float64 26 season 1389 non-null int64 27 team 1389 non-null object dtypes: float64(13), int64(2), object(13) memory usage: 304.0+ KB
Squad : Squad's name
team : Name of the team
date: date of the match
time: Time of the match
W : Wins
D : Draws
L : Losses
GF : Goals for
GA : Goals against
GD : Goal difference
result : result of the match
venue:Home&Away
xG : Expected goals
xGA : Expected goals allowed
Attendance : Attendance per game during this season, only for home matches
captai: captain
formation: Formation
referee: Refree
match report
sh: Shots taken
sot: shots off target
dist: distance traveled
fk: number of free kicks made
pk: number of penalty kicks made
pkatt: number of penatly kicks attempted
season: seaon
team: team
I've brought a dataset from kaggle. df is a dataset from 2020 to 2022's game records. There are more datasets that can be easily acquired online.
The initial plan is to create learn the dataset from previous seasons to find where the teams ended at the end of the season based on mid_season performane. We can ask questions such as, 'did teams with high GD continue their good performance?' 'does the leading team mid season usually carry it all the way?' 'what formation works best for GF'