In [1]:
# The English Premier League(EPL)

Describes and motivates a real-world problem where data science may provide helpful insights. Your description should be easily understood by a casual reader and include citations to motivating sources or relevant information (e.g. news articles, further reading links … Wikipedia makes for a poor reference but the links it cites are usually promising).¶

The English Premier League(EPL) is one of the most popular/best soccer leagues in the world. Located in England, 20 teams compete every year for the title, and just like any other good leagues, it is always very difficullt to guess who the winner will be for the season. Also, top players from all over the world gathers to EPL, which makes guessing the best player very interesting. The overall goal of this project is to build a model that can successfully predict the outcome of the 22-23 season winner, but I am also opened to ideas such as guess outcomes of every game or the whole table by the end of the season.

https://www.premierleague.com/premier-league-explained
https://worldsoccertalk.com/beginners-guide-premier-league/

Explicitly load and show your dataset. Provide a data dictionary which explains the meaning of each feature present. Demonstrate that this data is sufficient to make progress on your real-world problem described above.¶

In [3]:
import pandas as pd
df = pd.read_csv('matches.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1389 entries, 0 to 1388
Data columns (total 28 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    1389 non-null   int64  
 1   date          1389 non-null   object 
 2   time          1389 non-null   object 
 3   comp          1389 non-null   object 
 4   round         1389 non-null   object 
 5   day           1389 non-null   object 
 6   venue         1389 non-null   object 
 7   result        1389 non-null   object 
 8   gf            1389 non-null   float64
 9   ga            1389 non-null   float64
 10  opponent      1389 non-null   object 
 11  xg            1389 non-null   float64
 12  xga           1389 non-null   float64
 13  poss          1389 non-null   float64
 14  attendance    693 non-null    float64
 15  captain       1389 non-null   object 
 16  formation     1389 non-null   object 
 17  referee       1389 non-null   object 
 18  match report  1389 non-null   object 
 19  notes         0 non-null      float64
 20  sh            1389 non-null   float64
 21  sot           1389 non-null   float64
 22  dist          1388 non-null   float64
 23  fk            1389 non-null   float64
 24  pk            1389 non-null   float64
 25  pkatt         1389 non-null   float64
 26  season        1389 non-null   int64  
 27  team          1389 non-null   object 
dtypes: float64(13), int64(2), object(13)
memory usage: 304.0+ KB

Squad : Squad's name
team : Name of the team
date: date of the match
time: Time of the match
W : Wins
D : Draws
L : Losses
GF : Goals for
GA : Goals against
GD : Goal difference
result : result of the match
venue:Home&Away
xG : Expected goals
xGA : Expected goals allowed
Attendance : Attendance per game during this season, only for home matches
captai: captain
formation: Formation
referee: Refree
match report
sh: Shots taken
sot: shots off target
dist: distance traveled
fk: number of free kicks made
pk: number of penalty kicks made
pkatt: number of penatly kicks attempted
season: seaon
team: team

I've brought a dataset from kaggle. df is a dataset from 2020 to 2022's game records. There are more datasets that can be easily acquired online.

Write one or two sentences about how the data will be used to solve the problem¶

The initial plan is to create learn the dataset from previous seasons to find where the teams ended at the end of the season based on mid_season performane. We can ask questions such as, 'did teams with high GD continue their good performance?' 'does the leading team mid season usually carry it all the way?' 'what formation works best for GF'

In [ ]: