Project: Which club manager will win a world cup?
Motivation:
Every four years, a world champion is crowned in the World Cup Soccer Games. The national pride associated with this extradionary accomplishment is not the only thing that should motivate countries to win this award, it also serves as an extremely beneficial shock to the economy. (See articles and studies below that illustrate effect of Argentina's most recent world cup win in 2022 as well as studies on previous World Cup wins) This economic incentive brings about the question for many national football associations, how do you bring home a world cup?
You can't buy or trade players from other nations, but you can and certainly should appoint the right manager.
Articles:
https://www.bloomberg.com/news/articles/2022-12-16/winning-2022-fifa-world-cup-could-bring-gdp-boost-to-france-or-argentina
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4226179
https://www.weforum.org/agenda/2022/12/world-cup-football-financial-windfall/
- what key metrics translate to wins in tournament football?
- Does the same metrics translate to wins in club competition?
- Train and test these metrics on tournament football to find relationship with "winning", to find the
metrics that lead to world cup wins, then find most similar matches in club managers.
- which club managers would be a good fit for international football?
- Conventional metrics used in football usually focus on shots, goals and assists.
- Fewer than 1% of on-ball actions in football are shots.
- Traditional statistics fails to consider context and position on the pitch (passing accuracy).
Metrics or 'SubEventNames' in football analytics:
Source of the data (https://figshare.com/collections/Soccer_match_event_dataset/4415000/5)
The dataset is converted from json to a dataframe, the explanation of variables to be used briefly explained below.
name (str) - name of the team,
wyid (str) - team id
)
df_events (101759 rows x 12 columns)
(variables: subEventName (str) - match action event metric,
positions (pd.series) - euclidean position on a football pitch,
eventName (str) - Broader classification of match action event,
matchPeriod (str) - phase of game,
eventSec (float) - length of action
)
df_matches (115 rows x 15 columns)
(variables: winner (str) - winner of the match, <br>
label (str) - result of the match, <br>
duration (str) - whether the game was 90 minutes or overtime
)
# Reading in the data files
import pandas as pd
from io import BytesIO
from pathlib import Path
from urllib.parse import urlparse
from urllib.request import urlopen, urlretrieve
from zipfile import ZipFile, is_zipfile
from tqdm.notebook import tqdm
# links to data files
data_files = {
'events': 'https://ndownloader.figshare.com/files/14464685', # ZIP file containing one JSON file for each competition
'matches': 'https://ndownloader.figshare.com/files/14464622', # ZIP file containing one JSON file for each competition
'teams': 'https://ndownloader.figshare.com/files/15073697' # JSON file
}
Loops through the data_files dict, downloads each listed data file, and stores each downloaded data file to the local file system.
If the downloaded data file is a ZIP archive, the included JSON files are extracted from the ZIP archive and stored to the local file system.
for url in tqdm(data_files.values()):
url_s3 = urlopen(url).geturl()
path = Path(urlparse(url_s3).path)
file_name = path.name
file_local, _ = urlretrieve(url_s3, file_name)
if is_zipfile(file_local):
with ZipFile(file_local) as zip_file:
zip_file.extractall()
The read_json_file function reads and returns the content of a given JSON file. The function handles the encoding of special characters (e.g., accents in names of teams) that the pd.read_json function cannot handle properly.
def read_json_file(filename):
with open(filename, 'rb') as json_file:
return BytesIO(json_file.read()).getvalue().decode('unicode_escape')
# Identify tournament competitions to isolate from club competitions later on potentially
competitions = ['European Championship','World Cup']
# Reading the required competitions
for competition in competitions:
competition_name = competition.replace(' ', '_')
file_events = f'events_{competition_name}.json'
json_events = read_json_file(file_events)
df_events = pd.read_json(json_events)
df_events.shape
(101759, 12)
df_events.tail(3)
eventId | subEventName | tags | playerId | positions | matchId | eventName | teamId | matchPeriod | eventSec | subEventId | id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
101756 | 8 | Cross | [{'id': 401}, {'id': 801}, {'id': 1802}] | 14812 | [{'y': 2, 'x': 82}, {'y': 100, 'x': 100}] | 2058017 | Pass | 9598 | 2H | 2983.448628 | 80 | 263885654 |
101757 | 4 | Goalkeeper leaving line | [] | 25381 | [{'y': 0, 'x': 0}, {'y': 98, 'x': 18}] | 2058017 | Goalkeeper leaving line | 4418 | 2H | 2985.869275 | 40 | 263885613 |
101758 | 8 | Launch | [{'id': 1802}] | 25381 | [{'y': 43, 'x': 14}, {'y': 0, 'x': 0}] | 2058017 | Pass | 4418 | 2H | 3002.148765 | 84 | 263885618 |
df_events.columns
Index(['eventId', 'subEventName', 'tags', 'playerId', 'positions', 'matchId', 'eventName', 'teamId', 'matchPeriod', 'eventSec', 'subEventId', 'id'], dtype='object')
df_events.subEventName.unique()
array(['Simple pass', 'High pass', 'Air duel', 'Ground attacking duel', 'Ground defending duel', 'Throw in', 'Foul', 'Free Kick', 'Clearance', 'Touch', 'Ground loose ball duel', 'Corner', 'Head pass', 'Launch', 'Acceleration', 'Shot', 'Smart pass', 'Cross', 'Reflexes', '', 'Hand pass', 'Goal kick', 'Free kick cross', 'Hand foul', 'Save attempt', 'Free kick shot', 'Goalkeeper leaving line', 'Penalty', 'Late card foul', 'Violent Foul', 'Protest', 'Out of game foul', 'Time lost foul', 'Simulation'], dtype=object)
json_teams = read_json_file('teams.json')
df_teams = pd.read_json(json_teams)
df_teams.shape
(142, 6)
df_teams.head(3)
city | name | wyId | officialName | area | type | |
---|---|---|---|---|---|---|
0 | Newcastle upon Tyne | Newcastle United | 1613 | Newcastle United FC | {'name': 'England', 'id': '0', 'alpha3code': '... | club |
1 | Vigo | Celta de Vigo | 692 | Real Club Celta de Vigo | {'name': 'Spain', 'id': '724', 'alpha3code': '... | club |
2 | Barcelona | Espanyol | 691 | Reial Club Deportiu Espanyol | {'name': 'Spain', 'id': '724', 'alpha3code': '... | club |
# national vs club teams dataset
df_teams.type == "national"
0 False 1 False 2 False 3 False 4 False ... 137 True 138 True 139 True 140 True 141 True Name: type, Length: 142, dtype: bool
dfs_matches = []
for competition in competitions:
competition_name = competition.replace(' ', '_')
file_matches = f'matches_{competition_name}.json'
json_matches = read_json_file(file_matches)
df_matches = pd.read_json(json_matches)
dfs_matches.append(df_matches)
df_matches = pd.concat(dfs_matches)
df_events.tail(5)
eventId | subEventName | tags | playerId | positions | matchId | eventName | teamId | matchPeriod | eventSec | subEventId | id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
101754 | 8 | Simple pass | [{'id': 1801}] | 3476 | [{'y': 20, 'x': 46}, {'y': 6, 'x': 64}] | 2058017 | Pass | 9598 | 2H | 2978.301867 | 85 | 263885652 |
101755 | 7 | Touch | [] | 14812 | [{'y': 6, 'x': 64}, {'y': 2, 'x': 82}] | 2058017 | Others on the ball | 9598 | 2H | 2979.084611 | 72 | 263885653 |
101756 | 8 | Cross | [{'id': 401}, {'id': 801}, {'id': 1802}] | 14812 | [{'y': 2, 'x': 82}, {'y': 100, 'x': 100}] | 2058017 | Pass | 9598 | 2H | 2983.448628 | 80 | 263885654 |
101757 | 4 | Goalkeeper leaving line | [] | 25381 | [{'y': 0, 'x': 0}, {'y': 98, 'x': 18}] | 2058017 | Goalkeeper leaving line | 4418 | 2H | 2985.869275 | 40 | 263885613 |
101758 | 8 | Launch | [{'id': 1802}] | 25381 | [{'y': 43, 'x': 14}, {'y': 0, 'x': 0}] | 2058017 | Pass | 4418 | 2H | 3002.148765 | 84 | 263885618 |
df_events.subEventName.unique()
array(['Simple pass', 'High pass', 'Air duel', 'Ground attacking duel', 'Ground defending duel', 'Throw in', 'Foul', 'Free Kick', 'Clearance', 'Touch', 'Ground loose ball duel', 'Corner', 'Head pass', 'Launch', 'Acceleration', 'Shot', 'Smart pass', 'Cross', 'Reflexes', '', 'Hand pass', 'Goal kick', 'Free kick cross', 'Hand foul', 'Save attempt', 'Free kick shot', 'Goalkeeper leaving line', 'Penalty', 'Late card foul', 'Violent Foul', 'Protest', 'Out of game foul', 'Time lost foul', 'Simulation'], dtype=object)