Which club manager will win a world cup?¶

Describes and motivates a real-world problem where data science may provide helpful insights.¶

  • Project: Which club manager will win a world cup?

  • Motivation: Every four years, a world champion is crowned in the World Cup Soccer Games. The national pride associated with this extradionary accomplishment is not the only thing that should motivate countries to win this award, it also serves as an extremely beneficial shock to the economy. (See articles and studies below that illustrate effect of Argentina's most recent world cup win in 2022 as well as studies on previous World Cup wins) This economic incentive brings about the question for many national football associations, how do you bring home a world cup?

    You can't buy or trade players from other nations, but you can and certainly should appoint the right manager.

Articles:
https://www.bloomberg.com/news/articles/2022-12-16/winning-2022-fifa-world-cup-could-bring-gdp-boost-to-france-or-argentina
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4226179
https://www.weforum.org/agenda/2022/12/world-cup-football-financial-windfall/

How the data will be used to predict which current club managers will be best suited to win a world cup.¶

  • Here is an article on why even the most successful club managers fail to achieve the same results at international level:
    https://www.theguardian.com/football/the-set-pieces-blog/2022/jul/20/art-international-management-football-luiz-felipe-scolari-ottmar-hitzfeld-carlos-alberto-parreira-world-cup
  • Torunament football, in contrast to club competitions, places more emphasis on corners, free kicks, counter-attacks amongst other key metrics. It also balances the playing field as not conceding a goal gives a chance to win at penalties, which is basically a lottery. You don't have to play to win, just play to not lose and give you a chance to win at penalties. Due to this aspect, many successful club managers usually fail at the international level.

The dataset aims to find the potential candidates for international teams by answering the following questions:

- what key metrics translate to wins in tournament football? 
- Does the same metrics translate to wins in club competition? 
- Train and test these metrics on tournament football to find relationship with "winning", to find the 
metrics that lead to world cup wins, then find most similar matches in club managers. 
- which club managers would be a good fit for international football?

Why this dataset?¶

- Conventional metrics used in football usually focus on shots, goals and assists. 
- Fewer than 1% of on-ball actions in football are shots. 
- Traditional statistics fails to consider context and position on the pitch (passing accuracy).

What could some of those predictive metrics be ?¶

Metrics or 'SubEventNames' in football analytics:

  • 'Free Kick'
  • 'Free kick cross'
  • 'Free kick shot'
  • 'Foul'
  • ''Late card foul'
  • ''Violent Foul'
  • 'Ground attacking duel'
  • 'Ground defending duel'
  • 'Shot'
  • 'Goal kick'
  • 'Goalkeeper leaving line' -->

Explicitly load and show your dataset. Provide a data dictionary which explains the meaning of each feature present. Demonstrate that this data is sufficient to make progress on your real-world problem described above.¶

Source of the data (https://figshare.com/collections/Soccer_match_event_dataset/4415000/5)

The dataset is converted from json to a dataframe, the explanation of variables to be used briefly explained below.

  1. df_teams
    (variables: city (str) - location of the team,
                     name (str) - name of the team,
                     wyid (str) - team id
                     )
  • This is relevant because the data source I found the data from doesn't have coaches, so I will be using teams to find the managers, since each team only have 1 coach. (141 candidates = 142 rows)
  1. df_events (101759 rows x 12 columns)
    (variables: subEventName (str) - match action event metric,

                      positions (pd.series) - euclidean position on a football pitch,
                      eventName (str) - Broader classification of match action event,
                      matchPeriod (str) - phase of game,
                      eventSec (float) - length of action
                      )
  2. df_matches (115 rows x 15 columns)

                     (variables: winner (str) - winner of the match, <br>
                     label (str) - result of the match, <br> 
                     duration (str) - whether the game was 90 minutes or overtime
                     )
In [1]:
# Reading in the data files
import pandas as pd
from io import BytesIO
from pathlib import Path
from urllib.parse import urlparse
from urllib.request import urlopen, urlretrieve
from zipfile import ZipFile, is_zipfile
from tqdm.notebook import tqdm
In [2]:
# links to data files
data_files = {
    'events': 'https://ndownloader.figshare.com/files/14464685',  # ZIP file containing one JSON file for each competition
    'matches': 'https://ndownloader.figshare.com/files/14464622',  # ZIP file containing one JSON file for each competition
    'teams': 'https://ndownloader.figshare.com/files/15073697'  # JSON file
}

Loops through the data_files dict, downloads each listed data file, and stores each downloaded data file to the local file system.

If the downloaded data file is a ZIP archive, the included JSON files are extracted from the ZIP archive and stored to the local file system.

In [ ]:
for url in tqdm(data_files.values()):
    url_s3 = urlopen(url).geturl()
    path = Path(urlparse(url_s3).path)
    file_name = path.name
    file_local, _ = urlretrieve(url_s3, file_name)
    if is_zipfile(file_local):
        with ZipFile(file_local) as zip_file:
            zip_file.extractall()

The read_json_file function reads and returns the content of a given JSON file. The function handles the encoding of special characters (e.g., accents in names of teams) that the pd.read_json function cannot handle properly.

In [3]:
def read_json_file(filename):
    with open(filename, 'rb') as json_file:
        return BytesIO(json_file.read()).getvalue().decode('unicode_escape')
In [4]:
# Identify tournament competitions to isolate from club competitions later on potentially
competitions = ['European Championship','World Cup']
In [5]:
# Reading the required competitions
for competition in competitions:
    competition_name = competition.replace(' ', '_')
    file_events = f'events_{competition_name}.json'
    json_events = read_json_file(file_events)
    df_events = pd.read_json(json_events)
In [6]:
df_events.shape
Out[6]:
(101759, 12)
In [7]:
df_events.tail(3)
Out[7]:
eventId subEventName tags playerId positions matchId eventName teamId matchPeriod eventSec subEventId id
101756 8 Cross [{'id': 401}, {'id': 801}, {'id': 1802}] 14812 [{'y': 2, 'x': 82}, {'y': 100, 'x': 100}] 2058017 Pass 9598 2H 2983.448628 80 263885654
101757 4 Goalkeeper leaving line [] 25381 [{'y': 0, 'x': 0}, {'y': 98, 'x': 18}] 2058017 Goalkeeper leaving line 4418 2H 2985.869275 40 263885613
101758 8 Launch [{'id': 1802}] 25381 [{'y': 43, 'x': 14}, {'y': 0, 'x': 0}] 2058017 Pass 4418 2H 3002.148765 84 263885618
In [8]:
df_events.columns
Out[8]:
Index(['eventId', 'subEventName', 'tags', 'playerId', 'positions', 'matchId',
       'eventName', 'teamId', 'matchPeriod', 'eventSec', 'subEventId', 'id'],
      dtype='object')
In [9]:
df_events.subEventName.unique()
Out[9]:
array(['Simple pass', 'High pass', 'Air duel', 'Ground attacking duel',
       'Ground defending duel', 'Throw in', 'Foul', 'Free Kick',
       'Clearance', 'Touch', 'Ground loose ball duel', 'Corner',
       'Head pass', 'Launch', 'Acceleration', 'Shot', 'Smart pass',
       'Cross', 'Reflexes', '', 'Hand pass', 'Goal kick',
       'Free kick cross', 'Hand foul', 'Save attempt', 'Free kick shot',
       'Goalkeeper leaving line', 'Penalty', 'Late card foul',
       'Violent Foul', 'Protest', 'Out of game foul', 'Time lost foul',
       'Simulation'], dtype=object)
In [10]:
json_teams = read_json_file('teams.json')
df_teams = pd.read_json(json_teams)
In [11]:
df_teams.shape
Out[11]:
(142, 6)
In [12]:
df_teams.head(3)
Out[12]:
city name wyId officialName area type
0 Newcastle upon Tyne Newcastle United 1613 Newcastle United FC {'name': 'England', 'id': '0', 'alpha3code': '... club
1 Vigo Celta de Vigo 692 Real Club Celta de Vigo {'name': 'Spain', 'id': '724', 'alpha3code': '... club
2 Barcelona Espanyol 691 Reial Club Deportiu Espanyol {'name': 'Spain', 'id': '724', 'alpha3code': '... club
In [13]:
# national vs club teams dataset
df_teams.type == "national"
Out[13]:
0      False
1      False
2      False
3      False
4      False
       ...  
137     True
138     True
139     True
140     True
141     True
Name: type, Length: 142, dtype: bool
In [14]:
dfs_matches = []
for competition in competitions:
    competition_name = competition.replace(' ', '_')
    file_matches = f'matches_{competition_name}.json'
    json_matches = read_json_file(file_matches)
    df_matches = pd.read_json(json_matches)
    dfs_matches.append(df_matches)
df_matches = pd.concat(dfs_matches)
In [15]:
df_events.tail(5)
Out[15]:
eventId subEventName tags playerId positions matchId eventName teamId matchPeriod eventSec subEventId id
101754 8 Simple pass [{'id': 1801}] 3476 [{'y': 20, 'x': 46}, {'y': 6, 'x': 64}] 2058017 Pass 9598 2H 2978.301867 85 263885652
101755 7 Touch [] 14812 [{'y': 6, 'x': 64}, {'y': 2, 'x': 82}] 2058017 Others on the ball 9598 2H 2979.084611 72 263885653
101756 8 Cross [{'id': 401}, {'id': 801}, {'id': 1802}] 14812 [{'y': 2, 'x': 82}, {'y': 100, 'x': 100}] 2058017 Pass 9598 2H 2983.448628 80 263885654
101757 4 Goalkeeper leaving line [] 25381 [{'y': 0, 'x': 0}, {'y': 98, 'x': 18}] 2058017 Goalkeeper leaving line 4418 2H 2985.869275 40 263885613
101758 8 Launch [{'id': 1802}] 25381 [{'y': 43, 'x': 14}, {'y': 0, 'x': 0}] 2058017 Pass 4418 2H 3002.148765 84 263885618
In [16]:
df_events.subEventName.unique()
Out[16]:
array(['Simple pass', 'High pass', 'Air duel', 'Ground attacking duel',
       'Ground defending duel', 'Throw in', 'Foul', 'Free Kick',
       'Clearance', 'Touch', 'Ground loose ball duel', 'Corner',
       'Head pass', 'Launch', 'Acceleration', 'Shot', 'Smart pass',
       'Cross', 'Reflexes', '', 'Hand pass', 'Goal kick',
       'Free kick cross', 'Hand foul', 'Save attempt', 'Free kick shot',
       'Goalkeeper leaving line', 'Penalty', 'Late card foul',
       'Violent Foul', 'Protest', 'Out of game foul', 'Time lost foul',
       'Simulation'], dtype=object)