Analyzing Trends and Predicting Sales in the Video Game Industry¶

Introduction¶

The video game industry is a fast-growing and dynamic industry, with billions of dollars in revenue each year. As the industry continues to evolve and expand, it is important to understand the factors that contribute to a game's success or failure. In this project, we aim to analyze video game sales and review data to identify trends, predict sales figures, evaluate reviews, and compare games.

Objectives¶

  1. To analyze video game sales trends over time and across different regions and platforms.
  2. To predict how well a new game is likely to sell based on factors such as platform, release date, publisher, and developer.
  3. To evaluate how user and critic scores compare and how they relate to sales figures.
  4. To identify popular genres and analyze their sales and review scores.
  5. To compare sales figures and review scores for different games.

Datas¶

In [1]:
import pandas as pd

# Import the CSV file as a pandas DataFrame
df = pd.read_csv("game_statistics.csv")

# Display the first five rows of the DataFrame
print(df.head())
                             title total_sales total_shipped  \
0                 Professor Layton         NaN        18.00m   
1      Need for Speed: Most Wanted         NaN        17.80m   
2  Pokémon Diamond / Pearl Version         NaN        17.67m   
3                       Elden Ring         NaN        17.50m   
4      Grand Theft Auto: Vice City         NaN        17.50m   

                    publisher       developer release_date platform  \
0                    Nintendo         Level-5  10th Feb 08   Series   
1             Electronic Arts       EA Canada  15th Nov 05      All   
2                    Nintendo      Game Freak  28th Apr 07       DS   
3  Bandai Namco Entertainment   From Software  25th Feb 22      All   
4              Rockstar Games  Rockstar North  29th Oct 02      All   

  japan_sales na_sales other_sales pal_sales  pos  user_score  vgchartz_score  \
0         NaN      NaN         NaN       NaN  201         NaN             NaN   
1         NaN      NaN         NaN       NaN  202         NaN             NaN   
2         NaN      NaN         NaN       NaN  203         NaN             NaN   
3         NaN      NaN         NaN       NaN  204         NaN             NaN   
4         NaN      NaN         NaN       NaN  205         NaN             NaN   

   critic_score  last_update  
0           NaN  04th Feb 20  
1           NaN  20th Oct 20  
2           8.6          NaN  
3           NaN  28th Feb 22  
4           NaN  14th Oct 20  
/var/folders/1f/8fvyhtpn77qb8yf3jr3d8fg80000gn/T/ipykernel_17235/475495006.py:4: DtypeWarning: Columns (1,2,7,8,9,10) have mixed types. Specify dtype option on import or set low_memory=False.
  df = pd.read_csv("game_statistics.csv")
In [6]:
import csv

# Open the CSV file in read mode
with open('game_statistics.csv', mode='r') as file:

    # Create a DictReader object
    reader = csv.DictReader(file)

    # Create an empty dictionary to store the headers
    headers = {}

    # Loop through the first row of the CSV file
    for row in reader:
        # Store the header and its value in the dictionary
        for key, value in row.items():
            headers[key] = value
        # Exit the loop after the first row
        break

# Print the dictionary of headers
print(headers)
{'title': 'Professor Layton', 'total_sales': 'N/A', 'total_shipped': '18.00m', 'publisher': 'Nintendo', 'developer': 'Level-5', 'release_date': '10th Feb 08', 'platform': 'Series', 'japan_sales': 'N/A', 'na_sales': 'N/A', 'other_sales': 'N/A', 'pal_sales': 'N/A', 'pos': '201', 'user_score': 'N/A', 'vgchartz_score': 'N/A', 'critic_score': 'N/A', 'last_update': '04th Feb 20'}

Methods¶

We will use publicly available data from sources such as VGChartz and Metacritic to gather information on video game sales figures and review scores. We will preprocess the data by removing duplicates, filling in missing values, and normalizing the features. We will then use data visualization techniques such as scatter plots and heatmaps to visualize the relationships between different variables, such as sales figures and review scores. We will also use Matplotlib to create interactive visualizations that allow us to explore the data more deeply.

To predict the sales figures of new games, we will use the K-nearest neighbor (KNN) algorithm. We will split the data into training and testing sets and use the KNN algorithm to predict the sales figures of the testing set based on the features of the training set. We will tune the hyperparameters of the KNN algorithm using cross-validation to achieve the best performance.

To analyze the performance of different games, we will use clustering techniques such as K-means clustering to group games based on their sales figures and review scores. We will visualize the clusters using Matplotlib to identify patterns and trends in the data. We will also compare the clusters to identify the factors that contribute to a game's success or failure.

Expected Results¶

We expect to identify trends in video game sales across different platforms and regions, as well as to predict sales figures for new games based on key factors using the KNN algorithm. We also expect to gain insights into the relationship between review scores and sales figures through data visualization. Finally, we expect to cluster games based on their sales figures and review scores to identify patterns and trends in the data using K-means clustering and Matplotlib.

Conclusion¶

This project will provide valuable insights into the video game industry and the factors that contribute to a game's success or failure. The results of this project could be useful for game developers, publishers, and investors who are looking to create or invest in new games. The use of data visualization, KNN, and Matplotlib will allow us to explore the data more deeply and gain new insights into the industry.

In [ ]: