The music industry is a massive market, with the global music industry market size estimated to be $26 billion in 2021. Despite that, the process of creating a hit song is still extremely difficult and predicting popularity can often feel like a guessing game.
Spotify is the largest music streaming service in the world, with 30%+ of the total market share in 2022. Their platform allows users/developers to access a variety of metadata on their songs, ranging from popularity to details like song key signature. The goal of this project is to identify a relationship from the change in Spotify features (e.g. duration, danceability, intensity, etc.) for top 100 songs over time.
If this project is able to successfully identify a relationship, it'll be possible to create a model that is able to map the ideal hit song in 2023 (or even further in the future) based on details such as whether the song may be explicit or the potential length. The ability to do so would allow artists to gain a better understanding of what type of music would be commercially viable/highly successful, and could guide A&R folks in the music industry to make better decisions on which artists to sign or songs to release.
Potential downsides/negative outcomes of such a model can include A) the fact that it'd be playing into existing cliches/trends in music ('dumbing down art') and B) it would be projecting based on past data, while future hit songs could be significantly different in features than past hit songs.
We will use a Kaggle dataset on Spotify Top Hits from 2000-2019 to analyze the following features for each song:
import pandas as pd
df_spotify = pd.read_csv('spotify_top_songs.csv')
df_spotify.head()
artist | song | duration_ms | explicit | year | popularity | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | genre | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Britney Spears | Oops!...I Did It Again | 211160 | False | 2000 | 77 | 0.751 | 0.834 | 1 | -5.444 | 0 | 0.0437 | 0.3000 | 0.000018 | 0.3550 | 0.894 | 95.053 | pop |
1 | blink-182 | All The Small Things | 167066 | False | 1999 | 79 | 0.434 | 0.897 | 0 | -4.918 | 1 | 0.0488 | 0.0103 | 0.000000 | 0.6120 | 0.684 | 148.726 | rock, pop |
2 | Faith Hill | Breathe | 250546 | False | 1999 | 66 | 0.529 | 0.496 | 7 | -9.007 | 1 | 0.0290 | 0.1730 | 0.000000 | 0.2510 | 0.278 | 136.859 | pop, country |
3 | Bon Jovi | It's My Life | 224493 | False | 2000 | 78 | 0.551 | 0.913 | 0 | -4.063 | 0 | 0.0466 | 0.0263 | 0.000013 | 0.3470 | 0.544 | 119.992 | rock, metal |
4 | *NSYNC | Bye Bye Bye | 200560 | False | 2000 | 65 | 0.614 | 0.928 | 8 | -4.806 | 0 | 0.0516 | 0.0408 | 0.001040 | 0.0845 | 0.879 | 172.656 | pop |
We will cluster the songs into sets comprising of their release year (2000-2019). By doing so, we can find the 'schema' for the average popular song in each year - allowing us to map a line of best fit from the change in popular song features towards what an ideal hit would look like in 2023.
A big concern centers around the scope of the proposal - the dataset covers many different metadata features and a wide range of songs spanning two decades. As such, there could be a lot of 'noise' in the predictive outputs as it might be hard to chart a generalized trendline. In order to mitigate this, we could limit the scope of the proposal to a smaller set of features (e.g. predict popularity based on change in danceability / energy / valence / tempo, predict genre based on duration / danceability, etc.)