Netflix Original Film Success Prediction¶

Motivation:¶

Problem¶

The music industry is a massive market, with the global music industry market size estimated to be $26 billion in 2021. Despite that, the process of creating a hit song is still extremely difficult and predicting popularity can often feel like a guessing game.

Objective¶

Spotify is the largest music streaming service in the world, with 30%+ of the total market share in 2022. Their platform allows users/developers to access a variety of metadata on their songs, ranging from popularity to details like song key signature. The goal of this project is to identify a relationship from the change in Spotify features (e.g. duration, danceability, intensity, etc.) for top 100 songs over time.

Impact¶

If this project is able to successfully identify a relationship, it'll be possible to create a model that is able to map the ideal hit song in 2023 (or even further in the future) based on details such as whether the song may be explicit or the potential length. The ability to do so would allow artists to gain a better understanding of what type of music would be commercially viable/highly successful, and could guide A&R folks in the music industry to make better decisions on which artists to sign or songs to release.

Potential downsides/negative outcomes of such a model can include A) the fact that it'd be playing into existing cliches/trends in music ('dumbing down art') and B) it would be projecting based on past data, while future hit songs could be significantly different in features than past hit songs.

Dataset¶

Detail¶

We will use a Kaggle dataset on Spotify Top Hits from 2000-2019 to analyze the following features for each song:

artist: Name of the Artist.
song: Name of the Track.
duration_ms: Duration of the track in milliseconds.
explicit: The lyrics or content of a song or a music video contain one or more of the criteria which could be considered offensive or unsuitable for children.
year: Release Year of the track.
popularity: The higher the value the more popular the song is.
danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.
key: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks.Values typically range between -60 and 0 db.
mode: Mode indicates the modality (major or minor) of a track.
speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.
acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness: Predicts whether a track contains no vocals. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content.
liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.
tempo: The overall estimated tempo of a track in beats per minute (BPM).
genre: Genre of the track.

Preview¶

In [1]:

import pandas as pd

df_spotify = pd.read_csv('spotify_top_songs.csv')
df_spotify.head()

Out[1]:

	artist	song	duration_ms	explicit	year	popularity	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	genre
0	Britney Spears	Oops!...I Did It Again	211160	False	2000	77	0.751	0.834	1	-5.444	0	0.0437	0.3000	0.000018	0.3550	0.894	95.053	pop
1	blink-182	All The Small Things	167066	False	1999	79	0.434	0.897	0	-4.918	1	0.0488	0.0103	0.000000	0.6120	0.684	148.726	rock, pop
2	Faith Hill	Breathe	250546	False	1999	66	0.529	0.496	7	-9.007	1	0.0290	0.1730	0.000000	0.2510	0.278	136.859	pop, country
3	Bon Jovi	It's My Life	224493	False	2000	78	0.551	0.913	0	-4.063	0	0.0466	0.0263	0.000013	0.3470	0.544	119.992	rock, metal
4	*NSYNC	Bye Bye Bye	200560	False	2000	65	0.614	0.928	8	-4.806	0	0.0516	0.0408	0.001040	0.0845	0.879	172.656	pop

Method¶

Solution¶

We will cluster the songs into sets comprising of their release year (2000-2019). By doing so, we can find the 'schema' for the average popular song in each year - allowing us to map a line of best fit from the change in popular song features towards what an ideal hit would look like in 2023.

Concerns¶

A big concern centers around the scope of the proposal - the dataset covers many different metadata features and a wide range of songs spanning two decades. As such, there could be a lot of 'noise' in the predictive outputs as it might be hard to chart a generalized trendline. In order to mitigate this, we could limit the scope of the proposal to a smaller set of features (e.g. predict popularity based on change in danceability / energy / valence / tempo, predict genre based on duration / danceability, etc.)