Netflix Original Film Success Prediction¶

Motivation:¶

Problem¶

The music industry is a massive market, with the global music industry market size estimated to be $26 billion in 2021. Despite that, the process of creating a hit song is still extremely difficult and predicting popularity can often feel like a guessing game.

Objective¶

Spotify is the largest music streaming service in the world, with 30%+ of the total market share in 2022. Their platform allows users/developers to access a variety of metadata on their songs, ranging from popularity to details like song key signature. The goal of this project is to identify a relationship from the change in Spotify features (e.g. duration, danceability, intensity, etc.) for top 100 songs over time.

Impact¶

If this project is able to successfully identify a relationship, it'll be possible to create a model that is able to map the ideal hit song in 2023 (or even further in the future) based on details such as whether the song may be explicit or the potential length. The ability to do so would allow artists to gain a better understanding of what type of music would be commercially viable/highly successful, and could guide A&R folks in the music industry to make better decisions on which artists to sign or songs to release.

Potential downsides/negative outcomes of such a model can include A) the fact that it'd be playing into existing cliches/trends in music ('dumbing down art') and B) it would be projecting based on past data, while future hit songs could be significantly different in features than past hit songs.

Dataset¶

Detail¶

We will use a Kaggle dataset on Spotify Top Hits from 2000-2019 to analyze the following features for each song:

  • artist: Name of the Artist.
  • song: Name of the Track.
  • duration_ms: Duration of the track in milliseconds.
  • explicit: The lyrics or content of a song or a music video contain one or more of the criteria which could be considered offensive or unsuitable for children.
  • year: Release Year of the track.
  • popularity: The higher the value the more popular the song is.
  • danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
  • energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.
  • key: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
  • loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks.Values typically range between -60 and 0 db.
  • mode: Mode indicates the modality (major or minor) of a track.
  • speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.
  • acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
  • instrumentalness: Predicts whether a track contains no vocals. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content.
  • liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
  • valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.
  • tempo: The overall estimated tempo of a track in beats per minute (BPM).
  • genre: Genre of the track.

Preview¶

In [1]:
import pandas as pd

df_spotify = pd.read_csv('spotify_top_songs.csv')
df_spotify.head()
Out[1]:
artist song duration_ms explicit year popularity danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo genre
0 Britney Spears Oops!...I Did It Again 211160 False 2000 77 0.751 0.834 1 -5.444 0 0.0437 0.3000 0.000018 0.3550 0.894 95.053 pop
1 blink-182 All The Small Things 167066 False 1999 79 0.434 0.897 0 -4.918 1 0.0488 0.0103 0.000000 0.6120 0.684 148.726 rock, pop
2 Faith Hill Breathe 250546 False 1999 66 0.529 0.496 7 -9.007 1 0.0290 0.1730 0.000000 0.2510 0.278 136.859 pop, country
3 Bon Jovi It's My Life 224493 False 2000 78 0.551 0.913 0 -4.063 0 0.0466 0.0263 0.000013 0.3470 0.544 119.992 rock, metal
4 *NSYNC Bye Bye Bye 200560 False 2000 65 0.614 0.928 8 -4.806 0 0.0516 0.0408 0.001040 0.0845 0.879 172.656 pop

Method¶

Solution¶

We will cluster the songs into sets comprising of their release year (2000-2019). By doing so, we can find the 'schema' for the average popular song in each year - allowing us to map a line of best fit from the change in popular song features towards what an ideal hit would look like in 2023.

Concerns¶

A big concern centers around the scope of the proposal - the dataset covers many different metadata features and a wide range of songs spanning two decades. As such, there could be a lot of 'noise' in the predictive outputs as it might be hard to chart a generalized trendline. In order to mitigate this, we could limit the scope of the proposal to a smaller set of features (e.g. predict popularity based on change in danceability / energy / valence / tempo, predict genre based on duration / danceability, etc.)