Making a movie that will have a high gross box office can be difficult, what makes a movie successful? Is it the runtime, genre, or movie rating? People usually have strong feelings about movies and some movies have gained legendary status, but does a highly rated movie equal more profits or is there some other factor that drives ratings?
IMDB is an extensive movie database that serves as the main hub for professional and amateur movie critics. Many use the IMDB movie rating scale as the main benchmark when ranking movies. The IMDB movie scale goes from 1 to 10, with 1 being the worst rating and 10 being the best. The database also includes other data about the movies such as budget, genre, release date, actors, etc.
The goal of this project is to identify the relationship between a movie's features and the success of the movie.
If this prediction is successful, this could help out smaller movie makers to help them with their success in movie making. It would also help to increase the quality of the movies in general, as moviemakers could see what types of features they need to focus on to make a movie that people will like.
A negative outcome could be that movie makers look to previous ways of making movies and what has previously worked. Instead of thinking out of the box and introducing new and creative ways of movie-making.
We will use a Kaggle Dataset of 5000 Movies on IMDB to observe the following features for each movie:
['color', 'director_name', 'num_critic_for_reviews', 'duration', 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name', 'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_facebook_likes', 'actor_3_name', 'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link', 'num_user_for_reviews', 'language', 'country', 'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score', 'aspect_ratio', 'movie_facebook_likes'
We are planning on looking away from features from this dataset that we think is unrelewant such as color, actors, facebook likes, and title_year
We plan to focus on these features:
We want to look how these features are tied into the imdb score and how the rating can be increased by changing the factors.
Different people like different types of genres of movies, someone who likes action might not like drama, therefore, giving that movie a lower score. So ratings of the movies are subjective and can be misleading, however, we assume that the ratings are averaged by many people and are also backed up by professional critics who are unbiased.
There is also a mix of qualitative and quantitative features for a movie so we need to figure out the best way of interpreting what is most essential for a movie's success.
Furthermore, this dataset does not take into account storytelling, which many consider an important aspect of movies.
import pandas as pd
df_movies = pd.read_csv("movie_data.csv")
df_movies.head()
color | director_name | num_critic_for_reviews | duration | director_facebook_likes | actor_3_facebook_likes | actor_2_name | actor_1_facebook_likes | gross | genres | ... | num_user_for_reviews | language | country | content_rating | budget | title_year | actor_2_facebook_likes | imdb_score | aspect_ratio | movie_facebook_likes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Color | James Cameron | 723.0 | 178.0 | 0.0 | 855.0 | Joel David Moore | 1000.0 | 760505847.0 | Action|Adventure|Fantasy|Sci-Fi | ... | 3054.0 | English | USA | PG-13 | 237000000.0 | 2009.0 | 936.0 | 7.9 | 1.78 | 33000 |
1 | Color | Gore Verbinski | 302.0 | 169.0 | 563.0 | 1000.0 | Orlando Bloom | 40000.0 | 309404152.0 | Action|Adventure|Fantasy | ... | 1238.0 | English | USA | PG-13 | 300000000.0 | 2007.0 | 5000.0 | 7.1 | 2.35 | 0 |
2 | Color | Sam Mendes | 602.0 | 148.0 | 0.0 | 161.0 | Rory Kinnear | 11000.0 | 200074175.0 | Action|Adventure|Thriller | ... | 994.0 | English | UK | PG-13 | 245000000.0 | 2015.0 | 393.0 | 6.8 | 2.35 | 85000 |
3 | Color | Christopher Nolan | 813.0 | 164.0 | 22000.0 | 23000.0 | Christian Bale | 27000.0 | 448130642.0 | Action|Thriller | ... | 2701.0 | English | USA | PG-13 | 250000000.0 | 2012.0 | 23000.0 | 8.5 | 2.35 | 164000 |
4 | NaN | Doug Walker | NaN | NaN | 131.0 | NaN | Rob Walker | 131.0 | NaN | Documentary | ... | NaN | NaN | NaN | NaN | NaN | NaN | 12.0 | 7.1 | NaN | 0 |
5 rows × 28 columns
We believe we can use several methods for this problem, we can start by clustering by genre, language, and age rating for the movie to get a better understanding of the performance for the movie. Then we can use regression. Looking into quantitative features of the movies seeing what values such as movie length or budget will decrease or increase the IMDB rating.