I don't know about you all, but I never actually watched the last Avengers movie... why? I wanted to save it for a rainy day--something to watch when school was no longer beating me up. But after a while this was no longer my reason for not watching, instead it was because everyone and their grandma had already spoiled the entirety of the movie across every social media platform imaginable (even Pinterest!).
Now, of course, I was devestated and would not even wish that form of torture on my worst enemies. Content (specifically movies) deserve to be enjoyed in their entirety, and people should be able to peacefully scroll through the internet without fear of spoilers. This is where our friendly neighbour Data Science comes to the rescue! We could use an ML model to detect whether or not a piece of content (e.g social media post or user review) contained spoilers or not.
Cool?
Trick question.
Yes
Moving swiftly along before I go Hulk with my rage, let's take a look at the data. The data I am proposing for use comes from Kaggle, and is a self-proclaimed IMDB Spoiler Dataset!
The Kaggle download actually gives us two files: one giving details about different movies, and the other giving user reviews of those movies. For the purpose of our model, we will be looking at the latter, i.e IMDB_reviews.json
import pandas as pd
# takes a hot second to load since there's a lot of data...
df_reviews = pd.read_json('IMDB_reviews.json', lines=True)
# quick glance at the data
df_reviews.head()
Feature | Data Type | Meaning |
---|---|---|
review_date | str | When the review was written |
movie_id | str | The unique identifying number of the movie being reviewed |
user_id | str | The unique identifying number of the user who made the review |
is_spoiler | boolean | Whether or not the user's review contained spoilers of the movie |
review_text | str | The actual review |
rating | int | The rating from 1-10 (one being lowest, ten highest) the user gave the movie |
review_summary | str | One line summary of the review |
I am confident this data will be more than sufficient due to two main factors:
On second thought, our IMDB_movie_details.json file shall not be ignored! The approach I believe to be best at testing whether or not a review contains spoilers came from this publication where they looked for the similarity between a review and the movie summary! Hopefully from there it will just be a matter of us deciding how much/little similarity means a review contains a spoiler.
I will be the first to admit that I am not fully sure how this will be done, but i'm excited to learn some cool things along the way to make it possible!