spoiler alert!¶

The Problem¶

and boy is it a problem...¶

Avengers Endgame Google Search

Synopsis¶

I don't know about you all, but I never actually watched the last Avengers movie... why? I wanted to save it for a rainy day--something to watch when school was no longer beating me up. But after a while this was no longer my reason for not watching, instead it was because everyone and their grandma had already spoiled the entirety of the movie across every social media platform imaginable (even Pinterest!).

Now, of course, I was devestated and would not even wish that form of torture on my worst enemies. Content (specifically movies) deserve to be enjoyed in their entirety, and people should be able to peacefully scroll through the internet without fear of spoilers. This is where our friendly neighbour Data Science comes to the rescue! We could use an ML model to detect whether or not a piece of content (e.g social media post or user review) contained spoilers or not.


Cool?

Trick question.

Yes


Two disturbing stats¶

  1. A 2014 TiVo study found that more than 78% of respondents have had a show, movie, etc spoiled to them before!
  2. In this same study, 2% of the respondents admitted to being terrible people --intentionally spoiling shows for others[^1].

The Data¶

wow, so cool!¶

Cute cartoon Hulk smash gif

Moving swiftly along before I go Hulk with my rage, let's take a look at the data. The data I am proposing for use comes from Kaggle, and is a self-proclaimed IMDB Spoiler Dataset!

Scale of The Data¶

Picture showing the scale of the dataset

Data Dictionary¶

The Kaggle download actually gives us two files: one giving details about different movies, and the other giving user reviews of those movies. For the purpose of our model, we will be looking at the latter, i.e IMDB_reviews.json

In [ ]:
import pandas as pd

# takes a hot second to load since there's a lot of data...
df_reviews = pd.read_json('IMDB_reviews.json', lines=True)
In [ ]:
# quick glance at the data
df_reviews.head()

Features & their meanings¶

Feature Data Type Meaning
review_date str When the review was written
movie_id str The unique identifying number of the movie being reviewed
user_id str The unique identifying number of the user who made the review
is_spoiler boolean Whether or not the user's review contained spoilers of the movie
review_text str The actual review
rating int The rating from 1-10 (one being lowest, ten highest) the user gave the movie
review_summary str One line summary of the review


I am confident this data will be more than sufficient due to two main factors:

  • There is so much data present in it to both test and train
  • It contains two fields we would need to test if there was a spoiler:
    • is_spoiler (the answer for us to use during testing)
    • review_text (the text we will have to find a way of analyzing for spoilers)

Potential Problems¶

sigh¶

  1. Sometimes spoilers are super subtle and hard to catch (even as a human). It's the difference between saying "xyz character dies" and saying "xyz character who you may or may not see again". When approaching this data we would have to decide the degree of spoiler we consider 'is_spoiler'.
    • i.e. what makes a spoiler really a spoiler?
  2. Sometimes the review_summary is the one that contains the spoilers instead of the actual review_text. This would mean that our model would potentially have to consider three features.

The Approach¶

this is going to be fun¶

On second thought, our IMDB_movie_details.json file shall not be ignored! The approach I believe to be best at testing whether or not a review contains spoilers came from this publication where they looked for the similarity between a review and the movie summary! Hopefully from there it will just be a matter of us deciding how much/little similarity means a review contains a spoiler.

I will be the first to admit that I am not fully sure how this will be done, but i'm excited to learn some cool things along the way to make it possible!

"Want to join me" gif