Cyberbullying is a growing problem in today's digital age, and it can have serious consequences for individuals, including mental health problems and social isolation. My goal is to build a predictive model that can detect and predict the risk of cyberbullying in online social media platforms such as Twitter and Facebook.I will use natural language processing techniques to analyze the text of messages and identify patterns that are indicative of cyberbullying behavior. Find more here: CNN
I will use a publicly available dataset (kaggle) from various online social media platforms such as Twitter and Facebook. The dataset includes features such as the text of the message. The dataset is labeled with tags, 0 for hate speech, 1 for harmful speed, and 2 if its neutral to facilitate supervised learning. One dataset, with 24000 tweets, may not be enough to train the model to a high accuracy. To combat this issue, I have used multiple datasets, and combined them into one. Using multiple datasets also helps eliminate any possible bias present in any one dataset. Since I am not the one labelling the data, I thought it would be nice to generalize what people consider to be cyber bullying. The datasets I have used are:
Cyberbullying Dataset
Toxic Tweets Dataset
Cybercullying Classification
Feature | Description |
---|---|
class | A categorical variable indicating the class of the tweet, where 0 represents hate speech, 1 represents offensive language, and 2 represents neither. |
tweet | The text of the tweet. |
First, I will preprocess the text data to remove unwanted characters (usernames/emojis/punctuation) and normalize the text. I will use techniques such as bag-of-words, n-grams, and word embeddings to extract features from the preprocessed text, most likely word embeddings. Then, I will then use various machine learning algorithms such as decision trees, random forests, and support vector machines to build predictive models that can classify text into three categories (hate speech, harmful speech, neither). I will evaluate the performance of my models using metrics such as accuracy, precision, recall, F-1 score, and confusion matrices.
import numpy as np
import pandas as pd
df = pd.read_csv('data/labeled_data.csv')
unwanted_columns = ['Unnamed: 0', 'count', 'hate_speech', 'offensive_language', 'neither']
# drop unwanted columns
df.drop(unwanted_columns, axis=1, inplace=True)
df.head()
#
class | tweet | |
---|---|---|
0 | 2 | !!! RT @mayasolovely: As a woman you shouldn't... |
1 | 1 | !!!!! RT @mleew17: boy dats cold...tyga dwn ba... |
2 | 1 | !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby... |
3 | 1 | !!!!!!!!! RT @C_G_Anderson: @viva_based she lo... |
4 | 1 | !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you... |
# Add addtional data from another source
df_2 = pd.read_csv('data/FinalBalancedDataset.csv')
df_2.drop('Unnamed: 0', axis=1, inplace=True)
# in this dataframe, the toxic column is the class and is either 1 for toxic or 0 for non-toxic, convert it to 1 for harmful, and 2 for non-harmful
df_2['Toxicity'] = df_2['Toxicity'].apply(lambda x: 1 if x == 1 else 2)
df_2.rename(columns={'Toxicity': 'class'}, inplace=True)
# Add even more data from another source
df_3 = pd.read_csv('data/cyberbullying_tweets.csv')
unwanted_columns = ['Unnamed: 0', 'count']
df_3.rename(columns={'tweet_text': 'tweet'}, inplace=True)
df_3.rename(columns={'cyberbullying_type': 'class'}, inplace=True)
# class is either not_cyberbullying, or some other string that is the type of cyberbullying, convert it to 0 for harmful, and 2 for non-harmful
df_3['class'] = df_3['class'].apply(lambda x: 2 if x == 'not_cyberbullying' else 0)
# combine the two dataframes vertically
df = pd.concat([df, df_2, df_3], axis=0)
# visualize the distribution of the classes
import matplotlib.pyplot as plt
import seaborn as sns
# count the number of tweets per class
value_counts = df['class'].value_counts()
# visualize the distribution of the classes
sns.barplot(x=value_counts.index, y=value_counts.values)
plt.title('Distribution of classes')
plt.xlabel('Class')
plt.ylabel('Number of tweets')
plt.show()
df.head()
class | tweet | |
---|---|---|
0 | 2 | !!! RT @mayasolovely: As a woman you shouldn't... |
1 | 1 | !!!!! RT @mleew17: boy dats cold...tyga dwn ba... |
2 | 1 | !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby... |
3 | 1 | !!!!!!!!! RT @C_G_Anderson: @viva_based she lo... |
4 | 1 | !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you... |