import numpy as np
import pandas as pd

df = pd.read_csv('data/labeled_data.csv')

unwanted_columns = ['Unnamed: 0', 'count', 'hate_speech', 'offensive_language', 'neither']

# drop unwanted columns
df.drop(unwanted_columns, axis=1, inplace=True)

df.head()

#


# Add addtional data from another source
df_2 = pd.read_csv('data/FinalBalancedDataset.csv')
df_2.drop('Unnamed: 0', axis=1, inplace=True)

# in this dataframe, the toxic column is the class and is either 1 for toxic or 0 for non-toxic, convert it to 1 for harmful, and 2 for non-harmful
df_2['Toxicity'] = df_2['Toxicity'].apply(lambda x: 1 if x == 1 else 2)

df_2.rename(columns={'Toxicity': 'class'}, inplace=True)


# Add even more data from another source
df_3 = pd.read_csv('data/cyberbullying_tweets.csv')
unwanted_columns = ['Unnamed: 0', 'count']
df_3.rename(columns={'tweet_text': 'tweet'}, inplace=True)
df_3.rename(columns={'cyberbullying_type': 'class'}, inplace=True)

# class is either not_cyberbullying, or some other string that is the type of cyberbullying, convert it to 0 for harmful, and 2 for non-harmful
df_3['class'] = df_3['class'].apply(lambda x: 2 if x == 'not_cyberbullying' else 0)


# combine the two dataframes vertically
df = pd.concat([df, df_2, df_3], axis=0)


# visualize the distribution of the classes
import matplotlib.pyplot as plt
import seaborn as sns

# count the number of tweets per class
value_counts = df['class'].value_counts()

# visualize the distribution of the classes
sns.barplot(x=value_counts.index, y=value_counts.values)
plt.title('Distribution of classes')
plt.xlabel('Class')
plt.ylabel('Number of tweets')
plt.show()

df.head()

Feature	Description
class	A categorical variable indicating the class of the tweet, where 0 represents hate speech, 1 represents offensive language, and 2 represents neither.
tweet	The text of the tweet.

	class	tweet
0	2	!!! RT @mayasolovely: As a woman you shouldn't...
1	1	!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2	1	!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3	1	!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4	1	!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...

	class	tweet
0	2	!!! RT @mayasolovely: As a woman you shouldn't...
1	1	!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2	1	!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3	1	!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4	1	!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...

Identifying and Predicting the Risk of Cyberbullying using Natural Language Processing¶

Problem Description¶

Dataset¶

Data Dictionary¶

Approach¶

Load and show data:¶