Hate Speech Detection

37 minute read

Abstract

In this era of the digital age, online hate speech residing in social media networks can influence hate violence or even crimes towards a certain group of people. Hate related attacks targetted at specific groups of people are at a 16-year high in the United States of America, statistics released by the FBI reported. [1] Therefore, there is a growing need to eradicate hate speech as much as possible through automatic detection to ease the load on moderators.

Datasets were obtained from Reddit and a white supremacist forum, Gab where there contains human labelled comments that are determined as hate speech related. [2]

Multiple modelling approaches will be explored, such as machine learning models and even state-of-the-art deep learning models. F1 score and recall will be the metrics to be prioritised in model comparison. In the event where both are the same, actual False Negatives and False Postive numbers will be looked at.

Problem Statement

In this digital age, online hate speech has increased over the past few years. Studies has shown that online hate speech can lead to offline violence towards a certain group. [3]

In some cases, social media can lead to a more direct role, in this case the New Zealand shooting incident was broadcasted live on Facebook.[4]

Due to the societal concern and how widespread hate speech is becoming on the Internet and especially on social media, there is a strong need to classify online hate speech comments that are considered hate speech. [5]

Hate speech definition: Hate speech is speech that attacks a person or a group on the basis of protected attributes such as race, religion, ethnic origin, national origin, sex, disability, sexual orientation, or gender identity. [6]

Hate speech categories:

misogyny –> aimed at women
misandry –> aimed at men
racism –> aimed at specific race
sexual orientation
religion
disability

EDA

Word Cloud

#specifying own stopwords
stopwords = ["a", "about", "above", "after", "again", "against", "ain", "all", "am", "an", "and", "any", "are", "aren", "aren't", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "can", "couldn", "couldn't", "d", "did", "didn", "didn't", "do", "does", "doesn", "doesn't", "doing", "don", "don't", "down", "during", "each", "few", "for", "from", "further", "had", "hadn", "hadn't", "has", "hasn", "hasn't", "have", "haven", "haven't", "having", "he", "her", "here", "hers", "herself", "him", "himself", "his", "how", "i", "if", "in", "into", "is", "isn", "isn't", "it", "it's", "its", "itself", "just", "ll", "m", "ma", "me", "mightn", "mightn't", "more", "most", "mustn", "mustn't", "my", "myself", "needn", "needn't", "no", "nor", "not", "now", "o", "of", "off", "on", "once", "only", "or", "other", "our", "ours", "ourselves", "out", "over", "own", "re", "s", "same", "shan", "shan't", "she", "she's", "should", "should've", "shouldn", "shouldn't", "so", "some", "such", "t", "than", "that", "that'll", "the", "their", "theirs", "them", "themselves", "then", "there", "these", "they", "this", "those", "through", "to", "too", "under", "until", "up", "ve", "very", "was", "wasn", "wasn't", "we", "were", "weren", "weren't", "what", "when", "where", "which", "while", "who", "whom", "why", "will", "with", "won", "won't", "wouldn", "wouldn't", "y", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves", "could", "he'd", "he'll", "he's", "here's", "how's", "i'd", "i'll", "i'm", "i've", "let's", "ought", "she'd", "she'll", "that's", "there's", "they'd", "they'll", "they're", "they've", "we'd", "we'll", "we're", "we've", "what's", "when's", "where's", "who's", "why's", "would"] \
+ ['was', 'really', 'let', 'like', 'also', 'dankMemes', 'imgoingtohellforthis', 'KotakuInAction', 'MensRights', 'MetaCanada', 'MGTOW'\
  'PussyPass', 'PussyPassDenied', 'The_Donald', 'TumblrInAction', 'please', 'moderators', 'questions', 'concerns', 'contact', 'action'\
  'perform', 'bot', 'subreddit', 'dankmemes', 'kotakuinaction', 'mensrights', 'metacanada', 'mgtowpussypass', 'pussypassdenied', \
   'the_donald', 'tumblrinaction', 'pussy', 'pass']
stopwords = set(stopwords)

wordcloud = WordCloud(stopwords=stopwords, background_color="white", max_font_size=80, max_words=20)

#text has to be one single string
all_text =' '.join([txt for txt in df.loc[:,'text']]).lower()
wordcloud.generate(all_text)
plt.figure(figsize=(10,8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

png

Top Unigrams

def get_top_n_unigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(1, 1), stop_words=stopwords).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words = get_top_n_unigram(df.loc[:,'text'], 20)

unigram_df = pd.DataFrame(common_words, columns = ['text' , 'count'])

plt.figure(figsize=(12, 9));
unigram_df.groupby('text').sum()['count'].sort_values(ascending=True).plot(
    kind='barh');
plt.ylabel('');
plt.title('Top 20 unigrams', fontdict={'fontsize': 30});
#set large enough font size for ytick labels
plt.gca().tick_params(axis='y', labelsize=16);

png

Top Bigrams

def get_top_n_bigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2), stop_words=stopwords).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words = get_top_n_bigram(df.loc[:,'text'], 20)

bigram_df = pd.DataFrame(common_words, columns = ['text' , 'count'])

plt.figure(figsize=(12, 9));
bigram_df.groupby('text').sum()['count'].sort_values(ascending=True).plot(
    kind='barh');
plt.ylabel('');
plt.title('Top 20 bigrams', fontdict={'fontsize': 30});
#set large enough font size for ytick labels
plt.gca().tick_params(axis='y', labelsize=16);

png

Modelling

BOW Modelling

Pipeline

from nltk.corpus import stopwords
stopwords_nltk =  set(stopwords.words('english'))

def superPipeline(Dataframes,Vectorizerlist,ClassifierList,Dfnames,pipe_params,methodgridname, df_column):
    Methodgrid=[]
    metnum=len(Dataframes)*len(Vectorizerlist)*len(ClassifierList)
    n=0
    for index,df in enumerate(Dataframes):
        X=Dataframes[index][df_column]
        y=Dataframes[index]['hate']
        X_train,X_test,y_train,y_test=train_test_split(X,y,stratify=y,random_state=28)

        for Vectorizer in Vectorizerlist:
            for Classifier in ClassifierList:
                n+=1
                print(f'{n} of {metnum} of methods attempting')
                method={}
                pipe = Pipeline([
                    ('vec', Vectorizer ),
                    ('class', Classifier)
                ])

                gs = GridSearchCV(pipe, param_grid=pipe_params, cv=5,verbose=1,n_jobs=-1, scoring='f1')
                gs.fit(X_train, y_train)
                method=(gs.best_params_)
                method['Cross_Val_Score']=(gs.best_score_)
                method['Test_Score']=gs.score(X_test,y_test)
                method['Vectorizer']=str(Vectorizer).split('(')[0]
                method['Data']=str(Dfnames[index])
                method['Classifier']=str(Classifier).split('(')[0]
                Methodgrid.append(method)

                tn, fp, fn, tp = confusion_matrix(y_test, gs.predict(X_test)).ravel()
                print(f"{str(Classifier).split('(')[0]} Confusion Matrix:")
                print(f"True Negatives: {tn}")
                print(f"False Positives: {fp}")
                print(f"False Negatives: {fn}")
                print(f"True Positives: {tp}")
                print('\n')

                report = classification_report(y_test, gs.predict(X_test), target_names=['Predict 0', 'Predict 1'], output_dict=True)
                class_table = pd.DataFrame(report).transpose()
                display(class_table)

    Methodgrid=pd.DataFrame(Methodgrid)
    Methodgrid.to_csv(methodgridname,index=False)
    return Methodgrid

Choosing best vectorizer

dataframes=[df]
df_names = ['df']
vectorizer_lst = [TfidfVectorizer(),CountVectorizer()]
classifier_lst = [LogisticRegression(), MultinomialNB()]
pipe_params = {
                    'vec__max_features': [int(i) for i in np.linspace(5000,20000,4)],
                    'vec__min_df': [2],
                    'vec__max_df': [.95],
                    'vec__ngram_range': [(1,1),(1,2)],
                    'vec__stop_words':[stopwords_nltk]
                }
superPipeline(dataframes, vectorizer_lst, classifier_lst, df_names, pipe_params, 'vectorizer_grid.csv')

1 of 4 of methods attempting
Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:  1.1min finished
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)


LogisticRegression Confusion Matrix:
True Negatives: 7364
False Positives: 322
False Negatives: 1136
True Positives: 3769


2 of 4 of methods attempting
Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:  1.0min finished


MultinomialNB Confusion Matrix:
True Negatives: 7188
False Positives: 498
False Negatives: 1804
True Positives: 3101


3 of 4 of methods attempting
Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:  1.4min finished
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)


LogisticRegression Confusion Matrix:
True Negatives: 7219
False Positives: 467
False Negatives: 925
True Positives: 3980


4 of 4 of methods attempting
Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:  1.3min finished


MultinomialNB Confusion Matrix:
True Negatives: 6309
False Positives: 1377
False Negatives: 874
True Positives: 4031

	Classifier	Cross_Val_Score	Data	Test_Score	Vectorizer	vec__max_df	vec__max_features	vec__min_df	vec__ngram_range	vec__stop_words
0	LogisticRegression	0.882397	df	0.884203	TfidfVectorizer	0.95	5000	2	(1, 1)	{whom, me, until, m, couldn, you'd, her, but, ...
1	MultinomialNB	0.810966	df	0.817171	TfidfVectorizer	0.95	5000	2	(1, 2)	{whom, me, until, m, couldn, you'd, her, but, ...
2	LogisticRegression	0.886686	df	0.889445	CountVectorizer	0.95	20000	2	(1, 1)	{whom, me, until, m, couldn, you'd, her, but, ...
3	MultinomialNB	0.821159	df	0.821222	CountVectorizer	0.95	5000	2	(1, 1)	{whom, me, until, m, couldn, you'd, her, but, ...

Generally CountVectorizer is better and it does better on unigrams. This is probably because of the type of words that are being used to classify if it is hate speech or not.

Choosing best vectorizer with SVC

dataframes=[df]
df_names = ['df']
vectorizer_lst = [TfidfVectorizer(),CountVectorizer()]
classifier_lst = [SVC()]
pipe_params = {
                    'vec__max_features': [int(i) for i in np.linspace(5000,20000,4)],
                    'vec__min_df': [2],
                    'vec__max_df': [.95],
                    'vec__ngram_range': [(1,1),(1,2),(1,3)],
                    'vec__stop_words':[stopwords_nltk]
                }
superPipeline(dataframes, vectorizer_lst, classifier_lst, df_names, pipe_params, 'vectorizer_grid_linearmodels.csv', 'tok_lemma')

1 of 2 of methods attempting
Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
/Users/clementow/anaconda3/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  "timeout or by a memory leak.", UserWarning
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 39.0min
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed: 54.3min finished
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)


SVC Confusion Matrix:
True Negatives: 7686
False Positives: 0
False Negatives: 4905
True Positives: 0




/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

	f1-score	precision	recall	support
Predict 0	0.758100	0.610436	1.000000	7686.000000
Predict 1	0.000000	0.000000	0.000000	4905.000000
accuracy	0.610436	0.610436	0.610436	0.610436
macro avg	0.379050	0.305218	0.500000	12591.000000
weighted avg	0.462772	0.372632	0.610436	12591.000000

2 of 2 of methods attempting
Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
/Users/clementow/anaconda3/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  "timeout or by a memory leak.", UserWarning
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 44.8min
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed: 64.7min finished
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)


SVC Confusion Matrix:
True Negatives: 7624
False Positives: 62
False Negatives: 3498
True Positives: 1407

	f1-score	precision	recall	support
Predict 0	0.810719	0.685488	0.991933	7686.000000
Predict 1	0.441481	0.957794	0.286850	4905.000000
accuracy	0.717258	0.717258	0.717258	0.717258
macro avg	0.626100	0.821641	0.639392	12591.000000
weighted avg	0.666877	0.791569	0.717258	12591.000000

	Classifier	Cross_Val_Score	Data	Test_Score	Vectorizer	vec__max_df	vec__max_features	vec__min_df	vec__ngram_range	vec__stop_words
0	SVC	0.000000	df	0.000000	TfidfVectorizer	0.95	5000	2	(1, 1)	{doesn't, under, was, were, down, against, out...
1	SVC	0.258985	df	0.441481	CountVectorizer	0.95	5000	2	(1, 3)	{doesn't, under, was, were, down, against, out...

With SVM Classifier (SVC), it is the same as the other classifiers above where CountVectorizer works better.

Choosing best model based on CountVectorizer

With CountVectorizer as the determined vectorizer, it is time to choose the best model that works well with it.

def superPipeline(Dataframes,Vectorizerlist,ClassifierList,Dfnames,pipe_params,methodgridname, df_column):
'''
Function that handles the pipeline to match each vectorizer with each classifer with their corresponding
parameters for GridSearch.
'''
    Methodgrid=[]
    metnum=len(Dataframes)*len(Vectorizerlist)*len(ClassifierList)
    n=0
    for index,df in enumerate(Dataframes):
        X=Dataframes[index][df_column]
        y=Dataframes[index]['hate']
        X_train,X_test,y_train,y_test=train_test_split(X,y,stratify=y,random_state=28)

        for Vectorizer in Vectorizerlist:
            for Classifier in ClassifierList:
                n+=1
                print(f'{n} of {metnum} of methods attempting')
                method={}
                pipe = Pipeline([
                    ('vec', Vectorizer ),
                    ('class', Classifier)
                ])

                gs = GridSearchCV(pipe, param_grid=pipe_params, cv=5,verbose=1,n_jobs=1, scoring='f1')
                gs.fit(X_train, y_train)
                method=(gs.best_params_)
                method['Cross_Val_Score']=(gs.best_score_)
                method['Test_Score']=gs.score(X_test,y_test)
                method['Vectorizer']=str(Vectorizer).split('(')[0]
                method['Data']=str(Dfnames[index])
                method['Classifier']=str(Classifier).split('(')[0]
                Methodgrid.append(method)

                tn, fp, fn, tp = confusion_matrix(y_test, gs.predict(X_test)).ravel()
                print(f"{str(Classifier).split('(')[0]} Confusion Matrix:")
                print(f"True Negatives: {tn}")
                print(f"False Positives: {fp}")
                print(f"False Negatives: {fn}")
                print(f"True Positives: {tp}")
                print('\n')

                report = classification_report(y_test, gs.predict(X_test), target_names=['Predict 0', 'Predict 1'], output_dict=True)
                class_table = pd.DataFrame(report).transpose()
                display(class_table)

    Methodgrid=pd.DataFrame(Methodgrid)
    Methodgrid.to_csv(methodgridname,index=False)
    return Methodgrid

dataframes=[df]
df_names = ['df']
vectorizer_lst = [CountVectorizer()]
classifier_lst = [LogisticRegression(), MultinomialNB(), ExtraTreesClassifier()]
pipe_params = {
                    'vec__max_features': [int(i) for i in np.linspace(5000,20000,4)],
                    'vec__min_df': [2],
                    'vec__max_df': [.95],
                    'vec__ngram_range': [(1,1),(1,2)],
                    'vec__stop_words':[stopwords_nltk]
                }
superPipeline(dataframes, vectorizer_lst, classifier_lst, df_names, pipe_params, 'class_grid.csv', 'lemma')

1 of 3 of methods attempting
Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:  1.6min finished
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)


LogisticRegression Confusion Matrix:
True Negatives: 7219
False Positives: 467
False Negatives: 925
True Positives: 3980

	f1-score	precision	recall	support
Predict 0	0.912066	0.886419	0.939240	7686.000000
Predict 1	0.851155	0.894985	0.811417	4905.000000
accuracy	0.889445	0.889445	0.889445	0.889445

2 of 3 of methods attempting
Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:   59.2s finished


MultinomialNB Confusion Matrix:
True Negatives: 6309
False Positives: 1377
False Negatives: 874
True Positives: 4031

	f1-score	precision	recall	support
Predict 0	0.848611	0.878324	0.820843	7686.000000
Predict 1	0.781732	0.745377	0.821814	4905.000000
accuracy	0.821222	0.821222	0.821222	0.821222

3 of 3 of methods attempting
Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:  3.9min finished
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)


ExtraTreesClassifier Confusion Matrix:
True Negatives: 7092
False Positives: 594
False Negatives: 1048
True Positives: 3857

	f1-score	precision	recall	support
Predict 0	0.896247	0.871253	0.922717	7686.000000
Predict 1	0.824498	0.866547	0.786340	4905.000000
accuracy	0.869589	0.869589	0.869589	0.869589

	Classifier	Cross_Val_Score	Data	Test_Score	Vectorizer	vec__max_df	vec__max_features	vec__min_df	vec__ngram_range	vec__stop_words
0	LogisticRegression	0.886686	df	0.889445	CountVectorizer	0.95	20000	2	(1, 1)	{whom, me, until, m, couldn, you'd, her, but, ...
1	MultinomialNB	0.821159	df	0.821222	CountVectorizer	0.95	5000	2	(1, 1)	{whom, me, until, m, couldn, you'd, her, but, ...
2	ExtraTreesClassifier	0.866697	df	0.869589	CountVectorizer	0.95	15000	2	(1, 2)	{whom, me, until, m, couldn, you'd, her, but, ...

As we can see the best F1 score still goes to LogisticRegression at 88.16% and unigram CountVectorizer. The model with the lowest False Negatives is Multinomial Naive-Bayes whoever the score is the worse out of the above classifiers. Coming in second for the lowest False Negatives goes to the LogisticRegression model.

SVC Optimization

So far LogisticRegression model works the best for this dataset classifying whether a comment is hate speech or not. Let’s use another linear based model to see if the score can further improved.

dataframes=[df]
df_names = ['df']
vectorizer_lst = [CountVectorizer(max_df=0.95, min_df=2, ngram_range=(1,1))]
classifier_lst = [SVC()]
pipe_params = {
                'class__C':[0.1,1,10]
                }

superPipeline(dataframes, vectorizer_lst, classifier_lst, df_names, pipe_params, 'svm.csv', 'tok_lemma')

1 of 1 of methods attempting
Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed: 53.8min finished
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)


SVC Confusion Matrix:
True Negatives: 7407
False Positives: 279
False Negatives: 1291
True Positives: 3614

	f1-score	precision	recall	support
Predict 0	0.904175	0.851575	0.963700	7686.000000
Predict 1	0.821550	0.928333	0.736799	4905.000000
accuracy	0.875308	0.875308	0.875308	0.875308
macro avg	0.862863	0.889954	0.850250	12591.000000
weighted avg	0.871987	0.881477	0.875308	12591.000000

	Classifier	Cross_Val_Score	Data	Test_Score	Vectorizer	class__C
0	SVC	0.792316	df	0.82155	CountVectorizer	10

The SVC model has an F1 score of 86.29% which is considered a decent score but still short of the LogisticRegression performance (88.16% before optimization). Though it did better in terms of False Positives, it failed to detect many comments which were considered hate speech which resulted in a high occurence of False Negatives.

Ensemble models

Next we will try ensemble models to see how well it does and if it can be a decent contender to LogisticRegression so far.

dataframes=[df]
df_names = ['df']
vectorizer_lst = [CountVectorizer()]
classifier_lst = [RandomForestClassifier(n_estimators=100), GradientBoostingClassifier(n_estimators=100), AdaBoostClassifier(n_estimators=100)]
pipe_params = {
                    'vec__max_features': [int(i) for i in np.linspace(5000,20000,4)],
                    'vec__min_df': [2],
                    'vec__max_df': [.95],
                    'vec__ngram_range': [(1,1),(1,2)],
                    'vec__stop_words':[stopwords_nltk]
                }
superPipeline(dataframes, vectorizer_lst, classifier_lst, df_names, pipe_params, 'class_ensemble_grid.csv')

1 of 3 of methods attempting
Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
/usr/local/lib/python3.6/dist-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  "timeout or by a memory leak.", UserWarning
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed: 34.0min finished


RandomForestClassifier Confusion Matrix:
True Negatives: 7126
False Positives: 559
False Negatives: 842
True Positives: 4064

	precision	recall	f1-score	support
Predict 0	0.894327	0.927261	0.910496	7685.00000
Predict 1	0.879083	0.828373	0.852975	4906.00000
accuracy	0.888730	0.888730	0.888730	0.88873

2 of 3 of methods attempting
Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
/usr/local/lib/python3.6/dist-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  "timeout or by a memory leak.", UserWarning
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:  5.9min finished


GradientBoostingClassifier Confusion Matrix:
True Negatives: 7329
False Positives: 356
False Negatives: 1016
True Positives: 3890

	precision	recall	f1-score	support
Predict 0	0.878250	0.953676	0.914410	7685.000000
Predict 1	0.916156	0.792907	0.850087	4906.000000
accuracy	0.891033	0.891033	0.891033	0.891033

3 of 3 of methods attempting
Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:  4.0min finished


AdaBoostClassifier Confusion Matrix:
True Negatives: 7282
False Positives: 403
False Negatives: 927
True Positives: 3979

	precision	recall	f1-score	support
Predict 0	0.887075	0.947560	0.916321	7685.000000
Predict 1	0.908033	0.811048	0.856804	4906.000000
accuracy	0.894369	0.894369	0.894369	0.894369

	vec__max_df	vec__max_features	vec__min_df	vec__ngram_range	vec__stop_words	Cross_Val_Score	Test_Score	Vectorizer	Data	Classifier
0	0.95	15000	2	(1, 2)	{hers, you'll, ain, being, shouldn, isn't, aga...	0.861168	0.852975	CountVectorizer	df	RandomForestClassifier
1	0.95	20000	2	(1, 1)	{hers, you'll, ain, being, shouldn, isn't, aga...	0.856908	0.850087	CountVectorizer	df	GradientBoostingClassifier
2	0.95	10000	2	(1, 2)	{hers, you'll, ain, being, shouldn, isn't, aga...	0.861670	0.856804	CountVectorizer	df	AdaBoostClassifier

Adaboost does the best among all the ensemble models with a F1 score of 88.66%. However we want to ensure the model does well to reduce false negatives, hence RandomForest classifier is superior with a lower number of false negatives and a decent F1 score of 88.17%. This slightly edges LogisticRegression (before optimization) by 0.01%.

RandomForest optimization

Since RandomForest and LogisticRegression are the top 2 models edging very close with each other. We shall do some optimization of parameters for each model.

dataframes=[df]
df_names = ['df']
vectorizer_lst = [CountVectorizer(max_features=15000, max_df=0.95, min_df=2, ngram_range=(1,2), stop_words=stopwords_nltk)]
classifier_lst = [RandomForestClassifier()]
pipe_params = {
               'class__n_estimators': [10, 100, 200],
               'class__max_depth': [None, 1, 3, 5, 7, 9],
               'class__max_features': [3, 5, 6]
                }
superPipeline(dataframes, vectorizer_lst, classifier_lst, df_names, pipe_params, 'class_rf_grid.csv')

1 of 1 of methods attempting
Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 180 out of 180 | elapsed: 76.4min finished


RandomForestClassifier Confusion Matrix:
True Negatives: 7448
False Positives: 237
False Negatives: 1877
True Positives: 3029

	precision	recall	f1-score	support
Predict 0	0.798713	0.969161	0.875720	7685.000000
Predict 1	0.927434	0.617407	0.741312	4906.000000
accuracy	0.832102	0.832102	0.832102	0.832102

	class__max_depth	class__max_features	class__n_estimators	Cross_Val_Score	Test_Score	Vectorizer	Data	Classifier
0	None	6	200	0.735024	0.741312	CountVectorizer	df	RandomForestClassifier

With an increased number of estimators, we want to ensure that it generalises well with unseen data. However, it compromises on the F1 score and recall, esp on classifying hate speech.

Logistic Regression optimization

dataframes=[df]
df_names = ['df']
vectorizer_lst = [CountVectorizer()]
classifier_lst = [LogisticRegression()]
pipe_params = {

                    'vec__max_features': [10000,15000,17500,None],
                    'vec__min_df': [2,3],
                    'vec__max_df': [.95,.9],
                    'vec__ngram_range': [(1,1), (1,2),(1,3)],
                    'vec__stop_words':[stopwords_nltk],
                    'class__C':[0.1,1,10]
                }

superPipeline(dataframes, vectorizer_lst, classifier_lst, df_names, pipe_params, 'logreg.csv')

1 of 1 of methods attempting
Fitting 5 folds for each of 144 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  8.8min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 22.8min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed: 44.0min finished
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)


LogisticRegression Confusion Matrix:
True Negatives: 7319
False Positives: 367
False Negatives: 949
True Positives: 3956

	f1-score	precision	recall	support
Predict 0	0.917513	0.885220	0.952251	7686.000000
Predict 1	0.857391	0.915105	0.806524	4905.000000
accuracy	0.895481	0.895481	0.895481	0.895481

	Classifier	Cross_Val_Score	Data	Test_Score	Vectorizer	class__C	vec__max_df	vec__max_features	vec__min_df	vec__ngram_range	vec__stop_words
0	LogisticRegression	0.853168	df	0.857391	CountVectorizer	0.1	0.95	10000	3	(1, 1)	{whom, me, until, m, couldn, you'd, her, but, ...

After optimization, LogisticRegression is the superior model with a better F1 score with its results being more interpretable. It also generalises better with unseen data as compared to RandomForest.

Modelling using POS Tags

By classifying hate speech on our dataset, we want to see if there is any relation to how the insults are structured grammatically, which can be used as our features for classification.

Pre-processing

X = df['pos']
y = df['hate']

X_train_pos, X_test_pos, y_train_pos, y_test_pos = train_test_split(X, y, stratify=y, random_state=28)

y_train_pos.value_counts(normalize=True)

0    0.610389
1    0.389611
Name: hate, dtype: float64

Modelling

We want to see if there are any relations to effective clasification by increasing the ngram range for the vectorizers.

dataframes=[df]
df_names = ['df']
vectorizer_lst = [TfidfVectorizer(), CountVectorizer()]
classifier_lst = [LogisticRegression()]
pipe_params = {

                    'vec__max_features': [10000,15000,17500,None],
                    'vec__min_df': [2,3],
                    'vec__max_df': [.95,.9],
                    'vec__ngram_range': [(1,1), (1,2),(1,3), (1,4), (1,5)],
                    'vec__stop_words':[stopwords_nltk],
                    'class__C':[0.1,1,10]
                }

superPipeline(dataframes, vectorizer_lst, classifier_lst, df_names, pipe_params, 'logreg_pos_ngram.csv', 'pos')

1 of 2 of methods attempting
Fitting 5 folds for each of 240 candidates, totalling 1200 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  8.7min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 20.2min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed: 38.1min
/Users/clementow/anaconda3/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  "timeout or by a memory leak.", UserWarning
[Parallel(n_jobs=-1)]: Done 1200 out of 1200 | elapsed: 63.3min finished
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)


LogisticRegression Confusion Matrix:
True Negatives: 5701
False Positives: 1985
False Negatives: 2935
True Positives: 1970

	f1-score	precision	recall	support
Predict 0	0.698566	0.660144	0.741738	7686.000000
Predict 1	0.444695	0.498104	0.401631	4905.000000
accuracy	0.609245	0.609245	0.609245	0.609245

2 of 2 of methods attempting
Fitting 5 folds for each of 240 candidates, totalling 1200 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  4.0min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 20.3min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 52.8min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed: 90.7min
[Parallel(n_jobs=-1)]: Done 1200 out of 1200 | elapsed: 145.6min finished
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
/Users/clementow/anaconda3/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)


LogisticRegression Confusion Matrix:
True Negatives: 6244
False Positives: 1442
False Negatives: 3388
True Positives: 1517

	f1-score	precision	recall	support
Predict 0	0.721099	0.648256	0.812386	7686.000000
Predict 1	0.385809	0.512673	0.309276	4905.000000
accuracy	0.616393	0.616393	0.616393	0.616393

	Classifier	Cross_Val_Score	Data	Test_Score	Vectorizer	class__C	vec__max_df	vec__max_features	vec__min_df	vec__ngram_range	vec__stop_words
0	LogisticRegression	0.447407	df	0.444695	TfidfVectorizer	10	0.95	10000	3	(1, 5)	{whom, me, until, m, couldn, you'd, her, but, ...
1	LogisticRegression	0.413341	df	0.385809	CountVectorizer	1	0.95	10000	3	(1, 3)	{whom, me, until, m, couldn, you'd, her, but, ...

The above gridsearch results shows that there is no obvious relation that the model can learn from in terms of POS tags of the sentences in the hate speech comments. This is likely due to the fact that most of them are actually grammatically structured in the same way whether is it hate speech or not.

Models with best parameters

def print_results(model, pred):

    tn, fp, fn, tp = confusion_matrix(y_test, model.predict(X_test_cvec)).ravel()
    print(f"{str(model).split('(')[0]} Confusion Matrix:")
    print(f"True Negatives: {tn}")
    print(f"False Positives: {fp}")
    print(f"False Negatives: {fn}")
    print(f"True Positives: {tp}")
    print('\n')

    report = classification_report(y_test, pred, target_names=['Predict 0', 'Predict 1'], output_dict=True)
    class_table = pd.DataFrame(report).transpose()
    display(class_table)

X=df['tok_lemma']
y=df['hate']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=28)

Logistic Regression

cvec = CountVectorizer(max_df=0.95, max_features=10000, min_df=3, ngram_range=(1,1))
logreg = LogisticRegression(C=1.0)

X_train_cvec = cvec.fit_transform(X_train)
X_test_cvec = cvec.transform(X_test)
logreg.fit(X_train_cvec, y_train)
pred_lr = logreg.predict(X_test_cvec)

/Users/clementow/opt/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

print_results(logreg, pred_lr)

LogisticRegression Confusion Matrix:
True Negatives: 7183
False Positives: 503
False Negatives: 920
True Positives: 3984

	precision	recall	f1-score	support
Predict 0	0.886462	0.934556	0.909874	7686.000000
Predict 1	0.887898	0.812398	0.848472	4904.000000
accuracy	0.886974	0.886974	0.886974	0.886974
macro avg	0.887180	0.873477	0.879173	12590.000000
weighted avg	0.887021	0.886974	0.885957	12590.000000

Logistic Regression with balanced classes

def get_class_weights(y):
    majority = max(y.value_counts())
    return  {cls: float(majority/count) for cls, count in enumerate(y.value_counts())}

class_weights = get_class_weights(y)
class_weights

{0: 1.0, 1: 1.5673209278613307}

cvec = CountVectorizer(max_df=0.95, max_features=10000, min_df=3, ngram_range=(1,1))
logreg_bal = LogisticRegression(C=1.0, class_weight=class_weights)

X_train_cvec = cvec.fit_transform(X_train)
X_test_cvec = cvec.transform(X_test)
logreg_bal.fit(X_train_cvec, y_train)
pred_lr_bal = logreg_bal.predict(X_test_cvec)

/Users/clementow/opt/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

print_results(logreg_bal, pred_lr_bal)

LogisticRegression Confusion Matrix:
True Negatives: 7063
False Positives: 623
False Negatives: 845
True Positives: 4059

	precision	recall	f1-score	support
Predict 0	0.893146	0.918944	0.905861	7686.0000
Predict 1	0.866937	0.827692	0.846860	4904.0000
accuracy	0.883400	0.883400	0.883400	0.8834
macro avg	0.880042	0.873318	0.876361	12590.0000
weighted avg	0.882937	0.883400	0.882879	12590.0000

Best Model Comparison Summary

	F1 score	Recall
Logistic Regression	87.91%	87.35%
Logistic Regression with class weights	87.64%	87.33%

Logistic Regression without class weights is the best performing one with an F1 score of 87.91% and the best recall.

Best Model Intepretation

coefs=logreg.coef_[0]

word_coef = pd.DataFrame({'word': cvec.get_feature_names() , 'coeff': np.exp(coefs)})

print("Top 50 Features for Logistic Regression & CVec on Hate = 1")
word_coef.sort_values(by='coeff' , ascending=False).head(50)

Top 50 Features for Logistic Regression & CVec on Hate = 1

	word	coeff
6017	nigger	376.591530
3113	faggot	363.563494
7510	retard	161.153371
7512	retarded	139.822428
2057	cunt	100.222272
8388	spic	85.527322
5872	muzzie	79.058264
9730	wetback	72.413391
3111	fag	66.935363
2701	dyke	45.077296
7511	retardation	35.366813
4879	kike	33.158387
9230	twat	32.439435
9801	wigger	29.150413
3114	faggotry	25.496369
7513	retards	23.443762
7182	raghead	22.446466
6018	niggers	16.405627
9110	tranny	15.947645
5119	libtard	12.504445
2061	cunty	11.256686
4123	homos	9.526618
3117	faggy	9.053693
9548	vietnam	8.863271
598	autism	7.939513
4119	homo	7.409377
5818	mudshark	7.393040
6010	nig	7.299513
7112	pussyboy	6.797214
2058	cuntfuse	6.649853
8205	slut	6.633148
2059	cuntish	6.316475
5860	mussolini	6.211510
6015	nigga	5.953319
1442	chinaman	5.951831
2060	cunts	5.916409
5780	moslem	5.701641
2939	esque	5.655120
3886	halfwit	5.635636
298	americunt	5.583840
7111	pussy	5.564052
8984	thundercunt	5.550721
9772	whitey	5.458723
6016	niggas	5.407924
5826	mulatto	5.371978
7948	sewage	5.210900
4091	hoe	5.181927
878	bitch	5.125506
7321	redneck	4.943599
8808	tard	4.932617

The model is learning that many different offensive words that are actually contributing to hate speech. This is the reason why Logistic Regression has been the go to classifier for so many years with reasonable performances.

But of course, ideally to reach above 90% perhaps a context based classifier might be useful to reduce the False Negatives.

The reason why unigram for the CountVectorizer is most useful for the classifier to make a decision is likely because of the fact that most offensive words come in a single word which are highly probable to be hate speech related.

print("Top 20 Features for Logistic Regression & CVec on Hate = 0")
word_coef.sort_values(by='coeff' , ascending=True).head(20)

Top 20 Features for Logistic Regression & CVec on Hate = 0

	word	coeff
9475	van	0.102721
952	boat	0.133900
6277	organisation	0.147655
2350	detail	0.172020
5363	manipulate	0.175165
1288	carter	0.183946
1187	butch	0.200971
6476	pastor	0.203492
2916	er	0.207313
128	adoption	0.207375
4781	judaism	0.238692
9350	unhinged	0.241239
1471	chuckle	0.241812
1710	complicit	0.242037
4779	jt	0.249178
6789	porno	0.251294
8384	spew	0.254795
2212	defender	0.258314
5308	madness	0.258361
531	assist	0.258451

For comments that contain unigrams that are not classifed as hate speech, they are usually non-offensive. Of course, given more context, they may or may not be hate speech related.

Best Model Misclassifications

# Create figure for distribution graph
plt.figure(figsize = (10,7))

# Creatinfg two histograms of observations, with blue (left) from nonhate and yellow (right) from hate
plt.hist(pred_df[pred_df['actual'] == 0]['pred_probs'],
         bins=25,
         color='b',
         alpha = 0.6,
         label='Outcome = 0 (No Hate)')
plt.hist(pred_df[pred_df['actual'] == 1]['pred_probs'],
         bins=25,
         color='orange',
         alpha = 0.6,
         label='Outcome = 1 (Hate)')

# Labeling of axes.
plt.title('Distribution of P(Outcome = 1)', fontsize=20)
plt.ylabel('Frequency', fontsize=18)
plt.xlabel('Predicted Probability that Outcome = 1', fontsize=18)

# Creating of legends
plt.legend(fontsize=20);

png

# Create figure for distribution graph
plt.figure(figsize = (10,7))

# Creatinfg two histograms of observations, with blue (left) from nonhate and yellow (right) from hate
plt.hist(pred_df[pred_df['actual'] == 0]['pred_probs'],
         bins=25,
         color='b',
         alpha = 0.6,
         label='Outcome = 0 (No Hate)')
plt.hist(pred_df[pred_df['actual'] == 1]['pred_probs'],
         bins=25,
         color='orange',
         alpha = 0.6,
         label='Outcome = 1 (Hate)')

# Labeling of axes.
plt.title('Distribution of P(Outcome = 1)', fontsize=20)
plt.ylabel('Frequency', fontsize=18)
plt.xlabel('Predicted Probability that Outcome = 1', fontsize=18)

# Creating of legends
plt.legend(fontsize=20);

png

print_results(logreg, pred_lr)

LogisticRegression Confusion Matrix:
True Negatives: 7183
False Positives: 503
False Negatives: 920
True Positives: 3984

	precision	recall	f1-score	support
Predict 0	0.886462	0.934556	0.909874	7686.000000
Predict 1	0.887898	0.812398	0.848472	4904.000000
accuracy	0.886974	0.886974	0.886974	0.886974
macro avg	0.887180	0.873477	0.879173	12590.000000
weighted avg	0.887021	0.886974	0.885957	12590.000000

Analysis of False Negatives

Warning: There is presence of derogatory comments in this section and is discussed for academic and research purposes. Please skip to the subsequent sections if you find this offensive.

#sampling subset of 10 false negatives
lr_fn_df = pd.read_csv('../assets/logreg_fn.csv')
for i in np.random.choice(lr_fn_df.index, 5):
    display(lr_fn_df.loc[i,'comment'])
    print('\nResponse:')
    print(lr_fn_df.loc[i,'response'])
    print('\n')
    print(f"Probability: {lr_fn_df.loc[i,'pred_probs']}")
    print('----'*10)

"I shouldn't even have to use the /s, it is incredibly obvious that I am being sarcastic either way."

Response:
['This thread is closed. We ask that you all review our user policies and refrain from hate speech of any kind. This is a final warning. Ty.', 'Its inappropriate to use words that directly attack one based on their sexual preferences (incel and faggot) and gender (cunt).', 'Do not discuss hate speech by using hate speech there are ways to avoid this. Name calling also does not add to the conversation.']

Probability: 0.0933714693538914
----------------------------------------

'I don’t know the full details of the IMC shit. I was bored with the Kraut stupidity the first time round so this round holds little interest for me. The name wild goose however I do know. He is a legitimately horrible individual who is butt buddies with Ralph and the other IBS cunts. I believe he was the one who doxed Sargon and celebrated the suicide of the sister of a GG’er back in the day.   Honestly it’s not what he changed his mind to that makes me distrust him. It’s the degree to which he flipped and the time it took to happen. I don’t give him any credibility and doubt every thing he says and does.   You do you mate. Just remember that someone’s actions belie the truth of their words. '

Response:
['Use of the c-word here is not acceptable because it is hurtful and offensive as it denigrates women.', 'Using the term "c---" pejoratively is offensive and should be avoided.']

Probability: 0.007523960445078103
----------------------------------------

".... they're the 'dumbest of the dumb' who believe the insane Left and try to prove they're NOT against being killed by religious imbeciles !"

Response:
['Use of the r-word is unacceptable in discourse as it demeans and insults people with mental disabilities.', 'I like Cindy Lauper and her song Girls Just Want to Have Fun', 'Please avoid expressions that denigrate women or people of other religions.']

Probability: 0.06008171222560287
----------------------------------------

"Sold yourself out dbag. Throwing buzzwords around instead of thinking makes you a inept yet dangerous person. Seek help, there's no shame in mental illness. "

Response:
["Next we need a law where if you're caught over and over lying about these things you go to jail.", "There's no need for that language. This is a warning.", 'Stop using a medical condition as an insult.']

Probability: 0.11629146743911055
----------------------------------------

"She's a big big BIG phony beyond her heritage Claims.  She was a Financial advisor for the Clinton administration.  She audited the Derivative bonds before the Real Estate bubble even ever happened. She found that it was pure garbage and junk and would destroy the Financial health of America if those type of financial vehicles were allowed to be sold.   Had She had just a smidgen of the meddle Trump has, She would have went public. Instead She was told to keep her finding quiet and was kicked up to the Big corner office. Then upgraded on a fast track to be the Senator.   I knew She was the Shittiest person to be entrusted with creating the CFPB. The verbiage in the legalese of most Financial contracts are more oppressive than ever. Companies can legally fuck you over 10 different ways to Sunday. As long as they mention in their contract to read the terms and condition that may be in other supplemental materials like on a Website, or even have to call or email for the hard copy of those terms. The terms don't even have to be in the fine print on those contracts any more. Just be available somewhere in the universe. But make no mistake the terms and conditions have never been less transparent in the History of Mankind trading goods and services for a currency.   This Bitch is the most Vile and Evil Cunt to ever have existed.  She created more loopholes to fuck over the most vulnerable and week and people don't even realize it. They think the CFPB is there to protect them. It's there to create legal ways to fuck you over. "

Response:
["Name calling the people you disagree with isn't going to help solve anything.", 'Removing the foul language will help others to understand your point of view more.', "While an argument can be made on whether certain derogatory terms are applied evenly, it doesn't help to aggressively label someone that way"]

Probability: 0.01147579121017993
----------------------------------------

The false negatives can be segregated into the following categories highlighted in bold.

Subjective

‘I don’t know the full details of the IMC shit. I was bored with the Kraut stupidity the first time round so this round holds little interest for me. The name wild goose however I do know. He is a legitimately horrible individual who is butt buddies with Ralph and the other IBS c****. I believe he was the one who doxed Sargon and celebrated the suicide of the sister of a GG’er back in the day. Honestly it’s not what he changed his mind to that makes me distrust him. It’s the degree to which he flipped and the time it took to happen. I don’t give him any credibility and doubt every thing he says and does. You do you mate. Just remember that someone’s actions belie the truth of their words. ‘

What is hate speech is subjective and some of the human labellers might be more offended at certain swear words which are deemed derogatory whereas some might find it more acceptable.

Mispelled derogatory terms or joined together

‘A f***ing life sentence? Jeez, give that lady her p***y pass back.’

“Hell if you’re over here it’s the only word, damn thing fits in just about ever c****ing sentence you can come up with.”

‘The rest of his life in Pakistan? Not so sure that’s a preferable option to being treated like a hero as part of a Moslim gang in some British jail.’

Sometimes the words are joined together or seperated deliberately or accidentally. Since misspellings are not very common, the model does not recognise them as hate speech related.

Dubious ones

“Cool, now we know it’s false because you said it. :D”

“Half these clowns on here can’t even spell philosophy without the spell checker!”

The above ones do not seem to be hate speech in nature and might be due to mislabelling.

Contextual to the conversation

‘Exacly my point, and thats why we have the second amendment so if any of those monkeys try and give me consequences for my speech i can blow them away’

“Being white just means your better than anyone else and should go the extra length for all the dumb brown people in the world(SJW logic) Based on various garbage articles I’ve seen online(vox buzzfeed salon etc) Black people can’t figure out how to be places okkn time so we need to be patient with them. Black people can’t figure out basic manners and politeness. Black people are scared of milk. Black peop e need their own little safe spaces so e don’t spook them. Black people can’t find jobs and should live off the table scraps of white people. I could go on and on. If you frame it this way, it’s pretty clear SJW’s are convijnced white people are superior in every way, and shouldn’t be held to the same standard. Being black is a mentalcretardation(and any other skin color to)”

’ Wait until Canadian women’s muslim owners are fucking them at will and then selling them to the highest bidder! They will wish they had moved to the United States where a president would protect them! ‘

“Sold yourself out dbag. Throwing buzzwords around instead of thinking makes you a inept yet dangerous person. Seek help, there’s no shame in mental illness. “

And of course, there will be some comments that do not contain any of the derogatory words and yet are still deemed as hate speech with more context of the conversation.

Analysis of False Positives

Warning: There is presence of derogatory comments in this section and is discussed for academic and research purposes. Please skip to the subsequent sections if you find this offensive.

#sampling subset of 10 false postives
lr_fp_df = pd.read_csv('../assets/logreg_fp.csv')
for i in np.random.choice(lr_fp_df.index, 10):
    display(lr_fp_df.loc[i,'comment'])
    print('\n')
    print(f"Probability: {lr_fp_df.loc[i,'pred_probs']}")
    print('----'*10)

'Point taken.  When I see gang rape, I think Muzzie.'

Probability: 0.8553899900368416
----------------------------------------

'>You\'re still pretending that there are radical dominionists in the wings, when the reality is that the contemporary right has become substantially more secular and in some ways much more socially liberal than it used to be.  If the contemporary right is becoming more liberal, that doesn\'t mean the contemporary right is changing - it means more people are accepting of liberal values (which the right has traditionally not been). And I fully agree.   But, I honestly think you are trying to paint a picture of a new, improved, updated right-wing - accepting of gays, accepting of personal freedoms when it comes to issues such as abortion etc... which at best, is simply a thinly veneered attempt to attract "new blood" by throwing out some of the ideas that simply aren\'t going to wash in the modern world (eg - Milo being "flamboyantly gay" - while at the same time taking an anti-gay stance on just about every gay rights issue - such as gay marriage which, even though he has indulged in it himself, he claims should be "called something else".)  And at worst (in the case of the far-right) is exactly the same as the old-right has always been.  You have to admit, the so-called new, "liberal" right wing which you are trying to convince me exists, and the old fashioned Bible-thumping, cross-burning right-wing have an awful lot in common - Love of Trump, hatred of feminism, hatred of Islam, rejection of #metoo, belief in traditional gender roles, willingness to engage in conspiracy theories (Pizza Gate, Sandy Hook being a false flag) and, as we\'re discussing this on a Gamergate subreddit, hatred of Anita Sarkeesian/Zoe Quinn.   And if we include Breitbart readers/commenters (who certainly would include themselves, and I don\'t really think you can exclude them simply because they don\'t fit your idea of what constitutes the "new right"), then we can include white nationalism in the mix, too.  Breitbart, by the way, to some extent, represents the more moderate end of the "far right" - I could easily have cited Andrew Anglin or Richard Spencer - outright white nationalists.    You may say they are different, but they are unequivocally the same in many ways. So it\'s really just a PR exercise.   '

Probability: 0.7350872438669452
----------------------------------------

'That is a fair concern. However, I am a hillbilly stuck in Denver. Everytime I hear one of these harpies try to to act like we are evil men because we are white and straight, I feel like reminding them just how dangerous we really are. How easy would it be for you and your buddies to leave many leftist hats on the ground? I know it would not be a challenge on my end.'

Probability: 0.7547324069011335
----------------------------------------

"There are a lot of women who are fucking cunts too, but I still love the women in my life and I know there are a majority out there who aren't dumb cunts. It doesn't mean all women have to apologize for the cunts out there. Fucking shit man, fuck progressives."

Probability: 0.9999936816613868
----------------------------------------

'Is this something that YouTube was actually planning on doing? I ask because the source was Keemstar, who no one should have as a source for anything other than how to be a total cunt. '

Probability: 0.6660993244536081
----------------------------------------

'It reminds me of the "Redneck Revolt" contingent of Antifa. They should probably change their name to Redneck Surrender.'

Probability: 0.8374726632394158
----------------------------------------

'The first three episodes were golden, I was blown away. SBC is far better doing this kind of comedy than scripted film roles. I\'ve been a fan of his since the 11 o\'clock show, and he\'s been receiving the same criticisms since those days - he\'s remained remarkably unfazed.  The show ran out of steam though. The latter half of the season was by no means terrible, but it peaked in those first three episodes. Getting that state senator to walk around saying "nigger", getting those guys dressed up as teenage girls for the Mexican coming out party, opening a brand new state of the art mosque - these were all up there with some of the best work SBC has ever done. There was really nowhere left to go after those bits, they set the bar too high too early. '

Probability: 0.6289210491449319
----------------------------------------

'Well i definitely agree with the controller being superior in some cases, I for example prefer using it in dark souls and some other 2D platformers. Just that this case was a shooter, FPS, on PC and she was saying that controller was better wtf are you using M+KB and that IMO is beyond retarded. '

Probability: 0.7490682692133128
----------------------------------------

'> The people who get called racist, xenophobic or "right-wingers" tend to be the ones who conflate all Muslims with radical Islamists.  Ayaan Hirsi Ali, Maajid Nawaz and even *Tommy Fucking Robinson* frequently make clear they aren\'t talking about all Muslims, and they frequently distinguish between "Islamists" and "most Muslims in the west." They still get called "radical racist xenophobic right-wingers."  >The experience you will gain, backed with factual statistics, will show you that the vast majority of Muslims are peaceful and simply want to be left alone to get on with their lives.   The factual statistics also suggest that a very large percentage of Muslims are socially and theologically conservative, and would favor laws that restrict our civil liberties in the name of their religion. A survey of British Muslims found that a *supermajority* thought homosexuality should be criminalized.  Sure, they aren\'t necessarily *jihadists*. But if you believe in forcing society to live by Islamic norms *even via the ballot box* then you\'re an Islamist ("Jihadists" are Islamists who support terrorism as a means to forcing society to live by Islamic norms).   Yes, there are many Muslims whom are not Islamists. But we need to take the problem seriously. We saw the theocratic nonsense spouted by the religious right back in the George W Bush administration for the *threat it was*, and we didn\'t make excuses for them like "but they\'re nonviolent, they want a democratic process to restrict our rights." Islamists should be viewed with the same suspicion, if not more, that was cast upon the Dominionists.  >When people start making claims that "all" Muslims are jihadists, or being Muslim inherently means you are a violent extremist, or follow an extremist ideology - that\'s when you will get called racist, xenophobic or right-wing.  Again, not even Tommy Robinson supports that viewpoint. In addition, some people are *very eager to conflate* the proposition that "some verses in the Quran and some theological positions that are prominent in the Islamic world logically support Jihadists" with the proposition that "all Muslims are violent extremists." Take a look at how Sam Harris was treated by Ben Affleck.   >Do you have a list of "official" organs of the "establishment left"? I didn\'t know there was such a thing.  Think "major, center-left social-democratic political parties," left-leaning MSM outlets, and most academics. '

Probability: 0.9918611748424442
----------------------------------------

'Frankly I think it’s a good thing that racists have to hide their disgusting inhumane views, but sadly that’s changing. We’ve got these idiots on TiA, and cunts with tiki torches and cargo shorts on TV. Mental.  '

Probability: 0.9392706010700892
----------------------------------------

The false positives can be segregated into the following categories highlighted in bold.

Mislabelled

‘Point taken. When I see gang rape, I think Muzzie.’

‘Frankly I think it’s a good thing that racists have to hide their disgusting inhumane views, but sadly that’s changing. We’ve got these idiots on TiA, and c**** with tiki torches and cargo shorts on TV. Mental. ‘

‘Is this something that YouTube was actually planning on doing? I ask because the source was Keemstar, who no one should have as a source for anything other than how to be a total c***. ‘

‘It reminds me of the “Redneck Revolt” contingent of Antifa. They should probably change their name to Redneck Surrender.’

The above comments are definitely hate speech and the model did well by being sensitive to such derogatory words and to detect mislabelled comments.

Sensitive to strong words

‘Well i definitely agree with the controller being superior in some cases, I for example prefer using it in dark souls and some other 2D platformers. Just that this case was a shooter, FPS, on PC and she was saying that controller was better wtf are you using M+KB and that IMO is beyond retarded. ‘

‘The first three episodes were golden, I was blown away. SBC is far better doing this kind of comedy than scripted film roles. I've been a fan of his since the 11 o'clock show, and he's been receiving the same criticisms since those days - he's remained remarkably unfazed. The show ran out of steam though. The latter half of the season was by no means terrible, but it peaked in those first three episodes. Getting that state senator to walk around saying “n*****”, getting those guys dressed up as teenage girls for the Mexican coming out party, opening a brand new state of the art mosque - these were all up there with some of the best work SBC has ever done. There was really nowhere left to go after those bits, they set the bar too high too early. ‘

‘That is a fair concern. However, I am a hillbilly stuck in Denver. Everytime I hear one of these harpies try to to act like we are evil men because we are white and straight, I feel like reminding them just how dangerous we really are. How easy would it be for you and your buddies to leave many leftist hats on the ground? I know it would not be a challenge on my end.’

Generally the model is sensitive to derogatory terms and we have trained it to be so to catch more and lower the false negatives.

“Hillbilly” is a considered a derogatory term in America for people who live in the countryside. However, it is not considered hate speech as he or she is directing it at himself and not at other people, thereby not satisfying the hate speech definition.

Hence, more context will be needed to really acertain if it is indeed hate speech or not.

Many top features of the model in one comment

“There are a lot of women who are f***ing c**** too, but I still love the women in my life and I know there are a majority out there who aren’t dumb c****. It doesn’t mean all women have to apologize for the c**** out there. F***ing shit man, f*** progressives.”

’> The people who get called racist, xenophobic or “right-wingers” tend to be the ones who conflate all Muslims with radical Islamists. Ayaan Hirsi Ali, Maajid Nawaz and even Tommy Fucking Robinson frequently make clear they aren't talking about all Muslims, and they frequently distinguish between “Islamists” and “most Muslims in the west.” They still get called “radical racist xenophobic right-wingers.” >The experience you will gain, backed with factual statistics, will show you that the vast majority of Muslims are peaceful and simply want to be left alone to get on with their lives. The factual statistics also suggest that a very large percentage of Muslims are socially and theologically conservative, and would favor laws that restrict our civil liberties in the name of their religion. A survey of British Muslims found that a supermajority thought homosexuality should be criminalized. Sure, they aren't necessarily jihadists. But if you believe in forcing society to live by Islamic norms even via the ballot box then you're an Islamist (“Jihadists” are Islamists who support terrorism as a means to forcing society to live by Islamic norms). Yes, there are many Muslims whom are not Islamists. But we need to take the problem seriously. We saw the theocratic nonsense spouted by the religious right back in the George W Bush administration for the threat it was, and we didn't make excuses for them like “but they're nonviolent, they want a democratic process to restrict our rights.” Islamists should be viewed with the same suspicion, if not more, that was cast upon the Dominionists. >When people start making claims that “all” Muslims are jihadists, or being Muslim inherently means you are a violent extremist, or follow an extremist ideology - that's when you will get called racist, xenophobic or right-wing. Again, not even Tommy Robinson supports that viewpoint. In addition, some people are very eager to conflate the proposition that “some verses in the Quran and some theological positions that are prominent in the Islamic world logically support Jihadists” with the proposition that “all Muslims are violent extremists.” Take a look at how Sam Harris was treated by Ben Affleck. >Do you have a list of “official” organs of the “establishment left”? I didn't know there was such a thing. Think “major, center-left social-democratic political parties,” left-leaning MSM outlets, and most academics. ‘

Due to the sensitivity of the model, whenever there are many of such words in a comment, it is very highly likely a hate speech comment. However, the first comment might not be really hate speech but it could be subjective and offensive for some.

Overall, the model did well in predicting if the comments are hate speech or not. However, many of the mislabellings can be avoided with more context as it is one of the hot topics of the NLP space.

Limitations and Future Work

The challenge faced by automatic hate speech detection is the subjectivity of whether a comment is considered hate speech or not. This can be better managed by having more people labelling these datasets to cross reference and to take a majority vote.

Another challenge is that many new urban words that are deemed derogatory are coined every few years or decades and the models that are developed now might be obsolete in the future. Constant training of new data sets will thus be paramount in overriding this problem.

As with any hate speech classification problem, context is needed to determine whether it is hate speech or not in many cases. Looking at the context of the text of how a word is being used and linguistic features will be a better way of understanding text. Of course, understanding sarcasm is one of the ongoing research which will help immensely in NLP tasks and higher accuracy rates. Therefore, more models have to be developed to train on learning to read context left or right of the target word or having “multiple views” of the same comment by using Multi-view ensemble stacking models.

Conclusions

With the rise of social media and users being able to stay anonymous, hate speech detection is ever important in the digital age.

We present current approaches to this classification task and also explored different techniques including deep learning models and state-of-the-art models such as BERT.

Classifier	F1 score	Recall
Logistic Regression	87.91%	87.35%
Logistic Regression with balanced class	87.64%	87.33%
LSTM - word embeddings on dataset	85.18%	-
LSTM & CNN - word embeddings on dataset	84.95%	-
LSTM 1 - pre-trained word embeddings	81.75%	-
LSTM 2 - pre-trained word embeddings	84.36%	-
BERT	87%	87%

Even though context is important in determining if a comment is hate speech or not, the simplest classifier, Logistic Regression, is actually the best performing one. This goes to show that at times, the simpler the classifier the better in terms of interpretability and it has made it easier to choose the best model with a superior F1 score.

In comparison with the state-of-the-art NLP BERT model, Logistic Regression was still able to perform very well while generalises well for this specific task. It is no wonder why Logistic Regression has been around for many years and continues to be widely used.

Share on

Twitter Facebook Google+ LinkedIn

Clement Ow

Hate Speech Detection

Abstract

Problem Statement

EDA

Word Cloud

Top Unigrams

Top Bigrams

Modelling

BOW Modelling

Pipeline

Choosing best vectorizer

Choosing best vectorizer with SVC

Choosing best model based on CountVectorizer

SVC Optimization

Ensemble models

RandomForest optimization

Logistic Regression optimization

Modelling using POS Tags

Pre-processing

Modelling

Models with best parameters

Logistic Regression

Logistic Regression with balanced classes

Best Model Comparison Summary

Best Model Intepretation

Best Model Misclassifications

Analysis of False Negatives

Analysis of False Positives

Limitations and Future Work

Conclusions

Share on

You May Also Enjoy

Hate Speech Detection