Spam Classification using word2vec embeddings
In this tutorial, we will be performing spam classification using word2vec embeddings. However, before we get to the code, let’s take a moment to understand word2vec in more detail.
Word Embeddings
Word embeddings are vector representations of words that capture the semantic and syntactic relationships between words. They are used in many natural language processing (NLP) tasks, such as next word prediction, machine translation, question answering and many more.
Analogy
Let’s say I want to learn the meaning of a new word. I could learn the meaning by reading a few sentences that contain the word, and then try to figure out what the word means based on the context.
Word2vec works in a similar way. The model is trained on a large corpus of text, and it learns to represent words as vectors in a high-dimensional space. For words which are semantically similar, vectors will be close in the vector space and for the semantically dissimilar works, vectors will be far apart.
“You shall know a word by the company it keeps” — J.R. Firth
Linguistic regularities and patterns in Word2vec
The word representations learned by Word2vec are very interesting because they encode many linguistic regularities and patterns.
For example, the result of a vector calculation vec(“Madrid”) — vec(“Spain”) + vec(“France”) is closer to vec(“Paris”) than to any other word vector.
Word2vec Details
Word2vec is a neural network algorithm that learns word embeddings by looking at the context in which words appear.
Training Data Generation
The process of generating training data for this algorithm is as follows:
- Get lot of text data and take a window(let’s say window of 4 words)
- We can slide through this window and generate training examples for the model
Variants
It has two main variants: CBOW and Skip-gram.
CBOW
The continuous bag-of-words (CBOW) model predicts the target word based on the context of the surrounding words in the window.
For example, given the sentence “the quick brown fox jumps over the lazy dog” and the window size is three with the current word being “fox”, the CBOW model would try to predict the word “fox” based on the context words “brown” and “jumps”.
Skip-gram
The skip-gram model predicts the context words around a given target word.
Similar to the example above, given the target word “fox”, the skip-gram model would try to predict the context words “brown” and “jumps”.
In short, CBOW predicts the target word from a window of context words, while Skip-gram predicts the context words from the target word.
Let’s move on to the coding part now. We will be performing spam classification using word2vec embeddings.
- Firstly, we need to download the dataset from the following URL: https://www.kaggle.com/datasets/bagavathypriya/spam-ham-dataset?resource=download
- Once the dataset is downloaded, we need to upload it to our workspace.
from google.colab import files
uploaded = files.upload()
3. Load the dataset into a Pandas DataFrame.
import pandas as pd
dataset = pd.read_csv('spamhamdata.csv',sep='\t', header=None)
dataset.head()
3. Add columns names to the dataset.
dataset.columns = ["label", "content"]
dataset.head(5)
4. Tokenize the content column, as the word2vec model expects a list of words.
dataset['content_list'] = dataset['content'].apply(lambda x: gensim.utils.simple_preprocess(x))
dataset.head()
5. Split the dataset into train and test.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset['content_list'],
dataset['label'], test_size=0.2)
X_train = X_train.reset_index(drop = True)
X_test = X_test.reset_index(drop = True)
6. Train a Word2Vec model on this dataset using the Gensim library. Vector size is the embedding size and min_count=1 basically considers the words that appear atleast 1 time.
import gensim
w2v_model = gensim.models.Word2Vec(X_train,
vector_size=100,
window=5,
min_count=1)
7. Based on the word vectors learnt, get the aggregated sentence vectors.
words = set(w2v_model.wv.index_to_key)
X_train_vect = np.array([np.array([w2v_model.wv[i] for i in ls if i in words])
for ls in X_train])
X_test_vect = np.array([np.array([w2v_model.wv[i] for i in ls if i in words])
for ls in X_test])
X_train_avg = []
for v in X_train_vect:
X_train_avg.append(v.mean(axis=0))
X_test_avg = []
for v in X_test_vect:
X_test_avg.append(v.mean(axis=0))
8. After getting the sentence vectors, let’s try to train a randomforest classifier on top of it.
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc_classifier = rfc.fit(X_train_avg, y_train)
9. Predict using the trained classifier
y_pred = rfc_classifier.predict(X_test_avg)
10. Compute metrics like precision, recall, accuracy and confusion matrix for the predictions made.
from sklearn.metrics import precision_score, recall_score, classification_report, confusion_matrix, roc_auc_score, roc_curve
accuracy = (y_pred==y_test).sum()/len(y_pred)
precision = precision_score(y_test, y_pred, pos_label='ham')
recall = recall_score(y_test, y_pred, pos_label='ham')
print('Accuracy: {} , Precision: {} , Recall: {}'.format(accuracy, precision, recall))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
Please let me know in the comments if this was helpful. You can also try experimenting with different data splits, word2vec parameters, and classifiers to see how they affect the performance of the model.
References
https://arxiv.org/pdf/1310.4546.pdf
Ready to take your coding and ML/AI skills to the next level? Visit my page for details on my 1:1 mentorship program and upcoming webinars!