Comparison of different Word Embeddings on Text Similarity — A use case in NLP
Brief Introduction
Natural Language Processing (NLP) is one of the key components in Artificial Intelligence (AI), which carries the ability to make machines understand human language. A lot of information is being generated in unstructured format be it reviews, comments, posts, articles, etc wherein, a large amount of data is in natural language. NLP allows machines to understand and extract patterns from such text data by applying various techniques such as text similarity, information retrieval, document classification, entity extraction, clustering.
Text Similarity is one of the essential techniques of NLP which is being used to find the closeness between two chunks of text by it’s meaning or by surface. Computers require data to be converted into a numeric format to perform any machine learning task. In order to perform such tasks, various word embedding techniques are being used i.e., Bag of Words, TF-IDF, word2vec to encode the text data. This will allow you to perform NLP operations such as finding similarity between two sentences to extract semantically similar questions from FAQ corpus, searching similar documents from the database, recommending semantically similar news articles.
Quick Summary
In this blog, we will walk through the standard approach for text similarity using NLP techniques that include text pre-processing, word embedding techniques, vector similarity. We will go through the details of each of the method for each section. We will highlight one of the case-study where we have used text similarity and computed the performance evaluation for test queries. The performance of the approach has been measured based on the output generated after assigning the threshold score for similarity and accuracy for the output.
Methodology
In order to perform text similarity using NLP techniques, these are the standard steps to be followed:
- Text Pre-Processing:
In day to day practice, information is being gathered from multiple sources be it web, document or transcription from audio, this information may contain various types of garbage values, noisy text, encoding. This needs to be cleaned in order to perform further tasks of NLP. In this preprocessing phase, it should include removing non-ASCII values, special characters, HTML tags, stop words, raw format conversion and so on.
- Feature Extraction:
To convert the text data into a numeric format, text data needs to be encoded. Various encoding techniques are widely being used to extract the word-embeddings from the text data such techniques are bag-of-words, TF-IDF, word2vec.
- Vector Similarity:
Once we will have vectors of the given text chunk, to compute the similarity between generated vectors, statistical methods for the vector similarity can be used. Such techniques are cosine similarity, Euclidean distance, Jaccard distance, word mover’s distance. Cosine similarity is the technique that is being widely used for text similarity.
- Decision Function:
From the similarity score, a custom function needs to be defined to decide whether the score classifies the pair of chunks as similar or not. Cosine similarity returns the score between 0 and 1 which refers 1 as the exact similar and 0 as the nothing similar from the pair of chunks. In regular practice, if the similarity score is more than 0.5 than it is likely to similar at a somewhat level. But, this can vary based on the application and use-case.
Implementation Steps
- Data Preparation
Here, we have considered the list of questions and prepared the semantically similar questions for those. We have considered all the possible pair of questions with their actual labels which will define whether the pair of questions are similar or not.
- Data Pre-processing
Once we had prepared the pair of questions with their actual labels, we have performed text pre-processing techniques in order to clean the text for the further task.
- Uniform case
For uniformity of case, all the sentences are converted to lower case.
- Remove stop words
The most widely used library for pre-processing is NLTK and the list of stop-words provided by them are ‘the’, ‘is’, ‘are’, ‘a’, ‘an’, and so on.
You can install NLTK library using the following command.
$ pip install nltk
Install all the supporting data and libraries.
$ python -m nltk.downloader all
The words like ‘no’, ‘not’, etc are used in a negative sentence and useful in semantic similarity. So before removing these words observed the data and based on your application one can select and filter the stop words.
- Remove punctuation
Punctuation characters are $, “, !, ?, etc. Python string class provides the list of punctuation. We are removing punctuation because they are not providing any information related to semantic similarity.
- Remove non-ASCII characters
Like punctuation, non-ASCII characters are not useful to capture semantic similarity.
Sample code for pre-processing:
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from unidecode import unidecode
import stringdef pre_process(corpus):
# convert input corpus to lower case.
corpus = corpus.lower()
# collecting a list of stop words from nltk and punctuation form
# string class and create single array.
stopset = stopwords.words('english') + list(string.punctuation)
# remove stop words and punctuations from string.
# word_tokenize is used to tokenize the input corpus in word tokens.
corpus = " ".join([i for i in word_tokenize(corpus) if i not in stopset])
# remove non-ascii characters
corpus = unidecode(corpus)
return corpuspre_process("Sample of non ASCII: Ceñía. How to remove stopwords and punctuations?")
Output:
'sample non ascii cenia remove stopwords punctuations'
- Lemmatization
Lemmatization is the process of producing morphological variants of a root/base word of the language. The root word is called lemma. A Lemmatization algorithm reduces the words ‘chocolates’ to the root word ‘chocolate’. NLTK library provides a WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenizelemmatizer = WordNetLemmatizer()sentence = "The striped bats are hanging on their feet for best"words = word_tokenize(sentence)for w in words:
print(w, " : ", lemmatizer.lemmatize(w)
Output:
The : The
striped : striped
bats : bat
are : are
hanging : hanging
on : on
their : their
feet : foot
for : for
best : best
Feature Extraction
The Features are the representation of a sequence of words or sentence in the numeric vector. Using different word embeddings we canrepresent the same sentence differently in numbers. Here we will use TF-IDF, Word2Vec and Smooth Inverse Frequency (SIF).
- TF-IDF
Using TF-IDF embeddings, word will be represented as a single scaler number based on TF-IDF scores. TF-IDF is the combination of TF (Term Frequency) and IDF (Inverse Document Frequency). TF gives the count of word t in document d. Mathematically we can write tf(t,d). IDF gives information about how the word is common or rare across all document. It is the logarithmically scaled inverse fraction of the documents that contain the word.
Mathematically,
idf(t,D) = log (N/dfi) , where N or |D| = Total Number of Document, and dfi = Number of document where the term t appears.
TF-IDF(t, d, D) = tf(t,d) . idf(t,D)
scikit-learn library provides easy implementation of TF-IDF. Install it in the current environment by using the following command.
from sklearn.feature_extraction.text import TfidfVectorizer# sentence paircorpus = ["A girl is styling her hair.", "A girl is brushing her hair."]for c in range(len(corpus)):
corpus[c] = pre_process(corpus[c])# creating vocabulary using uni-gram and bi-gram
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2))
tfidf_vectorizer.fit(corpus)feature_vectors = tfidf_vectorizer.transform(corpus)
To create a vocabulary for TF-IDF we can select different n-gram (a group of words ). For example “New York” will be considered as a single word in bi-gram. The vector generated using TF-IDF is sparse. It gives zero TF-IDF score for the words which are not present in the document.
- Word2vec
Using Word2Vec embeddings, word will be represented as a multidimensional array. The two unsupervised algorithms, Skip-gram, and CBoW are used to generate word embeddings.
The size of the embedding can vary based on the target size selected at the time of training. For this case study, we are using a pre-trained model trained on Wikipedia dataset. This model has approximately 6 lakhs words in the vocabulary. Each word is represented with 300 vector length. In the vector space, similar words will be nearer to each other.
Gensim library is one of the popular for word embedding operations. This allows you to load pre-trained model, extract word-vectors, train model from scratch, fine-tune the pre-trained model.
Install gensim using the following command.
$ pip install gensim
load the model
from gensim.models import Word2Vec
import numpy as np# give a path of model to load function
word_emb_model = Word2Vec.load('word2vec.bin')
For representing sentence as a vector we will take mean of all the word embeddings which are present in the vocabulary of Word2Vec. Ignore the words which are not present in the vocabulary.
- SIF
We use Word2Vec for word embedding but unlike taking the mean of word embeddings which equivalent weights to each word in the sentence even though if any word is irrelevant for semantic similarity, we will take a weighted average of word embeddings. Every word embedding is weighted by a/(a + p(w)), where a is a parameter that is typically set to 0.001 and p(w) is the estimated frequency of the word in a corpus.
from collections import Counter
import itertools
def map_word_frequency(document):
return Counter(itertools.chain(*document))
def get_sif_feature_vectors(sentence1, sentence2, word_emb_model=word_emb_model):
sentence1 = [token for token in sentence1.split() if token in word_emb_model.wv.vocab]
sentence2 = [token for token in sentence2.split() if token in word_emb_model.wv.vocab]
word_counts = map_word_frequency((sentence1 + sentence2))
embedding_size = 300 # size of vectore in word embeddings
a = 0.001
sentence_set=[]
for sentence in [sentence1, sentence2]:
vs = np.zeros(embedding_size)
sentence_length = len(sentence)
for word in sentence:
a_value = a / (a + word_counts[word]) # smooth inverse frequency, SIF
vs = np.add(vs, np.multiply(a_value, word_emb_model.wv[word])) # vs += sif * word_vector
vs = np.divide(vs, sentence_length) # weighted average
sentence_set.append(vs)
return sentence_set
Vector Similarity
Generated word embeddings need to be compared in order to get semantic similarity between two vectors. There are few statistical methods are being used to find the similarity between two vectors. which are:
- Cosine Similarity
- Word mover’s distance
- Euclidean distance
- Cosine similarity
It is the most widely used method to compare two vectors. It is a dot product between two vectors. We would find the cosine angle between the two vectors. For degree 0, cosine is 1 and it is less than 1 for any other angle.
For our case study, we had used cosine similarity.
from sklearn.metrics.pairwise import cosine_similaritydef get_cosine_similarity(feature_vec_1, feature_vec_2):
return cosine_similarity(feature_vec_1.reshape(1, -1), feature_vec_2.reshape(1, -1))[0][0]
- Word mover’s distance
This uses the word embeddings of the words in two texts to measure the minimum distance that the words in one text need to “travel” in semantic space to reach the words in the other text.
- The Euclidean distance
Euclidean distance between two points is the length of the path connecting them. The Pythagorean theorem gives this distance between two points. If the length of the sentence is increased between two sentences then by the euclidean distance they are different even though they have the same meaning.
Case Study
In any organization with huge customer base needs to have a customer representative to maintain their services and improve customer experience. This can be a product provider, service provider or any non-profit organization which regularly needs to have communication with people around the world or within the organization. For the same purpose, to resolve the common queries, the organization usually have the Frequently Asked Questions (FAQs) list with their answers which can be delivered by a webpage, or chatbot, or email or by the customer representative. To automate this process, we can have a system with NLP functionality which reads the query, understand it, finds the semantically similar question that is present in the FAQ and returns the answer for that. This makes the process more easily than the manual one. For this, text similarity can be used to identify a similar question from the FAQ list.
We have used multiple word embedding techniques along with the required pre-processing tasks to compute the similarity and observed the performance by evaluating the ratio of questions from all that have answered automatically and how many answers were correct from that process. We have also, assigned the custom threshold value for the similarity score, which can be different for each use-case based on the business requirement.
- Dataset
We have considered 125 questions and prepared test queries, 4 to 5 variety of questions for each question in the question bank. These ~500 questions were used to compute the performance of the system.
- Evaluation
We have applied various pre-processing techniques and word-embedding techniques and evaluated the text-similarity operation on the test queries. For the evaluation, we have computed two different scores 1. ration of question answered/ found a similar question from knowledge-base 2. number of correctly answered questions. The aim was to get the maximum number of question to be answered with high accuracy score.
Conclusion
In this blog, overall approach on how to go with text similarity using NLP technique has been explained includes text pre-processing, feature extraction, various word-embedding techniques i.e., BOW, TF-IDF, Word2vec, SIF, and multiple vector similarity techniques. One of the case-study has also been explained along with the performance evaluation. In our case study, we have achieved 92.7% accuracy for 74.6% of the questions that have answered using fine-tuned word2vec model along with the SIF technique and 0.8 threshold value for the similarity score. The choice of pre-processing techniques, word-embedding techniques and the threshold for the similarity score can be differ based on the use-case and business requirements.
If you’re looking for similar tech competence or want to semantic similarity functionality with your existing system; feel free to reach out at info@intellica.ai