Word Embeddings

Introduction

There are 3 different techniques that can be used to learn word embeddings.

Embedding Layer

Embedding Layer, for lack of a better name, is a word embedding which is learnt jointly with a neural network on a specific NLP task such as Language Modeling or Document Classification.

Following are the steps to learn an embedding layer for a clean and pre-processed text corpus:

Tokenize the text corpus and convert each word to one-hot encoding vector. The size of the one-hot encoded vectors would be the vocabulary size of the text corpus.
The one-hot encoded vectors are then mapped to word vectors of lower dimension, e.g. 786 or 512.
If a MLP is used as the neural net, the word vectors are concatenated and fed as input to the MLP.
If a RNN is used, each word vector is used as one input in a sequence.

Word2Vec

Focussed on learning about words given their local usage context. Uses unsupervised learning approach. Developed as a response to make neural network based training for the word embedding more efficient. Provides 2 different modeling architectures:

Continuous Bag of Words (CBOW)
Skip Gram

CBOW

CBOW modeling architecture learns the embedding by predicting the current (centre) word based on its context (surrounding) words. The context window is configurable.

Following are the steps to learn CBOW word embeddings:

Generate training dataset from the text corpus. For the text in a corpus, "the quick brown fox jumps over the lazy cat", some data instances would be:

[
([the, brown], quick),
([quick, fox], brown),
([brown, jumps], fox)
]

Convert the words to one-hot encoding vectors.
Map the one-hot encoded vectors to lower dimensional word vectors.
The word embeddings are propagated to a lambda layer which averages out the word embeddings to output a context embedding.
The context embedding is passed through a dense softmax layer to predict the target word.
The predicted target word is matched with the actual word to calculate the categorical cross-entropy loss.
The loss is back-propagated to adjust the weights of the neural network.

Skip-Gram

Skip-Gram's learning objective is just the opposite. It learns to predict the context (surrounding) words given the current word.

Following are the steps to learn Skip-Gram word embedding.

Generate training dataset from the text corpus. For the sentence, "the quick brown fox jumps over the lazy cat", some dataset instances would be:

[
[(quick, brown), 1],
[(quick, cat), 0],
[(lazy, cat), 1]
]

Convert the words to one-hot encoded vectors and map them to smaller dimension word vectors.
The pair of word embeddings are passed through a merge layer which computes the dot product between the word embeddings.
The output of the merge layer is passed through a dense sigmoid layer which outputs the prediction of 0 or 1. Based on the actual label, MSE loss is calculated.
The loss is back-propagated to adjust the weights of the neural network.

GloVe

GloVe is an extension to Word2Vec. It approaches to marry matrix factorization technique with local context based learning to learn word embeddings.

NextAttention Mechanism

Last updated 1 year ago