> For the complete documentation index, see [llms.txt](https://amanalok.gitbook.io/deep-learning/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://amanalok.gitbook.io/deep-learning/word-embeddings.md).

# Word Embeddings

## Introduction

There are 3 different techniques that can be used to learn word embeddings.

## Embedding Layer

Embedding Layer, for lack of a better name, is a word embedding which is learnt jointly with a neural network on a specific NLP task such as Language Modeling or Document Classification.

Following are the steps to learn an embedding layer for a clean and pre-processed text corpus:

1. Tokenize the text corpus and convert each word to one-hot encoding vector. The size of the one-hot encoded vectors would be the vocabulary size of the text corpus.
2. The one-hot encoded vectors are then mapped to word vectors of lower dimension, e.g. 786 or 512.
3. If a MLP is used as the neural net, the word vectors are concatenated and fed as input to the MLP.
4. If a RNN is used, each word vector is used as one input in a sequence.

## Word2Vec

Focussed on learning about words given their local usage context. Uses unsupervised learning approach. Developed as a response to make neural network based training for the word embedding more efficient. Provides 2 different modeling architectures:

1. Continuous Bag of Words (CBOW)
2. Skip Gram

### CBOW

CBOW modeling architecture learns the embedding by predicting the current (centre) word based on its context (surrounding) words. The context window is configurable.

Following are the steps to learn CBOW word embeddings:

1. Generate training dataset from the text corpus. For the text in a corpus, "the quick brown fox jumps over the lazy cat", some data instances would be:

```python
[
([the, brown], quick),
([quick, fox], brown),
([brown, jumps], fox)
]
```

2. Convert the words to one-hot encoding vectors.
3. Map the one-hot encoded vectors to lower dimensional word vectors.
4. The word embeddings are propagated to a lambda layer which averages out the word embeddings to output a context embedding.
5. The context embedding is passed through a dense softmax layer to predict the target word.
6. The predicted target word is matched with the actual word to calculate the categorical cross-entropy loss.
7. The loss is back-propagated to adjust the weights of the neural network.

### Skip-Gram

Skip-Gram's learning objective is just the opposite. It learns to predict the context (surrounding) words given the current word.

Following are the steps to learn Skip-Gram word embedding.

1. Generate training dataset from the text corpus. For the sentence, "the quick brown fox jumps over the lazy cat", some dataset instances would be:

```python
[
[(quick, brown), 1],
[(quick, cat), 0],
[(lazy, cat), 1]
]
```

2. Convert the words to one-hot encoded vectors and map them to smaller dimension word vectors.
3. The pair of word embeddings are passed through a merge layer which computes the dot product between the word embeddings.
4. The output of the merge layer is passed through a dense sigmoid layer which outputs the prediction of 0 or 1. Based on the actual label, MSE loss is calculated.
5. The loss is back-propagated to adjust the weights of the neural network.

## GloVe

GloVe is an extension to Word2Vec. It approaches to marry matrix factorization technique with local context based learning to learn word embeddings.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://amanalok.gitbook.io/deep-learning/word-embeddings.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
