Word2Vec is a group of related models that are used to produce word embeddings. They follow several similiar designs which I intend to conclude here.

Skip-Gram Model

Word2Vec models adopt Skip-Gram Model to model a word’s context. Skip-Gram Model is a BOW (bag of words) model that defines the context of a word by N words both before & after the target word (2N words), no matter how they ordered.

Each training instance have an input word, and a random sampled surrounding word from its context as target. Given sufficient training, the model should be able to predit a probability distribution over vocabulary which gives high probability to those words frequently appear in input word’s context.

Network structure

Word2Vec models often use a shallow structure (2 layers) with Hidden layer + output layer.

Hidden layer:

The hidden layer actually produces what we want: the encoded word vector.

It takes as input a one-hot encoded vector for any input word, and produce word vector to output layer. As it’s a one-hot encoded vector (only has 1 at the input word’s position, with elsewhere all 0) , it will just select the corresponding row from the hidden layer’s weight matrix. Which means the hidden layer’s weight matrix is actually a collection of word vectors.

Output layer:

The output layer produce a vector for a random sampled context word of a input word, however, as usually is the case, our vocabulary size is hugh, which makes a vanilla softmax expensive to compute:

$softmax(o_i) = \frac{\exp(o_i)}{\sum_{j=0}^{V-1} \exp(o_j)}$

Where $o_i$ is the network output for word i; V is size of our vocabulary.

To save the training cost, there are several solutions to this problem:

Negative Sampling

This is a way to decrease the softmax computational cost by decreasing envolved dimension: rather than compute softmax over the whole output vector (V dimensional), it randomly samples N negative positions plus the target positive position to compute the softmax, the envolved dimension decreased to N+1 in this case.

The paper says that selecting N equals 5-20 words works well for smaller datasets, and you can get away with only 2-5 words for large datasets.

The negative samples are sampled using a unigram distribution, that is, the probability of selecting one word equals the frequency of this word appears in our corpus:

$P(w_i) = \frac{n(w_i)}{\sum_{j=0}^{V-1} n(w_j)}$

Where $n(w_i)$ is the number of times $w_i$ appears in corpus.

The authors state in their paper that they tried a number of variations on this equation, and the one which performed best was to raise the word counts to the 3/4 power:

$P(w_i) = \frac{n(w_i)^{3/4}}{\sum_{j=0}^{V-1} n(w_j)^{3/4}}$

Hierarchical Softmax

Hierarchical Softmax is another softmax alternative that adopts Huffman encoding to compress the encoded dimension.

A Huffman binary tree will needs to be constructed before training. The basic idea is to put frequent tokens at shallow leaf nodes while infrequent tokens at deep leaf nodes. The encoded vector will be a sequence of zeros and ones that zero means go left and one means go right from the root node, until a leaf node is reached. In this way, the output dimension is the height of the Huffman tree $H$.

The output layer will be $H$ dimensional and each applied with a sigmoid function to decide go left or right in the tree.

Training tricks

Subsampling Frequent Words

Links

Mikolov, Tomas; et al. (2013). “Efficient Estimation of Word Representations in Vector Space”. Accessed from: https://arxiv.org/abs/1301.3781

Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg S.; Dean, Jeff (2013). Distributed representations of words and phrases and their compositionality. Accessed from: https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/