Motivation

Using vector as the symbol to represent words, which can capture their meaning in some way.

the advantage of vector representation?
maybe capture the meaning closeness of words. e.g. similar words have similar vectors.
how do we know if the embedding is good?
- usually, we use the prediction ability to evaluate the goodness.
- skip-ram model (distributional context model): a model of prediction. Given a context word, learn the probability distribution of all the vocabularies. Our objective is to maximize:
  $$
  L(\theta;w) = \sum_{t=1}^T \sum_{\Delta\in I}\log p_\theta(w^{t+\Delta}|w^{(t)})
  $$
  the context is a point, the predicted words are a window.
- CBOW (continuous bag-of-word model): Contrary to the skip-ram model, given the context words, learn the probability of this word. The context is a window, the predicted word is a point.

Models

$$
w\mapsto (\vec{x_w},b_w)\in\mathbb{R}^{d+1}
$$

represent a word by a vector and a scalar bias.

This is a method of sampling. It is natural to use sampling when the object is exponentially big. One classic sampling method is MCMC.

sampling: In the maximum likelihood estimation, each time for one context word, we need to update all the other words vectors, represented by the exponential items. This is very time-consuming. Instead, we only update a few words’ vectors. These few words are selected by sampling.
negative: all the words not the one we expected are called negative words.
- how to know which word is our expected word? the word itself as the context must be a positive word.
how to update the word vectors?
what is the relationship between Negative sampling and PMI?

The key idea of GloVe is:

extract a count matrix from the given dataset.
- Each row is a word in the vocabulary
- Each column is a context
- Each entry $n_{ij}$ is the appearance times of the word i in the context $j$
construct another matrix from the vector representation of vocabularies and contexts.
- each entry is the inner product of the vocabulary vector and the context vector.
update vectors to make these two matrices as close as possible.

optional: using weight function to weight items with different counts.

In this way, the vector representation problem is transferred as a matrix factorization problem.

$$
\min_{X,Y}=|M-X^\top Y|_F^2
$$

how does GloVe solve the exponential problem?
- it uses the unnormalized “probability” instead of the normalized ones to avoid the exponential item. Because the exponential item works for normalization.

The objective function: $f(x)=\sum_i f_i(x)$
Traditional gradient descent way:

$$
\frac{\partial f}{\partial x}=\sum_i \frac{\partial f_i}{\partial x}
$$
Stochastic way:

$$
\frac{\partial f}{\partial x}=\frac{\partial f_\gamma}{\partial x}, \gamma\sim\text{Uniform}(1,n)
$$
Key Idea:
- derivative
- only part of objective
- stochastic each time