Vector Quantization
assign vectors to some groups, denoted by the centroids. Methods: K-means, Mixture Model.
K-Means
- $k$: the number of clusters
- means: the way we get the centroids is by averaging.
Objective
$$
\min_{U,Z}J(U,Z)=\sum_{i=1}^N \sum_{j=1}^K z_{ij}=|x_i-u_j|^2 = |X-UZ^\top|_F^2
$$
Here, we use the squared Euclidean distance, not the Euclidean distance.
Algorithms
Alternating minimization
- update $Z$, given $U$: the closest $U$
- update $U$, given $Z$: the average of $Z$.
Practical considerations
- Quadratic convergence rate.
- computational cost: $O(nkd)$ per iteration.
Variant
initialization: K-Means++
Since initialize $U$ and $Z$ matter a lot to the computation cost and final result, this variant try to find a better way to initialize $U$.
The intuition behind this approach is that spreading out the $k$ initial cluster centers is a good thing:
- the first cluster center is chosen uniformly at random from the data points that are being clustered,
- after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its squared distance from the point’s closest existing cluster center.
sub-training dataset: Core set
The idea is our dataset is too big to run the K-Means algorithm. Therefore, we need to extract a sub-sample dataset from it. This method solves how to select the sub-training dataset.
how big should the sub-dataset be?
m is dependent on $\epsilon$-approximation guarantees (with probability $\delta$ for
$$
m\propto \frac{dk\log k+\log 1/\delta}{\epsilon^2}
$$
which nodes should be selected?
randomly selected, according to the probability, which combines the uniform selection and the distance of nodes to centroids.
$$
p_i=\frac{1}{2N}+\frac{D_i^2}{2\sum_{j=1}^N D_j^2}
$$why to weight the data points?the original paper
the weight is to guarantee the expectation of the quantization error on the sample dataset doesn’t change.
- the original vector quantization error is
$$
\phi_{\chi}(Q)=\sum_{x\in \chi}d(x,Q)^2 = \sum_{x\in\chi}q(x)\frac{d(x,Q)^2}{q(x)}
$$The quantization error can hence be approximated by sampling m points from $\mathcal{X}$ using $q(x)$and assigning them weights inversely proportional to $q(x)$. Back to the original dataset, we can assign each data point with the selection probability $\frac{1}{N}$, and weight $1$, we get $\mathbb{E}(\phi_\chi(Q)=\frac{1}{N}\mathbb{E}(\sum_{x\in\chi}d(x,Q)^2)$
the error on the new dataset $\zeta$ is $\phi_\zeta(Q)=\sum_{x\in\zeta}p(x)w(w)d(x,Q)^2$,
Set $w(x)=\frac{1}{mp(x)}$
Mixture Model
Idea
Probabilistic clustering. For each data point, it is not absolutely assigned to one cluster but with the probability belonging to one cluster.
Gaussian Mixture Model
- mixture: several models mixed together
- gaussian: each model follows the multinomial Gaussian distribution.
- the categorical distribution of $j$ is given ahead and the same for all $x$? Or it is related to $x_i$, which means should be written as $\pi_{ij}$? How should we know ?
- Yes. Here $\pi_j$ means $p(z_i=1)$, this is known from our dataset as a whole, not meaning $p(z_{ij}=1|x_i)$
- So, it only applies when we already know the categories for all $x$
$$
p(x;\theta)=\sum_{j=1}^K\pi_j p(x;\mu_j,\Sigma_j)
$$
we can understand in this way: The model is composed of different models. For a data point $x$, its probability of generated by this “big” model, is decided by its probability of generated by each “small” model, and the proportion of the “small” model. For example, I know $x$ is generated by the $j$-th model with probability 1 and 0 to other models, and $j$-th model totally generates $\pi_j\cdot N$ number of points, ($N$ is the size of the dataset), then the probability of $x$ is generated by the whole model is $1\cdot\pi_j$
ML method
Maximum likelihood method usually
- makes some general assumption about the distribution. Here: the Gaussian distribution.
- then try to obtain/ “infer” the specifics from the data available.
Generation- EM algorithms
By MLE, we want to get the parameters of $\theta$. Since the objective function has a summation within a log calculation, we should use E-M algorithm and use Jensen’s inequality.
what is the role of $q_j$?
- $q_j$ should exactly be $q_{ij}$, it is related to $x_i$. It means $p(z_{ij}=1|x_i)$
why we also estimate $\pi_j$? I think it is given ahead, and has no effect on the solution of $\theta$
- Yes, it has no direct effect on $\theta$, but has a direct influence on , $qz_{ij}$ thus indirectly influences $\theta$. Therefore, in each step, we also need to optimize $\pi_j$
Inference
It is easy to get the inference of latent variable $j$ as $q_{ij}$
Model Selection
Data fit
Complexity
can be measured by the number of free parameters.
AIC
$$
\text{AIC}(\theta|X)=-\log p(X;\theta)+\kappa(\theta)
$$
BIC
$$
\text{BIC}(\theta|X)=-\log p(X;\theta)+\frac{1}{2}\kappa(\theta)\log N
$$
BIC penalizes complexity more than AIC criterion
A single AIC (BIC) result is meaningless. One has to repeat the analysis for different Ks and compare the differences: the most suitable number of clusters corresponds to the smallest AIC (BIC) value.