CIL5 Data Clustering


Vector Quantization

assign vectors to some groups, denoted by the centroids. Methods: K-means, Mixture Model.

K-Means

  • $k$: the number of clusters
  • means: the way we get the centroids is by averaging.

Objective

$$
\min_{U,Z}J(U,Z)=\sum_{i=1}^N \sum_{j=1}^K z_{ij}=|x_i-u_j|^2 = |X-UZ^\top|_F^2
$$

Here, we use the squared Euclidean distance, not the Euclidean distance.

Algorithms

Alternating minimization

  • update $Z$, given $U$: the closest $U$
  • update $U$, given $Z$: the average of $Z$.

Practical considerations

  • Quadratic convergence rate.
  • computational cost: $O(nkd)$ per iteration.

Variant

initialization: K-Means++

Since initialize $U$ and $Z$ matter a lot to the computation cost and final result, this variant try to find a better way to initialize $U$.

wiki

The intuition behind this approach is that spreading out the $k$ initial cluster centers is a good thing:

  1. the first cluster center is chosen uniformly at random from the data points that are being clustered,
  2. after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its squared distance from the point’s closest existing cluster center.

sub-training dataset: Core set

The idea is our dataset is too big to run the K-Means algorithm. Therefore, we need to extract a sub-sample dataset from it. This method solves how to select the sub-training dataset.

  • how big should the sub-dataset be?

    • m is dependent on $\epsilon$-approximation guarantees (with probability $\delta$ for

      $$
      m\propto \frac{dk\log k+\log 1/\delta}{\epsilon^2}
      $$

  • which nodes should be selected?

    • randomly selected, according to the probability, which combines the uniform selection and the distance of nodes to centroids.

      $$
      p_i=\frac{1}{2N}+\frac{D_i^2}{2\sum_{j=1}^N D_j^2}
      $$

    • why to weight the data points?the original paper

      • the weight is to guarantee the expectation of the quantization error on the sample dataset doesn’t change.

        • the original vector quantization error is

        $$
        \phi_{\chi}(Q)=\sum_{x\in \chi}d(x,Q)^2 = \sum_{x\in\chi}q(x)\frac{d(x,Q)^2}{q(x)}
        $$

        The quantization error can hence be approximated by sampling m points from $\mathcal{X}$ using $q(x)$and assigning them weights inversely proportional to $q(x)$. Back to the original dataset, we can assign each data point with the selection probability $\frac{1}{N}$, and weight $1$, we get $\mathbb{E}(\phi_\chi(Q)=\frac{1}{N}\mathbb{E}(\sum_{x\in\chi}d(x,Q)^2)$

        • the error on the new dataset $\zeta$ is $\phi_\zeta(Q)=\sum_{x\in\zeta}p(x)w(w)d(x,Q)^2$,

          Set $w(x)=\frac{1}{mp(x)}$

Mixture Model

Idea

Probabilistic clustering. For each data point, it is not absolutely assigned to one cluster but with the probability belonging to one cluster.

Gaussian Mixture Model

  • mixture: several models mixed together
  • gaussian: each model follows the multinomial Gaussian distribution.
  • the categorical distribution of $j$ is given ahead and the same for all $x$? Or it is related to $x_i$, which means should be written as $\pi_{ij}$? How should we know ?
    • Yes. Here $\pi_j$ means $p(z_i=1)$, this is known from our dataset as a whole, not meaning $p(z_{ij}=1|x_i)$
    • So, it only applies when we already know the categories for all $x$

$$
p(x;\theta)=\sum_{j=1}^K\pi_j p(x;\mu_j,\Sigma_j)
$$

​ we can understand in this way: The model is composed of different models. For a data point $x$, its probability of generated by this “big” model, is decided by its probability of generated by each “small” model, and the proportion of the “small” model. For example, I know $x$ is generated by the $j$-th model with probability 1 and 0 to other models, and $j$-th model totally generates $\pi_j\cdot N$ number of points, ($N$ is the size of the dataset), then the probability of $x$ is generated by the whole model is $1\cdot\pi_j$

ML method

Maximum likelihood method usually

  • makes some general assumption about the distribution. Here: the Gaussian distribution.
  • then try to obtain/ “infer” the specifics from the data available.

Generation- EM algorithms

By MLE, we want to get the parameters of $\theta$. Since the objective function has a summation within a log calculation, we should use E-M algorithm and use Jensen’s inequality.

  • what is the role of $q_j$?

    • $q_j$ should exactly be $q_{ij}$, it is related to $x_i$. It means $p(z_{ij}=1|x_i)$
  • why we also estimate $\pi_j$? I think it is given ahead, and has no effect on the solution of $\theta$

    • Yes, it has no direct effect on $\theta$, but has a direct influence on , $qz_{ij}$ thus indirectly influences $\theta$. Therefore, in each step, we also need to optimize $\pi_j$

Inference

It is easy to get the inference of latent variable $j$ as $q_{ij}$

Model Selection

Data fit

Complexity

can be measured by the number of free parameters.

AIC

$$
\text{AIC}(\theta|X)=-\log p(X;\theta)+\kappa(\theta)
$$

BIC

$$
\text{BIC}(\theta|X)=-\log p(X;\theta)+\frac{1}{2}\kappa(\theta)\log N
$$

BIC penalizes complexity more than AIC criterion

A single AIC (BIC) result is meaningless. One has to repeat the analysis for different Ks and compare the differences: the most suitable number of clusters corresponds to the smallest AIC (BIC) value.


Author: Fululu
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source Fululu !
  TOC