Learning Notes

Publish Date: 2019-07-09

Are the topics given ahead, fixed? Means we already have a set of available topics.
- Yes. Also, in pLSA, all the words and documents are given.
Are the given data regularized? Do they have the same dimension? Or have arbitrary length?
- Yes, they have the same dimension. We need to pre-process data to achieve this.
what are the approaches to solve this problem?
- Key idea: Non-negative matrix factorization
- practical method: EM algorithm
  pLSA: Probabilistic Latent Semantic Analysis
where shows probabilistic?
- the probability of $word_j$ in $document_i$ is generated by $topic_z$.
what is the latent?
- the topics are latent variables.
what does the semantic mean?
- semantic means each topic has a fixed vocabulary distribution. This distribution is the semantic of this latent variable.
what is the natural motivation of this model?
- according to the “word appearance in this article” and all the “topics’ word” distribution pattern, find the “topics of this article”.
how to do pLSA?
- ideally, using MLE to solve the matrix $V$.
- actually, using EM to solve both $U$ and $V$.

Model

Suppose the word appearance of a document is a probabilistic problem. This is wired, cause once a document is given, all the words are given, the p(word|docum) should be 1. In other words, a document is defined by all the words it includes.
We want to calculate this probability by the conditional probability, conditioned on the topics. That is how likely this document is under the specific topic? And how likely one word appears under this topic? Then we can get the probability of p(word|docum). p(topic|docum) is what we want to get.

Objective

$$
p(w|d)=\sum_{z=1}^K p(w|z)p(z|d)
$$

$z$ represents topics, $w$ represents words, $d$ represents document.

We want to maximize the $p(w|d)$ by assigning $p(z|d)$.

How do we know $p(w|z)$, given ahead?
- ideally, yes. Actually no. We also need to solve this matrix.

Data format

A matrix $D: M\times N$. $M$ is the total number of different words, $N$ is the total number of documents. $D_ij$ is the occurrences of $w_i$ in document $d_j$ (or transposed)

in the real-life case, will the M be very large? How should we decide N, how many documents we want to use?

In a matrix perspective to resay this problem, we

decompose $D$ as $UV$
$U$ is given, we try to assign $V$ to maximize the log-likelihood of $D$.

$$
\max\sum_{ij=1}x_{ij}\log p(w_i|d_j)
$$

$U: M\times T$, $T$ is the number of topics. This is a given matrix, showing the word distribution of a topic.
$U_ij$ shows the $word_i$’s appearance probability under the $topic_j$, should be non-negative $U_{ij}\geq 0$
The sum of word distribution under a specific topic should be 1.$\sum_{i=1}^M U_{ij}=1$
$V: T\times N$. This is the matrix we want to assign. By assigning this matrix, we maximize the log-likelihood of $D$. $V_ij$ shows the ratio of the $topic_i$ in $document_j$. Document is a mixture of topics. We assume it composed by different topics. Also, $V$ should satisfy the
- non-negative
- sum as 1.

Challenge

The challenge is $U$ is not given. We have no prior knowledge on it. To solve this problem, we

include one more variable $q_{zij}$, denotes the probability that word j in document i is generated by topic $z$, $\sum_z q_{zij}=1$
use EM (expectation maximization) method.
- expectation step, solve $q$. $q_{zij}=\frac{p(w_i|z)p(z|d_j)}{\sum_{k=1}^T p(w_i|k)p(k|d_j)}$
- maximization step, solve $U$ and $V$.
  - $u_{iz}=\frac{\sum_j x_{ij}q_{zij}}{\sum_i\sum_j x_{ij}q_{zij}}$ the probability of topic $z$ generates word $i$ is the total “number” of word $i$ generated by topic $z$ across all the topics divided by the “number” of all the words generated by topic z across and topics.
  - $v_{zj}=\frac{\sum_i x_{ij}q_{zij}}{\sum_i x_{ij}}$, the ratio of topic $z$ in document $j$ is the number of words generated by topic $z$ in document $j$ divided by all the words in document $j$.

remarks

convergence guaranteed
not for the final global optimum
influenced by the initial value of $U_0$ and $V_0$, right?? Not unique??

LDA: Latent Dirichlet Allocation

what is the different from pLSA
- Here, we take a different attitude to Word*topic matrix (essence) and Topic*Docum(nusiance) matrix. But in pLSA, these two matrices have the same level.
- Also, in LDA it is convenient to judge for new documents.
where shows Dirichlet?
- In Docum*Topic matrix, we assume topics follow a dirichlet distribution given a document.
- In Word*Topic matrix, we assume words follow a dirichlet distribution given a topic.
what does the allocation mean?
- allocate topics to document by a dirichlet distribution

zhihu

Dirichlet Distribution

why we use dirichlet distribution here? Any advantage on computation? Or real-life closeness?
- Dirichlet distribution is used to describe the probability of each class i in totally K classes. It is for k classes continuous variables with the sum 1. It needs k parameters alphas.
- Mwe use this cause it is good for the distribution of “probability”.
- it is the “simplest” conjugate distribution. This is very nice!!! It means the prior and posterior are from the same distribution family.
  - Gaussian Distribution is conjugate prior/posterior with respect to Gaussian likelihood.
  - with the multinomial likelihood, Dirichlet is the choice!

Model

how to solve the matrices $U$ and $V$?
- in LDA, there are two steps:
  - achieve V, many methods:
    - MCMC
    - Variational EM
    - …
  - for each new document vector $x = (x_1,x_2,\cdots, x_M)$, the word count across all the possible word, we try to find the topic vector $u=(u_1, u_2, \cdots, u_K)$.

Since vector u is not fixed, is some way random, we assume it follows a Dirichlet distribution, and the posterior vector x also follows a Dirichlet distribution. Then we apply Bayesian approach to solve u according to v.

NMF: Non-Negative Matrix Factorization

what is NMF?
- all the entries of the decomposed matrix are non-negative and L1 normalization.
what is the application situation of NMF?
- to solve the probabilistic matrix by reducing dimension.
  Objective
log-likelihood objective when modeling for probability. That is pLSA
quadratic cost