CN11 Model selection


Objectives

  • how information theoretic concepts can be used to solve the overfitting problem in community detection.
    • only in community detection? Can it be generalized to other fields?

Questions

  • how is the entropy of a probability mass function defined?

    • $H(x)=-\sum_i p(x_i)\log p(x_i)$
  • how is the entropy of a statistical ensemble defined?

    • using the distribution
  • What is the entropy of a micro-canonical statistical ensemble?

    • the log of the microstates number
  • how can entropy be used for community detection based on the stochastic block model?

    • define description length and minimize it.
  • how is the description length of stochastic block model defined?
    $$
    H(M,z)=\log Z(M,z) = \log \Pi_{1\leq k\leq l\leq B}{N_{kl}\choose M_{kl}}
    $$

    $$
    =\sum_{1\leq k\leq l\leq B}\log P{N_{kl}\choose M_{kl}}
    $$

  • Explain the basic idea behind flow compression.

    • minimize description length and random walk.
  • How is the MapEquation defind? How is it used in InfoMap?

    • $L(z) = qH(Q)+\sum_ip_iH(P_i)$
    • as a guidance
  • How can we calculate the MapEquation for an undirected and unweighted network?

    • stationary probability of edge, $1/m$
  • How can we generalize InfoMap to directed and weighted network?

    • eigenvectors
  • How can we quickly discover a community mapping that minimizes the MapEquation?

    • heuristic algorithms, e.g. simulated annealing, greedy optimization.

Keywords

  • entropy
  • description length
  • flow compression
  • InfoMap

Entropy

Definition of Shannon entropy

$$
H(x)=-\sum_{i} P(X=i)\log P(X=i)
$$

  • meaning: describe the amount of information needed to know the value of $x$.
    • the higher $H(x)$ is, the more bit of information, except the $x$’s own distribution, we need to know the value of $x$.

Examples

$G(n,p)$ model

$$
H(n,p)=-\sum_{m=1}^{T}T p(n, m)\log p(n,m)
$$

where $T={D \choose m}$ and $D ={n+1\choose 2} $
$$
p(n,m)=p^m(1-p)^{D-m}
$$

$G(n,m)$ model

$$
H(n,m)=-T p_0\log p_0
$$

$$
p_0=\frac{1}{T}
$$

$$
\Rightarrow H(n,m)=-\log p_0=\log T
$$

Micro-canonical ensemble

  • All microstates are equiprobable.

  • The entropy of the ensemble is the logarithm of the partition function, i.e. the number of possible microstates.

  • the likelihood is simply the inverse of the number of realizations.

    • the smaller the number

    • the larger the likelihood

    • the smaller the entropy

    • MLE = minimize entropy
      $$
      H(\theta)=\log Z(\theta)
      $$

    • $Z(\theta)$ is the number of instances.

stochastic block model

Modify the block matrix M by $M_{ij}$ representing the number of edges between block i and block j, not the probability of edges appearing. Hence, we change the stochastic block model as a micro-canonical ensemble.

We have two parameters: M and z. They are independent.

  • $N_{kl}$ are decided by the vector z, the block assignment vector. When we know z, we can know how many nodes in each block and calculate N.

  • $E_{kl}$ are decided by the vector z and the topology of the network. We need combine the block assignment of a node and its link to others to count.

  • M: note that not all M and z match each other. There exists contradiction, e.g. $M_{ij} > N_{ij}$

  • the likelihood is based on three objects:

    • the block matrix $M$
    • the block assignment vector $z$
    • the known network (with link information)

    $$
    H(M,z)=\log Z(M,z)= \log \Pi_{1\leq k\leq l\leq B}{N_{kl}\choose M_{kl}}
    $$

    $$
    =\sum_{1\leq k\leq l\leq B}\log P{N_{kl}\choose M_{kl}}
    $$

The maximum likelihood given $z$ is
$$
\hat{H}(z)=\sum_{1\leq k\leq l\leq B}\log p{N_{kl}\choose E_{kl}}
$$

Description length

This is a trade-off between explanatory power and complexity, like AIC (interesting, AIC is also based on the information theory). It works by adding punishment of parameters. The good thing of this criterion is the number can be directly interpreted as bit of information.

Definition

$$
\lambda(z)=H(z)+\Delta(z)
$$

  • $\Delta(z)$: the bits of information needed for parameters.

Example

For stochastic block model, the approximation solution is
$$
\lambda(z)=\Delta(z)+m\cdot h(x),h(x)=(1+x)\log(1+x)-x\log(x)
$$

$$
x= \frac{B(B+1)}{2m}+n\cdot\log(B)
$$

Flow compression

Idea

using the cluster to compress the flow description sequence

Hierarchy coding scheme

Huffman code

pre-fix coding

Source coding theorem

the minimum pre-symbol length of lossless code given by Shannon entropy

Hoffman code is asymptotically optimal for n i.i.d symbols with known prob. mass function $P$​
$$
L_n\to n\cdot H(p)
$$

MapEquation

Definition

The minimal code length for mapping nodes to $k$​ communities
$$
L(\vec{z_k})=qH(Q)+\sum_{i=1}^k p_i H(P_i)
$$

  • q: the per-step probability that random walker switches between communities
  • p_i: the per-step probability that random walker switches within community $i$.
    • $$
      q+\sum_{i=1}^kp_i=1
      $$

    • Not $\sum_{i=1}^kp_i=1$

  • $H(Q)$: the bits needed to encode clusters.
  • $H(P_i)$: the bits needed to encode nodes within cluster $P_i$

Example

undirected and unweighted network

each edge has the same visit probability $\frac{1}{m}$

node with degree $d_v$ has the stationary visit probability $\frac{d_v}{2m}$

so we can get $q$ and $p_i$

generalized to directed/weighted network

Instead of using the relative frequencies of links within and across communities, we now actually have to calculate eigenvectors to find the stationary visitation probabilities of nodes and communities.

Detection algorithms

InfoMap

find community assignment (with any number of communities) s.t.
$$
\hat{\vec{z}}=\arg_{\vec{z}}\min L(\vec{z})
$$

Heuristic (to find/to discover) algorithms

  • Simulated annealing
  • Greedy optimization
  • Genetic algorithms

Author: Fululu
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source Fululu !
  TOC