Objectives
- how information theoretic concepts can be used to solve the overfitting problem in community detection.
- only in community detection? Can it be generalized to other fields?
Questions
how is the entropy of a probability mass function defined?
- $H(x)=-\sum_i p(x_i)\log p(x_i)$
how is the entropy of a statistical ensemble defined?
- using the distribution
What is the entropy of a micro-canonical statistical ensemble?
- the log of the microstates number
how can entropy be used for community detection based on the stochastic block model?
- define description length and minimize it.
how is the description length of stochastic block model defined?
$$
H(M,z)=\log Z(M,z) = \log \Pi_{1\leq k\leq l\leq B}{N_{kl}\choose M_{kl}}
$$$$
=\sum_{1\leq k\leq l\leq B}\log P{N_{kl}\choose M_{kl}}
$$Explain the basic idea behind flow compression.
- minimize description length and random walk.
How is the MapEquation defind? How is it used in InfoMap?
- $L(z) = qH(Q)+\sum_ip_iH(P_i)$
- as a guidance
How can we calculate the MapEquation for an undirected and unweighted network?
- stationary probability of edge, $1/m$
How can we generalize InfoMap to directed and weighted network?
- eigenvectors
How can we quickly discover a community mapping that minimizes the MapEquation?
- heuristic algorithms, e.g. simulated annealing, greedy optimization.
Keywords
- entropy
- description length
- flow compression
- InfoMap
Entropy
Definition of Shannon entropy
$$
H(x)=-\sum_{i} P(X=i)\log P(X=i)
$$
- meaning: describe the amount of information needed to know the value of $x$.
- the higher $H(x)$ is, the more bit of information, except the $x$’s own distribution, we need to know the value of $x$.
Examples
$G(n,p)$ model
$$
H(n,p)=-\sum_{m=1}^{T}T p(n, m)\log p(n,m)
$$
where $T={D \choose m}$ and $D ={n+1\choose 2} $
$$
p(n,m)=p^m(1-p)^{D-m}
$$
$G(n,m)$ model
$$
H(n,m)=-T p_0\log p_0
$$
$$
p_0=\frac{1}{T}
$$
$$
\Rightarrow H(n,m)=-\log p_0=\log T
$$
Micro-canonical ensemble
All microstates are equiprobable.
The entropy of the ensemble is the logarithm of the partition function, i.e. the number of possible microstates.
the likelihood is simply the inverse of the number of realizations.
the smaller the number
the larger the likelihood
the smaller the entropy
MLE = minimize entropy
$$
H(\theta)=\log Z(\theta)
$$$Z(\theta)$ is the number of instances.
stochastic block model
Modify the block matrix M by $M_{ij}$ representing the number of edges between block i and block j, not the probability of edges appearing. Hence, we change the stochastic block model as a micro-canonical ensemble.
We have two parameters: M and z. They are independent.
$N_{kl}$ are decided by the vector z, the block assignment vector. When we know z, we can know how many nodes in each block and calculate N.
$E_{kl}$ are decided by the vector z and the topology of the network. We need combine the block assignment of a node and its link to others to count.
M: note that not all M and z match each other. There exists contradiction, e.g. $M_{ij} > N_{ij}$
the likelihood is based on three objects:
- the block matrix $M$
- the block assignment vector $z$
- the known network (with link information)
$$
H(M,z)=\log Z(M,z)= \log \Pi_{1\leq k\leq l\leq B}{N_{kl}\choose M_{kl}}
$$$$
=\sum_{1\leq k\leq l\leq B}\log P{N_{kl}\choose M_{kl}}
$$
The maximum likelihood given $z$ is
$$
\hat{H}(z)=\sum_{1\leq k\leq l\leq B}\log p{N_{kl}\choose E_{kl}}
$$
Description length
This is a trade-off between explanatory power and complexity, like AIC (interesting, AIC is also based on the information theory). It works by adding punishment of parameters. The good thing of this criterion is the number can be directly interpreted as bit of information.
Definition
$$
\lambda(z)=H(z)+\Delta(z)
$$
- $\Delta(z)$: the bits of information needed for parameters.
Example
For stochastic block model, the approximation solution is
$$
\lambda(z)=\Delta(z)+m\cdot h(x),h(x)=(1+x)\log(1+x)-x\log(x)
$$
$$
x= \frac{B(B+1)}{2m}+n\cdot\log(B)
$$
Flow compression
Idea
using the cluster to compress the flow description sequence
Hierarchy coding scheme
Huffman code
pre-fix coding
Source coding theorem
the minimum pre-symbol length of lossless code given by Shannon entropy
Hoffman code is asymptotically optimal for n i.i.d symbols with known prob. mass function $P$
$$
L_n\to n\cdot H(p)
$$
MapEquation
Definition
The minimal code length for mapping nodes to $k$ communities
$$
L(\vec{z_k})=qH(Q)+\sum_{i=1}^k p_i H(P_i)
$$
- q: the per-step probability that random walker switches between communities
- p_i: the per-step probability that random walker switches within community $i$.
$$
q+\sum_{i=1}^kp_i=1
$$Not $\sum_{i=1}^kp_i=1$
- $H(Q)$: the bits needed to encode clusters.
- $H(P_i)$: the bits needed to encode nodes within cluster $P_i$
Example
undirected and unweighted network
each edge has the same visit probability $\frac{1}{m}$
node with degree $d_v$ has the stationary visit probability $\frac{d_v}{2m}$
so we can get $q$ and $p_i$
generalized to directed/weighted network
Instead of using the relative frequencies of links within and across communities, we now actually have to calculate eigenvectors to find the stationary visitation probabilities of nodes and communities.
Detection algorithms
InfoMap
find community assignment (with any number of communities) s.t.
$$
\hat{\vec{z}}=\arg_{\vec{z}}\min L(\vec{z})
$$
Heuristic (to find/to discover) algorithms
- Simulated annealing
- Greedy optimization
- Genetic algorithms