2017年7月15日星期六

A small note on mixing time of Markov chain

This is a small note after  the note by Nathanael Berestycki and the object is just for reviewing some interesting story about this model and preparing for the coming summer school.  I decide to write it on my blogger since this permit me to write down the idea as quickly as possible  without paying too much attention on the detail and technical part.

Basic notation of mixing time

At first, we have to talk about the motivation of the research of this subject. In abstract measure theory, we talk about the topology of the Radon measure on a metric space and its distance. The distance which induces the weak star convergence is called Levy distance. This theory is very profond, while the situation in discrete state situation is very easy since the topology coincides and we use the total variation distance  
$$\| \mu - \nu \| = \sup_{A \subset S} \mu(A) - \nu(A)$$
to represent the distance between two measures, which is very simple and elegant. On the other hand, the study of this subject  arise because of the emerging of many computer algorithms : we have to know if they converge and moreover, the quantitative behavior of the convergence. Especially, the cutoff phenomenon tells us sometimes we have to do enough step of Markov chain to make sure that it converge.

There is some other equivalent definition of the total variation distance like 
$$\begin{eqnarray*}\| \mu - \nu \| &=& \sup_{A \subset S} \mu(A) - \nu(A) \\ &=& \frac{1}{2}\sum_{x\in S} \mu(x) - \nu(x) \end{eqnarray*}$$

Concerning the distance between a Markov chain, if it has a invariant measure  $\pi$, then we use 
$$d(t) = \sup_{x \in S}\|P^t(x, \cdot) -  \pi(\cdot)  \|$$
to describe the distance from the invariant measure. Sometimes it  interacts with another definition 
$$ \rho(t) = \sup_{x,y \in S}\|P^t(x, \cdot) -  P^t(y, \cdot) \| $$
since the two are equivalent 
$$ d(t) \leq \rho(t) \leq 2 d(t)$$.
The mixing time is defined as 
$$t_{mix}(\epsilon) = \inf \{t | d(t) \leq \epsilon \}$$.
and we use usually $t_{mix} = t_{mix}(1/e)$.

The convergence to the invariant measure is after Perron-Frobenius theorem for irreducible, aperiodic, finite state Markov chain and sometimes we also use the Lyaponov function to treat the infinite state case. However, these criteria isn't so sharp and just give a qualitative description. 

Before finishing this section, we recall the phenomenon of cutoff. Generally speaking, that is the $d(t)$ falls drastically near the point of $t_{mix}$. This phenomenon is a little hard to understand since until now, we didn't find a universal method to analyse this phenomenon but just study the model case by case. Moreover, sometimes it happens but sometimes, it can be avoided and we don't see it in the simulation.

Coupling technique

To estimate the speed of convergence, a first technique is called coupling and it may be the most useful and powerful method which apply in all the case. A coupling of measure $\mu, \nu$ is a couple $(X,Y)$ where $X$ has the law of $\mu$ and $Y$ has the law of $\nu$. We remark that the marginal law can never recover the joint distribution, so this condition doesn't fix the law $(X,Y)$.

A third definition of total variation distance is 
$$\| \mu - \nu \| = \inf_{couple}\mathbb{P}(X \ne Y)$$.
This formula gives us a upper  bound of the $d(t)$ and the upper bound of the $t_{mix}$ since 
$$\begin{eqnarray*} d(t) \leq \rho(t) &=& \sup_{x,y \in S}\|P^t(x, \cdot) -  P^t(y, \cdot) \| \\&\leq& \sup \mathbb{P}(X_t \ne Y_t) \end{eqnarray*}$$
and a useful technique is to demande that when two Markov chain meet, they move together - this induces that 
$$\boxed{d(t) \leq \sup_x \mathbb{P}_x(\tau_{X,Y} > t)}$$

One example is the lazy random walk on the circle $Z / n$. The Markov chain has $1/2$ probability to stay on the same place and $1/4$ to move forward or backward. Then we start two chain $(X,Y)$ and let a first coin to decide which one to move and second one to decide the direction. Once they meet together, the move together. In fact, the meet of two chain is reduced to the ruin of gambler. So using the first moment 
$$d(t) \leq \sup_x \mathbb{P}_x(\tau_{X,Y} > t) \leq \frac{\mathbb{E}(\tau_{X,Y})}{t} = \frac{n^2}{t}$$
We induce that 
$$t_{mix} \leq e n^2$$. 

At the end of this part, we give a small lemma about the log-sub-additivity of the $\rho$ which is 
$$\rho(s+t) \leq \rho(s) \times \rho(t)$$
The proof is easy by the coupling technique.

Strong stationary time technique

Strong stationary time is a very special technique : we have a stopping time $\tau$, once we arrive this point, the law becomes naturally invariant law. Thus 
$$d(t) \leq \mathbb{P}(\tau \geq t)$$.

This idea sounds crazy but the model fonction like this exists. One famous example is the random Top-to-random shuffle. In this model, the cards under the card at the bottom is uniformly random since there is no order under this one a prior. Therefore, once the card at the bottom comes to the top and once go back to the deck, the card becomes totally random. The mixing time is easy to calculate and is the same as the collection of coupon $t_{mix} \sim n\log n$.

However, this technique just applies in some specific case and requires a good observation.

Spectral technique : viewpoint  from operator

The spectral method is especially powerful for the case of reversible Markov chain, which relates the speed of convergence to the gap of eigenvalue. Therefore, the gap of eigenvalue $\gamma$, the convergence of total variation distance, the ration of expansion have been related naturally.

We know that for a irreducible aperiodic finite Markov chain, it has naturally a series of eigenvalue $1 = \lambda_1 > \lambda_2 \geq \lambda_3 \cdots \lambda_n \geq -1$. We can extract a family of eigenvector orthogomal naturally by diagonalisation, however, we would like the orthogomal base in inner product 
$$\langle f, g\rangle_{\pi} = \sum_{x \in S} f(x)g(x)\pi(x)$$  
instead of the orthogomal base in usual $L^2$ inner product. To realize it, we find the orthogomal base $\phi_i$ of the operator $A$ where
$$A(x,y) = \sqrt{\frac{\pi(x)}{\pi(y)}}P(x,y) \Leftrightarrow A = D^{1/2}PD^{-1/2}$$
who is also symmetric and share the same eigenvalue with $P$ that is 
$$A\phi_i = \lambda_i \phi_i$$.
Then we define $f_i = D^{-1/2}\phi_i$, we have 
$$\langle f_i, f_j\rangle_{\pi} = \langle \phi_i, \phi_j \rangle = \delta_{ij}$$
$$Pf_i = D^{-1/2}(D^{1/2}PD^{-1/2})\phi_i = D^{-1/2}A\phi_i = \lambda_i D^{-1/2}\phi = \lambda f_i$$
So $f_i$ is the orthomal base wanted. The orthogomal decomposition tells us 
$$A = \sum_{i=1}^n \lambda_i \phi_i \phi_i^T $$
when we  plug the $f_i$ in, we obtain
$$\boxed{\frac{P(x,y)}{\pi(y)} = \sum_{i=1}^n f_i(x) f_i(y) \lambda_i}$$

One may wonder why we would like find a expression like this. In fact, it has interest since 
$$\begin{eqnarray*}d(t) &=& \sup_{x \in S}\|P^t(x, \cdot) - \pi(\cdot)\| \\ &=& \frac{1}{2}\sum_{y \in S}\pi(y) \left|\frac{P(x,y)}{\pi(y) }- 1 \right| \\&=& \frac{1}{2}\|\sum_{i=2}^n f_i(x) f_i \lambda_i \|_{L^1(\pi)}  \\ &\leq& \frac{1}{2} \|\sum_{i=2}^n f_i(x) f_i \lambda_i \|_{L^2(\pi)} \\ &=& \sqrt{\sum_{i=2}^n f_i^2(x) \lambda_i^2} \end{eqnarray*}$$

We continue the analysis and obtain the  inequality that 
$$\boxed{(t_{rel} - 1)\log(\frac{1}{2\epsilon}) \leq t_{mix}(\epsilon) \leq \log(\frac{1}{2\epsilon \sqrt{\pi_{min}}})t_{rel}}$$
where the relax time is defined as
$$t_{rel} = \frac{1}{\gamma}$$

Spectral technique : viewpoint from geometry

The result obtained by the eigenvalue of operator is so powerful since it gives both the upper and lower bound. However, to calculate the gap of eigenvalue is not so easy for general case, so we propose many other method to estimate the gap $\gamma$. Here we list two  methods : by the optimal constant of Poincare's inequality and Cheeger's inequality.

If we define the Dirichlet energy under the measure $\pi$ as
$$\begin{eqnarray*}E(f,f) &=& \sum_{e}Q(e) \nabla f(e) \nabla f(e)\\ Q(x,y)  &=& \pi(x)P(x,y)\end{eqnarray*}$$,
then  the classical discrete Poincare inequality gives
$$\boxed{Var(f) \leq \frac{1}{\gamma} E(f,f)}$$.
The constant is optimal  so if we have another equality of constant  $C$, we have necessarily $t_{rel} \leq C$ so we get the upper bound of mixing time.

Another method comes from the famous Cheeger's inequality, the ratio of expansion is
$$\Phi(A) = \frac{Q(A, A^C)}{\pi(A)}$$
and the ratio of expansion of the graph is
$$\Phi_* = \inf_{\pi(A) \leq 1/2} \Phi(A)$$.
The famous Cheeger's inequality tells that
$$\frac{\Phi_*^2}{2} \leq \gamma  \leq 2\Phi_*$$

We remark that one good lower bound for the mixing time is that
$$t_{mix} \geq \frac{1}{4\Phi_*}$$
means that if the graph is not so connected, it will take long time to mix the chain.


Comparaison method and continuous time Markov chain

All the result of mixing time can also be converted to the continuous time, where the jumping time is a Poisson process but the state is still finite. We follow the same routine and get that 
$$d(t) \leq \frac{1}{2}d_2(t) = \frac{1}{2}\sqrt{\sum_{j=2}^n e^{-2t\mu_j}} , \mu_j = 1-\lambda_j$$

The comparaison theorem says that for two Markov chains $X,Y$ if $E_X(f,f) \leq AE_Y(f,f)$, then we have  $\mu_j^X \leq A \mu_j^Y$. This helps us to give some analysis for a Markov chain given the information of one chain already well understood.


Coupling from the past

Wilson proposes some interesting algorithm called coupling from the past which gives an exact  law of  invariant measure in some specific case, for example the Ising model. The simulation requires that the state should have a function to make it ordered. Unlike the MCMC, we will come from the past. The coupling technique says that generally, the invariant measure is the case that a coupling meet. Now we apply check if 
$$G(-t,0)(-1) = G(-t,0)(1)$$
where $-1,1$ represent respectively the minimal and maximal element. If it is the case, we take the value and if not, we add a segment of simulation even before it until it meet. The monotone property says it looks like we simulate from $-\infty$ and all the trajectories meet.

Finally, we remark although this program terminates almost surely, its time is a random variable so it takes risk to take longtime before ending.


Cover time

The cover time is another concept similar to mixing time which says the time to visit all the state at least once. The Matthews inequality gives the control 
$$t_{cov} \leq (1 + \frac{1}{2} + \cdots \frac{1}{n}) t_{hit}$$

One surprising result is the relation between the cover time and DGFF that 
$$t_{cov} \sim |E| (\mathbb{E}(\max_{x \in S} h_x))^2$$.

Others

We conclude that the mixing time has to be estimated carefully. The coupling method has a lot of flexibility but requires good observation and technique while the spectral method looks more like analysis on heat equation. We can also reformulate the subject in the language of electronic network so that we use the analysis of structure to estimate the mixing time. 

2017年7月12日星期三

嫌疑犯X的献身

      今天在网上看了改编自东野圭吾小说的电影,好像中日韩都改编了,我看的是日版的感觉还不错吧。这里也就随手写点感想。

      或许是因为自身专业的缘故,对男主非常有代入感而且基本也能猜到他的做法。这个世界真正热爱数学并且也掌握的人不少,但最后做出划时代成就的也就那么寥寥数人,大多人其实也只是整个进程的参与者。所以自觉壮志难酬、岁月蹉跎的数学家以及最后面对现实变成数学老师的其实也是蛮大的一个群体。做数学的人理性单纯,但其实有时候情感又说不出的强烈,脑子热起来比普通人可以更狂热。但专业特质让他们人际沟通能力或者主客世界观多多少少都有那么一点点问题。

      这个问题在哪里呢?就是从数学的角度,理论框架和模型是可以自己构建的,如果不考虑应用,甚至可以忽略掉框架和模型的合理性。这点和物理学家不同:他们或许有时候缺少逻辑的严格性,却不乏敏锐的观察,以及最终目的还是为了解释现象理论不过是辅助罢了。

      理解了这些再去解读石神和汤川就更加简单了。石神就是那个脑子发热了,情商不高的数学家。说着做了那么多,其实人家理解吗?接受吗?愿意吗?什么都不闻不问也不去试着了解。设计的阴谋某种意义上也是自己构建的一个局,真把所有人都当齿轮了。最后石神的痛苦我觉得既有感情上一厢情愿付出不得接受的无奈,也是因为觉得队友蠢吧?

      说队友蠢是有点过的,此处无意过多贬低女主。人生都有自己选择,难道接受了他人充满不正义的爱意再加上自己的负罪感,真的能过得很舒坦么?

      汤川的抉择是无奈的。很多人说他们惺惺相惜,而最后是为了一教高下才说出真相的。我倒觉得惺惺相惜固然有,但和内海的谈话已经表明了他其实自己也明白真相如此残酷,也在纠结是否就这样仍由这个局存在下去呢?但物理本身就是求事实之真,并非逻辑体系严谨自洽就承认是对的。我想因为这样他才选择去揭破这个残忍的局吧。

      最后的最后,我觉得本片对于数学系学生思想和心理教育是很有启发式的一部电影。多多少少我们不能只活在自己的世界里。都说爱情是数学的毒药,其实也不至于吧,但如果能为了爱情付出全部甚至自己的未来,为什么就不能选择两个人一起好好活下去?比如石神可以坦诚一切,选择无论如何陪母女两人一起走完剩下的路啊。