Chenlin‘s Blog: A small note on mixing time of Markov chain

This is a small note after the note by Nathanael Berestycki and the object is just for reviewing some interesting story about this model and preparing for the coming summer school. I decide to write it on my blogger since this permit me to write down the idea as quickly as possible without paying too much attention on the detail and technical part.

Basic notation of mixing time

At first, we have to talk about the motivation of the research of this subject. In abstract measure theory, we talk about the topology of the Radon measure on a metric space and its distance. The distance which induces the weak star convergence is called Levy distance. This theory is very profond, while the situation in discrete state situation is very easy since the topology coincides and we use the total variation distance

$$\| \mu - \nu \| = \sup_{A \subset S} \mu(A) - \nu(A)$$

to represent the distance between two measures, which is very simple and elegant. On the other hand, the study of this subject arise because of the emerging of many computer algorithms : we have to know if they converge and moreover, the quantitative behavior of the convergence. Especially, the cutoff phenomenon tells us sometimes we have to do enough step of Markov chain to make sure that it converge.

There is some other equivalent definition of the total variation distance like

$$\begin{eqnarray*}\| \mu - \nu \| &=& \sup_{A \subset S} \mu(A) - \nu(A) \\ &=& \frac{1}{2}\sum_{x\in S} \mu(x) - \nu(x) \end{eqnarray*}$$

Concerning the distance between a Markov chain, if it has a invariant measure $\pi$, then we use

$$d(t) = \sup_{x \in S}\|P^t(x, \cdot) - \pi(\cdot) \|$$

to describe the distance from the invariant measure. Sometimes it interacts with another definition

$$ \rho(t) = \sup_{x,y \in S}\|P^t(x, \cdot) - P^t(y, \cdot) \| $$

since the two are equivalent

$$ d(t) \leq \rho(t) \leq 2 d(t)$$.

The mixing time is defined as

$$t_{mix}(\epsilon) = \inf \{t | d(t) \leq \epsilon \}$$.

and we use usually $t_{mix} = t_{mix}(1/e)$.

The convergence to the invariant measure is after Perron-Frobenius theorem for irreducible, aperiodic, finite state Markov chain and sometimes we also use the Lyaponov function to treat the infinite state case. However, these criteria isn't so sharp and just give a qualitative description.

Before finishing this section, we recall the phenomenon of cutoff. Generally speaking, that is the $d(t)$ falls drastically near the point of $t_{mix}$. This phenomenon is a little hard to understand since until now, we didn't find a universal method to analyse this phenomenon but just study the model case by case. Moreover, sometimes it happens but sometimes, it can be avoided and we don't see it in the simulation.

Coupling technique

To estimate the speed of convergence, a first technique is called coupling and it may be the most useful and powerful method which apply in all the case. A coupling of measure $\mu, \nu$ is a couple $(X,Y)$ where $X$ has the law of $\mu$ and $Y$ has the law of $\nu$. We remark that the marginal law can never recover the joint distribution, so this condition doesn't fix the law $(X,Y)$.

A third definition of total variation distance is

$$\| \mu - \nu \| = \inf_{couple}\mathbb{P}(X \ne Y)$$.

This formula gives us a upper bound of the $d(t)$ and the upper bound of the $t_{mix}$ since

$$\begin{eqnarray*} d(t) \leq \rho(t) &=& \sup_{x,y \in S}\|P^t(x, \cdot) - P^t(y, \cdot) \| \\&\leq& \sup \mathbb{P}(X_t \ne Y_t) \end{eqnarray*}$$

and a useful technique is to demande that when two Markov chain meet, they move together - this induces that

$$\boxed{d(t) \leq \sup_x \mathbb{P}_x(\tau_{X,Y} > t)}$$

One example is the lazy random walk on the circle $Z / n$. The Markov chain has $1/2$ probability to stay on the same place and $1/4$ to move forward or backward. Then we start two chain $(X,Y)$ and let a first coin to decide which one to move and second one to decide the direction. Once they meet together, the move together. In fact, the meet of two chain is reduced to the ruin of gambler. So using the first moment

$$d(t) \leq \sup_x \mathbb{P}_x(\tau_{X,Y} > t) \leq \frac{\mathbb{E}(\tau_{X,Y})}{t} = \frac{n^2}{t}$$

We induce that

$$t_{mix} \leq e n^2$$.

At the end of this part, we give a small lemma about the log-sub-additivity of the $\rho$ which is

$$\rho(s+t) \leq \rho(s) \times \rho(t)$$

The proof is easy by the coupling technique.

Strong stationary time technique

Strong stationary time is a very special technique : we have a stopping time $\tau$, once we arrive this point, the law becomes naturally invariant law. Thus

$$d(t) \leq \mathbb{P}(\tau \geq t)$$.

This idea sounds crazy but the model fonction like this exists. One famous example is the random Top-to-random shuffle. In this model, the cards under the card at the bottom is uniformly random since there is no order under this one a prior. Therefore, once the card at the bottom comes to the top and once go back to the deck, the card becomes totally random. The mixing time is easy to calculate and is the same as the collection of coupon $t_{mix} \sim n\log n$.

However, this technique just applies in some specific case and requires a good observation.

Spectral technique : viewpoint from operator

The spectral method is especially powerful for the case of reversible Markov chain, which relates the speed of convergence to the gap of eigenvalue. Therefore, the gap of eigenvalue $\gamma$, the convergence of total variation distance, the ration of expansion have been related naturally.

We know that for a irreducible aperiodic finite Markov chain, it has naturally a series of eigenvalue $1 = \lambda_1 > \lambda_2 \geq \lambda_3 \cdots \lambda_n \geq -1$. We can extract a family of eigenvector orthogomal naturally by diagonalisation, however, we would like the orthogomal base in inner product

$$\langle f, g\rangle_{\pi} = \sum_{x \in S} f(x)g(x)\pi(x)$$

instead of the orthogomal base in usual $L^2$ inner product. To realize it, we find the orthogomal base $\phi_i$ of the operator $A$ where

$$A(x,y) = \sqrt{\frac{\pi(x)}{\pi(y)}}P(x,y) \Leftrightarrow A = D^{1/2}PD^{-1/2}$$

who is also symmetric and share the same eigenvalue with $P$ that is

$$A\phi_i = \lambda_i \phi_i$$.

Then we define $f_i = D^{-1/2}\phi_i$, we have

$$\langle f_i, f_j\rangle_{\pi} = \langle \phi_i, \phi_j \rangle = \delta_{ij}$$

$$Pf_i = D^{-1/2}(D^{1/2}PD^{-1/2})\phi_i = D^{-1/2}A\phi_i = \lambda_i D^{-1/2}\phi = \lambda f_i$$

So $f_i$ is the orthomal base wanted. The orthogomal decomposition tells us

$$A = \sum_{i=1}^n \lambda_i \phi_i \phi_i^T $$

when we plug the $f_i$ in, we obtain

$$\boxed{\frac{P(x,y)}{\pi(y)} = \sum_{i=1}^n f_i(x) f_i(y) \lambda_i}$$

One may wonder why we would like find a expression like this. In fact, it has interest since

$$\begin{eqnarray*}d(t) &=& \sup_{x \in S}\|P^t(x, \cdot) - \pi(\cdot)\| \\ &=& \frac{1}{2}\sum_{y \in S}\pi(y) \left|\frac{P(x,y)}{\pi(y) }- 1 \right| \\&=& \frac{1}{2}\|\sum_{i=2}^n f_i(x) f_i \lambda_i \|_{L^1(\pi)} \\ &\leq& \frac{1}{2} \|\sum_{i=2}^n f_i(x) f_i \lambda_i \|_{L^2(\pi)} \\ &=& \sqrt{\sum_{i=2}^n f_i^2(x) \lambda_i^2} \end{eqnarray*}$$

We continue the analysis and obtain the inequality that

$$\boxed{(t_{rel} - 1)\log(\frac{1}{2\epsilon}) \leq t_{mix}(\epsilon) \leq \log(\frac{1}{2\epsilon \sqrt{\pi_{min}}})t_{rel}}$$

where the relax time is defined as
$$t_{rel} = \frac{1}{\gamma}$$

Spectral technique : viewpoint from geometry

The result obtained by the eigenvalue of operator is so powerful since it gives both the upper and lower bound. However, to calculate the gap of eigenvalue is not so easy for general case, so we propose many other method to estimate the gap $\gamma$. Here we list two methods : by the optimal constant of Poincare's inequality and Cheeger's inequality.

If we define the Dirichlet energy under the measure $\pi$ as
$$\begin{eqnarray*}E(f,f) &=& \sum_{e}Q(e) \nabla f(e) \nabla f(e)\\ Q(x,y) &=& \pi(x)P(x,y)\end{eqnarray*}$$,
then the classical discrete Poincare inequality gives
$$\boxed{Var(f) \leq \frac{1}{\gamma} E(f,f)}$$.
The constant is optimal so if we have another equality of constant $C$, we have necessarily $t_{rel} \leq C$ so we get the upper bound of mixing time.

Another method comes from the famous Cheeger's inequality, the ratio of expansion is
$$\Phi(A) = \frac{Q(A, A^C)}{\pi(A)}$$
and the ratio of expansion of the graph is
$$\Phi_* = \inf_{\pi(A) \leq 1/2} \Phi(A)$$.
The famous Cheeger's inequality tells that
$$\frac{\Phi_*^2}{2} \leq \gamma \leq 2\Phi_*$$

We remark that one good lower bound for the mixing time is that
$$t_{mix} \geq \frac{1}{4\Phi_*}$$
means that if the graph is not so connected, it will take long time to mix the chain.

Comparaison method and continuous time Markov chain

All the result of mixing time can also be converted to the continuous time, where the jumping time is a Poisson process but the state is still finite. We follow the same routine and get that
$$d(t) \leq \frac{1}{2}d_2(t) = \frac{1}{2}\sqrt{\sum_{j=2}^n e^{-2t\mu_j}} , \mu_j = 1-\lambda_j$$

The comparaison theorem says that for two Markov chains $X,Y$ if $E_X(f,f) \leq AE_Y(f,f)$, then we have $\mu_j^X \leq A \mu_j^Y$. This helps us to give some analysis for a Markov chain given the information of one chain already well understood.

Coupling from the past

Wilson proposes some interesting algorithm called coupling from the past which gives an exact law of invariant measure in some specific case, for example the Ising model. The simulation requires that the state should have a function to make it ordered. Unlike the MCMC, we will come from the past. The coupling technique says that generally, the invariant measure is the case that a coupling meet. Now we apply check if
$$G(-t,0)(-1) = G(-t,0)(1)$$
where $-1,1$ represent respectively the minimal and maximal element. If it is the case, we take the value and if not, we add a segment of simulation even before it until it meet. The monotone property says it looks like we simulate from $-\infty$ and all the trajectories meet.

Finally, we remark although this program terminates almost surely, its time is a random variable so it takes risk to take longtime before ending.

Cover time

The cover time is another concept similar to mixing time which says the time to visit all the state at least once. The Matthews inequality gives the control
$$t_{cov} \leq (1 + \frac{1}{2} + \cdots \frac{1}{n}) t_{hit}$$

One surprising result is the relation between the cover time and DGFF that

$$t_{cov} \sim |E| (\mathbb{E}(\max_{x \in S} h_x))^2$$.

Others

We conclude that the mixing time has to be estimated carefully. The coupling method has a lot of flexibility but requires good observation and technique while the spectral method looks more like analysis on heat equation. We can also reformulate the subject in the language of electronic network so that we use the analysis of structure to estimate the mixing time.

Chenlin‘s Blog

2017年7月15日星期六

A small note on mixing time of Markov chain