统计推断(七) Typical Sequence

典型集

1. 一些定理

Markov inequality: $r.v. \ \ \mathsf{x}\ge0$ \[ \mathbb{P}(x\ge\mu)\le \frac{\mathbb{E}[x]}{\mu} \] Proof: omit...

Weak law of large numbers(WLLN): $\vec{y}=[y_1,y_2,...,y_N]^T, \ \ \ \ y_i \sim p \ \ \ i.i.d$ \[ \lim_{N\to\infty}\mathbb{P}(|L_p(\vec{y})+H(p)|>\varepsilon)=0, \ \ \forall \varepsilon>0 \] Proof: omit...

2. Typical set

Definition: $\mathcal{T}_\varepsilon(p;N)=\{\vec{y}\in\mathcal{Y}^N:|L_p(\vec{y})+H(p)|<\varepsilon\}$
Properties
- WLLN $\Longrightarrow P\left(\vec{y}\in\mathcal{T}_\varepsilon(p;N)\right)\simeq1$, $N$ large
- $L_p(\vec{y})\simeq H(p) \Longrightarrow p_y(\vec{y})\simeq 2^{-NH(p)}$
- $\Longrightarrow |\mathcal{T}_\varepsilon(p;N)|\simeq 2^{NH(p)}$
- 当 p 不是均匀分布的时候，$\frac{|\mathcal{T}_\varepsilon(p;N)|}{|\mathcal{Y}^N|}\to0$，也就是说典型集中元素(序列)个数在所有可能的元素(序列)中所占比例趋于 0，但是典型集中元素概率的和却趋近于 1
Theorem

Asymptotic Equipartition Property(AEP)

\[ \lim_{N\to\infty}P(\mathcal{T}_\varepsilon(p;N))=1 \\ \] \[ 2^{-N(H(p)+\epsilon)} \leq p_{\mathrm{y}}(\boldsymbol{y}) \leq 2^{-N(H(p)-\epsilon)}, \forall \boldsymbol{y} \in \mathcal{T}_{\epsilon}(p ; N) \]

for a sufficient large $N$ \[ (1-\epsilon) 2^{N(H(p)-\epsilon)} \leq\left|\mathcal{T}_{\epsilon}(p ; N)\right| \leq 2^{N(H(p)+\epsilon)} \] Proof: \[ \begin{aligned}\left|\mathcal{T}_{\epsilon}(p ; N)\right| &=\sum_{\boldsymbol{y} \in \mathcal{T}_{\epsilon}(p ; N)} 1 \\ &=2^{N(H(p)+\epsilon)} \sum_{\boldsymbol{y} \in \mathcal{T}_{\epsilon}(p ; N)} 2^{-N(H(p)+\epsilon)} \\ & \leq 2^{N(H(p)+\epsilon)} \sum_{\boldsymbol{y} \in \mathcal{T}_{\epsilon}(p ; N)} p_{\mathbf{y}}(\boldsymbol{y}) \\ &=2^{N(H(p)+\epsilon)} P\left\{\mathcal{T}_{\epsilon}(p ; N)\right\} \\ & \leq 2^{N(H(p)+\epsilon)} \end{aligned} \]

3. Divergence $\varepsilon$-typical set

WLLN: $\vec{y}=[y_1,y_2,...,y_N]^T, \ \ \ \ y_i \sim p \ \ \ i.i.d$ $$ L_{p | q}()= = _{n=1}^{N} \

{N } (|L{p | q}()-D(p | q)|>)=0 $$ Remarks: 前面只考虑的均值，这里还考虑了另一个分布
Definition: $\vec{\boldsymbol{y}}=[y_1,y_2,...,y_N]^T, \ \ \ \ y_i \sim p \ \ \ i.i.d$ \[ \mathcal{T}_{\epsilon}(p | q ; N)=\left\{\boldsymbol{y} \in \mathcal{Y}^{N}:\left|L_{p | q}(\boldsymbol{y})-D(p \| q)\right| \leq \epsilon\right\} \]
Properties
- WLLN $\Longrightarrow q_{\mathbf{y}}(\boldsymbol{y}) \approx p_{\mathbf{y}}(\boldsymbol{y}) 2^{-N D(p \| q)}$
- $Q\left\{\mathcal{T}_{\epsilon}(p | q ; N)\right\} \approx 2^{-N D(p \| q)} \to0$
- Remarks: p 的典型集可能是 q 的非典型集，在 $N$ 很大的时候，不同分布的 typical set 是正交的
Theorem \[ (1-\epsilon) 2^{-N(D(p \| q)+\epsilon)} \leq Q\left\{\mathcal{T}_{\epsilon}(p \| q ; N)\right\} \leq 2^{-N(D(p \| q)-\epsilon)} \]

4. Large deviation of sample averages

Theorem (Cram´er’s Theorem): $\vec{\boldsymbol{y}}=[y_1,y_2,...,y_N]^T, \ \ \ y_i \sim q \ \ \ i.i.d$ with mean $\mu<\infty$ and $\gamma>\mu$ \[ \lim _{N \rightarrow \infty}-\frac{1}{N} \log \mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right)=E_{C}(\gamma) \] where $E_C(\gamma)$ is referred as Chernoﬀ exponent \[ E_{C}(\gamma) \triangleq D(p(\cdot ; x) \| q),\ \ \ p(\cdot ; x)=q(y) e^{x y-\alpha(x)} \] and with $x>0$ chosen such that \[ \mathbb{E}_{p(\cdot;x)}[y]=\gamma \] Proof:

$\begin{aligned} \mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right) &=\mathbb{P}\left(e^{x \sum_{n=1}^{N} y_{n}} \geq e^{N x \gamma}\right) \\ & \leq e^{-N x \gamma} \mathbb{E}\left[e^{x \sum_{n=1}^{N} y_{n}}\right] \\ &=e^{-N x \gamma}\left(\mathbb{E}\left[e^{x y}\right]\right)^{N} \\ & \leq e^{-N\left(x_{*} \gamma-\alpha\left(x_{*}\right)\right)} \end{aligned}$

$\varphi(x)=x\gamma-\alpha(x)$ 是凸的，最大值取在 $\mathbb{E}_{p\left(\cdot ; x_{*}\right)}[y]=\dot{\alpha}\left(x_{*}\right)=\gamma$

可以证明 $x_{*} \gamma-\alpha\left(x_{*}\right)=x_{*} \dot{\alpha}\left(x_{*}\right)-\alpha\left(x_{*}\right)=D\left(p\left(\cdot ; x_{*}\right) \| q\right)$

于是有 $\mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right) \leq e^{-N E_{C}(\gamma)}$

下界的证明，暂时略...

用到的两个事实：$p(y;x)=q(y)\exp(xy-\alpha(x))$

$D(p(y;x)||q(y))$ 随着 x 单调增加

$\mathbb{E}_{p(;x)}[y]$ 随着 x 单调增加

Remarks:

这个定理也相当于表达了 $\mathbb{P}\left(\frac{1}{N} \sum_{n=1}^{N} y_{n} \geq \gamma\right) \cong 2^{-N E_{\mathrm{C}}(\gamma)}$

相当于是分布 q 向由 $\mathbb{E}[y]=\sum_{n=1}^{N} y_{n} \geq \gamma$ 所定义的一个凸集中投影，恰好投影到边界(线性分布族) $\mathbb{E}[y]=\gamma$ 上，而 $q$ 向线性分布族的投影恰好就是 (10) 中的指数族表达式

5. Types and type classes

Definition: $\vec{\boldsymbol{y}}=[y_1,y_2,...,y_N]^T$ (不关心真实服从的是哪个分布)
- type(实质上就是一个经验分布)定义为
\[ \hat{p}(b ; \mathbf{y})=\frac{1}{N} \sum_{n=1}^{N} \mathbb{1}_{b}\left(y_{n}\right)=\frac{N_{b}(\mathbf{y})}{N} \]
- $\mathcal{P}_{N}^{y}$ 表示长度为 $N$ 的序列所有可能的 types
- type class: $\mathcal{T}_{N}^{y}(p)=\left\{\mathbf{y} \in y^{N}: \hat{p}(\cdot ; \mathbf{y}) \equiv p(\cdot)\right\},\ \ \ p\in\mathcal{P}_{N}^{y}$
Exponential Rate Notation: $f(N) \doteq 2^{N \alpha}$ \[ \lim _{N \rightarrow \infty} \frac{\log f(N)}{N}=\alpha \] Remarks: $\alpha$ 表示了指数上面关于 $N$ 的阶数(log、线性、二次 ...)
Properties
- $\left|\mathcal{P}_{N}^{y}\right| \leq(N+1)^{|y|}$
- $q^{N}(\mathbf{y})=2^{-N(D(\hat{p}(\cdot \mathbf{y}) \| q)+H(\hat{p}(\cdot ; \mathbf{y})))}$ $p^{N}(\mathbf{y})=2^{-N H(p)} \quad \text { for } \mathbf{y} \in \mathcal{T}_{N}^{y}(p)$
- $c N^{-|y|} 2^{N H(p)} \leq\left|\mathcal{T}_{N}^{y}(p)\right| \leq 2^{N H(p)}$
Theorem \[ c N^{-|y|} 2^{-N D(p \| q)} \leq Q\left\{\mathcal{T}_{N}^{y}(p)\right\} \leq 2^{-N D(p \| q)} \\ Q\left\{\mathcal{T}_{N}^{y}(p)\right\} \doteq 2^{-N D(p \| q)} \]

6. Large Deviation Analysis via Types

Definition: $\mathcal{R}=\left\{\mathbf{y} \in y^{N}: \hat{p}(\cdot ; \mathbf{y}) \in \mathcal{S} \cap \mathcal{P}_{N}^{y}\right\}$

Sanov’s Theorem: \[ Q\left\{\mathrm{S} \cap \mathcal{P}_{N}^{y}\right\} \leq(N+1)^{|y|} 2^{-N D\left(p_{*} \| q\right)} \\ Q\left\{\mathrm{S} \cap \mathcal{P}_{N}^{y}\right\} \dot\leq 2^{-N D\left(p_{*} \| q\right)} \\ p_{*}=\underset{p \in \mathcal{S}}{\arg \min } D(p \| q) \]

7. Asymptotics of hypothesis testing

LRT: $L(\boldsymbol{y})=\frac{1}{N} \log \frac{p_{1}^{N}(\boldsymbol{y})}{p_{0}^{N}(\boldsymbol{y})}=\frac{1}{N} \sum_{n=1}^{N} \log \frac{p_{1}\left(y_{n}\right)}{p_{0}\left(y_{n}\right)} \frac{>}{<} \gamma$
$P_{F}=\mathbb{P}_{0}\left\{\frac{1}{N} \sum_{n=1}^{N} t_{n} \geq \gamma\right\} \approx 2^{-N D\left(p^{*} \| p_{0}^{\prime}\right)}$
$P_{M}=1-P_{D} \approx 2^{-N D\left(p^{*} \| p_{1}^{\prime}\right)}$
$D\left(p^{*} \| p_{0}^{\prime}\right)-D\left(p^{*} \| p_{1}^{\prime}\right)=\int p^{*}(t) \log \frac{p_{1}^{\prime}(t)}{p_{0}^{\prime}(t)} \mathrm{d} t=\int p^{*}(t) t \mathrm{d} t=\mathbb{E}_{p^{*}}[\mathrm{t}]=\gamma$

8.Asymptotics of parameter estimation

Strong Law of Large Numbers(SLLN): \[ \mathbb{P}\left(\lim _{N \rightarrow \infty} \frac{1}{N} \sum_{n=1}^{N} w_{n}=\mu\right)=1 \] Central Limit Theorem(CLT): \[ \lim _{N \rightarrow \infty} \mathbb{P}\left(\frac{1}{\sqrt{N}} \sum_{n=1}^{N}\left(\frac{w_{n}-\mu}{\sigma}\right) \leq b\right)=\Phi(b) \] 以下三个强度依次递减

依概率 1 收敛(SLLN)：$\mathsf{x}_{N} \stackrel{w . p .1}{\longrightarrow} a$

概率趋于 0(WLLN):

依分布收敛: $\mathsf{x}_{N} \stackrel{d}{\longrightarrow} p$

Asymptotics of ML Estimation

Theorem: \[ \hat{x}_{N}=\arg \max _{x} L_{N}(x ; \mathbf{y}) \] 在满足某些条件下(mild conditions)，有 \[ \begin{array}{c}{\hat{x}_{N} \stackrel{w \cdot p \cdot 1}{\longrightarrow} x_{0}} \\ {\sqrt{N}\left(\hat{x}_{N}-x_{0}\right) \stackrel{d}{\longrightarrow} \mathcal{N}\left(0, J_{y}\left(x_{0}\right)^{-1}\right)}\end{array} \]

统计推断

#典型集

统计推断(七) Typical Sequence

https://glooow1024.github.io/2020/02/03/statistic/SI_Ch7_TypicalSequence/

作者

Glooow

发布于

2020年2月3日

许可协议

统计推断(八) Model Selection 上一篇

统计推断(六) Modeling 下一篇

统计推断(七) Typical Sequence

1. 一些定理

2. Typical set

3. Divergence \(\varepsilon\)-typical set

4. Large deviation of sample averages

5. Types and type classes

6. Large Deviation Analysis via Types

7. Asymptotics of hypothesis testing

8.Asymptotics of parameter estimation