Parametric model
A statistical model is parametric if the probability distribution of $X$ belongs to some family of distributions indexed by some parameter $\theta$ of finite dimension.
Our main goal is to use the observations $X_1, \dots, X_n$ to learn the value of $\theta$. Note that it is possible to do so only if each probability distribution $P_\theta \in \mathcal{P}$ is defined by a unique parameter $\theta$.
In most practical cases, the probability distribution $P_\theta$ has a density.
The model $\mathcal{P} = \{ P_\theta \sim \mathcal{N}(\mu, \sigma^2), \theta = (\mu, \sigma^2) \in \mathbb{R} \times \mathbb{R}_+ \}$ of Gaussian distributions is dominated by the density function $p_\theta(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x - \mu)^2}{2 \sigma^2}}$.
The model $\mathcal{P} = \{ P_\theta \sim \mathcal{B}(\theta), \theta \in [0, 1] \}$ of Bernoulli distributions is dominated by the density function $p_\theta(x) = \theta^x (1 - \theta)^{1 - x}$.
Decision function
We may have many sets of actions:
- If we want to find the value of $\theta$ based on observations, then $\mathcal{A} = \Theta$;
- If we want to answer a binary question, i.e. do a hypothesis test, then $\mathcal{A} = \{0, 1\}$;
- If we want to find a confidence region for $\theta$, then $\mathcal{A} = \mathcal{P}(\Theta)$.
Loss function
A loss function is used to determine the quality of an action $a \in \mathcal{A}$. This quality dependes on $\theta$.
A common loss function is $L(\theta, a) = (\theta - a)^2$.
Risk
In order to evaluate the quality of the decision function, $\delta$, we use risk. We take all possible actions in account and compute their average loss.
For discrete distributions, the expected value is the weighted average of the possible outcomes. Let $x_1, \ldots, x_n$ be the possible outcomes and $p_1, \ldots, p_n$ the corresponding probabilities.
$$ \mathbb{E}(X) = \sum_{i = 1}^n x_i p_i $$For continuous probability densities, the expected value can be computed using an integral.
$$ \mathbb{E}(X) = \int_{-\infty}^{\infty} x p_\theta(x) \; \mathrm{d} x $$Notions of statistic
Let $P_\theta$ be a probability distribution that depends on some unknown parameter $\theta \in \Theta$ with $\Theta \subset \mathbb{R}^d$ for some finite $d$. We assume that $P_\theta$ is dominates, which means that it has a density function $p_\theta$. We denote by $X = (X_1, \ldots, X_n)$ a sequence of $n$ i.i.d. samples from $P_\theta$, and by $x = (x_1, \ldots, x_n)$ the values taken by $X$.
Some statistics are better than other. There may be statistics that do not provide any information at all (free statistics) or statistics that provide all the information possible (sufficient statistic).
In other words, this means that the information that $T(X)$ gives is all the information that $X$ provides.
Let $X \sim \mathcal{N}(\mu, \sigma^2)$. Suppose we know $\sigma^2$ but not $\mu$. The sample mean, $T_1(x) = \frac{1}{n} \sum_{i = 1}^n x_i$, is a sufficient statistic because no further information about $\mu$ can be obtained from $X$.
On the other hand, the sample median, $T_2(X)$, is not a sufficient statistic, since knowing $X$ itself would provide more information about $\mu$.
Unfortunately, it is not always obvious whether a statistic is sufficient. The following theorem shows a criterion that helps us identify sufficient statistics.
In the Bernoulli model $\mathcal{P} = \{ P_\theta \sim \mathcal{B}(\theta), \theta \in [0, 1] \}$ the statistic $T(x) = \sum_{i = 1}^n x_i$ is sufficient.
For a single experiment, $p_\theta(x) = \theta^x (1 - \theta)^{1 - x}$. Since $X_1, \ldots, X_n$ are independent, then their joined density function is
$$ \begin{align*} p_\theta(x) &= p_\theta(x_1) p_\theta(x_2) \ldots p_\theta(x_n) \\ &= \prod_{i = 1}^n \theta^{x_i} (1 - \theta)^{1 - x_i} \\ &= \left( \prod_{i = 1}^n \theta^{x_i} \right) \left( \prod_{i = 1}^n (1 - \theta)^{1 - x_i} \right) \\ &= \theta^{\sum_{i = 1}^n x_i} (1 - \theta)^{n - \sum_{i = 1}^n x_i} \end{align*} $$So, if we have $g_\theta(t) = \theta^t (1 - \theta)^{n - t}$ and $h(x) = 1$, $p_\theta(x) = g_\theta(T(x)) h(x)$. Thus, $T(X)$ is sufficient.
Entropy
We can use entropy to measure the amount of information in $X$.
The entropy of the random variable $X$ is
$$ H(X) = - \int_{-\infty}^{\infty} p(x) \log{p(x)} \; \mathrm{d}x = - \mathbb{E}(\log{p(x)}). $$The entropy of a Bernoulli random variable $X \sim \mathcal{B}(\theta)$ is:
$$ \begin{align*} H(X) &= - \sum_{x = 0}^1 p_\theta(x) \log{p_\theta(x)} \\ &= - \theta^0 (1 - \theta)^{1 - 0} \log{(\theta^0 (1 - \theta)^{1 - 0})} - \theta^1 (1 - \theta)^{1 - 1} \log{(\theta^1 (1 - \theta)^{1 - 1})} \\ &= - (1 - \theta) \log{(1 - \theta)} - \theta \log{\theta} \end{align*} $$In general, for two random variables $X$, $Y$ with joint probability density function $p(x, y)$, the entropy of the distribution $Y$ given $X = x$, denoted by $p(y \vert x)$, expected over the distribution of $X$ is:
$$ \begin{align*} H(Y \vert X) &= \int p(x) H(Y \vert X = x) \; \mathrm{d} x \\ &= - \int p(x) \left( \int p(y \vert x) \log{p(y \vert x)} \; \mathrm{d} y \right) \; \mathrm{d} x \end{align*} $$Note, here, that $p(x) p(y \vert x)$ is equal to $p(x, y)$. Then,
$$ \begin{align*} H(Y \vert X) &= - \int p(x, y) \log{p(y \vert x)} \; \mathrm{d} x \, \mathrm{d} y \\ &= - \mathbb{E}(\log{p(Y \vert X)}) \end{align*} $$For any random variables $X_1, \ldots, X_n$ (not necessarily independent),
$$ H(X_1, \ldots, X_n) = H(X_1) + H(X_2 \vert X_1) + \ldots + H(X_n \vert X_1, \ldots, X_{n - 1}) $$If $X$ and $Y$ are independent, then $H(Y \vert X) = H(Y)$. This means that
$$ H(X, Y) = H(X) + H(Y) $$An important consequence of the chain rule is that a statistic cannot contain more information than the original data.
We now introduce a notion of information gap between two probability distributions $P$, $Q$ related to entropy.
The Kullback-Leibler divergence between $P$ and $Q$ is defined by:
$$ D(P \Vert Q) = \int p(x) \log{\left( \frac{p(x)}{q(x)} \right)} \; \mathrm{d} x = \mathbb{E}\left( \log{\frac{p(X)}{q(X)}} \right) $$with $X \sim P$. In this case, $\log{\frac{p(X)}{q(X)}}$ plays the role of $X$.
The Kullback-Leibler divergence between two Bernoulli distributions, $P \sim \mathcal{B}(p)$ and $Q \sim \mathcal{B}(q)$, is:
$$ \begin{align*} D(P \Vert Q) &= \mathbb{E}\left( \log{\frac{p^x (1 - p)^{1 - x}}{q^x (1 - q)^{1 - x}}} \right) \\ &= \log{\frac{p^0 (1 - p)^{1 - 0}}{q^0 (1 - q)^{1 - 0}}} p^0 (1 - p)^{1 - 0} + \log{\frac{p^1 (1 - p)^{1 - 1}}{q^1 (1 - q)^{1 - 1}}} p^1 (1 - p)^{1 - 1} \\ &= (1 - p) \log{\frac{1 - p}{1 - q}} + p \log{\frac{p}{q}} \\ \end{align*} $$Fisher information
The Fisher information quantifies the information on $\theta$ contained in $X$ when $X \sim P_\theta$. Its standard notation is $I(\theta)$ or $I_X(\theta)$.
There are some conditions for the existence of $I(\theta)$:
- $\frac{\partial \log{p_\theta}}{\partial \theta}$ exists,
- The support of $p_\theta$ is independent of $\theta$,
- $\mathbb{E}(X)$ and $\frac{\partial p_\theta}{\partial \theta}$ are inversible.
Scalar case
We first consider the scalar case $\Theta \subset \mathbb{R}$.
Since $\mathbb{E}(S(X)) = 0$, we can simplify $I(\theta)$ to
$$ I(\theta) = \mathbb{E}(S^2(X)) $$We can also say that the Fisher information is the opposite of the curvature of the log-likelihood:
$$ I(\theta) = - \mathbb{E}\left( \frac{\partial^2 \log{p_\theta(X)}}{\partial \theta^2} \right). $$For a Bernoulli random variable $X \sim \mathcal{B}(\theta)$ with $\theta \in (0, 1)$, we have:
$$ \begin{align*} S(X) &= \frac{\partial}{\partial \theta} \log{(\theta^X (1 - \theta)^{1 - X})} \\ &= \frac{\partial}{\partial \theta} X \log{\theta} + (1 - X) \log{(1 - \theta)} \\ &= \frac{X}{\theta} - \frac{1 - X}{1 - \theta} = \frac{X - \theta}{\theta(1 - \theta)} \end{align*} $$Then,
$$ \begin{align*} I(\theta) &= \mathbb{E}(S^2(X)) \\ &= (1 - \theta) S^2(0) + \theta S^2(1) \\ &= (1 - \theta) \left(- \frac{1}{1 - \theta} \right)^2 + \theta \left( \frac{1}{\theta} \right)^2 \\ &= \frac{1}{1 - \theta} + \frac{1}{\theta} = \frac{1}{\theta (1 - \theta)} \end{align*} $$Now, let for two random variables $X$, $Y$ with joint probability density function $p_\theta(x, y)$. We define the notion of Fisher information of $Y$ given $X$ on $\theta$ by:
$$ \begin{align*} I_{Y \vert X}(\theta) &= \int p_\theta(x) \mathrm{Var}\left( \frac{\partial \log{p_\theta(Y \vert X = x)}}{\partial \theta} \right) \; \mathrm{d} x \\ &= \mathbb{E}\left[ \left( \frac{\partial \log{p_\theta(Y \vert X)}}{\partial \theta} \right)^2 \right] \end{align*} $$Observe that $I_{Y \vert X}(\theta) = I_Y(\theta)$ if $X$ and $Y$ are independent.
For any random variables $X_1, \ldots, X_n$ (not necessarily independent),
$$ I_{X_1, \ldots, X_n}(\theta) = I_{X_1}(\theta) + I_{X_2 \vert X_1}(\theta) + \ldots + I_{X_n \vert X_1, \ldots, X_{n - 1}}(\theta). $$For any statistic $T$,
$$ \begin{align*} H(T(X)) &\leq H(X) \\ I_{T(X)}(\theta) &\leq I_X(\theta) \quad \forall \theta \in \Theta \end{align*} $$Observe that $0 \leq I_{T(X)}(\theta) \leq I_X(\theta)$ with:
- $I_{T(X)}(\theta) = 0$ if $T$ is free,
- $I_{T(X)}(\theta) = I_X(\theta)$ if $T$ is sufficient (by the chain rule).
The link between both notions of information is provided by the following result:
$$ \forall h, \quad D(P_\theta \Vert P_{\theta + h}) = \frac{1}{2} I(\theta) h^2 + o(h^2). $$Vectorial case
When $\theta$ is a vector, that is $\Theta \subset \mathbb{R}^d$, the score is the gradient and the Fisher information becomes a matrix.
The score is a centered, $\mathbb{E}(S(X)) = 0$, so that the Fisher information matrix can be written:
$$ I(\theta) = \mathrm{Cov}(S(X)) = \mathbb{E}(S(X) S(X)^T). $$The chain rule and the data processing inequality hold. The link with entropy is given by:
$$ \forall h, \quad D(P_\theta \Vert P_{\theta + h}) = \frac{1}{2} h^T I(\theta) h + o(\Vert h \Vert^2). $$Exercises
(Identifiability) For any $\theta = (\theta_1, \theta_2) \in \mathbb{R}^2$, let $X \sim \mathcal{N}(\theta_1 \lor \theta_2, 1)$ and $Y \sim \mathcal{N}(\theta_1 \land \theta_2, 1)$ be independent random variables. We denote by $P_\theta$ the probability distribution of $(X, Y)$.
(a) Explain why the corresponding statistical model is parametric.
(b) Is this model identifiable?
Show answer
(a) The model is parametric because it is defined by the parameter $\theta$.
(b) Suppose $X \sim \mathcal{N}(\theta_1 \lor \theta_2, 1)$ and $Y \sim \mathcal{N}(\theta_1 \land \theta_2, 1)$. Then, $\mathbb{E}(X) = \theta_1 \lor \theta_2$ and $\mathbb{E}(Y) = \theta_1 \land \theta_2$. However, we will never be able to distinguish between $\theta_1$ and $\theta_2$. We can have both $\theta = (\theta_1, \theta_2) \in \mathbb{R}^2$ or $\theta = (\theta_2, \theta_1) \in \mathbb{R}^2$.(Statistics) Let $X = (X_1, X_2)$ where $X_1, X_2 \sim \mathcal{N}(\theta, 1)$ are independent random variables, with $\theta \in \mathbb{R}$ some unknown parameter.
(a) Show that $T(X) = X_1 - X_2$ is a free statistic.
(b) Show that $T(X) = X_1 + X_2$ is a sufficient statistic.
Show answer
(a) Let $T(X) = X_1 - X_2$. $$ \mathbb{E}(T(X)) = \mathbb{E}(X_1 - X_2) = \mathbb{E}(X_1) - \mathbb{E}(X_2) = \theta - \theta = 0 $$ $$ \mathrm{Var}(T(X)) = \mathrm{Var}(X_1 - X_2) = \mathrm{Var}(X_1) + \mathrm{Var}(X_2) = 1 + 1 = 2 $$ So, we can say that $T(X) \sim \mathcal{N}(0, 2)$. Hence, it is a free statistic.
(b) Let $T(X) = X_1 + X_2$. $$ \begin{align*} p_\theta(x) &= \frac{1}{2 \pi} e^{- \frac{(x_1 - \theta)^2}{2} - \frac{((x_2 - \theta))^2}{2}} \\ &= \frac{1}{2 \pi} e^{- \frac{1}{2} (x_1^2 - 2 x_1 \theta + \theta^2 + x_2^2 - 2 x_2 \theta + \theta^2)} \\ &= \frac{1}{2 \pi} e^{x_1 \theta + x_2 \theta - \theta^2 - \frac{1}{2} (x_1^2 + x_2^2)} \\ &= \frac{1}{2 \pi} e^{\theta (x_1 + x_2 - \theta) - \frac{x_1^2 + x_2^2}{2}} \\ \end{align*} $$ Then, let $g_\theta(t) = \frac{1}{2 \pi} e^{\theta (t - \theta)}$ and $h(x) = e^{- \frac{1}{2} (x_1^2 + x_2^2)}$. This means that $p_\theta(x) = g_\theta(T(x)) h(x)$. By the Fisher factorization theorem, $T$ is sufficient.(Statistics) Let $X = (X_1, X_2)$ where $X_1, X_2$ are independent random variables, uniformly distributed over $(0, \theta)$ some unknown parameter $\theta > 0$.
(a) Show that $T(X) = \max{(X_1, X_2)}$ is a sufficient statistic.
(b) Is $T(X) = X_1 + X_2$ a free statistic? A sufficient statistic?
Show answer
(a) Note that the distribution of $\min{(X_1, X_2)}$ given $T(X) = t$ is uniform in the interval $(0, t)$. Then, $\mathbb{P}[X = (\min{(X_1, X_2}), t)] = \mathbb{P}[X = (t, \min{(X_1, X_2)})] = \frac{1}{2}$. Since the probability of $X$ given $T(X)$ does not depend of $\theta$, $T$ is a sufficient statistic.
(b) Clearly, the distribution of $T(X) = X_1 + X_2$ is uniform in the interval $(0, 2 \theta)$. Since the distribution of $T$ depends on $\theta$, $T$ is not free. Also, the probability distribution of $X$ given $T(X) = t$ is uniform in the interval $(t - \theta, \theta)$, which still depends of $\theta$. Thus, $T$ is not sufficient.(Entropy, Fisher information) Let $X \sim \mathcal{N}(\theta, \sigma^2)$ where $\theta \in \mathbb{R}$ is some unknown parameter.
(a) Compute the entropy $H(X)$.
(b) Compute the Fisher information $I_X(\theta)$.
(c) Interpret the results.
Show answer
(a) Since $H(X) = - \mathbb{E}[\log{p}]$, we have $$ \begin{align*} H(X) &= - \mathbb{E}\left[ \log{\left( \frac{1}{\sigma \sqrt{2 \pi}} \right)} - \frac{(x - \theta)^2}{2 \sigma^2} \right] \\ &= \log{(\sigma \sqrt{2 \pi})} + \frac{1}{2 \sigma^2} \underbrace{\mathbb{E}[(x - \theta)^2]}_{\mathrm{Var}(X)} \\ &= \log{(\sigma \sqrt{2 \pi})} + \frac{1}{2 \sigma^2} \sigma^2 \\ &= \frac{1}{2} + \log{(\sigma \sqrt{2 \pi})} \end{align*} $$ (b) First, let us compute the score $S(X)$. $$ \frac{\partial \log{p}}{\partial \theta} = \frac{\partial}{\partial \theta}\left[ - \log{(\sigma \sqrt{2 \pi})} - \frac{(x - \theta)^2}{2 \sigma^2} \right] = \frac{x - \theta}{\sigma^2} $$ Now, we can find $I_X(\theta)$ as: $$ I_X(\theta) = \mathbb{E}[S^2(X)] = \frac{1}{\sigma^4} \mathbb{E}[(x - \theta)^2] = \sigma^{-2}. $$ (c) Both the entropy and the Fisher information do not depend on $\theta$. While the entropy increases with $\sigma$, the Fisher information decreases with $\sigma$ (more randomness = less information about $\theta$).(Entropy, data processing inequality) Let $X = (X_1, X_2)$ where $X_1$, $X_2$ are independent Bernoulli random variables with parameter $\frac{1}{2}$.
(a) Compute $H(X)$.
(b) Compute $H(T(X))$ with $T(X) = X_1 + X_2$.
(c) Check the data processing inequality.
Show answer
(a) We consider the log base 2. $$ \begin{align*} H(X_i) &= - \mathbb{E}[\log{p}] \\ &= - \mathbb{E}[X \log{2^{-1}} + (1 - X) \log{1 - 2^{-1}}] \\ &= \mathbb{E}[X + (1 - X)] = 1 \\ \implies H(X) &= H(X_1) + H(X_2) = 1 + 1 = 2 \end{align*} $$ (b) We can calculate the entropy with the discrete formula. $$ \begin{align*} H(T(X)) &= - \mathbb{E}[\log{p_T(X)}] \\ &= - \sum_{i = 0}^2 \mathbb{P}[T(X) = i] \log{(\mathbb{P}[T(X) = i])} \\ &= - [ 2^{-2} \log{2^{-2}} + 2^{-1} \log{2^{-1}} + 2^{-2} \log{2^{-2}} ] \\ &= 2^{-1} + 2^{-1} + 2^{-1} \\ &= \frac{3}{2} \end{align*} $$ (c) Indeed, $$ H(T(X)) \leq H(X) \quad \text{since} \quad \frac{3}{2} \leq 2. $$(Data processing inequality, sufficiency) Let $X = (X_1, \ldots, X_n)$ where $X_1, \ldots, X_n$ are i.i.d. Bernoulli random variables with parameter $\theta$. Let $T(x) = x_1 + \ldots + x_n$ for any $x \in \mathbb{R}^n$.
(a) Compute $I_X(\theta)$ and $I_{T(X)}(\theta)$.
(b) Check the data processing inequality.
(c) Is $T$ a sufficient statistic for $X$?
Show answer
(a) Since $X_1, \ldots, X_n$ are independent, then $$ I_X(\theta) = n I_{X_1}(\theta) = \frac{n}{\theta (1 - \theta)}. $$ $T(x) = x_1 + \ldots + x_n \implies T(X) \sim \mathcal{B}(n, \theta) \implies p_\theta(t) = \binom{n}{t} \theta^t (1 - \theta)^{n - t}$ $$ S(T) = \frac{T}{\theta} - \frac{n - T}{1 - \theta} = \frac{T}{\theta (1 - \theta)} - \frac{n}{1 - \theta} $$ $$ \implies I_{T(X)}(\theta) = \frac{n}{\theta (1 - \theta)} $$ (b) Indeed, $$ \frac{n}{\theta (1 - \theta)} \leq \frac{n}{\theta (1 - \theta)}. $$ (c) Yes, because $I_X(\theta) = I_{T(X)}(\theta)$.(Fisher information and Kullback-Leibler divergence) Let $X \sim \mathcal{N}(\theta, \sigma^2)$ where the variance $\sigma^2$ is known. Denote by $P_\theta$ the corresponding probability distribution.
(a) Compute the Kullback-Leibler divergence $D(P_\theta \Vert P_{\theta + h})$ for any $h \in \mathbb{R}$.
(b) Compute the Fisher information $I_X(\theta)$.
(c) Check the behaviour of $D(P_\theta \Vert P_{\theta + h})$ when $h \to 0$.
Show answer
(a) $$ D(P_\theta \Vert P_{\theta + h}) = \mathbb{E}\left[ \log{\frac{p_\theta(X)}{p_{\theta + h}(X)}} \right] $$ Let us calculate $\log{\frac{p_\theta(X)}{p_{\theta + h}(X)}}$. $$ \begin{align*} \log{\frac{p_\theta(X)}{p_{\theta + h}(X)}} &= - \frac{(x - \theta)^2}{2 \sigma^2} - \left( - \frac{(x - (\theta + h))^2}{2 \sigma^2} \right) \\ &= \frac{1}{2 \sigma^2} \left[ - (x^2 - 2 x \theta + \theta^2) + x^2 - 2 x (\theta + h) + (\theta + h)^2 \right] \\ &= \frac{1}{2 \sigma^2} \left[ 2 x \theta - \theta^2 - 2 x \theta - 2 x h + (\theta^2 + 2 \theta h + h^2) \right] \\ &= \frac{1}{2 \sigma^2} \left[ h^2 - 2 x h + 2 \theta h \right] \\ &= \frac{h}{2 \sigma^2} \left( h + 2 \theta - 2 x \right) \end{align*} $$ Now, we can compute $D(P_\theta \Vert P_{\theta + h})$. $$ \begin{align*} D(P_\theta \Vert P_{\theta + h}) &= \mathbb{E}\left[ \frac{h}{2 \sigma^2} \left( h + 2 \theta - 2 x \right) \right] \\ &= \frac{h}{2 \sigma^2} \left( h + 2 \theta - 2 \mathbb{E}[X] \right) \\ &= \frac{h}{2 \sigma^2} \left( h + 2 \theta - 2 \theta \right) \\ &= \frac{h^2}{2 \sigma^2}. \end{align*} $$ (b) First, let us compute the score $S(X)$. $$ S(X) = \frac{\partial}{\partial \theta}\left[ - \log{(\sigma \sqrt{2 \pi})} - \frac{(x - \theta)^2}{2 \sigma^2} \right] = \frac{x - \theta}{\sigma^2} $$ Then, $$ I_X(\theta) = \mathbb{E}[S^2(X)] = \sigma^{-2}. $$ (c) $$ \lim_{h \to 0} D(P_\theta \Vert P_{\theta + h}) = \lim_{h \to 0} \frac{h^2}{2 \sigma^2} = 0. $$(Independent random variables) Let $X = (X_1, \ldots, X_n)$ and $Y = (Y_1, \ldots, Y_n)$ where $X_k \sim \mathcal{N}(\cos{(k \theta)}, 1)$ and $Y_k \sim \mathcal{N}(\sin{(k \theta)}, 1)$, $k = 1, \ldots, n$, are independent variables.
(a) Compute the Fisher informations $I_X(\theta)$ and $I_Y(\theta)$.
(b) Deduce the Fisher information $I_{X, Y}(\theta)$.
Show answer
(a) Since all $X$ and $Y$ independent variables, we have $$ \begin{align*} I_X(\theta) &= \sum_{k = 1}^n I_{X_k}(\theta) \\ I_Y(\theta) &= \sum_{k = 1}^n I_{Y_k}(\theta) \end{align*} $$ Let us calculate $I_{X_k}(\theta)$. $$ \begin{align*} S(X_k) &= \frac{\partial}{\partial \theta}\left[ - \log{(\sigma \sqrt{2 \pi})} - \frac{(x - \cos{(k \theta)})^2}{2} \right] = - k \sin{(k \theta)} (x - \cos{(k \theta)}) \\ \implies I_{X_k}(\theta) &= \mathbb{E}[S(X_k)^2] = k^2 \sin^2{(k \theta)} \mathbb{E}[(x - \cos{(k \theta)})^2] = k^2 \sin^2{(k \theta)}. \end{align*} $$ Similarly, $I_{Y_k}(\theta) = k^2 \cos^2{(k \theta)}$. Thus, $$ \begin{align*} I_X(\theta) &= \sum_{k = 1}^n k^2 \sin^2{(k \theta)} \\ I_Y(\theta) &= \sum_{k = 1}^n k^2 \cos^2{(k \theta)} \end{align*} $$ (b) Since $X$ and $Y$ are independent, we have $$ I_{X, Y}(\theta) = I_X(\theta) + I_Y(\theta) = \sum_{k = 1}^n k^2 = \frac{n (n + 1) (2n + 1)}{6} $$(Fisher information matrix) For any $\theta = (\theta_1, \theta_2) \in \mathbb{R}^2$, let $Y_1 \sim \mathcal{N}(\theta_1, 1)$, $Y_2 \sim \mathcal{N}(\theta_2, 1)$ be independent random variables. Let $X = (X_1, X_2) \sim P_\theta$ with $X_1 = 2Y_1 + Y_2$, $X_2 = Y_1 + Y_2$.
(a) Show that the model is identifiable.
(b) Compute the Fisher information matrix $I_X(\theta)$.
(c) Deduce the behavior of $D(P_\theta \Vert P_{\theta + h})$ when $h \to 0$.
Show answer
(a) Since $X_1$ and $X_2$ are linearly independent, we can say that the model $X$ is identifiable.
(b) We can say that $X \sim \mathcal{N}(\mu, \Gamma)$ where $\mu$ is the mean vector and $\Gamma$ is the covariance matrix. $$ \begin{align*} \mu &= \begin{pmatrix} 2 \theta_1 + \theta_2 \\ \theta_1 + \theta_2 \end{pmatrix} \\ \Gamma &= \begin{pmatrix} \mathrm{Cov}(X_1, X_1) & \mathrm{Cov}(X_1, X_2) \\ \mathrm{Cov}(X_2, X_1) & \mathrm{Cov}(X_2, X_2) \end{pmatrix} \end{align*} $$ where $$ \begin{align*} \mathrm{Cov}(X_1, X_1) &= \mathrm{Cov}(2 Y_1 + Y_2, 2 Y_1 + Y_2) = 5 \\ \mathrm{Cov}(X_1, X_2) &= \mathrm{Cov}(X_2, X_1) = \mathrm{Cov}(Y_1 + Y_2, 2 Y_1 + Y_2) = 3 \\ \mathrm{Cov}(X_2, X_2) &= \mathrm{Cov}(Y_1 + Y_2, Y_1 + Y_2) = 2 \\ \implies \Gamma &= \begin{pmatrix} 5 & 3 \\ 3 & 2 \end{pmatrix} \end{align*} $$ We can, now, write the probability distribution of $X$: $$ p_\theta(\mathbf{x}) = \frac{1}{\sqrt{(2 \pi)^2 \det{\Gamma}}} e^{-\frac{1}{2} (\mathbf{x} - \mu)^\top \Gamma^{-1} (\mathbf{x} - \mu)} $$ Then, we compute the score $S(X)$: $$ \begin{align*} S(X) &= \nabla_\theta \log{p_\theta(X)} \\ &= \nabla_\theta \left( - \log{2 \pi} - \frac{1}{2} (\mathbf{x} - \mu)^\top \Gamma^{-1} (\mathbf{x} - \mu) \right) \\ &= - \frac{1}{2} \nabla_\theta ((\mathbf{x} - \mu)^\top \Gamma^{-1} (\mathbf{x} - \mu) ) \\ &= - \frac{1}{2} \nabla_\theta (\mathbf{x} - \mu)^\top \Gamma^{-1} (\mathbf{x} - \mu) + (\mathbf{x} - \mu)^\top \Gamma^{-1} \nabla_\theta (\mathbf{x} - \mu) \\ &= - \frac{1}{2} (- \nabla_\theta \mu)^\top \Gamma^{-1} (\mathbf{x} - \mu) + (\mathbf{x} - \mu)^\top \Gamma^{-1} ( - \nabla_\theta \mu ) \\ &= (\mathbf{x} - \mu)^\top \Gamma^{-1} ( \nabla_\theta \mu ) \end{align*} $$ Now, we are able to find $I_X(\theta)$. Since $S(X)$ is a square matrix, we have $S(X) S(X)^\top = S(X)^\top S(X)$. $$ \begin{align*} I_X(\theta) &= \mathbb{E}[S(X)^\top S(X)] \\ &= \mathbb{E}[ (\nabla_\theta \mu)^\top \Gamma^{-1} (\mathbf{x} - \mu) (\mathbf{x} - \mu)^\top \Gamma^{-1} (\nabla_\theta \mu) ] \\ &= (\nabla_\theta \mu)^\top \Gamma^{-1} \mathbb{E}[(\mathbf{x} - \mu) (\mathbf{x} - \mu)^\top] \Gamma^{-1} (\nabla_\theta \mu) \\ &= (\nabla_\theta \mu)^\top \Gamma^{-1} \Gamma \Gamma^{-1} (\nabla_\theta \mu) \\ &= (\nabla_\theta \mu)^\top \Gamma^{-1} (\nabla_\theta \mu) \\ &= I_2. \end{align*} $$