Conjugacy
According to Bayes’ theorem, the posterior is proportional to the product of the prior and the likelihood. The specification of the prior can be tricky for two reasons: First, the prior should encapsulate our knowledge about the problem before we see any data. This is often difficult to describe. Second, it is often not possible to compute the posterior distribution analytically. However, there are some priors that are computationally convenient and are called conjugate priors.
(Conjugate Prior). A prior is conjugate for the likelihood function if the posterior is of the same form/type as the prior.
Conjugacy is particularly convenient because we can algebraically calculate our posterior distribution by updating the parameters of the prior distribution.When considering the geometry of probability distributions, conjugate priors retain the same distance structure as the likelihood.
To introduce a concrete example of conjugate priors, consider the Binomial distribution (defined on discrete random variables) and the Beta distribution (defined on continuous random variables).
Beta-Binomial Conjugacy
Consider a Binomial random variable $x \sim Bin(N,\mu)$, where
$p(x|N,\mu)=\binom{N}{x}\mu^x(1-\mu)^{N-x},\quad x=0,1,\ldots,N$
is the probability of finding $x$ times the outcome “heads” in $N$ coin flips,where $\mu$ is the probability of a “head”. We place a Beta prior on the parameter $\mu$, that is, $\mu \sim Beta(\alpha,\beta)$, where
$p(\mu|\alpha,\beta)=\frac{\Gamma(\alpha+\beta) }{\Gamma(\alpha)\Gamma(\beta)}\mu^{\alpha-1}(1-\mu)^{\beta-1}$If we now observe some outcome $x = h$, that is, we see $h$ heads in $N$ coin flips, we compute the posterior distribution on $\mu$ as
$p(\mu|x=h,N,\alpha,\beta) \propto p(x|N,\mu)p(\mu|\alpha,\beta)$
$\propto \mu^h(1-\mu)^{N-h}\mu^{\alpha-1}(1-\mu)^{\beta-1}$
$=\mu^{h+\alpha-1}(1-\mu)^{N-h+\beta-1}$
$\propto Beta(h+\alpha,N-h+\beta)$
the posterior distribution is a Beta distribution as the prior, i.e., the Beta prior is conjugate for the parameter $\mu$ in the Binomial likelihood function.
Beta-Bernoulli Conjugacy
Let $x \in \{0,1\}$ be distributed according to the Bernoulli distribution with parameter $\theta \in [0,1]$, that is, $p(x=1|\theta)=\theta$. This can also be expressed as $p(x|\theta)=\theta^x(1-\theta)^{1-x}$.Let $\theta$ be distributed according to a Beta distribution with parameters $\alpha,\beta$, that is , $p(\theta|\alpha,\beta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}$.
Multiplying the Beta and the Bernoulli distribution, we get
$p(\theta|x,\alpha,\beta)=p(x|\theta)p(\theta|\alpha,\beta)$
$\quad \quad \propto \theta^x(1-\theta)^{1-x}\theta^{\alpha-1}(1-\theta)^{\beta -1}$
$\quad \quad = \theta^{\alpha+x-1}(1-\theta)^{\beta+(1-x)-1}$
$\propto p(\theta|\alpha+x,\beta+(1-x))$
This is the Beta distribution with parameters $(\alpha+x,\beta+(1-x))$
Sufficient Statistics
Statistic of a random variable is a deterministic function of that random variable.For example, if $x=[x_1,\ldots,x_N]^T$ is a vector of univariate Gaussian Random variables, that is, $x_n \sim \mathcal{N}(\mu,\sigma^2)$, then the sample mean $\bar{\mu}=\frac{1}{N}(x_1+\cdots+x_N)$ is a statistic.
Sir Ronald Fisher discovered the notion of sufficient statistics: the idea that there are statistics that will contain all available information that can be inferred from data corresponding to the distribution under consideration. In other words, sufficient statistics carry all the information needed to make inference about the population, that is, they are the statistics that are sufficient to represent the distribution.
For a set of distributions parametrized by $\theta$, let $X$ be a random variable with distribution $p(x | \theta_0)$ given an unknown $\theta_0$. A vector $\phi(x)$ of statistics is called sufficient statistics for $\theta_0$ if they contain all possible information about $\theta_0$. To be more formal about “contain all possible information”,this means that the probability of $x$ given $\theta$ can be factored into a part that does not depend on $\theta$, and a part that depends on $\theta$ only via $\phi(x)$.
The Fisher-Neyman factorization theorem formalizes this notion
Let $X$ have a probability density function $p(x|\theta)$. Then the statistics $\phi(x)$ are sufficient for $\theta$, if and only if $p(x|\theta)$ can be written in the form
$p(x|\theta)=h(x)g_\theta(\phi(x))$
where $h(x)$ is a distribution independent of $\theta$ and $g_\theta$ captures all the dependence on $\theta$ via sufficient statistics $\phi(x)$.
If $p(x | \theta)$ does not depend on $\theta$, then $\phi(x)$ is trivially a sufficient statistic for any function $\phi$. The more interesting case is that $p(x | \theta)$ is dependent only on $\phi(x)$ and not $x$ itself. In this case, $\phi(x)$ is a sufficient statistic for $\theta$.
In machine learning, we consider a finite number of samples from a distribution. One could imagine that for simple distributions (such as the Bernoulli ) we only need a small number of samples to estimate the parameters of the distributions. We could also consider the opposite problem: If we have a set of data (a sample from an unknown distribution), which distribution gives the best fit? A natural question to ask is, as we observe more data, do we need more parameters $\theta$ to describe the distribution? It turns out that the answer is yes in general, and this is studied in non-parametric statistics.
A converse question is to consider which class of distributions have finite-dimensional sufficient statistics, that is the number of parameters needed to describe them does not increase arbitrarily. The answer is exponential family distributions,described in the following section.
Exponential Family
Many of the probability distributions “with names” that we find in statistics textbooks were discovered to model particular types of phenomena.For example, we have seen the Gaussian distribution.It is often difficult to figure out which distribution to use.We saw that many of the operations required for inference can be conveniently calculated when the distribution is Gaussian.As we collect more data, we do not need more parameters to describe the distribution. Since we are interested in learning from data, we want parameter estimation to behave nicely.It turns out that the class of distributions called the exponential family provides the right balance of generality while retaining favorable computation and inference properties
There are three possible levels of abstraction we can have when considering distributions (of discrete or continuous random variables). At level one (the most concrete end of the spectrum), we have a particular named distribution with fixed parameters, for example a univariate Gaussian $\mathcal{N}(0,1)$,with zero mean and unit variance. In machine learning, we often use the second level of abstraction, that is, we fix the parametric form (the univariate Gaussian) and infer the parameters from data. For example, we assume a univariate Gaussian $\mathcal{N}(\mu,\sigma^2)$,with unknown mean $\mu$ and unknown variance $\sigma^2$, and use a maximum likelihood fit to determine the best parameters $(\mu,\sigma^2)$.
A third level of abstraction is to consider families of distributions, and we consider the exponential family.The univariate Gaussian is an example of a member of the exponential family. Many of the widely used statistical models, including all the “named” models ( Binomial,Bernoulli,Beta etc..) are members of the exponential family.They can all be unified into one concept.Exponential families are the only families that enjoy finite-dimensional sufficient statistics under repeated independent sampling.
An exponential family is a family of probability distributions, parameterized by $\theta \in \mathbb{R}^D$, of the form
$p(x|\theta)=h(x) exp((<\theta,\phi(x)>)-A(\theta))$
where $\phi (x)$ is the vector of sufficient statistics. In general, any inner product can be used , and for concreteness we will use the standard dot product here $(<\theta,\phi(x)>=\theta^T\phi(x))$
The factor $h(x)$ can be absorbed into the dot product term by adding another entry $(log h(x))$ to the vector of sufficient statistics $\phi(x)$, and constraining the corresponding parameter $\theta_0 = 1$. The term $A(\theta)$ is the normalization constant that ensures that the distribution sums up or integrates to one and is called the log-partition function. A good intuitive notion of exponential families can be obtained by ignoring these two terms and considering exponential families as distributions of the form
$p(x | \theta) \propto exp(\theta^T \phi(x))$x\\
x^2
\end{bmatrix}$
Then by using the definition of exponential family
\end{bmatrix}^T$
x^2
\end{bmatrix}$, and natural parameters given by $\theta$.
Comments
Post a Comment