Skip to main content

4.7 Conjugacy and Exponential Family

 

Conjugacy

According to Bayes’ theorem, the posterior is proportional to the product of the prior and the likelihood. The specification of the prior can be tricky for two reasons: First, the prior should encapsulate our knowledge about the problem before we see any data. This is often difficult to describe. Second, it is often not possible to compute the posterior distribution analytically. However, there are some priors that are computationally  convenient and are called conjugate priors.

In Bayesian probability theory, if the posterior distribution $p(θ | x)$ is in the same probability distribution family as the prior probability distribution $p(θ)$, the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function $p(x | θ)$.

A conjugate prior is an algebraic convenience, giving a closed-form expression for the posterior; otherwise numerical integration may be necessary. Further, conjugate priors may give intuition, by more transparently showing how a likelihood function updates a prior distribution.

The concept, as well as the term "conjugate prior", were introduced by Howard Raiffa and Robert Schlaifer in their work on Bayesian decision theory.

(Conjugate Prior). A prior is conjugate for the likelihood function if the posterior is of the same form/type as the prior.

Conjugacy is particularly convenient because we can algebraically calculate our posterior distribution by updating the parameters of the prior distribution.When considering the geometry of probability distributions, conjugate priors retain the same distance structure as the likelihood.

To introduce a concrete example of conjugate priors, consider the Binomial distribution (defined on discrete random variables) and the Beta distribution (defined on continuous random variables).

Beta-Binomial Conjugacy

Consider a Binomial random variable $x \sim Bin(N,\mu)$, where

$p(x|N,\mu)=\binom{N}{x}\mu^x(1-\mu)^{N-x},\quad x=0,1,\ldots,N$

is the probability of finding $x$ times the outcome “heads” in $N$ coin flips,where $\mu$ is the probability of a “head”. We place a Beta prior on the parameter $\mu$, that is, $\mu \sim Beta(\alpha,\beta)$, where

$p(\mu|\alpha,\beta)=\frac{\Gamma(\alpha+\beta) }{\Gamma(\alpha)\Gamma(\beta)}\mu^{\alpha-1}(1-\mu)^{\beta-1}$

If we now observe some outcome $x = h$, that is, we see $h$ heads in $N$ coin flips, we compute the posterior distribution on $\mu$ as

$p(\mu|x=h,N,\alpha,\beta) \propto p(x|N,\mu)p(\mu|\alpha,\beta)$

$\propto \mu^h(1-\mu)^{N-h}\mu^{\alpha-1}(1-\mu)^{\beta-1}$

$=\mu^{h+\alpha-1}(1-\mu)^{N-h+\beta-1}$

$\propto Beta(h+\alpha,N-h+\beta)$

the posterior distribution is a Beta distribution as the prior, i.e., the Beta prior is conjugate for the parameter $\mu$  in the Binomial likelihood function.

Beta-Bernoulli Conjugacy

Let $x \in \{0,1\}$ be distributed according to the Bernoulli distribution with parameter $\theta \in [0,1]$, that is, $p(x=1|\theta)=\theta$. This can also be expressed as $p(x|\theta)=\theta^x(1-\theta)^{1-x}$.Let $\theta$ be distributed according to a Beta distribution with parameters $\alpha,\beta$, that is  , $p(\theta|\alpha,\beta) \propto \theta^{\alpha-1}(1-\theta)^{\beta-1}$.

Multiplying the Beta and the Bernoulli distribution, we get

$p(\theta|x,\alpha,\beta)=p(x|\theta)p(\theta|\alpha,\beta)$

$\quad \quad \propto \theta^x(1-\theta)^{1-x}\theta^{\alpha-1}(1-\theta)^{\beta -1}$

$\quad \quad = \theta^{\alpha+x-1}(1-\theta)^{\beta+(1-x)-1}$

$\propto p(\theta|\alpha+x,\beta+(1-x))$

This is the Beta distribution with parameters $(\alpha+x,\beta+(1-x))$

Sufficient Statistics

Statistic of a random variable is a deterministic function of that random variable.For example, if $x=[x_1,\ldots,x_N]^T$ is a vector of univariate Gaussian Random variables, that is, $x_n \sim \mathcal{N}(\mu,\sigma^2)$, then the sample mean $\bar{\mu}=\frac{1}{N}(x_1+\cdots+x_N)$ is a statistic.

Sir Ronald Fisher discovered the notion of sufficient statistics: the idea that there are statistics that will contain all available information that can be inferred from data corresponding to the distribution under consideration. In other words, sufficient statistics carry all the information needed to make inference about the population, that is, they are the statistics that are sufficient to represent the distribution.

For a set of distributions parametrized by $\theta$, let $X$ be a random variable with distribution $p(x | \theta_0)$ given an unknown $\theta_0$. A vector $\phi(x)$ of statistics is called sufficient statistics for $\theta_0$ if they contain all possible information about $\theta_0$. To be more formal about “contain all possible information”,this means that the probability of $x$ given $\theta$ can be factored into a part that does not depend on $\theta$, and a part that depends on $\theta$ only via $\phi(x)$.

The Fisher-Neyman factorization theorem formalizes this notion

Let $X$ have a probability density function $p(x|\theta)$. Then the statistics $\phi(x)$ are  sufficient for $\theta$, if and only if $p(x|\theta)$ can be written in the form

$p(x|\theta)=h(x)g_\theta(\phi(x))$

where $h(x)$ is a distribution independent of $\theta$ and $g_\theta$ captures all the dependence on $\theta$ via sufficient statistics $\phi(x)$.

If $p(x | \theta)$ does not depend on $\theta$, then $\phi(x)$ is trivially a sufficient statistic for any function $\phi$. The more interesting case is that $p(x | \theta)$ is dependent only on $\phi(x)$ and not $x$ itself. In this case, $\phi(x)$ is a sufficient statistic for $\theta$.

In machine learning, we consider a finite number of samples from a distribution. One could imagine that for simple distributions (such as the Bernoulli ) we only need a small number of samples to estimate the parameters of the distributions. We could also consider the opposite problem: If we have a set of data (a sample from an unknown distribution), which distribution gives the best fit? A natural question to ask is, as we observe more data, do we need more parameters $\theta$ to describe the distribution? It turns out that the answer is yes in general, and this is studied in non-parametric statistics.

A converse question is to consider which class of distributions have finite-dimensional sufficient statistics, that is the number of parameters needed to describe them does not increase arbitrarily. The answer is exponential family distributions,described in the following section.

Exponential Family

Many of the probability distributions “with names” that we find in statistics textbooks were discovered to model particular types of phenomena.For example, we have seen the Gaussian distribution.It is often difficult to figure out which distribution to use.We saw that many of the operations required for inference can be conveniently calculated when the distribution is Gaussian.As we collect more data, we do not need more parameters to describe the distribution. Since we are interested in learning from data, we want parameter estimation to behave nicely.It turns out that the class of distributions called the  exponential family provides the right balance of generality while retaining favorable computation and inference properties

There are three possible levels of abstraction we can have when considering distributions (of discrete or continuous random variables). At level one (the most concrete end of the spectrum), we have a particular named distribution with fixed parameters, for example a univariate Gaussian $\mathcal{N}(0,1)$,with zero mean and unit variance. In machine learning, we often use the second level of abstraction, that is, we fix the parametric form (the univariate Gaussian) and infer the parameters from data. For example, we assume a univariate Gaussian $\mathcal{N}(\mu,\sigma^2)$,with unknown mean $\mu$ and unknown variance $\sigma^2$, and use a maximum likelihood fit to determine the best parameters $(\mu,\sigma^2)$.

A third level of abstraction is to consider families of distributions, and  we consider the exponential family.The univariate Gaussian is an example of a member of the exponential family. Many of the widely used statistical models, including all the “named” models ( Binomial,Bernoulli,Beta etc..) are members of the exponential family.They can all be unified into one concept.Exponential families are the only families that enjoy finite-dimensional sufficient statistics under repeated independent sampling.

An exponential family is a family of probability distributions, parameterized by $\theta \in \mathbb{R}^D$, of the form

$p(x|\theta)=h(x) exp((<\theta,\phi(x)>)-A(\theta))$

where $\phi (x)$ is the vector of sufficient statistics. In general, any inner product can be used , and for concreteness we will use the standard dot product here $(<\theta,\phi(x)>=\theta^T\phi(x))$

The factor $h(x)$ can be absorbed into the dot product term by adding another entry $(log h(x))$ to the vector of sufficient statistics $\phi(x)$, and constraining the corresponding parameter $\theta_0 = 1$. The term $A(\theta)$ is the normalization constant that ensures that the distribution sums up or integrates to one and is called the log-partition function. A good intuitive notion of exponential families can be obtained by ignoring these two terms and considering exponential families as distributions of the form

$p(x | \theta) \propto exp(\theta^T \phi(x))$

For this form of parametrization, the parameters $\theta$ are called the natural parameters.

Examples :

Gaussian as Exponential Family
Consider the univariate Gaussian distribution $\mathcal{N}(\mu,\sigma^2)$.
Let $\phi(x)=\begin{bmatrix}
x\\
x^2
\end{bmatrix}$
Then by using the definition of exponential family
$p(x | \theta) \propto exp(\theta^T \phi(x))$=$exp(\theta_1.x+\theta_2.x^2)$
Setting
$\theta=\begin{bmatrix}\frac{\mu}{\sigma^2} & \frac{-1}{2\sigma^2}
\end{bmatrix}^T$

So
$p(x | \theta) \propto exp \begin{bmatrix}\frac{\mu x}{\sigma^2} & \frac{-x^2}{2\sigma^2}\end{bmatrix}^T \propto exp \left( \frac{1}{2\sigma^2}(x-\mu)^2\right)$

Therefore, the univariate Gaussian distribution is a member of the exponential family with sufficient statistic $\phi(x) =\begin{bmatrix}x\\
x^2
\end{bmatrix}$, and natural parameters given by $\theta$.

Bernoulli as Exponential Family

Recall the Bernoulli distribution
$p(x|\mu)=\mu^x(1-\mu)^{1-x},\quad x={0,1}$
This can be written in exponential family form
$p(x | \mu) = exp[log(\mu^x(1-\mu)^{1-x})]$
$\quad=exp[x log \mu+(1-x)log(1-\mu)]$
$\quad=exp[x log \mu-xlog(1-\mu)+log(1-\mu)]$
$\quad=exp[x log\frac{\mu}{1-\mu}+log(1-\mu)]$

This can be identified as being in exponential family form
$h(x)=1$
$\theta=log(\frac{\mu}{1-\mu})$
$\phi(x)=x$
$A(\theta)=-log(1-\mu)=log(1+exp(-\theta))$

The relationship between $\theta $ and $\mu$ is invertible so that
$\mu=\frac{1}{1+exp(-\theta)}$

Remark. The relationship between the original Bernoulli parameter $\mu$ and the natural parameter $\theta$ is known as the sigmoid or logistic function. Observe that $\mu \in (0, 1)$ but $\theta \in \mathbb{R}$, and therefore the sigmoid function squeezes a real value into the range $(0, 1)$. This property is useful in machine learning, for example it is used in logistic regression as well as as a nonlinear activation functions in neural networks

Example:

You have written a computer program that sometimes compiles and sometimes not (code does not change). You decide to model the apparent stochasticity (success vs. no success) x of the compiler using a Bernoulli distribution with parameter $\mu$
$p(x |\mu) = \mu^x(1-\mu)^{1-x} \quad, x \in \{0, 1\}$
Choose a conjugate prior for the Bernoulli likelihood and compute the posterior distribution 
$p(\mu | x_1,\ldots, x_N)$.


Comments

Popular posts from this blog

Mathematics for Machine Learning- CST 284 - KTU Minor Notes - Dr Binu V P

  Introduction About Me Syllabus Course Outcomes and Model Question Paper University Question Papers and Evaluation Scheme -Mathematics for Machine learning CST 284 KTU Overview of Machine Learning What is Machine Learning (video) Learn the Seven Steps in Machine Learning (video) Linear Algebra in Machine Learning Module I- Linear Algebra 1.Geometry of Linear Equations (video-Gilbert Strang) 2.Elimination with Matrices (video-Gilbert Strang) 3.Solving System of equations using Gauss Elimination Method 4.Row Echelon form and Reduced Row Echelon Form -Python Code 5.Solving system of equations Python code 6. Practice problems Gauss Elimination ( contact) 7.Finding Inverse using Gauss Jordan Elimination  (video) 8.Finding Inverse using Gauss Jordan Elimination-Python code Vectors in Machine Learning- Basics 9.Vector spaces and sub spaces 10.Linear Independence 11.Linear Independence, Basis and Dimension (video) 12.Generating set basis and span 13.Rank of a Matrix 14.Linear Mapping...

4.3 Sum Rule, Product Rule, and Bayes’ Theorem

 We think of probability theory as an extension to logical reasoning Probabilistic modeling  provides a principled foundation for designing machine learning methods. Once we have defined probability distributions corresponding to the uncertainties of the data and our problem, it turns out that there are only two fundamental rules, the sum rule and the product rule. Let $p(x,y)$ is the joint distribution of the two random variables $x, y$. The distributions $p(x)$ and $p(y)$ are the corresponding marginal distributions, and $p(y |x)$ is the conditional distribution of $y$ given $x$. Sum Rule The addition rule states the probability of two events is the sum of the probability that either will happen minus the probability that both will happen. The addition rule is: $P(A∪B)=P(A)+P(B)−P(A∩B)$ Suppose $A$ and $B$ are disjoint, their intersection is empty. Then the probability of their intersection is zero. In symbols:  $P(A∩B)=0$  The addition law then simplifies to: $P(...

5.1 Optimization using Gradient Descent

Since machine learning algorithms are implemented on a computer, the mathematical formulations are expressed as numerical optimization methods.Training a machine learning model often boils down to finding a good set of parameters. The notion of “good” is determined by the objective function or the probabilistic model. Given an objective function, finding the best value is done using optimization algorithms. There are two main branches of continuous optimization constrained and unconstrained. By convention, most objective functions in machine learning are intended to be minimized, that is, the best value is the minimum value. Intuitively finding the best value is like finding the valleys of the objective function, and the gradients point us uphill. The idea is to move downhill (opposite to the gradient) and hope to find the deepest point. For unconstrained optimization, this is the only concept we need,but there are several design choices. For constrained optimization, we need to intr...