Skip to main content

4.6 Bernoulli, Binomial ,Beta and Poisson Distributions

Bernoulli Distribution

Bernoulli distribution, named after Swiss mathematician Jacob Bernoulli, is the discrete probability distribution of a random variable which takes the value 1 with probability $\mu$ and the value 0 with probability $1-\mu$. Less formally, it can be thought of as a model for the set of possible outcomes of any single experiment that asks a yes–no question. Such questions lead to outcomes that are boolean-valued: a single bit whose value is success/yes/true/one with probability $\mu$ and failure/no/false/zero with probability $1-\mu$.The Bernoulli distribution is a special case of the binomial distribution where a single trial is conducted (so n would be 1 for such a binomial distribution).

The Bernoulli distribution is a distribution for a single binary random variable $X$ with state $x \in \{0,1\}$. It is governed by a single continuous parameter $\mu \in  [0, 1] $ that represents the probability of $X = 1$. The Bernoulli distribution $Ber(\mu)$ is defined as

$p(x |\mu) = \mu^x(1-\mu)^{1-x}$ , $x \in [0, 1]$

$E[x]=\mu$

This is due to the fact that for a Bernoulli distributed random variable $X$ with  $Pr(X=1)=\mu$ and $Pr(X=0)=1-\mu$ we find

$E [X]=\Pr(X=1)\cdot 1+\Pr(X=0)\cdot 0=\mu \cdot 1+ (1-\mu)\cdot 0=\mu$

variance of $X$ is $V[X]=E[X^2]-(E[X])^2$

$V[X]=\mu-(\mu)^2=\mu(1-\mu)$

$V[x]=\mu(1-\mu)$

where $E[x]$ and $V[x]$ are the mean and variance of the binary random variable $X$.

An example where the Bernoulli distribution can be used is when we are interested in modeling the probability of “heads” when flipping a coin.

Binomial Distribution

The binomial distribution with parameters $N$ and $\mu$ is the discrete probability distribution of the number of successes in a sequence of $N$ independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: success (with probability $\mu$) or failure (with probability $1 − \mu$).

The Binomial distribution is a generalization of the Bernoulli distribution to a distribution over integers. In particular, the Binomial can be used to describe the probability of observing $m$ occurrences of $X = 1$ in a set of $N$ samples from a Bernoulli distribution where $p(X = 1) = \mu \in [0, 1]$. The Binomial distribution $Bin(N, \mu)$ is defined as
$p(m|N,\mu)=\binom{N}{m}\mu^m(1-\mu)^{N-m}$,
$\binom{N}{m}$ is the Binomial Coefficient and is equal to $\binom{N}{m}=\frac{n!}{(n-m)!m!}$
and hence the name of the distribution
Example:
Suppose a biased coin comes up heads with probability 0.3 when tossed. The probability of seeing exactly 4 heads in 6 tosses is
$p(4|6,0.3)=\binom{6}{4}0.3^4(1-0.3)^{6-4}=0.059535$

$E[X]=E[X_1]+E[X_2]+\cdots+E[X_N]$
$E[X]=\mu+\mu+\cdots+\mu$
$E[X]=N\mu$
Variance $V[X]=E[X^2]-(E[X])^2=N\mu-(N\mu)^2=N\mu(1-\mu)$
$V[X]=N\mu(1-\mu)$


An example where the Binomial could be used is if we want to describe the probability of observing $m$ “heads” in $N$ coin-flip experiments, if the probability for observing head in a single experiment is $\mu$.



Beta Distribution
The beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parameterized by two positive shape parameters, denoted by α and β, that appear as exponents of the random variable and control the shape of the distribution.

We may wish to model a continuous random variable on a finite interval. The Beta distribution is a distribution over a continuous random variable $\mu \in [0, 1]$, which is often used to represent the probability for some binary event (e.g., the parameter governing the Bernoulli distribution).The $Beta(\alpha,\beta)$ itself is governed by two parameters $\alpha> 0, \beta > 0$ and is defined as

$p(\mu|\alpha,\beta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\mu^{\alpha-1}(1-\mu)^{\beta-1}$
$E[\mu]=\frac{\alpha}{\alpha+\beta}, V[\mu]=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$
Where  $\Gamma(.)$ is the Gamma function defined as
$\Gamma(t)=\int_0^\infty x^{t-1}exp(-x)dx$,  $ t>0$
$\Gamma(t+1)=t\Gamma(t)$


Lets ignore the coefficient $\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}$. This is just a normalizing constant to make the function integrate to 1.
The difference between the binomial and the beta is that the former models the number of successes (x), while the latter models the probability (p) of success.
In other words, the probability is a parameter in binomial; In the Beta, the probability is a random variable.
You can choose the $\alpha$ and $\beta$ parameters however you think they are supposed to be. If you think the probability of success is very high, let’s say 90%, set 90 for $\alpha$ and 10 for $\beta$. If you think otherwise, 90 for $\beta$ and 10 for $\alpha$.

As $\alpha$ becomes larger (more successful events), the peak of the probability distribution will shift towards the right, whereas an increase in $\beta$ moves the distribution towards the left (more failures).Also, the distribution will narrow if both $\alpha$ and $\beta$ increase, for we are more certain.
Example: probability of probability
Let’s say how likely someone would agree to go on a date with you follows a Beta distribution with $\alpha = 2$ and $\beta= 8$. What is the probability that your success rate will be greater than 50%?
$P(X>0.5) = 1- CDF(0.5) = 0.01953$

Dr. Bognar at the University of Iowa built the calculator for Beta distribution, which I found useful and beautiful. You can experiment with different values of α and β and visualize how the shape changes.
Why do we use the Beta distribution?
If we just want the probability distribution to model the probability, any arbitrary distribution over (0,1) would work. And creating one should be easy. Just take any function that doesn’t blow up anywhere between 0 and 1 and stays positive, then integrate it from 0 to 1, and simply divide the function with that result. You just got a probability distribution that can be used to model the probability. In that case, why do we insist on using the beta distribution over the arbitrary probability distribution?

The Beta distribution is the conjugate prior for the Bernoulli, binomial, negative binomial and geometric distributions (seems like those are the distributions that involve success & failure) in Bayesian inference.

Computing a posterior using a conjugate prior is very convenient, because you can avoid expensive numerical computation involved in Bayesian Inference.
If we choose to use the beta distribution as a prior, during the modeling phase, we already know the posterior will also be a beta distribution. Therefore, after carrying out more experiments (asking more people to go on a date with you), you can compute the posterior simply by adding the number of acceptances and rejections to the existing parameters α, β respectively, instead of multiplying the likelihood with the prior distribution.

Poisson distribution
In probability theory and statistics, the Poisson distribution named after French mathematician Siméon Denis Poisson,is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.

For instance, a call center receives an average of 180 calls per hour, 24 hours a day. The calls are independent; receiving one does not change the probability of when the next one will arrive. The number of calls received during any minute has a Poisson probability distribution: the most likely numbers are 2 and 3 but 1 and 4 are also likely and there is a small probability of it being as low as zero and a very small probability it could be 10

A discrete random variable X is said to have a Poisson distribution, with parameter {$ \lambda >0$}, if it has a probability mass function given by:
$f(k,\lambda)=Pr(X=k)=\frac{\lambda^ke^{- \lambda}}{k!}$
where
$k$ is the number of occurances (k=0,1,2...)
$e$ is the Eulers number (2.71828)
The positive real number $\lambda$ is equal to the expected value of $X$ and also to its variance
$\lambda=E(X)=Var(X)$

The Poisson distribution can be applied to systems with a large number of possible events, each of which is rare. The number of such events that occur during a fixed time interval is, under the right circumstances, a random number with a Poisson distribution.

The equation can be adapted if, instead of the average number of events $\lambda$, we are given a time rate for the number of events $r$ to happen. Then $\lambda=rt$,(showing $r$ number of events per unit of time), and
$ P(k $ events in interval $ t)=\frac {(rt)^{k}e^{-rt}}{k!}$




Comments

Popular posts from this blog

Mathematics for Machine Learning- CST 284 - KTU Minor Notes - Dr Binu V P

  Introduction About Me Syllabus Course Outcomes and Model Question Paper University Question Papers and Evaluation Scheme -Mathematics for Machine learning CST 284 KTU Overview of Machine Learning What is Machine Learning (video) Learn the Seven Steps in Machine Learning (video) Linear Algebra in Machine Learning Module I- Linear Algebra 1.Geometry of Linear Equations (video-Gilbert Strang) 2.Elimination with Matrices (video-Gilbert Strang) 3.Solving System of equations using Gauss Elimination Method 4.Row Echelon form and Reduced Row Echelon Form -Python Code 5.Solving system of equations Python code 6. Practice problems Gauss Elimination ( contact) 7.Finding Inverse using Gauss Jordan Elimination  (video) 8.Finding Inverse using Gauss Jordan Elimination-Python code Vectors in Machine Learning- Basics 9.Vector spaces and sub spaces 10.Linear Independence 11.Linear Independence, Basis and Dimension (video) 12.Generating set basis and span 13.Rank of a Matrix 14.Linear Mapping...

4.3 Sum Rule, Product Rule, and Bayes’ Theorem

 We think of probability theory as an extension to logical reasoning Probabilistic modeling  provides a principled foundation for designing machine learning methods. Once we have defined probability distributions corresponding to the uncertainties of the data and our problem, it turns out that there are only two fundamental rules, the sum rule and the product rule. Let $p(x,y)$ is the joint distribution of the two random variables $x, y$. The distributions $p(x)$ and $p(y)$ are the corresponding marginal distributions, and $p(y |x)$ is the conditional distribution of $y$ given $x$. Sum Rule The addition rule states the probability of two events is the sum of the probability that either will happen minus the probability that both will happen. The addition rule is: $P(A∪B)=P(A)+P(B)−P(A∩B)$ Suppose $A$ and $B$ are disjoint, their intersection is empty. Then the probability of their intersection is zero. In symbols:  $P(A∩B)=0$  The addition law then simplifies to: $P(...

5.1 Optimization using Gradient Descent

Since machine learning algorithms are implemented on a computer, the mathematical formulations are expressed as numerical optimization methods.Training a machine learning model often boils down to finding a good set of parameters. The notion of “good” is determined by the objective function or the probabilistic model. Given an objective function, finding the best value is done using optimization algorithms. There are two main branches of continuous optimization constrained and unconstrained. By convention, most objective functions in machine learning are intended to be minimized, that is, the best value is the minimum value. Intuitively finding the best value is like finding the valleys of the objective function, and the gradients point us uphill. The idea is to move downhill (opposite to the gradient) and hope to find the deepest point. For unconstrained optimization, this is the only concept we need,but there are several design choices. For constrained optimization, we need to intr...