Skip to main content

4.6 Bernoulli, Binomial ,Beta and Poisson Distributions

Bernoulli Distribution

Bernoulli distribution, named after Swiss mathematician Jacob Bernoulli, is the discrete probability distribution of a random variable which takes the value 1 with probability $\mu$ and the value 0 with probability $1-\mu$. Less formally, it can be thought of as a model for the set of possible outcomes of any single experiment that asks a yes–no question. Such questions lead to outcomes that are boolean-valued: a single bit whose value is success/yes/true/one with probability $\mu$ and failure/no/false/zero with probability $1-\mu$.The Bernoulli distribution is a special case of the binomial distribution where a single trial is conducted (so n would be 1 for such a binomial distribution).

The Bernoulli distribution is a distribution for a single binary random variable $X$ with state $x \in \{0,1\}$. It is governed by a single continuous parameter $\mu \in  [0, 1] $ that represents the probability of $X = 1$. The Bernoulli distribution $Ber(\mu)$ is defined as

$p(x |\mu) = \mu^x(1-\mu)^{1-x}$ , $x \in [0, 1]$

$E[x]=\mu$

This is due to the fact that for a Bernoulli distributed random variable $X$ with  $Pr(X=1)=\mu$ and $Pr(X=0)=1-\mu$ we find

$E [X]=\Pr(X=1)\cdot 1+\Pr(X=0)\cdot 0=\mu \cdot 1+ (1-\mu)\cdot 0=\mu$

variance of $X$ is $V[X]=E[X^2]-(E[X])^2$

$V[X]=\mu-(\mu)^2=\mu(1-\mu)$

$V[x]=\mu(1-\mu)$

where $E[x]$ and $V[x]$ are the mean and variance of the binary random variable $X$.

An example where the Bernoulli distribution can be used is when we are interested in modeling the probability of “heads” when flipping a coin.

Binomial Distribution

The binomial distribution with parameters $N$ and $\mu$ is the discrete probability distribution of the number of successes in a sequence of $N$ independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: success (with probability $\mu$) or failure (with probability $1 − \mu$).

The Binomial distribution is a generalization of the Bernoulli distribution to a distribution over integers. In particular, the Binomial can be used to describe the probability of observing $m$ occurrences of $X = 1$ in a set of $N$ samples from a Bernoulli distribution where $p(X = 1) = \mu \in [0, 1]$. The Binomial distribution $Bin(N, \mu)$ is defined as
$p(m|N,\mu)=\binom{N}{m}\mu^m(1-\mu)^{N-m}$,
$\binom{N}{m}$ is the Binomial Coefficient and is equal to $\binom{N}{m}=\frac{n!}{(n-m)!m!}$
and hence the name of the distribution
Example:
Suppose a biased coin comes up heads with probability 0.3 when tossed. The probability of seeing exactly 4 heads in 6 tosses is
$p(4|6,0.3)=\binom{6}{4}0.3^4(1-0.3)^{6-4}=0.059535$

$E[X]=E[X_1]+E[X_2]+\cdots+E[X_N]$
$E[X]=\mu+\mu+\cdots+\mu$
$E[X]=N\mu$
Variance $V[X]=E[X^2]-(E[X])^2=N\mu-(N\mu)^2=N\mu(1-\mu)$
$V[X]=N\mu(1-\mu)$


An example where the Binomial could be used is if we want to describe the probability of observing $m$ “heads” in $N$ coin-flip experiments, if the probability for observing head in a single experiment is $\mu$.



Beta Distribution
The beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parameterized by two positive shape parameters, denoted by α and β, that appear as exponents of the random variable and control the shape of the distribution.

We may wish to model a continuous random variable on a finite interval. The Beta distribution is a distribution over a continuous random variable $\mu \in [0, 1]$, which is often used to represent the probability for some binary event (e.g., the parameter governing the Bernoulli distribution).The $Beta(\alpha,\beta)$ itself is governed by two parameters $\alpha> 0, \beta > 0$ and is defined as

$p(\mu|\alpha,\beta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\mu^{\alpha-1}(1-\mu)^{\beta-1}$
$E[\mu]=\frac{\alpha}{\alpha+\beta}, V[\mu]=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$
Where  $\Gamma(.)$ is the Gamma function defined as
$\Gamma(t)=\int_0^\infty x^{t-1}exp(-x)dx$,  $ t>0$
$\Gamma(t+1)=t\Gamma(t)$


Lets ignore the coefficient $\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}$. This is just a normalizing constant to make the function integrate to 1.
The difference between the binomial and the beta is that the former models the number of successes (x), while the latter models the probability (p) of success.
In other words, the probability is a parameter in binomial; In the Beta, the probability is a random variable.
You can choose the $\alpha$ and $\beta$ parameters however you think they are supposed to be. If you think the probability of success is very high, let’s say 90%, set 90 for $\alpha$ and 10 for $\beta$. If you think otherwise, 90 for $\beta$ and 10 for $\alpha$.

As $\alpha$ becomes larger (more successful events), the peak of the probability distribution will shift towards the right, whereas an increase in $\beta$ moves the distribution towards the left (more failures).Also, the distribution will narrow if both $\alpha$ and $\beta$ increase, for we are more certain.
Example: probability of probability
Let’s say how likely someone would agree to go on a date with you follows a Beta distribution with $\alpha = 2$ and $\beta= 8$. What is the probability that your success rate will be greater than 50%?
$P(X>0.5) = 1- CDF(0.5) = 0.01953$

Dr. Bognar at the University of Iowa built the calculator for Beta distribution, which I found useful and beautiful. You can experiment with different values of α and β and visualize how the shape changes.
Why do we use the Beta distribution?
If we just want the probability distribution to model the probability, any arbitrary distribution over (0,1) would work. And creating one should be easy. Just take any function that doesn’t blow up anywhere between 0 and 1 and stays positive, then integrate it from 0 to 1, and simply divide the function with that result. You just got a probability distribution that can be used to model the probability. In that case, why do we insist on using the beta distribution over the arbitrary probability distribution?

The Beta distribution is the conjugate prior for the Bernoulli, binomial, negative binomial and geometric distributions (seems like those are the distributions that involve success & failure) in Bayesian inference.

Computing a posterior using a conjugate prior is very convenient, because you can avoid expensive numerical computation involved in Bayesian Inference.
If we choose to use the beta distribution as a prior, during the modeling phase, we already know the posterior will also be a beta distribution. Therefore, after carrying out more experiments (asking more people to go on a date with you), you can compute the posterior simply by adding the number of acceptances and rejections to the existing parameters α, β respectively, instead of multiplying the likelihood with the prior distribution.

Poisson distribution
In probability theory and statistics, the Poisson distribution named after French mathematician Siméon Denis Poisson,is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.

For instance, a call center receives an average of 180 calls per hour, 24 hours a day. The calls are independent; receiving one does not change the probability of when the next one will arrive. The number of calls received during any minute has a Poisson probability distribution: the most likely numbers are 2 and 3 but 1 and 4 are also likely and there is a small probability of it being as low as zero and a very small probability it could be 10

A discrete random variable X is said to have a Poisson distribution, with parameter {$ \lambda >0$}, if it has a probability mass function given by:
$f(k,\lambda)=Pr(X=k)=\frac{\lambda^ke^{- \lambda}}{k!}$
where
$k$ is the number of occurances (k=0,1,2...)
$e$ is the Eulers number (2.71828)
The positive real number $\lambda$ is equal to the expected value of $X$ and also to its variance
$\lambda=E(X)=Var(X)$

The Poisson distribution can be applied to systems with a large number of possible events, each of which is rare. The number of such events that occur during a fixed time interval is, under the right circumstances, a random number with a Poisson distribution.

The equation can be adapted if, instead of the average number of events $\lambda$, we are given a time rate for the number of events $r$ to happen. Then $\lambda=rt$,(showing $r$ number of events per unit of time), and
$ P(k $ events in interval $ t)=\frac {(rt)^{k}e^{-rt}}{k!}$




Comments

Popular posts from this blog

Mathematics for Machine Learning- CST 284 - KTU Minor Notes - Dr Binu V P

  Introduction About Me Syllabus Course Outcomes and Model Question Paper Question Paper July 2021 and evaluation scheme Question Paper June 2022 and evaluation scheme Overview of Machine Learning What is Machine Learning (video) Learn the Seven Steps in Machine Learning (video) Linear Algebra in Machine Learning Module I- Linear Algebra 1.Geometry of Linear Equations (video-Gilbert Strang) 2.Elimination with Matrices (video-Gilbert Strang) 3.Solving System of equations using Gauss Elimination Method 4.Row Echelon form and Reduced Row Echelon Form -Python Code 5.Solving system of equations Python code 6. Practice problems Gauss Elimination ( contact) 7.Finding Inverse using Gauss Jordan Elimination  (video) 8.Finding Inverse using Gauss Jordan Elimination-Python code Vectors in Machine Learning- Basics 9.Vector spaces and sub spaces 10.Linear Independence 11.Linear Independence, Basis and Dimension (video) 12.Generating set basis and span 13.Rank of a Matrix 14.Linear Mapping and Matri

1.1 Solving system of equations using Gauss Elimination Method

Elementary Transformations Key to solving a system of linear equations are elementary transformations that keep the solution set the same, but that transform the equation system into a simpler form: Exchange of two equations (rows in the matrix representing the system of equations) Multiplication of an equation (row) with a constant  Addition of two equations (rows) Add a scalar multiple of one row to the other. Row Echelon Form A matrix is in row-echelon form if All rows that contain only zeros are at the bottom of the matrix; correspondingly,all rows that contain at least one nonzero element are on top of rows that contain only zeros. Looking at nonzero rows only, the first nonzero number from the left pivot (also called the pivot or the leading coefficient) is always strictly to the right of the  pivot of the row above it. The row-echelon form is where the leading (first non-zero) entry of each row has only zeroes below it. These leading entries are called pivots Example: $\begin

4.3 Sum Rule, Product Rule, and Bayes’ Theorem

 We think of probability theory as an extension to logical reasoning Probabilistic modeling  provides a principled foundation for designing machine learning methods. Once we have defined probability distributions corresponding to the uncertainties of the data and our problem, it turns out that there are only two fundamental rules, the sum rule and the product rule. Let $p(x,y)$ is the joint distribution of the two random variables $x, y$. The distributions $p(x)$ and $p(y)$ are the corresponding marginal distributions, and $p(y |x)$ is the conditional distribution of $y$ given $x$. Sum Rule The addition rule states the probability of two events is the sum of the probability that either will happen minus the probability that both will happen. The addition rule is: $P(A∪B)=P(A)+P(B)−P(A∩B)$ Suppose $A$ and $B$ are disjoint, their intersection is empty. Then the probability of their intersection is zero. In symbols:  $P(A∩B)=0$  The addition law then simplifies to: $P(A∪B)=P(A)+P(B)$  wh