Skip to main content

4.3 Sum Rule, Product Rule, and Bayes’ Theorem

 We think of probability theory as an extension to logical reasoning Probabilistic modeling  provides a principled foundation for designing machine learning methods. Once we have defined probability distributions corresponding to the uncertainties of the data and our problem, it turns out that there are only two fundamental rules, the sum rule and the product rule.

Let $p(x,y)$ is the joint distribution of the two random variables $x, y$. The distributions $p(x)$ and $p(y)$ are the corresponding marginal distributions, and $p(y |x)$ is the conditional distribution of $y$ given $x$.

Sum Rule

The addition rule states the probability of two events is the sum of the probability that either will happen minus the probability that both will happen.

The addition rule is: $P(A∪B)=P(A)+P(B)−P(A∩B)$

Suppose $A$ and $B$ are disjoint, their intersection is empty. Then the probability of their intersection is zero. In symbols: 
$P(A∩B)=0$  The addition law then simplifies to:

$P(A∪B)=P(A)+P(B)$  when   $A∩B=∅$


where $\mathbb{Y}$ are the states of the target space of random variable $Y $. This means that we sum out (or integrate out) the set of states $y$ of the random marginalization variable $Y$ . The sum rule is also known as the marginalization property. The sum rule relates the joint distribution to a marginal distribution. In general, when the joint distribution contains more than two random variables, the sum rule can be applied to any subset of the random variables,resulting in a marginal distribution of potentially more than one random variable. More concretely, if $x = [x_1,\ldots, x_D]^T$, we obtain the marginal
$p(x_i)=\int p(x_1,\ldots,x_D) dx/i$
by repeated application of the sum rule where we integrate/sum out all random variables except $x_i$, which is indicated by /i, which reads “all except i.”

Product Rule
The multiplication rule states that the probability that $A$ and $B$ both occur is equal to the probability that $B$ occurs times the conditional probability that $A$ occurs given that $B$ occurs.

The multiplication rule can be written as: 
$P(A∩B)=P(B)⋅P(A|B)$

Switching the role of A and B, we can also write the rule as:
$P(A∩B)=P(A)⋅P(B|A)$

Note that when A and B are independent, we have that $P(B|A)=P(B)$, so the formula becomes 
$P(A∩B)=P(A).P(B)$

The product rule, relates the joint distribution to the conditional distribution via
$p(x,y)=p(y|x)p(x)$

The product rule can be interpreted as the fact that every joint distribution of two random variables can be factorized (written as a product) of two other distributions. The two factors are the marginal distribution of the first random variable $p(x)$, and the conditional distribution of the second random variable given the first $p(y |x)$. Since the ordering of random variables is arbitrary in $p(x, y)$, the product rule also implies $p(x,y) = p(x | y)p(y)$. To be precise, Bayes rule is expressed in terms of the probability mass functions for discrete random variables. For continuous random variables, the product rule is expressed in terms of the probability density functions .

In machine learning and Bayesian statistics, we are often interested in making inferences of unobserved (latent) random variables given that we have observed other random variables. Let us assume we have some prior knowledge $p(x)$ about an unobserved random variable $x$ and some relationship $p(y | x)$ between $x$ and a second random variable $y$, which we can observe. If we observe $y$, we can use Bayes’ theorem to draw some conclusions about $x$ given the observed values of $y$. Bayes’ theorem (also Bayes’ theorem Bayes’ rule or Bayes’ law) is a direct consequence of the product rule.


$p(x)$ is the prior, which encapsulates our subjective prior  knowledge of the unobserved (latent) variable $x$ before observing any data. We can choose any prior that makes sense to us, but it is critical to ensure that the prior has a nonzero pdf (or pmf) on all plausible $x$, even if they are very rare.

The likelihood $p(y | x)$ describes how $x$ and $y$ are related, and in the likelihood case of discrete probability distributions, it is the probability of the data $y$ if we were to know the latent variable $x$. Note that the likelihood is not a distribution in $x$, but only in $y$. We call $p(y | x)$ either the “likelihood of x (given y)” or the “probability of y given x” but never the likelihood of $y$.

Bayes’ theorem  allows us to invert the relationship between $x$ and $y$ given by the likelihood. Therefore, Bayes’ theorem is sometimes probabilistic inverse called the probabilistic inverse.


Example-1

You might be interested in finding out a patient’s probability of having liver disease if they are an alcoholic. “Being an alcoholic” is the test (kind of like a litmus test) for liver disease.

A could mean the event “Patient has liver disease.” Past data tells you that 10% of patients entering your clinic have liver disease. P(A) = 0.10.

B could mean the litmus test that “Patient is an alcoholic.” Five percent of the clinic’s patients are alcoholics. P(B) = 0.05.

You might also know that among those patients diagnosed with liver disease, 7% are alcoholics. This is your P(B|A), the probability that a patient is alcoholic, given that they have liver disease, is 7%.

Bayes’ theorem tells you:
P(A|B) = (0.07 * 0.1)/0.05 = 0.14
In other words, if the patient is an alcoholic, their chances of having liver disease is 0.14 (14%). This is a large increase from the 10% suggested by past data. But it’s still unlikely that any particular patient has liver disease.

Example-2

What is the probability of there being a fire when you see smoke? You know these facts:
Fires happens 1% of the time when you see smoke
Smoke is likely to occur 10% of the time
Smoke is likely to occur 90% of the times when there is a fire

The first thing we could do is write the probability down in our language, the probability theory language. We can use the F for fire and S for smoke, so that we get the proposition P(F|S)P(F|S), that is given there is smoke, how likely is it that it is a fire?


This is where Bayes' theorem comes in handy.
$P(A|B)=\frac{P(B|A)P(A)}{P(B)}$
$P(F|S)=\frac{P(S|F)P(F)}{P(S)}$

Since we know all the facts, we would get that the probability of smoke given we can see fire is P(S|F)=0.90
 And the probability of there being a fire is 
P(F)=0.01 
while the probability of there being a smoke is 
P(S)=0.10
To translate this into the formula
$P(F|S)=\frac{P(S|F)P(F)}{P(S)}=\frac{0.90×0.01}{0.10}=0.09=9%$

This means that when we see smoke, it is likely to be a fire 9% of the time.

Example-3
What is the probability of having the disease given a true positive? You know these facts:
A test for the disease correctly identifies it 99% of the time (true positive-T)
A test for the disease incorrectly identifies it 2% of the time (false positive-F)
1% of the population suffers from the disease(D)

Here we use T for true positive, F for false positive and D for disease. 
The probability of a true positive given that you have the disease is 
$P(T|D)=0.99$
The probability of a false positive given you have the disease is 
$P(F|D)=0.02$
The probability of having the disease is 
$P(D)=0.01$.
Then the probability of not having the disease would be 
$P(\bar{D})=1−P(D)=0.99$

Note that $P(T|D)=0.99$ is the same as $P(F|\bar{D})=0.99$, and also that $P(F|D)=0.02$ is the same as $P(T|\bar{D})=0.02$. This is because it is a negation, meaning the opposite of the statement.
$P(T|\bar{D})$ reads 'the probability of true positive given not having the disease'.

$P(D|T)=\frac{P(T|D)P(D)}{P(T)}$
$P(T)=P(T|D)P(D)+P(T|\bar{D})P(\bar{D})$
$P(D|T)=\frac{P(T|D)P(D)}{P(T|D)P(D)+P(T|\bar{D})P(\bar{D})}$
$P(D|T)=\frac{0.99×0.01}{0.99×0.01+0.02×0.99}=0.33=33\%$

Example-4
Given the following statistics, what is the probability that a woman has cancer if she has a positive mammogram result?

One percent of women over 50 have breast cancer.
Ninety percent of women who have breast cancer test positive on mammograms.
Eight percent of women will have false positives.


Step 1: Assign events to A or X. You want to know what a woman’s probability of having cancer is, given a positive mammogram. For this problem, actually having cancer is A and a positive test result is X.
Step 2: List out the parts of the equation (this makes it easier to work the actual equation):
P(A)=0.01
P(~A)=0.99
P(X|A)=0.9
P(X|~A)=0.08
Step 3: Insert the parts into the equation and solve. Note that as this is a medical test, we’re using the form of the equation from example #3:
(0.9 * 0.01) / ((0.9 * 0.01) + (0.08 * 0.99) = 0.10.

The probability of a woman having cancer, given a positive test result, is 10%.

Example-5
There are two bags. The first bag contains four mangos and two apples; the second bag contains four mangos and four apples.We also have a biased coin, which shows “heads” with probability 0.6 and “tails” with probability 0.4. If the coin shows “heads”. we pick a fruit at random from bag 1; otherwise we pick a fruit at random from bag 2.Your friend flips the coin (you cannot see the result), picks a fruit at random from the corresponding bag, and presents you a mango.
What is the probability that the mango was picked from bag 2?
Hint: Use Bayes’ theorem.
We apply Bayes’ theorem and compute the posterior $p(b_2|m)$ of picking a mango from bag 2.
$p(b_2 |m) =\frac{p(m|b_2)p(b_2)}{p(m)}$
where
Evidence
$p(m) = p(b_1)p(m|b_1) + p(b_2)p(m| b_2) =\frac{3}{5}.\frac{2}{3}+\frac{2}{5}.\frac{1}{2}=\frac{3}{5}$
prior
$p(b_2)=\frac{4}{10}=\frac{2}{5}$
likelihood
$p(m|b_2)=\frac{1}{2}$
Therefore
$p(b_2 |m) =\frac{p(m|b_2)p(b_2)}{p(m)}=\frac{\frac{1}{2}.\frac{4}{10}}{\frac{3}{5}}=\frac{1}{3}$


6.Let J and T be independent events, where P(J)=0.4 and P(T)=0.7.
i. Find P(J∩T)
ii. Find P(J∪T)
iii. Find P(J∩T′)

i.P(J∩T)=P(J).P(T)=0.4x0.7=0.28
ii.P(J∪T)=P(J)+P(T)=0.4+0.7=1.1
iii.P(J∩T')=P(J).P(T')=P(J)(1-P(T))=0.4x0.3=0.12$

7.Let A and B be events such that P(A)=0.45 , P(B)=0.35 and P(A∪B)=0.5. Find P(A∣B)

P(A∣B)=P(A∩B)/P(B)
P(A∪B)=P(A)+P(B)-P(A∩B)
P(A∩B)=P(A)+P(B)-P(A∪B)=0.45+0.35-0.5=0.3
P(A∣B)=0.3/0.35=0.8571

8.While watching a game of Cricket, you observe someone who is clearly supporting Mumbai Indians. What is the probability that they were actually born within 25KM of Mumbai? Assume that:
• the probability that a randomly selected person is born within 25KM of
Mumbai is 1/20;
• the chance that a person born within 25KMs of Mumbai actually supports MI
is 7/10 ;
• the probability that a person not born within 25KM of Mumbai supports MI
with probability 1/10.

Let B be the person is born within 25KM of Mumbai .
let S be the support to Mumbai indians
P(B)=1/20
P(S|B)=7/10
P(S|~B)=1/10

Find P(B|S)..?
$P(B|S)=\frac{P(S|B)*P(B)}{ P(S|B)*P(B)+P(S|~B)P(~B)}$
$P(B|S)=\frac{(7/10*1/20) }{  ( 7/10*1/20+ 1/10*19/20)}$
$P(B|S)=0.035$

9.Show that if two events A and B are independent, then A and B' are independent

Proof: The events A and B are independent, so, P(A ∩ B) = P(A) P(B).
It is noted that the events A ∩ B and A ∩ B’ are mutually exclusive and together they form the event A.
A = ( A ∩ B) ∪ (A ∩ B').
Also, P(A) = P[(A ∩ B) ∪ (A ∩ B')].
or, P(A) = P(A ∩ B) + P(A ∩ B').
or, P(A) = P(A) P(B) + P(A ∩ B')
or, P(A ∩ B') = P(A) − P(A) P(B) = P(A) (1 – P(B)) = P(A) P(B')

10. What is the expected number of coin flips for getting a head?
Ans:
Let the expected number of coin flips be x. Then we can write an equation for it -
a. If the first flip is the head, then we are done. The probability of this event is 1/2 and the number of coin flips for this event is 1.
b. If the first flip is the tails, then we have wasted one flip. Since consecutive flips are independent events, the solution in this case can be recursively framed in terms of x - The probability of this event is 1/2 and the expected number of coins flips now onwards is x. But we have already wasted one flip, so the total number of flips is x+1.

The expected value x is the sum of the expected values of these two cases. Using the rule of linearity of the expectation and the definition of Expected value, we get

x = (1/2)(1) + (1/2) (1+x)
Solving, we get x = 2.

Thus the expected number of coin flips for getting a head is 2.

11.What is the expected number of coin flips for getting two consecutive heads?

Let the expected number of coin flips be x. The case analysis goes as follows:
a. If the first flip is a tails, then we have wasted one flip. The probability of this event is 1/2 and the total number of flips required is x+1
b. If the first flip is a heads and second flip is a tails, then we have wasted two flips. The probability of this event is 1/4 and the total number of flips required is x+2
c. If the first flip is a heads and second flip is also heads, then we are done. The probability of this event is 1/4 and the total number of flips required is 2.

Adding, the equation that we get is -
x = (1/2)(x+1) + (1/4)(x+2) + (1/4)2

Solving, we get x = 6.

Thus, the expected number of coin flips for getting two consecutive heads is 6.

12. (Generalization) What is the expected number of coin flips for getting N consecutive heads, given N?

Let the exepected number of coin flips be x. Based on previous exercises, we can wind up the whole case analysis in two basic parts

a) If we get 1st, 2nd, 3rd,...,n'th tail as the first tail in the experiment, then we have to start all over again.
b) Else we are done.

For the 1st flip as tail, the part of the equation is $(1/2)(x+1)$
For the 2nd flip as tail, the part of the equation is $(1/4)(x+2)$
...
For the k'th flip as tail, the part of the equation is $(1/(2^k))(x+k)$
...
For the N'th flip as tail, the part of the equation is $(1/(2^N))(x+N)$
The part of equation corrresponding to case (b) is $(1/(2^N))(N)$

Adding,

$x = (1/2)(x+1) + (1/4)(x+2) + ... + (1/(2^k))(x+k) + .. + (1/(2^N))(x+N) + (1/(2^N))(N)$

Solving this equation is left as an exercise to the reader. The entire equation can be very easily reduced to the following form:

$x = 2^{N+1}-2$

Thus, the expected number of coin flips for getting N consecutive heads is $2^{N+1} - 2$.
13.A biased coin (with probability of obtaining a head equal to p > 0) is tossed repeatedly and independently until the first head is observed. Compute the probability that the first head appears at an even numbered toss.

Let 
event H1​ - head on the first toss;
event E - first head on an even-numbered toss.
We want P(E).
Using the Theorem of Total Probability
P(E/H1​)=0 (given H1​, that a head appears on the first toss, E cannot occur)
P(E/H1′​)=P(E′)=1−P(E)
(given that a head does not appear on the first toss, the required conditional probability is merely the probability that the sequence concludes after a further odd number of tosses, that is, the probability of E'). Hence P(E) satisfies

$P(E)=P(E|H_1)P(H_1)+P(E|H_1')P(H_1')$
$P(E)=0 \times p +(1-P(E))(1-p)$
$P(E)=(1-p)-P(E)(1-p)$
$P(E)+P(E)-pP(E)=(1-p)$
$P(E)(2-p)=(1-p)$
$P(E)=\frac{(1-p)}{(2-p)}$

Easy way of interpretation

Let the $p$ be the probability of getting a head and $q=1-p$ be the probability of not getting a head.
The head appears the first time on the even number of throws are 2 or 4 or 6.
Thus, probability of getting the first head on even numbered toss is equal to getting the head on 2,4,6,8...etc

$P(E)=qp+q^3p+q^5p+q^7p+.....$
$P(E)=\frac{qp}{1-q^2}$
$P(E)=\frac{p(1-p)}{1-(1-p)^2}$
$P(E)=\frac{p(1-p)}{1-1-p^2+2p}$
$P(E)=\frac{p(1-p)}{p(2-p)}$
$P(E)=\frac{1-p}{2-p}$

14.Two players A and B are competing at a trivia quiz game involving a series of questions. On any individual question, the probabilities that A and B give the correct answer are p and q respectively, for all questions, with outcomes for different questions being independent. The game finishes when a player wins by answering a question correctly. Compute the probability that A wins if
i. A answers the first question,
ii. B answers the first question.

sample space S to be all possible infinite sequences of answers

even A- A answers the first question

event F- game ends after the first question

event W- A wins.

We want P(W|A) and P(W|\overline A )

Using the Theorem of Total Probability, and the partition given by

P(W|A) = P(W|A \cap F)P(F|A) + P(W|A \cap \overline F )P(\overline F |A)

Now, clearly

P(F|A) = \alpha ,\,\,P\left( {\overline F |A} \right) = 1 - \alpha and P(W|A \cap F) = 1 , but P(W|A \cap \overline F ) = P(W|\overline A ) , so that P(W|A) = 1 \cdot \alpha + P(W|\overline A )\left( {1 - \alpha } \right) = \alpha + P(W|\overline A )\left( {1 - \alpha } \right) .

Similarly,

P(W|\overline A ) = P(W|\overline A \cap F)P(F|\overline A ) + P(W|\overline A \cap \overline F )P(\overline F |\overline A )

We have

P(F|\overline A ) = \beta ,\,\,P(\overline F |\overline A ) = 1 - \beta , but

P(W|\overline A \cap F) = 0 . Finally

P(W|\overline A \cap \overline F ) = P(W|A) , so that

P(W|\overline A ) = 0 \cdot \beta + P(W|A)\left( {1 - \beta } \right) = P(W|A)\left( {1 - \beta } \right)

Solutions

a) P(W|A) = \alpha + P(W|\overline A )\left( {1 - \alpha } \right) = \alpha + P(W|A)\left( {1 - \beta } \right)\left( {1 - \alpha } \right) \Rightarrow P(W|A) - P(W|A)\left( {1 - \beta } \right)\left( {1 - \alpha } \right) = \alpha \Rightarrow P(W|A)(1 - \left( {1 - \beta } \right)\left( {1 - \alpha } \right)) = \alpha \Rightarrow P(W|A) = \frac{\alpha }{{1 - \left( {1 - \beta } \right)\left( {1 - \alpha } \right)}}

$P(W∣A)=\frac{p}{1-(1-p)(1-q)}$

Answer: P(W|A) = $P(W∣A)=\frac{p}{1-(1-p)(1-q)}$

b) P(W|\overline A )\left( {1 - \alpha } \right) = P(W|A) - \alpha \Rightarrow P(W|\overline A )\left( {1 - \alpha } \right) = \frac{\alpha }{{1 - \left( {1 - \beta } \right)\left( {1 - \alpha } \right)}} - \alpha = \frac{{\alpha - \alpha \left( {1 - \left( {1 - \beta } \right)\left( {1 - \alpha } \right)} \right)}}{{1 - \left( {1 - \beta } \right)\left( {1 - \alpha } \right)}} = \frac{{\alpha \left( {1 - \beta } \right)\left( {1 - \alpha } \right)}}{{1 - \left( {1 - \beta } \right)\left( {1 - \alpha } \right)}} \Rightarrow P(W|\overline A ) = \frac{{\alpha \left( {1 - \beta } \right)}}{{1 - \left( {1 - \beta } \right)\left( {1 - \alpha } \right)}}

$P(W|\bar{A})=\frac{p}{1-(1-p)(1-q)}.(1-q)$

Answer: $P(W∣A)=\frac{p(1-q)}{1-(1-p)(1-q)}$

15.You roll a fair dice twice. Let the random variable X be the product of the outcomes of the two rolls. What is the probability mass function of X? What are the expected value and the standard deviation of X.


The sample space of equally likely outcomes is
11 21 31 41 51 61
12 22 32 42 52 62
13 23 33 43 53 63
14 24 34 44 54 64
15 25 35 45 55 65
16 26 36 46 56 66
The products are  1     2   3   4  5  6  8   9  10   12  15   16 18 20  24 25   30  36
count                    1     2   2   3  2  4   2  1   2    4    2     1   2    2     2   1     2   1
P(X)                     1/36 2/36  3/36......so on

$E(X)=1.1/36+2.2/36+3.2/36+4.3/36+5.2/26+6.4/36
\\+8.2/36+9.1/36+10.2/36+12.4/36+15.2/36+16.1/36
\\+18.2/36+20.2/36+24.2/36+25.1/36+30.2/36+36.1/36$
$V(X)=E(X^2)-E^2(X)$
$\sigma=\sqrt{V(X)}$

Differentiate between independent events and mutually exclusive events in probability. Let $A$and $B$ be two independent events, where $P(A) = 0.4, P(B) = p$ and P(A∪B) =0.7. Find the value of $p$. ( university question)

A mutually exclusive event can simply be defined as a situation when two events cannot occur at same time whereas independent event occurs when one event remains unaffected by the occurrence of the other event.

since $A$ and $B$ are independent $A \cap B= P(A).P(B)=0.4p$
$P(A \cup B)=P(A)+P(B)-P(A \cap B)$
$0.7=0.4 + p-0.4p$
$0.6p=0.3$
$p=0.5$

Comments

Popular posts from this blog

Mathematics for Machine Learning- CST 284 - KTU Minor Notes - Dr Binu V P

  Introduction About Me Syllabus Course Outcomes and Model Question Paper University Question Papers and Evaluation Scheme -Mathematics for Machine learning CST 284 KTU Overview of Machine Learning What is Machine Learning (video) Learn the Seven Steps in Machine Learning (video) Linear Algebra in Machine Learning Module I- Linear Algebra 1.Geometry of Linear Equations (video-Gilbert Strang) 2.Elimination with Matrices (video-Gilbert Strang) 3.Solving System of equations using Gauss Elimination Method 4.Row Echelon form and Reduced Row Echelon Form -Python Code 5.Solving system of equations Python code 6. Practice problems Gauss Elimination ( contact) 7.Finding Inverse using Gauss Jordan Elimination  (video) 8.Finding Inverse using Gauss Jordan Elimination-Python code Vectors in Machine Learning- Basics 9.Vector spaces and sub spaces 10.Linear Independence 11.Linear Independence, Basis and Dimension (video) 12.Generating set basis and span 13.Rank of a Matrix 14.Linear Mapping and Matr

5.1 Optimization using Gradient Descent

Since machine learning algorithms are implemented on a computer, the mathematical formulations are expressed as numerical optimization methods.Training a machine learning model often boils down to finding a good set of parameters. The notion of “good” is determined by the objective function or the probabilistic model. Given an objective function, finding the best value is done using optimization algorithms. There are two main branches of continuous optimization constrained and unconstrained. By convention, most objective functions in machine learning are intended to be minimized, that is, the best value is the minimum value. Intuitively finding the best value is like finding the valleys of the objective function, and the gradients point us uphill. The idea is to move downhill (opposite to the gradient) and hope to find the deepest point. For unconstrained optimization, this is the only concept we need,but there are several design choices. For constrained optimization, we need to intr