Skip to main content

4.4 Summary Statistics and Independence

We are often interested in summarizing sets of random variables and comparing pairs of random variables. A statistic of a random variable is a deterministic function of that random variable. The summary statistics of a distribution provide one useful view of how a random variable behaves,and as the name suggests, provide numbers that summarize and characterize the distribution. 

Mean and the Variance are two well known summary statistics. There are two ways to compare a pair
of random variables: first, how to say that two random variables are independent and second, how to compute an inner product between them.

Means and Covariances
Mean and (co)variance are often useful to describe properties of probability distributions (expected values and spread). There is a useful family of distributions (called the exponential family), where the statistics of the random variable capture all possible information.

The concept of the expected value is central to machine learning, and the foundational concepts of probability itself can be derived from the expected value.

Expected Value
The expected value of a function $g:\mathbb{R} \to \mathbb{R}$ of  a uni variate continuous random variable $X \sim p(x)$ is given by

$E_x[g(x)]=\int_X g(x)p(x)dx$

Correspondingly, the expected value of a function $g$ of a discrete random variable $X \sim p(x)$ is given by

$E_x[g(x)]=\sum_{x \in X}g(x)p(x)$

where $X$ is the set of possible outcomes (the target space) of the random variable $X$.
We consider multivariate random variables $X$ as a finite vector of univariate random variables $[X_1, \ldots,X_D]^T$. For multivariate random variables, we define the expected value element wise
$E_X[g(x)]=\begin{bmatrix}E_{X_1}(x_1)\\
\vdots \\
E_{X_D}(x_D)
\end{bmatrix} \in \mathbb{R}^D$

where the subscript $E_{X_d}$ indicates that we are taking the expected value with respect to the $d$th element of the vector $x$. 

Mean, Median and Mode
The mean of a random variable $X$ with states $x \in \mathbb{R}^D$ is an average and is defined as
$E_X[x]=\begin{bmatrix}E_{X_1}(g(x_1))\\
\vdots \\
E_{X_D}(g(x_D))
\end{bmatrix} \in \mathbb{R}^D$

where
$E_{X_D}[g(x)]=\int_X x_Dp(x_D)dx_D$
if $X$ is a continuous random variable.

$E_{X_D}[X_D]=\sum_{x_i \in X}x_ip(x_D=x_i)$
 if $X$ is a discrete random variable.

In one dimension, there are two other intuitive notions of “average”, which are the median and the mode. The median is the “middle” value if we sort the values, i.e., 50% of the values are greater than the median and 50% are smaller than the median. This idea can be generalized to continuous values by considering the value where the cdf  is 0.5.For distributions, which are asymmetric or have long tails, the median provides an estimate of a typical value that is closer to human intuition than the mean value. Furthermore, the median is more robust to outliers than the mean. The generalization of the median to higher dimensions is non-trivial as there is no obvious way to “sort” in more than one dimension . 

The mode is the most frequently occurring value. For a discrete random variable, the mode is defined as the value of $x$ having the highest frequency of occurrence. For a continuous random variable, the mode is defined as a peak in the density $p(x)$. A particular density $p(x)$ may have more than one mode, and furthermore there may be a very large number of modes in high-dimensional distributions. Therefore, finding all the modes of a distribution can be computationally challenging.

Covariance, Variance and standard deviation
The covariance between two univariate random variables $X, Y \in \mathbb{R}$ is given by the expected product of their deviations from their respective means, i.e.,
$Cov[x,y]=E(x-E[x])(y-E[y])$
By using linearity of expectation the expression can be rewritten as the expected value of the product minus the product of the expected values, i.e.,
$Cov[x,y]=E[xy]-E[x]E[y]$

The covariance of a variable with itself $Cov[x,x]$ is called the variance and is denoted by $V_X[X]$. The square root of the variance is called the standard deviation and is often denoted by $\sigma(x)$. The notion of covariance can be generalized to multivariate random variables also.

Covariance (Multivariate). 
If we consider two multivariate random variables $X$ and $Y$ with states $x \in R^D$ and $y \in R^E$ respectively, the covariance between $X$ and $Y$ is defined as
$Cov[x, y] = E[xy^T] - E[x]E[y]^T = Cov[y, x]^T \in  \mathbb{R}^{D \times E} $
For a multivariate random
variable, the variance describes the relation between individual dimensions of the random variable.

Variance(Multivariate)
 The variance of a multivariate random variable $X$ with states $x \in \mathbb{R}^D$ and a mean vector $\mu \in \mathbb{R}^D$ is defined as

$V_X[x] = Cov_X[x,x]$
$\quad=E_X[(x -\mu)(x -\mu)^T] = E_X[xx^T] - E_X[x]E_X[x]^T$$\quad=\begin{bmatrix}
Cov[x_1,x_1] &Cov[x_1,x_2] &\ldots & Cov[x_1,x_D]\\
Cov[x_2,x_1]& Cov[x_2,x_2] &\ldots &Cov[x_2,x_D] \\
\vdots & \vdots & \ddots & \vdots\\
Cov[x_D,x_1]& Cov[x_D,x_2] & \ldots & Cov[x_D,x_D]

\end{bmatrix}$
The $D \times D$ matrix in is called the covariance matrix of the multivariate random variable $X$. The covariance matrix is symmetric and positive semi definite and tells us something about the spread of the data. On its diagonal, the covariance matrix contains the variances of the marginals.The off-diagonal entries are the cross-covariance terms $Cov[x_i, x_j ]$ for $ i, j = 1,\ldots,D, i \ne j$.

When we want to compare the covariances between different pairs of random variables, it turns out that the variance of each random variable affects the value of the covariance. The normalized version of covariance is called the correlation.

Correlation
The correlation between two random variables $X,Y$ is given by

$Corr[x,y]=\frac{Cov[x,y]}{\sqrt{V[x]V[y]}}\in [-1,1]$

The correlation matrix is the covariance matrix of standardized random variables, $x/\sigma(x)$. In other words, each random variable is divided by its standard deviation (the square root of the variance) in the correlation matrix.
The covariance (and correlation) indicate how two random variables are related; see Figure below. Positive correlation $corr[x, y]$ means that when $x$ grows, then $y$ is also expected to grow. Negative correlation means that as $x$ increases, then $y$ decreases.


Empirical Mean and Covariance
The definitions in the previous sessions are often also called the population mean and covariance, as it refers to the true statistics for the population. In chine learning, we need to learn from empirical observations of data.

Consider a random variable $X$. There are two conceptual steps to go from population statistics to the realization of empirical statistics. First, we use the fact that we have a finite dataset (of size $N$) to construct an empirical statistic that is a function of a finite number of identical random variables,
$X_1,\ldots,X_N$.

Second, we observe the data, that is, we look at the realization $x_1,\ldots, x_N$ of each of the random variables and apply the empirical statistic.Specifically, given a particular dataset we can obtain an estimate of the mean, which is called the empirical mean or sample mean . The same holds for the empirical covariance.

 The empirical mean vector is the arithmetic average of the observations for each variable, and it is defined as

$\bar{x}=\frac{1}{N}\sum_1^N x_n$ where $x_n \in \mathbb{R}^D$

Similar to the empirical mean, the empirical covariance matrix is a $D\times D$ matrix
$\sum=\frac{1}{N}\sum_1^N (x_n-\bar{x})(x_n-\bar{x})^T$ 

The standard definition of variance,is the expectation of the squared deviation of a random variable $X$ from its expected value $\mu$, i.e.,
$V_X[x] := E_X[(x -\mu)^2] $

The variance as expressed before is the mean of a new random variable $Z := (X - \mu)^2$
When estimating the variance empirically, we need to resort to a two-pass algorithm: one pass through the data to calculate the mean $\mu$, and then a second pass using this estimate $\mu$ to calculate the variance. It turns out that we can avoid two passes by rearranging the terms. The formula can be converted to the so-called raw-score formula for variance:

$V_X[x]=E_X[x^2]-(E_X[x])^2$

It can be calculated empirically in one pass through data since we can accumulate $x_i$ (to calculate the mean) and $x_i^2$ simultaneously.The raw-score version of the variance can be useful in machine learning, e.g., when deriving the bias–variance decomposition.

Sums  of Random Variables

Consider two random variables $X,Y$ with states $x, y \in \mathbb{R}^D$. Then:
$E[x + y] = E[x] + E[y] $
$E[x - y] = E[x] - E[y] $
$V[x + y] = V[x] + V[y] + Cov[x, y] + Cov[y,x]$ 
$V[x - y] = V[x] + V[y] - Cov[x,y] - Cov[y, x] $
 
Statistical Independence

(Independence). Two random variables $X,Y$ are statistically  independent if and only if
$p(x, y) = p(x)p(y) $

Intuitively, two random variables $X$ and $Y$ are independent if the value of $y$ (once known) does not add any additional information about $x$ (and vice versa). If $X,Y$ are (statistically) independent, then
$p(y |x) = p(y)$
$p(x |y) = p(x)$
$V_{X,Y} [x + y] = V_X[x] + V_Y [y]$
$Cov_{X,Y} [x, y] = 0$
The last point may not hold in converse, i.e., two random variables can have covariance zero but are not statistically independent.

In machine learning, we often consider problems that can be modeled as independent and identically distributed (i.i.d.) random variables,$X_1,\ldots,X_N$.For more than two random variables, the word “independent”  usually refers to mutually independent random variables, where all subsets are independent . The phrase “identically distributed” means that all the random variables are from the same distribution.

Conditional Independence
Two random variables $X$ and $Y$ are conditionally independent given $Z$ if and only if
$p(x,y | z) = p(x | z)p(y | z)$ for all $z \in Z $
We write $X \perp Y |Z$ to denote that $X$ is conditionally independent of $Y$ given $Z$.
We can expand the left hand side using the product rule
$p(x, y | z) = p(x | y, z)p(y | z)$
So by comparing these two equation
$p(x|y,z)=p(x|z)$
“given that we know $z$, knowledge about $y$ does not change our knowledge of $x$”.

Comments

Popular posts from this blog

Mathematics for Machine Learning- CST 284 - KTU Minor Notes - Dr Binu V P

  Introduction About Me Syllabus Course Outcomes and Model Question Paper Question Paper July 2021 and evaluation scheme Question Paper June 2022 and evaluation scheme Overview of Machine Learning What is Machine Learning (video) Learn the Seven Steps in Machine Learning (video) Linear Algebra in Machine Learning Module I- Linear Algebra 1.Geometry of Linear Equations (video-Gilbert Strang) 2.Elimination with Matrices (video-Gilbert Strang) 3.Solving System of equations using Gauss Elimination Method 4.Row Echelon form and Reduced Row Echelon Form -Python Code 5.Solving system of equations Python code 6. Practice problems Gauss Elimination ( contact) 7.Finding Inverse using Gauss Jordan Elimination  (video) 8.Finding Inverse using Gauss Jordan Elimination-Python code Vectors in Machine Learning- Basics 9.Vector spaces and sub spaces 10.Linear Independence 11.Linear Independence, Basis and Dimension (video) 12.Generating set basis and span 13.Rank of a Matrix 14.Linear Mapping and Matri

1.1 Solving system of equations using Gauss Elimination Method

Elementary Transformations Key to solving a system of linear equations are elementary transformations that keep the solution set the same, but that transform the equation system into a simpler form: Exchange of two equations (rows in the matrix representing the system of equations) Multiplication of an equation (row) with a constant  Addition of two equations (rows) Add a scalar multiple of one row to the other. Row Echelon Form A matrix is in row-echelon form if All rows that contain only zeros are at the bottom of the matrix; correspondingly,all rows that contain at least one nonzero element are on top of rows that contain only zeros. Looking at nonzero rows only, the first nonzero number from the left pivot (also called the pivot or the leading coefficient) is always strictly to the right of the  pivot of the row above it. The row-echelon form is where the leading (first non-zero) entry of each row has only zeroes below it. These leading entries are called pivots Example: $\begin

4.3 Sum Rule, Product Rule, and Bayes’ Theorem

 We think of probability theory as an extension to logical reasoning Probabilistic modeling  provides a principled foundation for designing machine learning methods. Once we have defined probability distributions corresponding to the uncertainties of the data and our problem, it turns out that there are only two fundamental rules, the sum rule and the product rule. Let $p(x,y)$ is the joint distribution of the two random variables $x, y$. The distributions $p(x)$ and $p(y)$ are the corresponding marginal distributions, and $p(y |x)$ is the conditional distribution of $y$ given $x$. Sum Rule The addition rule states the probability of two events is the sum of the probability that either will happen minus the probability that both will happen. The addition rule is: $P(A∪B)=P(A)+P(B)−P(A∩B)$ Suppose $A$ and $B$ are disjoint, their intersection is empty. Then the probability of their intersection is zero. In symbols:  $P(A∩B)=0$  The addition law then simplifies to: $P(A∪B)=P(A)+P(B)$  wh