We are often interested in summarizing sets of random variables and comparing pairs of random variables. A statistic of a random variable is a deterministic function of that random variable. The summary statistics of a distribution provide one useful view of how a random variable behaves,and as the name suggests, provide numbers that summarize and characterize the distribution.
Mean and the Variance are two well known summary statistics. There are two ways to compare a pair of random variables: first, how to say that two random variables are independent and second, how to compute an inner product between them.
Means and Covariances
Mean and (co)variance are often useful to describe properties of probability distributions (expected values and spread). There is a useful family of distributions (called the exponential family), where the statistics of the random variable capture all possible information.
The concept of the expected value is central to machine learning, and the foundational concepts of probability itself can be derived from the expected value.
Expected Value
The expected value of a function $g:\mathbb{R} \to \mathbb{R}$ of a uni variate continuous random variable $X \sim p(x)$ is given by
$E_x[g(x)]=\int_X g(x)p(x)dx$
Correspondingly, the expected value of a function $g$ of a discrete random variable $X \sim p(x)$ is given by
$E_x[g(x)]=\sum_{x \in X}g(x)p(x)$
where $X$ is the set of possible outcomes (the target space) of the random variable $X$.
We consider multivariate random variables $X$ as a finite vector of univariate random variables $[X_1, \ldots,X_D]^T$. For multivariate random variables, we define the expected value element wise
\vdots \\
E_{X_D}(x_D)
\end{bmatrix} \in \mathbb{R}^D$
where the subscript $E_{X_d}$ indicates that we are taking the expected value with respect to the $d$th element of the vector $x$.
Mean, Median and Mode
The mean of a random variable $X$ with states $x \in \mathbb{R}^D$ is an average and is defined as
$E_X[x]=\begin{bmatrix}E_{X_1}(g(x_1))\\
\vdots \\
E_{X_D}(g(x_D))
\end{bmatrix} \in \mathbb{R}^D$
\vdots \\
E_{X_D}(g(x_D))
\end{bmatrix} \in \mathbb{R}^D$
where
$E_{X_D}[g(x)]=\int_X x_Dp(x_D)dx_D$
if $X$ is a continuous random variable.
$E_{X_D}[X_D]=\sum_{x_i \in X}x_ip(x_D=x_i)$
if $X$ is a discrete random variable.
In one dimension, there are two other intuitive notions of “average”, which are the median and the mode. The median is the “middle” value if we sort the values, i.e., 50% of the values are greater than the median and 50% are smaller than the median. This idea can be generalized to continuous values by considering the value where the cdf is 0.5.For distributions, which are asymmetric or have long tails, the median provides an estimate of a typical value that is closer to human intuition than the mean value. Furthermore, the median is more robust to outliers than the mean. The generalization of the median to higher dimensions is non-trivial as there is no obvious way to “sort” in more than one dimension .
The mode is the most frequently occurring value. For a discrete random variable, the mode is defined as the value of $x$ having the highest frequency of occurrence. For a continuous random variable, the mode is defined as a peak in the density $p(x)$. A particular density $p(x)$ may have more than one mode, and furthermore there may be a very large number of modes in high-dimensional distributions. Therefore, finding all the modes of a distribution can be computationally challenging.
Covariance, Variance and standard deviation
The covariance between two univariate random variables $X, Y \in \mathbb{R}$ is given by the expected product of their deviations from their respective means, i.e.,
$Cov[x,y]=E(x-E[x])(y-E[y])$
By using linearity of expectation the expression can be rewritten as the expected value of the product minus the product of the expected values, i.e.,
$Cov[x,y]=E[xy]-E[x]E[y]$
The covariance of a variable with itself $Cov[x,x]$ is called the variance and is denoted by $V_X[X]$. The square root of the variance is called the standard deviation and is often denoted by $\sigma(x)$. The notion of covariance can be generalized to multivariate random variables also.
Covariance (Multivariate).
If we consider two multivariate random variables $X$ and $Y$ with states $x \in R^D$ and $y \in R^E$ respectively, the covariance between $X$ and $Y$ is defined as
$Cov[x, y] = E[xy^T] - E[x]E[y]^T = Cov[y, x]^T \in \mathbb{R}^{D \times E} $
For a multivariate random
variable, the variance describes the relation between individual dimensions of the random variable.
Variance(Multivariate)
The variance of a multivariate random variable $X$ with states $x \in \mathbb{R}^D$ and a mean vector $\mu \in \mathbb{R}^D$ is defined as$\quad=E_X[(x -\mu)(x -\mu)^T] = E_X[xx^T] - E_X[x]E_X[x]^T$$\quad=\begin{bmatrix}
Cov[x_1,x_1] &Cov[x_1,x_2] &\ldots & Cov[x_1,x_D]\\
Cov[x_2,x_1]& Cov[x_2,x_2] &\ldots &Cov[x_2,x_D] \\
\vdots & \vdots & \ddots & \vdots\\
Cov[x_D,x_1]& Cov[x_D,x_2] & \ldots & Cov[x_D,x_D]
\end{bmatrix}$
The $D \times D$ matrix in is called the covariance matrix of the multivariate random variable $X$. The covariance matrix is symmetric and positive semi definite and tells us something about the spread of the data. On its diagonal, the covariance matrix contains the variances of the marginals.The off-diagonal entries are the cross-covariance terms $Cov[x_i, x_j ]$ for $ i, j = 1,\ldots,D, i \ne j$.
When we want to compare the covariances between different pairs of random variables, it turns out that the variance of each random variable affects the value of the covariance. The normalized version of covariance is called the correlation.
Correlation
The correlation between two random variables $X,Y$ is given by
$Corr[x,y]=\frac{Cov[x,y]}{\sqrt{V[x]V[y]}}\in [-1,1]$
The correlation matrix is the covariance matrix of standardized random variables, $x/\sigma(x)$. In other words, each random variable is divided by its standard deviation (the square root of the variance) in the correlation matrix.
The covariance (and correlation) indicate how two random variables are related; see Figure below. Positive correlation $corr[x, y]$ means that when $x$ grows, then $y$ is also expected to grow. Negative correlation means that as $x$ increases, then $y$ decreases.
Empirical Mean and Covariance
The definitions in the previous sessions are often also called the population mean and covariance, as it refers to the true statistics for the population. In chine learning, we need to learn from empirical observations of data.
Consider a random variable $X$. There are two conceptual steps to go from population statistics to the realization of empirical statistics. First, we use the fact that we have a finite dataset (of size $N$) to construct an empirical statistic that is a function of a finite number of identical random variables,
$X_1,\ldots,X_N$.
Second, we observe the data, that is, we look at the realization $x_1,\ldots, x_N$ of each of the random variables and apply the empirical statistic.Specifically, given a particular dataset we can obtain an estimate of the mean, which is called the empirical mean or sample mean . The same holds for the empirical covariance.
The empirical mean vector is the arithmetic average of the observations for each variable, and it is defined as
$\bar{x}=\frac{1}{N}\sum_1^N x_n$ where $x_n \in \mathbb{R}^D$
Similar to the empirical mean, the empirical covariance matrix is a $D\times D$ matrix
$\sum=\frac{1}{N}\sum_1^N (x_n-\bar{x})(x_n-\bar{x})^T$
The standard definition of variance,is the expectation of the squared deviation of a random variable $X$ from its expected value $\mu$, i.e.,
$V_X[x] := E_X[(x -\mu)^2] $
When estimating the variance empirically, we need to resort to a two-pass algorithm: one pass through the data to calculate the mean $\mu$, and then a second pass using this estimate $\mu$ to calculate the variance. It turns out that we can avoid two passes by rearranging the terms. The formula can be converted to the so-called raw-score formula for variance:
$V_X[x]=E_X[x^2]-(E_X[x])^2$
It can be calculated empirically in one pass through data since we can accumulate $x_i$ (to calculate the mean) and $x_i^2$ simultaneously.The raw-score version of the variance can be useful in machine learning, e.g., when deriving the bias–variance decomposition.
Sums of Random Variables
Consider two random variables $X,Y$ with states $x, y \in \mathbb{R}^D$. Then:
$E[x + y] = E[x] + E[y] $
$E[x - y] = E[x] - E[y] $
$V[x + y] = V[x] + V[y] + Cov[x, y] + Cov[y,x]$
$V[x - y] = V[x] + V[y] - Cov[x,y] - Cov[y, x] $
Statistical Independence
(Independence). Two random variables $X,Y$ are statistically independent if and only if
$p(x, y) = p(x)p(y) $
Intuitively, two random variables $X$ and $Y$ are independent if the value of $y$ (once known) does not add any additional information about $x$ (and vice versa). If $X,Y$ are (statistically) independent, then
$p(y |x) = p(y)$
$p(x |y) = p(x)$
$V_{X,Y} [x + y] = V_X[x] + V_Y [y]$
$Cov_{X,Y} [x, y] = 0$
The last point may not hold in converse, i.e., two random variables can have covariance zero but are not statistically independent.
In machine learning, we often consider problems that can be modeled as independent and identically distributed (i.i.d.) random variables,$X_1,\ldots,X_N$.For more than two random variables, the word “independent” usually refers to mutually independent random variables, where all subsets are independent . The phrase “identically distributed” means that all the random variables are from the same distribution.
Conditional Independence
Two random variables $X$ and $Y$ are conditionally independent given $Z$ if and only if
$p(x,y | z) = p(x | z)p(y | z)$ for all $z \in Z $
We write $X \perp Y |Z$ to denote that $X$ is conditionally independent of $Y$ given $Z$.
We can expand the left hand side using the product rule
$p(x, y | z) = p(x | y, z)p(y | z)$
So by comparing these two equation
$p(x|y,z)=p(x|z)$
“given that we know $z$, knowledge about $y$ does not change our knowledge of $x$”.
Comments
Post a Comment