Fundamental statistics are useful tools in applied machine learning for a better understanding your data. They are also the tools that provide the foundation for more advanced linear algebra operations and machine learning methods, such as the covariance matrix and principal component analysis respectively. As such, it is important to have a strong grip on fundamental statistics in the context of linear algebra notation. In this, you will discover how fundamental statistical operations work and how to implement them using NumPy.
Expected Value and Mean
In probability, the average value of some random variable X is called the expected value or the expectation. The expected value uses the notation E with square brackets around the name of the variable;
for example:
E[X]
It is calculated as the probability weighted sum of values that can be drawn.
E[X] =x1.p1+x2. p2+x3.p3+.................. +xn .pn
In simple cases, such as the flipping of a coin or rolling a dice, the probability of each event is just as likely. Therefore, the expected value can be calculated as the sum of all values multiplied by the reciprocal of the number of values.
E[X] =1/n( x+x2+.....+xn)
In statistics, the mean,
or more technically the arithmetic mean or sample mean, can be estimated from a sample of
examples drawn from the domain. It is confusing because mean,average, and
expected value are used interchangeably. In the abstract, the mean is denoted
by the lower case Greek letter mu µ and is
calculated from the sample of observations, rather than all possible values.
The arithmetic mean can be calculated for a vector or matrix in NumPy by using the mean() function. The example below de nes a 6-element vector and calculates the mean.
# vector mean
from numpy import array
from numpy import mean
# define vector
v = array([1,2,3,4,5,6])
print(v)
# calculate mean
result = mean(v)
print(result)
[1 2 3 4 5 6]
3.5
The mean function can calculate the row or column means of a matrix by specifying the axis argument and the value 0 or 1 respectively. The example below de fines a 2 x 6 matrix and calculates both column and row means.
# matrix means
from numpy import array
from numpy import mean
# define matrix
M = array([
[1,2,3,4,5,6],
[1,2,3,4,5,6]])
print(M)
# column means
col_mean = mean(M, axis=0)
print(col_mean)
# row means
row_mean = mean(M, axis=1)
print(row_mean)
o/p:
[[1 2 3 4 5 6]
[1 2 3 4 5 6]]
[ 1. 2. 3. 4. 5. 6.]
[ 3.5 3.5]
Variance and Standard Deviation
In probability, the variance of some random variable X is a measure of how much values in the distribution vary on average with respect to the mean. The variance is denoted as the function Var() on the variable.
Var[X]
In probability, the variance of some random variable X is a measure of how much values in the distribution vary on average with respect to the mean. The variance is denoted as the function Var() on the variable.
Var[X]
Variance is calculated as
the average squared difference of each value in the distribution from the
expected value. Or the expected squared difference from the expected value.
Var[X] = E[(X- E[X])2]
In statistics, the variance can be estimated from a sample of examples drawn from the
domain. In the abstract, the sample variance is denoted by the lower case sigma with a 2
superscript indicating the units are squared (e.g. 2), not that you must square the nal value.
The sum of the squared differences is multiplied by the reciprocal of the number of examples
minus 1 to correct for a bias
σ2 = 1/(n-1) [ (x1-µ2) +(x2-µ2).......+(xn-µ2)
In NumPy, the variance can be calculated for a vector or a matrix using the var() function.By default, the var() function calculates the population variance. To calculate the sample variance, you must set the ddof argument to the value 1. The example below defines a 6-element vector and calculates the sample variance.
# vector variance
from numpy import array
from numpy import var
# define vector
v = array([1,2,3,4,5,6])
print(v)
# calculate variance
result = var(v, ddof=1)
print(result)
[1 2 3 4 5 6]
3.5
The var function can calculate the row or column variances of a matrix by specifying the axis argument and the value 0 or 1 respectively, the same as the mean function above. The example below defines a 2 x 6 matrix and calculates both column and row sample variances.
# matrix variances
from numpy import array
from numpy import var
# define matrix
M = array([
[1,2,3,4,5,6],
[1,2,3,4,5,6]])
print(M)
# column variances
col_var = var(M, ddof=1, axis=0)
print(col_var)
# row variances
row_var = var(M, ddof=1, axis=1)
print(row_var)
[[1 2 3 4 5 6]
[1 2 3 4 5 6]]
[ 0. 0. 0. 0. 0. 0.]
[ 3.5 3.5]
The standard deviation is calculated as the square root of the variance and is denoted as
lowercase s.
s =sqrt(σ2 )
NumPy also provides a function for calculating the standard deviation directly via the std() function. As with the var() function, the ddof argument must be set to 1 to calculate the unbiased sample standard deviation and column and row standard deviations can be calculated by setting the axis argument to 0 and 1 respectively.
The example below demonstrates how to calculate the sample standard deviation for the rows and columns of a matrix.
# matrix standard deviation
from numpy import array
from numpy import std
# define matrix
M = array([
[1,2,3,4,5,6],
[1,2,3,4,5,6]])
print(M)
# column standard deviations
col_std = std(M, ddof=1, axis=0)
print(col_std)
# row standard deviations
row_std = std(M, ddof=1, axis=1)
print(row_std)
# vector variance
from numpy import array
from numpy import var
# define vector
v = array([1,2,3,4,5,6])
print(v)
# calculate variance
result = var(v, ddof=1)
print(result)
[1 2 3 4 5 6]
3.5
The var function can calculate the row or column variances of a matrix by specifying the axis argument and the value 0 or 1 respectively, the same as the mean function above. The example below defines a 2 x 6 matrix and calculates both column and row sample variances.
# matrix variances
from numpy import array
from numpy import var
# define matrix
M = array([
[1,2,3,4,5,6],
[1,2,3,4,5,6]])
print(M)
# column variances
col_var = var(M, ddof=1, axis=0)
print(col_var)
# row variances
row_var = var(M, ddof=1, axis=1)
print(row_var)
[[1 2 3 4 5 6]
[1 2 3 4 5 6]]
[ 0. 0. 0. 0. 0. 0.]
[ 3.5 3.5]
The standard deviation is calculated as the square root of the variance and is denoted as
lowercase s.
s =sqrt(σ2 )
NumPy also provides a function for calculating the standard deviation directly via the std() function. As with the var() function, the ddof argument must be set to 1 to calculate the unbiased sample standard deviation and column and row standard deviations can be calculated by setting the axis argument to 0 and 1 respectively.
The example below demonstrates how to calculate the sample standard deviation for the rows and columns of a matrix.
# matrix standard deviation
from numpy import array
from numpy import std
# define matrix
M = array([
[1,2,3,4,5,6],
[1,2,3,4,5,6]])
print(M)
# column standard deviations
col_std = std(M, ddof=1, axis=0)
print(col_std)
# row standard deviations
row_std = std(M, ddof=1, axis=1)
print(row_std)
[[1 2 3 4 5 6]
[1 2 3 4 5 6]]
[ 0. 0. 0. 0. 0. 0.]
[ 1.87082869 1.87082869]
Covariance and Correlation
In probability, covariance is the measure of the joint probability for two random variables. It describes how the two variables change together. It is denoted as the function cov(X; Y ), where X and Y are the two random variables being considered.
cov(X; Y )
Covariance is calculated as expected value or average of the product of the differences of each random variable from their expected values, where E[X] is the expected value for X and E[Y ] is the expected value of y.
cov(X; Y ) = E[(X -E[X] x (Y - E[Y])]
The sign of the covariance can be interpreted as whether the two variables increase together (positive) or decrease together (negative). The magnitude of the covariance is not easily interpreted. A covariance value of zero indicates that both variables are completely independent. NumPy does not have a function to calculate the covariance between two variables directly. Instead, it has a function for calculating a covariance matrix called cov() that we can use to retrieve the covariance. By default, the cov()function will calculate the unbiased or sample covariance between the provided random variables.
The example below defi nes two vectors of equal length with one increasing and one decreasing.We would expect the covariance between these variables to be negative. We access just the covariance for the two variables as the [0, 1] element of the square covariance matrix returned.
# vector covariance
from numpy import array
from numpy import cov
# define first vector
x = array([1,2,3,4,5,6,7,8,9])
print(x)
# define second covariance
y = array([9,8,7,6,5,4,3,2,1])
print(y)
# calculate covariance
Sigma = cov(x,y)[0,1]
print(Sigma)
[1 2 3 4 5 6 7 8 9]
[9 8 7 6 5 4 3 2 1]
-7.5
The covariance can be normalized to a score between -1 and 1 to make the magnitude interpretable by dividing it by the standard deviation of X and Y . The result is called the correlation of the variables, also called the Pearson correlation coefficient, named for the developer of the method.
r =cov(X; Y )/sx . sy
Where r is the correlation coefficient of X and Y , cov(X; Y ) is the sample covariance of X and Y and sx and sy are the standard deviations of X and Y respectively. NumPy provides the corrcoef() function for calculating the correlation between two variables directly. Like cov(), it returns a matrix, in this case a correlation matrix. As with the results from cov() we can access just the correlation of interest from the [0,1] value from the returned squared matrix.
# vector correlation
from numpy import array
from numpy import corrcoef
# define first vector
x = array([1,2,3,4,5,6,7,8,9])
print(x)
# define second vector
y = array([9,8,7,6,5,4,3,2,1])
print(y)
# calculate correlation
corr = corrcoef(x,y)[0,1]
print(corr)
[1 2 3 4 5 6 7 8 9]
[9 8 7 6 5 4 3 2 1]
-1.0
Covariance Matrix
The covariance matrix is a square and symmetric matrix that describes the covariance between two or more random variables. The diagonal of the covariance matrix are the variances of each of the random variables, as such it is often called the variance-covariance matrix. A covariance matrix is a generalization of the covariance of two variables and captures the way in which all variables in the dataset may change together.
Each entry of the covariance matrix C(i,j)=cov(Xi,Xj) X is a matrix where each column represents a random variable.
The covariance matrix provides a useful tool for separating the structured relationships in a matrix of random variables.This can be used to decorrelate variables or applied as a transform to other variables. It is a key element used in the Principal Component Analysis data reduction method, or PCA for short.The covariance matrix can be calculated in NumPy using the cov() function. By default,this function will calculate the sample covariance matrix. The cov() function can be called with a single 2D array where each sub-array contains a feature (e.g. column). If this function is called
with your data de ned in a normal matrix format (rows then columns), then a transpose of the matrix will need to be provided to the function in order to correctly calculate the covariance of the columns. Below is an example that de fines a dataset with 5 observations across 3 features and calculates the covariance matrix.
# covariance matrix
from numpy import array
from numpy import cov
# define matrix of observations
X = array([
[1, 5, 8],
[3, 5, 11],
[2, 4, 9],
[3, 6, 10],
[1, 5, 10]])
print(X)
# calculate covariance matrix
Sigma = cov(X.T)
print(Sigma)
[[ 1 5 8]
[ 3 5 11]
[ 2 4 9]
[ 3 6 10]
[ 1 5 10]]
[[ 1. 0.25 0.75]
[ 0.25 0.5 0.25]
[ 0.75 0.25 1.3 ]]
The following program will find covariance matrix with out using the library function. You can check the result with the output of the standard function cov().The idea here is subtract mean from each column. Then take the transpose and multiply with the original matrix.
The following program will find covariance matrix with out using the library function. You can check the result with the output of the standard function cov().The idea here is subtract mean from each column. Then take the transpose and multiply with the original matrix.
import numpy as np
X=np.array([[64.0 ,
580.0, 29.0],
[66.0 , 570.0, 33.0],
[68.0, 590.0, 37.0],
[69.0, 660.0 ,46.0],
[73.0 , 600.0 ,55.0]
])
V=np.cov(X.T)
print(V)
# finding mean of each column
M=np.mean(X,axis=0)
print(M)
#subtract mean array from each row
X=(X-M)
#finding the covariance matrix
covm=X.T.dot(X)
#dividing with n-1=4
print(covm/4)
[[ 11.5 50. 34.75]
[ 50. 1250. 205. ]
[ 34.75 205. 110. ]]
[ 68. 600. 40.]
[ 50. 1250. 205. ]
[ 34.75 205. 110. ]]
[ 68. 600. 40.]
[[ 11.5 50. 34.75]
[ 50. 1250. 205. ]
[ 34.75 205. 110. ]]
The covariance matrix is used widely in linear algebra and the intersection of linear algebra and statistics called multivariate analysis. We have only had a small taste in this tutorial.
Comments
Post a Comment