Skip to main content

Introduction to statistics with python

Fundamental statistics are useful tools in applied machine learning for a better understanding your data. They are also the tools that provide the foundation for more advanced linear algebra operations and machine learning methods, such as the covariance matrix and principal component analysis respectively. As such, it is important to have a strong grip on fundamental statistics in the context of linear algebra notation. In this, you will discover how fundamental statistical operations work and how to implement them using NumPy.
Expected Value and Mean
In probability, the average value of some random variable X is called the expected value or the expectation. The expected value uses the notation E with square brackets around the name of the variable; 
for example:
E[X]
It is calculated as the probability weighted sum of values that can be drawn.
E[X] =x1.p1+x2. p2+x3.p3+.................. +xn .pn
In simple cases, such as the  flipping of a coin or rolling a dice, the probability of each event is just as likely. Therefore, the expected value can be calculated as the sum of all values multiplied by the reciprocal of the number of values.
E[X] =1/n( x+x2+.....+xn)
In statistics, the mean, or more technically the arithmetic mean or sample mean, can be estimated from a sample of examples drawn from the domain. It is confusing because mean,average, and expected value are used interchangeably. In the abstract, the mean is denoted by the lower case Greek letter mu µ and is calculated from the sample of observations, rather than all possible values.
The arithmetic mean can be calculated for a vector or matrix in NumPy by using the mean() function. The example below de nes a 6-element vector and calculates the mean.
# vector mean
from numpy import array
from numpy import mean
# define vector
v = array([1,2,3,4,5,6])
print(v)
# calculate mean
result = mean(v)
print(result)
[1 2 3 4 5 6]
3.5
The mean function can calculate the row or column means of a matrix by specifying the axis argument and the value 0 or 1 respectively. The example below de fines a 2 x 6 matrix and calculates both column and row means.
# matrix means
from numpy import array
from numpy import mean
# define matrix
M = array([
[1,2,3,4,5,6],
[1,2,3,4,5,6]])
print(M)
# column means
col_mean = mean(M, axis=0)
print(col_mean)
# row means
row_mean = mean(M, axis=1)
print(row_mean)
o/p:
[[1 2 3 4 5 6]
[1 2 3 4 5 6]]
[ 1. 2. 3. 4. 5. 6.]
[ 3.5 3.5]
Variance and Standard Deviation
In probability, the variance of some random variable X is a measure of how much values in the distribution vary on average with respect to the mean. The variance is denoted as the function Var() on the variable.
Var[X]
Variance is calculated as the average squared difference of each value in the distribution from the expected value. Or the expected squared difference from the expected value.
Var[X] = E[(X- E[X])2]
In statistics, the variance can be estimated from a sample of examples drawn from the
domain. In the abstract, the sample variance is denoted by the lower case sigma with a 2
superscript indicating the units are squared (e.g. 2), not that you must square the nal value.
The sum of the squared differences is multiplied by the reciprocal of the number of examples
minus 1 to correct for a bias
σ2= 1/(n-1) [ (x1-µ2) +(x2-µ2).......+(xn-µ2)
In NumPy, the variance can be calculated for a vector or a matrix using the var() function.By default, the var() function calculates the population variance. To calculate the sample variance, you must set the ddof argument to the value 1. The example below defines a 6-element vector and calculates the sample variance.
# vector variance
from numpy import array
from numpy import var
# define vector
v = array([1,2,3,4,5,6])
print(v)
# calculate variance
result = var(v, ddof=1)
print(result)
[1 2 3 4 5 6]
3.5
The var function can calculate the row or column variances of a matrix by specifying the axis argument and the value 0 or 1 respectively, the same as the mean function above. The example below defines a 2 x 6 matrix and calculates both column and row sample variances.
# matrix variances
from numpy import array
from numpy import var
# define matrix
M = array([
[1,2,3,4,5,6],
[1,2,3,4,5,6]])
print(M)
# column variances
col_var = var(M, ddof=1, axis=0)
print(col_var)
# row variances
row_var = var(M, ddof=1, axis=1)
print(row_var)
[[1 2 3 4 5 6]
[1 2 3 4 5 6]]
[ 0. 0. 0. 0. 0. 0.]
[ 3.5 3.5]
The standard deviation is calculated as the square root of the variance and is denoted as
lowercase s.
s =sqrt(σ)
NumPy also provides a function for calculating the standard deviation directly via the std() function. As with the var() function, the ddof argument must be set to 1 to calculate the unbiased sample standard deviation and column and row standard deviations can be calculated by setting the axis argument to 0 and 1 respectively.
The example below demonstrates how to calculate the sample standard deviation for the rows and columns of a matrix.
# matrix standard deviation
from numpy import array
from numpy import std
# define matrix
M = array([
[1,2,3,4,5,6],
[1,2,3,4,5,6]])
print(M)
# column standard deviations
col_std = std(M, ddof=1, axis=0)
print(col_std)
# row standard deviations
row_std = std(M, ddof=1, axis=1)
print(row_std)
[[1 2 3 4 5 6]
[1 2 3 4 5 6]]
[ 0. 0. 0. 0. 0. 0.]
[ 1.87082869 1.87082869]
Covariance and Correlation
In probability, covariance is the measure of the joint probability for two random variables. It describes how the two variables change together. It is denoted as the function cov(X; Y ), where X and Y are the two random variables being considered.
cov(X; Y )
Covariance is calculated as expected value or average of the product of the differences of each random variable from their expected values, where E[X] is the expected value for X and E[Y ] is the expected value of y.
cov(X; Y ) = E[(X -E[X] x (Y - E[Y])]
The sign of the covariance can be interpreted as whether the two variables increase together (positive) or decrease together (negative). The magnitude of the covariance is not easily interpreted. A covariance value of zero indicates that both variables are completely independent. NumPy does not have a function to calculate the covariance between two variables directly. Instead, it has a function for calculating a covariance matrix called cov() that we can use to retrieve the covariance. By default, the cov()function will calculate the unbiased or sample covariance between the provided random variables.
The example below defi nes two vectors of equal length with one increasing and one decreasing.We would expect the covariance between these variables to be negative. We access just the covariance for the two variables as the [0, 1] element of the square covariance matrix returned.
# vector covariance
from numpy import array
from numpy import cov
# define first vector
x = array([1,2,3,4,5,6,7,8,9])
print(x)
# define second covariance
y = array([9,8,7,6,5,4,3,2,1])
print(y)
# calculate covariance
Sigma = cov(x,y)[0,1]
print(Sigma)
[1 2 3 4 5 6 7 8 9]
[9 8 7 6 5 4 3 2 1]
-7.5
The covariance can be normalized to a score between -1 and 1 to make the magnitude interpretable by dividing it by the standard deviation of X and Y . The result is called the correlation of the variables, also called the Pearson correlation coefficient, named for the developer of the method.
r =cov(X; Y )/sx . sy
Where r is the correlation coefficient of X and Y , cov(X; Y ) is the sample covariance of X and Y and sx and sy are the standard deviations of X and Y respectively. NumPy provides the corrcoef() function for calculating the correlation between two variables directly. Like cov(), it returns a matrix, in this case a correlation matrix. As with the results from cov() we can access just the correlation of interest from the [0,1] value from the returned squared matrix.
# vector correlation
from numpy import array
from numpy import corrcoef
# define first vector
x = array([1,2,3,4,5,6,7,8,9])
print(x)
# define second vector
y = array([9,8,7,6,5,4,3,2,1])
print(y)
# calculate correlation
corr = corrcoef(x,y)[0,1]
print(corr)
[1 2 3 4 5 6 7 8 9]
[9 8 7 6 5 4 3 2 1]
-1.0
Covariance Matrix
The covariance matrix is a square and symmetric matrix that describes the covariance between two or more random variables. The diagonal of the covariance matrix are the variances of each of the random variables, as such it is often called the variance-covariance matrix. A covariance matrix is a generalization of the covariance of two variables and captures the way in which all variables in the dataset may change together.
Each entry of the covariance matrix C(i,j)=cov(Xi,Xj) X is a matrix where each column represents a random variable.
The covariance matrix provides a useful tool for separating the structured relationships in a matrix of random variables.This can be used to decorrelate variables or applied as a transform to other variables. It is a key element used in the Principal Component Analysis data reduction method, or PCA for short.The covariance matrix can be calculated in NumPy using the cov() function. By default,this function will calculate the sample covariance matrix. The cov() function can be called with a single 2D array where each sub-array contains a feature (e.g. column). If this function is called
with your data de ned in a normal matrix format (rows then columns), then a transpose of the matrix will need to be provided to the function in order to correctly calculate the covariance of the columns. Below is an example that de fines a dataset with 5 observations across 3 features and calculates the covariance matrix.
# covariance matrix
from numpy import array
from numpy import cov
# define matrix of observations
X = array([
[1, 5, 8],
[3, 5, 11],
[2, 4, 9],
[3, 6, 10],
[1, 5, 10]])
print(X)
# calculate covariance matrix
Sigma = cov(X.T)
print(Sigma)
[[ 1 5 8]
[ 3 5 11]
[ 2 4 9]
[ 3 6 10]
[ 1 5 10]]
[[ 1. 0.25 0.75]
[ 0.25 0.5 0.25]
[ 0.75 0.25 1.3 ]]
The following program will find covariance matrix with out using the library function. You can check the result with the output of the standard function cov().The idea here is subtract mean from each column. Then take the transpose and multiply with the original matrix.
import numpy as np
X=np.array([[64.0 ,  580.0, 29.0],
       [66.0 ,  570.0, 33.0],
       [68.0,   590.0, 37.0],
       [69.0,   660.0 ,46.0],
       [73.0 ,  600.0 ,55.0]
 ])
V=np.cov(X.T)
print(V)
# finding mean of each column
M=np.mean(X,axis=0)
print(M)
#subtract mean array from each row
X=(X-M)
#finding the covariance matrix
covm=X.T.dot(X)
#dividing with n-1=4
print(covm/4)
[[ 11.5 50. 34.75]
[ 50. 1250. 205. ]
[ 34.75 205. 110. ]]
[ 68. 600. 40.]

[[ 11.5 50. 34.75]
[ 50. 1250. 205. ]
[ 34.75 205. 110. ]]
The covariance matrix is used widely in linear algebra and the intersection of linear algebra and statistics called multivariate analysis. We have only had a small taste in this tutorial.

Comments

Popular posts from this blog

Mathematics for Machine Learning- CST 284 - KTU Minor Notes - Dr Binu V P

  Introduction About Me Syllabus Course Outcomes and Model Question Paper Question Paper July 2021 and evaluation scheme Question Paper June 2022 and evaluation scheme Overview of Machine Learning What is Machine Learning (video) Learn the Seven Steps in Machine Learning (video) Linear Algebra in Machine Learning Module I- Linear Algebra 1.Geometry of Linear Equations (video-Gilbert Strang) 2.Elimination with Matrices (video-Gilbert Strang) 3.Solving System of equations using Gauss Elimination Method 4.Row Echelon form and Reduced Row Echelon Form -Python Code 5.Solving system of equations Python code 6. Practice problems Gauss Elimination ( contact) 7.Finding Inverse using Gauss Jordan Elimination  (video) 8.Finding Inverse using Gauss Jordan Elimination-Python code Vectors in Machine Learning- Basics 9.Vector spaces and sub spaces 10.Linear Independence 11.Linear Independence, Basis and Dimension (video) 12.Generating set basis and span 13.Rank of a Matrix 14.Linear Mapping and Matri

1.1 Solving system of equations using Gauss Elimination Method

Elementary Transformations Key to solving a system of linear equations are elementary transformations that keep the solution set the same, but that transform the equation system into a simpler form: Exchange of two equations (rows in the matrix representing the system of equations) Multiplication of an equation (row) with a constant  Addition of two equations (rows) Add a scalar multiple of one row to the other. Row Echelon Form A matrix is in row-echelon form if All rows that contain only zeros are at the bottom of the matrix; correspondingly,all rows that contain at least one nonzero element are on top of rows that contain only zeros. Looking at nonzero rows only, the first nonzero number from the left pivot (also called the pivot or the leading coefficient) is always strictly to the right of the  pivot of the row above it. The row-echelon form is where the leading (first non-zero) entry of each row has only zeroes below it. These leading entries are called pivots Example: $\begin

4.3 Sum Rule, Product Rule, and Bayes’ Theorem

 We think of probability theory as an extension to logical reasoning Probabilistic modeling  provides a principled foundation for designing machine learning methods. Once we have defined probability distributions corresponding to the uncertainties of the data and our problem, it turns out that there are only two fundamental rules, the sum rule and the product rule. Let $p(x,y)$ is the joint distribution of the two random variables $x, y$. The distributions $p(x)$ and $p(y)$ are the corresponding marginal distributions, and $p(y |x)$ is the conditional distribution of $y$ given $x$. Sum Rule The addition rule states the probability of two events is the sum of the probability that either will happen minus the probability that both will happen. The addition rule is: $P(A∪B)=P(A)+P(B)−P(A∩B)$ Suppose $A$ and $B$ are disjoint, their intersection is empty. Then the probability of their intersection is zero. In symbols:  $P(A∩B)=0$  The addition law then simplifies to: $P(A∪B)=P(A)+P(B)$  wh