Skip to main content

PCA-Principle Component Analysis

An important machine learning method for dimensionality reduction is called Principal Component Analysis. It is a method that uses simple matrix operations from linear algebra and statistics to calculate a projection of the original data into the same number or fewer dimensions. In this tutorial, you will discover the Principal Component Analysis machine learning method for dimensionality reduction and how to implement it from scratch in Python.


Principal Component Analysis, or PCA for short, is a method for reducing the dimensionality of data. It can be thought of as a projection method where data with m-columns (features) is projected into a subspace with m or fewer columns, whilst retaining the essence of the original data. The PCA method can be described and implemented using the tools of linear algebra
PCA is an operation applied to a dataset, represented by an $nxm$ matrix $A$ that results in a projection of $A$ which we will call $B$. 
Let's walk through the steps of this operation.Let
$A=\begin{bmatrix}
a11&a12 \\
a21& a22\\
a31& a32
\end{bmatrix}$
$B=PCA(A)$

1.The first step is to calculate the mean of each column of $A$
$M=mean(A)$

2.Subtract the mean value from each column value so that the data is centered around mean 0.
$C=A-M$

3.The next step is to calculate the covariance matrix of the centered matrix $C$.  A covariance matrix is a calculation of covariance of a given matrix with covariance scores for every column with every other column, including itself.
$V = cov(C)$

4.Finally, we calculate the eigen decomposition of the covariance matrix $V$ . This results in a list of eigenvalues and a list of eigenvectors.
$values; vectors = eig(V )$

The eigenvectors represent the directions or components for the reduced subspace of $B$, whereas the eigenvalues represent the magnitudes for the directions. The eigenvectors can be sorted by the eigenvalues in descending order to provide a ranking of the components or axes of the new subspace for $A$. If all eigenvalues have a similar value, then we know that the existing representation may already be reasonably compressed or dense and that the projection may offer little. If there are eigenvalues close to zero, they represent components or axes of $B$ that may be discarded. A total of $m$ or less components must be selected to comprise the chosen subspace. Ideally, we would select $k$ eigenvectors, called principal components, that have the $k$ largest eigenvalues.

$B = select(values; vectors)$
Other matrix decomposition methods can be used such as Singular-Value Decomposition,or SVD. As such, generally the values are referred to as singular values and the vectors of the subspace are referred to as principal components. Once chosen, data can be projected into the subspace via matrix multiplication.
$P = B^T \times A$
Where $A$ is the original data that we wish to project,  $B^T$ is the transpose of the chosen principal components and $P$ is the projection of $A$. This is called the covariance method for calculating the PCA, although there are alternative ways to calculate it.The following program is the implementation of the same.
*********************************************************************************
# principal component analysis
from numpy import array
from numpy import hstack
from numpy import mean
from numpy import var
from numpy import cov
from numpy.linalg import eig
# define matrix
A = array([
[1, 2,3,4],
[3, 5,5,6],
[5, 12,9,8],
[9,2,11,12]])
print(A)
# column means
M = mean(A, axis=0)
print(M)
# center columns by subtracting column means
C = A - M
print(C)
# calculate covariance matrix of centered matrix
V = cov(C.T)
print(V)
# factorize covariance matrix
values, vectors = eig(V)
print(vectors)
print(values)
v1=vectors[:,0]
v2=vectors[:,1]
B=hstack([v1.reshape(4,1),v2.reshape(4,1)])
# vector corresponds to highest eigen value
print(B)
# project data considering the largest eigen value and reduce dimension
P =B.T.dot(A)
print(P.T)

Principal Component Analysis in scikit-learn
We can calculate a Principal Component Analysis on a dataset using the PCA() class in the scikit-learn library. The bene t of this approach is that once the projection is calculated, it can be applied to new data again and again quite easily. When creating the class, the number of components can be specified as a parameter. The class is first fit on a dataset by calling the fit() function, and then the original dataset or other data can be projected into a subspace with the chosen number of dimensions by calling the transform() function. Once fit, the singular values and principal components can be accessed on the PCA class via the explained variance and components attributes. The example below demonstrates using this class by first creating an instance, fitting it on a 3 x2 matrix, accessing the values and vectors of the projection, and transforming the original data.
# principal component analysis with scikit-learn
from numpy import array
from sklearn.decomposition import PCA
# define matrix
A = array([
[1, 2],
[3, 4],
[5, 6]])
print(A)
# create the transform
pca = PCA(2)
# fit transform
pca.fit(A)
# access values and vectors
print(pca.components_)
print(pca.explained_variance_)
# transform data
B = pca.transform(A)
print(B)
***********************************************************************
The following program demonstrate that the covariance matrix can be reconstructed with only two eigen values and eigen vectors which is the essence of PCA
from numpy import array
from numpy import hstack
from numpy import mean
from numpy import var
from numpy.linalg import inv
from numpy import cov
from numpy import diag
from numpy.linalg import eig
# define matrix
A = array([
[1, 2,3,4],
[3, 5,5,6],
[5, 12,9,8],
[9,2,11,12]])

M = mean(A, axis=0)
# center columns by subtracting column means
C = A - M
# calculate covariance matrix of centered matrix
V = cov(C.T)
print(V)
# factorize covariance matrix
values, vectors = eig(V)
print(vectors)
print(values)
#taking the two eigen vectors
v1=vectors[:,0]
v2=vectors[:,1]
B=hstack([v1.reshape(4,1),v2.reshape(4,1)])
#taking two eigen values
e1=values[0]
e2=values[1]
e=[e1,e2]
E=diag(e)
print(B)
print(E)
P=B.dot(E).dot(B.T)
print(P)
#covariance matrix
[[ 11.66666667 0.16666667 12. 11.66666667]
[ 0.16666667 22.25 4.66666667 0.16666667]
[ 12. 4.66666667 13.33333333   12. ]
[ 11.66666667 0.16666667 12. 11.66666667]] 
  #eigen vectors
[[ 5.45010324e-01 1.86077548e-01 -7.07106781e-01 4.10291229e-01]
[ 2.07687030e-01 -9.64745338e-01 4.98993645e-15 1.61655591e-01]
[ 6.02323493e-01 -4.08962150e-03 -2.38802808e-14 -7.98241620e-01]
[ 5.45010324e-01 1.86077548e-01 7.07106781e-01 4.10291229e-01]] 
#eigen values
[ 3.66587624e+01 2.22054899e+01 3.47384374e-16 5.24143926e-02]
#eigen vectors corresponds to largest eigen value
[[ 0.54501032 0.18607755]
[ 0.20768703 -0.96474534]
[ 0.60232349 -0.00408962]
[ 0.54501032 0.18607755]]
#largest eigen values
[[ 36.65876241         0. ]
[ 0.         22.20548986]]
#reconstructed covariance matrix
[[ 11.65784329 0.16319024   12.01716632   11.65784329]
 [ 0.16319024   22.24863028 4.67343023     0.16319024]
 [ 12.01716632 4.67343023   13.29993542   12.01716632]
 [ 11.65784329 0.16319024   12.01716632   11.65784329]]

Comments

Popular posts from this blog

Mathematics for Machine Learning- CST 284 - KTU Minor Notes - Dr Binu V P

  Introduction About Me Syllabus Course Outcomes and Model Question Paper University Question Papers and Evaluation Scheme -Mathematics for Machine learning CST 284 KTU Overview of Machine Learning What is Machine Learning (video) Learn the Seven Steps in Machine Learning (video) Linear Algebra in Machine Learning Module I- Linear Algebra 1.Geometry of Linear Equations (video-Gilbert Strang) 2.Elimination with Matrices (video-Gilbert Strang) 3.Solving System of equations using Gauss Elimination Method 4.Row Echelon form and Reduced Row Echelon Form -Python Code 5.Solving system of equations Python code 6. Practice problems Gauss Elimination ( contact) 7.Finding Inverse using Gauss Jordan Elimination  (video) 8.Finding Inverse using Gauss Jordan Elimination-Python code Vectors in Machine Learning- Basics 9.Vector spaces and sub spaces 10.Linear Independence 11.Linear Independence, Basis and Dimension (video) 12.Generating set basis and span 13.Rank of a Matrix 14.Linear Mapping...

Vectors in Machine Learning

As data scientists we work with data in various formats such as text images and numerical values We often use vectors to represent data in a structured and efficient manner especially in machine learning applications In this blog post we will explore what vectors are in terms of machine learning their significance and how they are used What is a Vector? In mathematics, a vector is a mathematical object that has both magnitude and direction. In machine learning, a vector is a mathematical representation of a set of numerical values. Vectors are usually represented as arrays or lists of numbers, and each number in the list represents a specific feature or attribute of the data. For example, suppose we have a dataset of houses, and we want to predict their prices based on their features such as the number of bedrooms, the size of the house, and the location. We can represent each house as a vector, where each element of the vector represents a specific feature of the house, such as the nu...

2.14 Singular Value Decomposition

The Singular Value Decomposition ( SVD) of a matrix is a central matrix decomposition method in linear algebra.It can be applied to all matrices,not only to square matrices and it always exists.It has been referred to as the 'fundamental theorem of linear algebra'( strang 1993). SVD Theorem: Let $A^{m \times n}$ be a rectangular matrix of rank $r \in [0,min(m,n)]$. The SVD of A is a decomposition of the form. $A= U \Sigma V^T $ with an orthogonal matrix $U \in \mathbb{R}^{m \times m}$ with column vectors $u_i, i=1,\ldots,m$ and an orthogonal matrix $V \in \mathbb{R}^{n \times n}$ with column vectors $v_j, j=1,\ldots,n$.Moreover, $\Sigma$ is an $m \times n$ matrix with $\sum_{ii} = \sigma \ge 0$ and $\sigma_{ij}=0, i \ne j$. The diagonal entries $\Sigma_i=1,\ldots,r$ of $\sigma$ are called singular values . $u_i$ are called left singular vectors , and $v_j$ are called right singular vectors .By convention singular values are ordered ie; $\sigma_1 \ge \sigma_2 \ldots \sigma_r \...