Skip to main content

Posts

Showing posts from August, 2021

3.7 Gradients in a Deep Neural Network

In many machine learning applications, we find good model parameters by performing gradient descent, which relies on the fact that we can compute the gradient of a learning objective with respect to the parameters of the model. For a given objective function, we can obtain the gradient with respect to the model parameters using calculus and applying the chain rule. We already seen the gradient of a squared loss with respect to the parameters of a linear regression model. Consider the function $f(x)=\sqrt{(x^2+exp(x^2)}+cos(x^2+exp(x^2)$ By application of the chain rule, and noting that differentiation is linear,we compute the gradient $\frac{\mathrm{d} f}{\mathrm{d} x}=\frac{2x + 2x\, exp(x^2)}{2\,\sqrt{x+exp(x^2)}}-sin(x^2+exp(x^2))(2x+exp(x^2)2x)$ Writing out the gradient in this explicit way is often impractical since it often results in a very lengthy expression for a derivative. In practice,it means that, if we are not careful, the implementation of the gradient could be signif

3.6 Gradient of a Least-Squares Loss in Linear Model

  Let $L=|Ax-b|^2$, where $A$ is a matrix , $x$ and $b$ are vectors. Derive $dL$ in terms of $dx$. $L=|Ax-b|^2$ $\frac{\partial L}{\partial x}=|Ax-b|^T|Ax-b|$ $\frac{\partial L}{\partial x}=2|Ax-b|^TA$

3.5 Gradients of Matrices

We will encounter situations where we need to take gradients of matrices with respect to vectors ( or other matrices), which results in multidimensional tensor. We can think of this tensor as a multidimensional array that collects partial derivatives. For example if we compute the gradient of an $m \times n$ matrix $A$ with respect to a $p \times q$ matrix $B$, the resulting Jacobian would be $(m \times n) \times ( p \times q)$.i.e, a four dimensional tensor $J$, whose entries are given as $J_{ijkl}=\frac{\partial A_{ij}}{\partial B_{kl}}$ Since matrices represent linear mappings, we can exploit the fact that there is a vector-space isomorphism (linear, invertible mapping) between the space $R^{m \times n}$ of $m \times n$ matrices and the space $R^{mn}$ of $mn$ vectors.Therefore, we can re-shape our matrices into vectors of lengths $mn$ and $pq$, respectively. The gradient using these $mn$ vectors results in a Jacobian Matrices can be of size $mn \times pq$. The following Figure vi

2.7 Orthogonal Compliment

Having defined orthogonality, lets now look at vector spaces that are orthogonal to each other.This play a major role in dimensionality reduction.Consider $D$ dimensional vector space $V$ and an $M$ dimensional subspace $U \subseteq V$. then its orthogonal compliment $U^\perp$ is a $(D-M)$ dimensional subspace of $V$ and contains all vectors in $V$ that are orthogonal  to every vector in $U$.Furthermore, $U \cap U^\perp={0}$ so that any vector $x \in V$ can be uniquely decomposed into $x=\sum_{m=1}^{M} \lambda_mb_m +  \sum_{j=1}^{D-M} \psi_ib_j^{\perp}$ $\lambda_m,\psi_j \in \mathbb{R}$ where   $(b_1,\ldots,b_M)$  is a basis of $U$ and $(b_1^\perp,\ldots,b_{D-M}^\perp)$ is a basis of $U^\perp$. Therefore, the orthogonal compliment can also be used to describe a plane $U$(two dimensional subspace) in a three dimensional vector space.More specifically, the vector $w$ with $||w||=1$, which is orthogonal to the plane $U$, is the basis vector of $U^\perp$.All vectors that are orthogonal

2.6 Orthonormal Basis

We found that in an $n$-dimensional vector space, we need $n$ basis vectors, i.e., $n$ vectors that are linearly independent. We will discuss the special case where the basis vectors are orthogonal to each other and where the length of each basis vector is 1. We will call this basis then an orthonormal basis. Consider an $n$ dimensional vector space $V$ and a basis ${b_1,\ldots,b_n}$ of $V$.If $<b_i,b_j>=0$ for  $i \ne j$         ------(1) $<b_i,b_i>=1$                              ------(2) for all $i,j=1,\ldots,n$, then the basis is called orthonormal basis (ONB).If only eqn (1) is satisfied, then the basis is called an orthogonal basis.Eqn(2) implies that every basis vector has length/norm 1. Note: we can use Gaussian elimination to find a basis for a vector space spanned by a set of vectors. Assume we are given a set $\{\tilde{b_1},\ldots,\tilde{b_n}\}$ of non-orthogonal and unnormalized basis vectors. We concatenate them into a matrix $\tilde{B} = [\tilde{b_1},\ldo

2.5 Orthogonal Matrix

A square matrix $A \in \mathbb{R}^{n \times n}$ is an orthogonal matrix if and only if its columns are orthonormal so that $AA^T=A^TA=I$ which implies that $A^{-1}=A^T$ Transformations by orthogonal matrices are special because the length of a vector $x$ is not changed when transforming it using an orthogonal matrix A.For the dot product we obtain $\left \| Ax \right \|^2= (Ax)^T(Ax)=x^TA^TAx=x^TIx=x^Tx=\left \|  x \right \|^2$ Moreover the angle between two vectors $x,y$ is unchanged when transforming both of them using an orthogonal matrix A $cos \omega=\frac{(Ax)^T(Ay)}{\left \| Ax \right \| \left \| Ay \right \|}=\frac{x^TA^TAy}{\sqrt{x^TA^TAxy^TA^TAy}}=\frac{x^Ty}{\left \| x \right \| \left \| y \right \|}$ This means that orthogonal matrices A with $A^T=A^{-1}$ preserve both angles and distances. Eg: The rotation matrix $\begin{bmatrix} cos \theta & -sin \theta\\ sin \theta & cos \theta \end{bmatrix}$ import numpy as np B=np.array([[1/np.sqrt(2),1/np.sqrt(2)],[1/np.sqrt