Skip to main content

Posts

Showing posts from June, 2021

3.7 Useful Identities for computing gradients

 

3.4 Gradients of a Vector Valued Functions

 We discussed partial derivatives and gradients of functions $f:\mathbb{R}^n \to \mathbb{R}$ mapping to the real numbers.Lets generalize this concept of the gradient to vector-valued functions(vector fields) $f: \mathbb{R}^n \to \mathbb{R}^m$, where $n \ge 1$ and $m \ge 1$. For a function $f:\mathbb{R}^n \to \mathbb{R}^m$ and a vector $x=[x_1,\ldots,x_n]^T \in \mathbb{R}^n$, the corresponding vector of function values is given as $f(x)=\begin{bmatrix} f_1(x)\\ \vdots \\ f_m(x) \end{bmatrix} \in \mathbb{R}^m$ where each $f_i: \mathbb{R}^n \to \mathbb{R}$ that map onto $\mathbb{R}$. The partial derivative of a vector-valued function $f:\mathbb{R}^n \to \mathbb{R}^m$ with respect to $x_i \in \mathbb{R}$, $i=1,\ldots,n$, is given as the vector We know that the gradient of  $f$ with respect to a vector is the row vector of partial derivatives, every partial derivative $\frac{\partial f}{\partial x}$ is a column vector. Therefore we obtain the gradient of $f:\mathbb{R}^n \to \mathbb{R}^m$

3.3 Partial Differentiation and Gradients-Jacobian

The generalization of the derivative to functions of several variables is the gradient.Where the function $f$ depends on one or more variables $x \in R^n$, e.g.,$f(x) = f(x1,x2)$. We find the gradient of the function $f$ with respect to $x$ by varying one variable at a time and keeping the others constant. The gradient is then the collection of these partial derivatives. Definition Partial Derivative For a function $f: R^n \to R$, $x \to f(x), x \in R^n$ of  $n$ variables  $x_1,x_2,\ldots,x_n$, we define partial derivatives as $\frac{\partial f}{\partial x_1}=\lim_{h \to 0}\frac{f(x_1+h,x_2,\ldots,x_n)-f(x))}{h}$ $\vdots$ $\frac{\partial f}{\partial x_n}=\lim_{h \to 0}\frac{f(x_1,x_2,\ldots,x_n+h)-f(x))}{h}$ and collect them in a row vector.The row vector is called the gradient of $f$ or the Jacobian and is the generalization of the derivative form. Example: if $f(x1,x2)=x_1^2x_2+x_1x_2^3 \in R$, then the derivative of $f$ with respect to $x_1$ and $x_2$ are. $\frac{\partial f(x_1,

3.2 Taylor Series and Taylor Polynomial

The Taylor series is a representation of a function $f$ as an infinite sum of  terms. These terms are determined using derivatives of $f$ evaluated at $x_0$. Taylor Polynomial The Taylor Polynomial of degree $n$ of $f : \mathbb{R} \mapsto \mathbb{R}$ at $x_0$ is defined as $T_n(x) = \sum_{k=0}^{n} \frac{f^{(k)}(x_0)}{k!} (x-x_0)^k$ where $f^{(k)}(x_0)$ is the $k^{th}$ derivative of $f$ at $x_0$ (which we assume exists) and $\frac{f^{(k)}(x_0)}{k!}$  are the coefficients of the polynomial. Taylor Series For a smooth function $f \in C^\infty, f: \mathbb{R} \mapsto \mathbb{R}$, the Taylor series of $f$ at $x_0$ is defined as $T_\infty(x) = \sum_{k=0}^{\infty} \frac{f^{(k)}(x_0)}{k!} (x-x_0)^k$ At $x_0=0$, we obtain the Maclaurin series as a special instance of the Taylor series.If $f(x)=T_\infty(x)$, then $f$ is called analytic. In general, a Taylor polynomial of degree $n$ is an approximation of a function, which does not need to be a polynomial. The Taylor polynomial is similar to $

3.1 Differentiation of Univariate Functions

Many algorithms in machine learning optimize an objective function with respect to a set of desired model parameters that control how well a model explains the data: Finding good parameters can be phrased as an optimization problem. Examples include: linear regression , where we look at curve-fitting problems and optimize linear weight parameters to maximize the likelihood. A function $f$ is a quantity that relates two quantities to each other. These quantities are typically inputs  $x \in \mathbb{R}^D$ and targets (function values) $f(x)$,which we assume are real-valued if not stated otherwise. Here $\mathbb{R}^D$  is the domain of $f$, and the function values $f(x)$ are the image/codomain of $f$.We often write $f: \mathbb{R}^D \mapsto \mathbb{R}$ $x \mapsto f(x) $ to specify a function. Example: The dot product  $f(x)=x^T.x , x \in \mathbb{R}^2$ would be specified as $f : \mathbb{R}^2 \mapsto \mathbb{R}$ $ x \mapsto x_1^2+x_2^2$ Vector calculus is one of the  fundamental mathemati

2.15 Matrix Approximation

SVD is a way to factorize $A=U\Sigma V^T \in \mathbb{R}^{m \times n}$ into the product of three matrices, where $U \in \mathbb{R}^{m \times m}$ and $V \in \mathbb{R}^{n \times n}$ are orthogonal and $\Sigma$ contains the singular values on its main diagonal.Instead of doing the full SVD factorization, we will now investigate how the SVD allows us to represent matrix A as a sum of simpler(low-rank) matrices $A_i$, which lends itself to a matrix approximation scheme that is cheaper to compute than the full SVD. We consider a rank-1 matrix $A_i \in \mathbb{R}^{m \times n}$ as                         $A_i=u_iv_i^T$ which is formed by the outer product of the $i^{th}$ orthogonal column vector of $U$ and $V$. A matrix $A \in \mathbb{R}^{m \times n}$ of rank $r$ can be written as a sum of rank-1 matrices A, so that $A=\sum_{i=1}^{r}\sigma_iu_iv_i^{T}=\sum_{i=1}^{r}\sigma_iA_i$ where the outer-product matrices $A_i$ are weighted by the $i^{th}$ singular value $\sigma_i$.We summed up the $r$

2.14 Singular Value Decomposition

The Singular Value Decomposition ( SVD) of a matrix is a central matrix decomposition method in linear algebra.It can be applied to all matrices,not only to square matrices and it always exists.It has been referred to as the 'fundamental theorem of linear algebra'( strang 1993). SVD Theorem: Let $A^{m \times n}$ be a rectangular matrix of rank $r \in [0,min(m,n)]$. The SVD of A is a decomposition of the form. $A= U \Sigma V^T $ with an orthogonal matrix $U \in \mathbb{R}^{m \times m}$ with column vectors $u_i, i=1,\ldots,m$ and an orthogonal matrix $V \in \mathbb{R}^{n \times n}$ with column vectors $v_j, j=1,\ldots,n$.Moreover, $\Sigma$ is an $m \times n$ matrix with $\sum_{ii} = \sigma \ge 0$ and $\sigma_{ij}=0, i \ne j$. The diagonal entries $\Sigma_i=1,\ldots,r$ of $\sigma$ are called singular values . $u_i$ are called left singular vectors , and $v_j$ are called right singular vectors .By convention singular values are ordered ie; $\sigma_1 \ge \sigma_2 \ldots \sigma_r \

2.13 Eigen decomposition and diagonalization

eigendecomposition (also known as eigenvalue decomposition, spectral decomposition, or diagonalization) A diagonal matrix is a matrix that has value zero on all off-diagonal elements i.e., they are of the form $D=\begin{bmatrix} c_1 & \cdots & 0\\ \vdots&\ddots & \vdots \\ 0& \cdots & c_n \end{bmatrix}$ They allow fast computation of determinants, powers and inverses.The determinant is the product of its diagonal entries, a matrix power $D^k$ is given by each diagonal element raised to the power $k$, and the inverse $D^{-1}$ is the reciprocal of its diagonal elements, if all of them are nonzero. Definition(Diagonalizable):A matrix $A \in \mathbb{R}^{n \times n}$ is diagonalizable, if it is similar to a diagonal matrix i.e., if there exists an invertable matrix $P \in \mathbb{R}^{n \times n}$ such that $D=P^{-1}AP$.  Eigen decomposition can be done only on a square matrix. let $A \in \mathbb{R}^{n \times n}$ and let $\lambda_1,\lambda_2,\ldots,\lambda_n$, b

2.12 Cholesky Decomposition of a matrix

  Cholesky Decomposition We have square root like operation that gives us a decomposition of the numbers into identical components eg 9= 3.3. For matrices we need to be careful that we compute a square root like operation on positive quantities.For symmetric positive definite matrices. we can choose from a number of square root equivalent operation.The Cholesky decomposition/Cholesky factorization provides a square root equivalent operation on symmetric positive definite matrices that is useful in practice. The decomposition is defined as follows: $A = L \times L^T$ Where $A$ is the matrix being decomposed, $L$ is the lower triangular matrix and $L^T$ is the transpose of  $L$. The decompose can also be written as the product of the upper triangular matrix, for example: $A = U^T \times U$ Where $U$ is the upper triangular matrix. The Cholesky decomposition is used for solving linear least squares for linear regression, as well as simulation and optimization methods. When decomposing s