Basic Mathematical Concepts for Machine Learning
Table of contents
In this blog post, we will explore a variety of basic mathematical concepts that underpin many machine learning algorithms.
1. Gaussian Processes¶
Gaussian Processes (GPs) are non-parametric methods for modeling a multivariate Gaussian probability distribution over a collection of random variables. GPs assume a prior over functions and update the posterior over functions based on the observed data points.
Given a collection of data points $\{x^{(1)}, \dots, x^{(N)}\}$, GPs assume that they follow a jointly multivariate Gaussian distribution, defined by a mean function $\mu(x)$ and a covariance matrix $\Sigma(x)$. Each entry at location $(i, j)$ in the covariance matrix $\Sigma(x)$ is defined by a kernel function $\Sigma_{i, j} = K(x^{(i)}, x^{(j)})$, also known as a covariance function. The core idea is that if two data points are considered similar by the kernel function, their function outputs should also be close.
The joint distribution of the observed outputs $y^{(1)}, \dots, y^{(N)}$ and the prediction at a test point $y^*$ is given by:
$$ \begin{aligned} \begin{bmatrix} y \\ y^* \end{bmatrix} \sim \mathcal{N} \left( \begin{bmatrix} \mu(X) \\ \mu(x^*) \end{bmatrix}, \begin{bmatrix} \Sigma(X, X) & \Sigma(X, x^*) \\ \Sigma(x^*, X) & \Sigma(x^*, x^*) \end{bmatrix} \right) \end{aligned} $$where $\Sigma(X, X)$ is the covariance matrix of the observed data points, $\Sigma(X, x^*)$ and $\Sigma(x^*, X)$ are the covariance matrices between the observed data points and the test point, and $\Sigma(x^*, x^*)$ is the covariance of the test point.
To make a prediction at the test point $x^*$, we compute the conditional distribution $p(y^* | x^*, X, y)$. Using properties of the multivariate Gaussian distribution, we obtain:
$$ \begin{aligned} y^* | x^*, X, y &\sim \mathcal{N} \Bigg( \mu(x^*) + \Sigma(x^*, X) \Sigma(X, X)^{-1} (y - \mu(X)), \\ &\qquad \Sigma(x^*, x^*) - \Sigma(x^*, X) \Sigma(X, X)^{-1} \Sigma(X, x^*) \Bigg) \end{aligned} $$This conditional distribution provides the predictive mean and variance at the test point $x^*$, which can be used for making predictions and quantifying uncertainty.
In summary, Gaussian Processes provide a powerful and flexible approach to modeling complex relationships between data points. They have been used in a wide range of applications, from regression and classification tasks to optimization and reinforcement learning. By understanding the underlying principles of GPs, we can better appreciate their strengths and limitations, and make informed choices when using them in practice.
2. Kernel Methods and Kernels¶
Kernels are fundamental components in various machine learning algorithms, particularly in non-parametric, instance-based techniques. A kernel is essentially a similarity function between two data points, $K: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$. It describes how sensitive the prediction for one data sample is to the prediction for the other, or in other words, how similar two data points are. The kernel should be symmetric, $K(x, x') = K(x', x)$.
Depending on the problem structure, some kernels can be decomposed into two feature maps, one corresponding to one data point, and the kernel value is an inner product of these two features: $K(x, x') = \langle \varphi(x), \varphi(x') \rangle$.
Common Kernel Functions¶
- Linear Kernel: $K(x, x') = x^\top x'$
- Polynomial Kernel: $K(x, x') = (x^\top x' + c)^d$, where $c$ and $d$ are constants
- Radial Basis Function (RBF) Kernel / Gaussian Kernel: $K(x, x') = \exp(-\frac{\lVert x - x' \rVert^2}{2\sigma^2})$, where $\sigma^2$ is a constant
- Sigmoid Kernel: $K(x, x') = \tanh(\alpha x^\top x' + \beta)$, where $\alpha$ and $\beta$ are constants
Kernel methods are a type of non-parametric, instance-based machine learning algorithms. Assuming we have known all the labels of training samples $\{x^{(i)}, y^{(i)}\}$, the label for a new input $x$ is predicted by a weighted sum $\sum_{i} K(x^{(i)}, x)y^{(i)}$. Popular kernel methods include Support Vector Machines (SVMs) and Kernel Principal Component Analysis (Kernel PCA).
Representer Theorem¶
The Representer Theorem is a fundamental result in kernel methods, which states that the optimal solution of a regularized problem in a Reproducing Kernel Hilbert Space (RKHS) can be expressed as a linear combination of kernel functions evaluated at the training points:
$$ f^*(x) = \sum_{i=1}^N \alpha_i K(x^{(i)}, x) $$The Representer Theorem is widely used in the derivation of kernel-based algorithms and ensures that these algorithms can be effectively implemented in practice.
3. Taylor Series Expansion¶
The Taylor series expansion is a powerful mathematical tool used to approximate functions as an infinite sum of terms based on the function's derivatives. It enables us to express a function, $f(x)$, in terms of its derivatives evaluated at a specific point, $x = a$. The Taylor series expansion can be written as:
$$ f(x) = f(a) + \sum_{k=1}^\infty \frac{1}{k!} (x - a)^k \nabla^k_x f(x)\Big\vert_{x=a} $$where $\nabla^k$ denotes the $k$-th derivative. The Taylor series is particularly useful in situations where the function is too complex to compute directly, or when we need to obtain a simpler, more tractable approximation of the function.
The first-order Taylor expansion, also known as the linear approximation, only includes the first term of the Taylor series:
$$ f(x) \approx f(a) + (x - a) \nabla_x f(x)\Big\vert_{x=a} $$This linear approximation is often used as a starting point for more advanced mathematical analysis or numerical methods, such as Newton's method for finding roots of a function. The higher-order terms in the Taylor series can be used to obtain a more accurate approximation of the function, but at the cost of increased complexity.
In some cases, it is useful to consider the Taylor series expansion in multiple dimensions. For a multivariate function, $f(\mathbf{x})$, where $\mathbf{x} \in \mathbb{R}^n$, the Taylor series expansion can be written as:
$$ f(\mathbf{x}) = f(\mathbf{a}) + \sum_{k=1}^\infty \frac{1}{k!} (\mathbf{x} - \mathbf{a})^k \nabla^k_{\mathbf{x}} f(\mathbf{x})\Big\vert_{\mathbf{x}=\mathbf{a}} $$where $\nabla^k_{\mathbf{x}}$ denotes the $k$-th derivative with respect to the vector $\mathbf{x}$.
The Taylor series expansion has widespread applications in various fields, including machine learning, optimization, numerical analysis, and physics. By understanding and leveraging the properties of Taylor series expansions, researchers can develop more effective and efficient algorithms to tackle complex problems.
4. Vector-to-vector Derivatives¶
Vector-to-vector derivatives are crucial in understanding the behavior of functions that map from one vector space to another. The Jacobian matrix is a key concept in this context, representing the first-order partial derivatives of a vector-valued function with respect to its input vector.
Given an input vector $\mathbf{x} \in \mathbb{R}^n$ (as a column vector) and a function $f: \mathbb{R}^n \to \mathbb{R}^m$, the derivative of $f$ with respect to $\mathbf{x}$ is an $m\times n$ matrix, also known as the Jacobian matrix:
$$ J = \frac{\partial f}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \dots &\frac{\partial f_1}{\partial x_n} \\ \vdots & & \\ \frac{\partial f_m}{\partial x_1} & \dots &\frac{\partial f_m}{\partial x_n} \\ \end{bmatrix} \in \mathbb{R}^{m \times n} $$Throughout this section, integer subscript(s) are used to refer to a single entry out of a vector or matrix value; i.e., $x_i$ indicates the $i$-th value in the vector $\mathbf{x}$, and $f_i(\cdot)$ is the $i$-th entry in the output of the function.
The gradient of a vector with respect to a vector is defined as $\nabla_\mathbf{x} f = J^\top \in \mathbb{R}^{n \times m}$, and this formation is also valid when $m=1$ (i.e., scalar output).
Consider a scalar function $g(\mathbf{x}) = \mathbf{a}^\top \mathbf{x}$, where $\mathbf{a} \in \mathbb{R}^n$ and $\mathbf{x} \in \mathbb{R}^n$. The gradient of $g$ with respect to $\mathbf{x}$ can be computed as:
$$ \nabla_\mathbf{x} g = \begin{bmatrix} \frac{\partial g}{\partial x_1} \\ \vdots \\ \frac{\partial g}{\partial x_n} \end{bmatrix} = \begin{bmatrix} a_1 \\ \vdots \\ a_n \end{bmatrix} = \mathbf{a} $$Now, consider a more complex example. Let $h(\mathbf{x}) = \mathbf{x}^\top A \mathbf{x}$, where $A \in \mathbb{R}^{n \times n}$ is a symmetric matrix, and $\mathbf{x} \in \mathbb{R}^n$. The gradient of $h$ with respect to $\mathbf{x}$ can be computed as:
$$ \begin{align*} \nabla_\mathbf{x} h &= \begin{bmatrix} \frac{\partial h}{\partial x_1} \\ \vdots \\ \frac{\partial h}{\partial x_n} \end{bmatrix} \\ &= \begin{bmatrix} \frac{\partial (\mathbf{x}^\top A \mathbf{x})}{\partial x_1} \\ \vdots \\ \frac{\partial (\mathbf{x}^\top A \mathbf{x})}{\partial x_n} \end{bmatrix} \\ &= \begin{bmatrix} \frac{\partial (\sum_{i=1}^n \sum_{j=1}^n x_i A_{i,j} x_j)}{\partial x_1} \\ \vdots \\ \frac{\partial (\sum_{i=1}^n \sum_{j=1}^n x_i A_{i,j} x_j)}{\partial x_n} \end{bmatrix} \\ &= \begin{bmatrix} \sum_{i=1}^n \sum_{j=1}^n \frac{\partial (x_i A_{i,j} x_j)}{\partial x_1} \\ \vdots \\ \sum_{i=1}^n \sum_{j=1}^n \frac{\partial (x_i A_{i,j} x_j)}{\partial x_n} \end{bmatrix} \\ &= \begin{bmatrix} \sum_{j=1}^n A_{1,j} x_j + \sum_{i=1}^n x_i A_{i,1} \\ \vdots \\ \sum_{j=1}^n A_{n,j} x_j + \sum_{i=1}^n x_i A_{i,n} \end{bmatrix} \\ &= \begin{bmatrix} \sum_{j=1}^n A_{1,j} x_j + \sum_{i=1}^n A_{i,1} x_i \\ \vdots \\ \sum_{j=1}^n A_{n,j} x_j + \sum_{i=1}^n A_{i,n} x_i \end{bmatrix} \\ &= \begin{bmatrix} \sum_{j=1}^n (A_{1,j} + A_{j,1}) x_j \\ \vdots \\ \sum_{j=1}^n (A_{n,j} + A_{j,n}) x_j \end{bmatrix} \\ &= (A + A^\top) \mathbf{x} \end{align*} $$In this example, the gradient of the quadratic form $\mathbf{x}^\top A \mathbf{x}$ is computed as $(A + A^\top) \mathbf{x}$, where $A$ is a symmetric matrix.
Vector-to-vector derivatives play a crucial role in understanding the behavior of functions in machine learning algorithms. Their applications include computing gradients for optimization, sensitivity analysis, and understanding how changes in input variables affect the output. Understanding and working with vector-to-vector derivatives is essential for machine learning practitioners and researchers.
5. Differential Equations¶
Differential equations describe the relationship between one or multiple functions and their derivatives. There are two main types of differential equations:
5.1 Ordinary Differential Equations (ODEs)¶
Ordinary Differential Equations (ODEs) contain only an unknown function of one random variable. ODEs are the primary form of differential equations used in this post. A general form of ODE appears as:
$$ F\left(x, y, \frac{dy}{dx}, \dots, \frac{d^ny}{dx^n}\right) = 0 $$5.1.1 First-Order ODEs¶
First-order ODEs involve only the first derivative of the function. They can be written as:
$$ F\left(x, y, \frac{dy}{dx}\right) = 0 $$5.1.2 Second-Order ODEs¶
Second-order ODEs involve the second derivative of the function. They can be expressed as:
$$ F\left(x, y, \frac{dy}{dx}, \frac{d^2y}{dx^2}\right) = 0 $$5.2 Partial Differential Equations (PDEs)¶
Partial Differential Equations (PDEs) contain unknown multivariable functions and their partial derivatives. The general form of a PDE can be written as:
$$ F\left(x_1, x_2, \dots, x_n, u, \frac{\partial u}{\partial x_1}, \frac{\partial u}{\partial x_2}, \dots, \frac{\partial u}{\partial x_n}\right) = 0 $$where $u = u(x_1, x_2, \dots, x_n)$ is the unknown function.
5.2.1 Linear PDEs¶
Linear PDEs are a special class of PDEs where the unknown function and its partial derivatives appear only in the first power and are not multiplied by one another.
5.2.2 Nonlinear PDEs¶
Nonlinear PDEs involve unknown functions and their partial derivatives that are nonlinear in nature, meaning that they can appear in higher powers or multiplied by one another.
5.3 Solution Methods for Differential Equations¶
Various solution methods can be employed to solve differential equations, including analytical methods such as separation of variables, integrating factors, and the method of undetermined coefficients, as well as numerical methods like Euler's method, Runge-Kutta methods, and finite difference methods.
For example, the separation of variables (Fourier method) can be used when all the terms containing one variable can be moved to one side, while the other terms are all moved to the other side:
$$ \begin{aligned} \text{Given }a\text{ is a constant scalar:}\quad\frac{dy}{dx} &= ay \\ \text{Move same variables to the same side:}\quad\frac{dy}{y} &= adx \\ \text{Put integral on both sides:}\quad\int \frac{dy}{y} &= \int adx \\ \ln (y) &= ax + C' \\ \text{Finally}\quad y &= e^{ax + C'} = C e^{ax} \end{aligned} $$Understanding differential equations is crucial in various fields, including physics, engineering, and economics, as they often model real-world phenomena and systems.
6. Central Limit Theorem¶
The Central Limit Theorem (CLT) is a fundamental theorem in probability theory that states that the sum or average of a large number of independent and identically distributed (i.i.d.) random variables approaches a Gaussian distribution, regardless of the original distribution of the individual variables.
Let $x_1, x_2, \dots, x_N$ be a collection of i.i.d. random variables with mean $\mu$ and variance $\sigma^2$. The CLT states that when $N$ becomes very large, the sample mean $\bar{x}$ converges to a Gaussian distribution:
$$ \bar{x} = \frac{1}{N}\sum_{i=1}^N x_i \xrightarrow{d} \mathcal{N}(\mu, \frac{\sigma^2}{N})\quad\text{as }N \to \infty $$Here, $\xrightarrow{d}$ denotes convergence in distribution. To illustrate the convergence, let $S_N = x_1 + x_2 + \dots + x_N$. Then, we can rewrite the CLT as follows:
$$ \frac{S_N - N\mu}{\sqrt{N}\sigma} \xrightarrow{d} \mathcal{N}(0, 1) $$The proof of the CLT relies on characteristic functions and the continuity theorem for characteristic functions. Let $\phi_X(t) = \mathbb{E}[e^{itX}]$ be the characteristic function of a random variable $X$. Using the property $\phi_{aX + b}(t) = e^{itb}\phi_X(at)$ and the independence of the random variables, we can derive the characteristic function of the standardized sum:
$$ \begin{aligned} \phi_{\frac{S_N - N\mu}{\sqrt{N}\sigma}}(t) &= \phi_{\frac{1}{\sqrt{N}\sigma}(S_N - N\mu)}(t) \\ &= \phi_{\frac{1}{\sqrt{N}\sigma}(x_1 + x_2 + \dots + x_N - N\mu)}(t) \\ &= \prod_{i=1}^N \phi_{\frac{x_i - \mu}{\sqrt{N}\sigma}}(t) \\ &= \left[\phi_{\frac{X - \mu}{\sqrt{N}\sigma}}(t)\right]^N \end{aligned} $$By applying Taylor series expansion to $\phi_{\frac{X - \mu}{\sqrt{N}\sigma}}(t)$, we can show that the limit of the characteristic function converges to the characteristic function of the standard Gaussian distribution:
$$ \lim_{N \to \infty} \phi_{\frac{S_N - N\mu}{\sqrt{N}\sigma}}(t) = \phi_{\mathcal{N}(0, 1)}(t) $$The CLT has profound implications for statistical inference, hypothesis testing, and machine learning. It is the foundation for many parametric statistical methods, such as the t-test and linear regression, and provides justification for using Gaussian distributions as priors in Bayesian methods and as assumptions in various machine learning algorithms.
7. Bayesian Inference¶
Bayesian inference is a statistical method that updates probabilities based on observed data, using Bayes' theorem. In machine learning, Bayesian methods can be applied to model uncertainty, update beliefs about model parameters, and make predictions.
Bayes' theorem is expressed as:
$$ P(H|D) = \frac{P(D|H)P(H)}{P(D)} $$where $P(H|D)$ is the posterior probability of hypothesis $H$ given data $D$, $P(D|H)$ is the likelihood of data $D$ given hypothesis $H$, $P(H)$ is the prior probability of hypothesis $H$, and $P(D)$ is the probability of observing data $D$. In the context of machine learning, $H$ can represent model parameters, and $D$ represents observed data.
For example, consider a simple Bayesian linear regression model:
$$ y = \mathbf{w}^{\top} \mathbf{x} + \epsilon $$where $y$ is the target variable, $\mathbf{w}$ is a weight vector, $\mathbf{x}$ is an input vector, and $\epsilon$ is a Gaussian noise term with zero mean and variance $\sigma^2$. We can assume a Gaussian prior on the weights $\mathbf{w}$:
$$ P(\mathbf{w}) = \mathcal{N}(\mathbf{w} | \mathbf{0}, \Sigma_p) $$Given observed data $D = \{(\mathbf{x}_i, y_i)\}_{i=1}^n$, the likelihood function is:
$$ P(D|\mathbf{w}) = \prod_{i=1}^n \mathcal{N}(y_i | \mathbf{w}^{\top} \mathbf{x}_i, \sigma^2) $$Using Bayes' theorem, we can compute the posterior distribution over the weights:
$$ P(\mathbf{w}|D) = \frac{P(D|\mathbf{w})P(\mathbf{w})}{P(D)} $$In practice, we often need to compute the marginal likelihood or evidence $P(D)$, which is given by:
$$ P(D) = \int P(D|\mathbf{w})P(\mathbf{w}) d\mathbf{w} $$This integral can be difficult to compute analytically, especially for high-dimensional or complex models. As a result, various approximate inference techniques, such as Markov Chain Monte Carlo (MCMC) methods, variational inference, and Laplace approximation, have been developed to estimate the posterior distribution or marginal likelihood.
In conclusion, Bayesian inference provides a powerful framework for incorporating prior knowledge, modeling uncertainty, and updating beliefs as new data is observed. By understanding the underlying principles and methods, we can better apply Bayesian techniques to machine learning tasks and develop more robust and accurate models.
8. Optimization¶
Optimization is the process of finding the best solution to a problem by minimizing or maximizing an objective function. In machine learning, optimization algorithms are used to find the best model parameters that minimize the error or loss function. Common optimization techniques include gradient descent, stochastic gradient descent, and more advanced methods such as L-BFGS, AdaGrad, RMSprop, and Adam.
One of the most popular optimization methods is gradient descent. Given a differentiable loss function $L(\theta)$, where $\theta$ is the parameter vector, gradient descent iteratively updates the parameters using the negative gradient of the loss function:
$$ \theta^{(t+1)} = \theta^{(t)} - \alpha \nabla L(\theta^{(t)}) $$where $\alpha$ is the learning rate, and $\nabla L(\theta^{(t)})$ is the gradient of the loss function with respect to the parameters at iteration $t$. The gradient can be computed using the chain rule of derivatives, as follows:
$$ \begin{aligned} \nabla L(\theta) &= \frac{\partial L}{\partial \theta} \\ &= \frac{\partial L}{\partial y} \frac{\partial y}{\partial \theta} \end{aligned} $$In stochastic gradient descent (SGD), instead of computing the gradient using the entire dataset, we use a random subset of the data (a mini-batch) to approximate the true gradient. This introduces some noise but can significantly speed up the convergence:
$$ \theta^{(t+1)} = \theta^{(t)} - \alpha \nabla L_{\text{mini-batch}}(\theta^{(t)}) $$More advanced optimization methods often adapt the learning rate for each parameter based on their individual gradients. For example, the AdaGrad algorithm computes a running sum of the squared gradients and adjusts the learning rate for each parameter as follows:
$$ \begin{aligned} G_{t+1} &= G_t + \nabla L(\theta^{(t)}) \odot \nabla L(\theta^{(t)}) \\ \theta^{(t+1)} &= \theta^{(t)} - \frac{\alpha}{\sqrt{G_{t+1} + \epsilon}} \odot \nabla L(\theta^{(t)}) \end{aligned} $$where $G_t$ is the running sum of the squared gradients, $\odot$ denotes element-wise multiplication, and $\epsilon$ is a small constant to prevent division by zero.
In conclusion, optimization is a crucial aspect of machine learning, as it allows us to find the best model parameters that minimize the error or loss function. By understanding the various optimization techniques and their properties, we can better choose the most appropriate method for a specific problem and improve the efficiency and effectiveness of our models.
9. Regularization¶
Regularization is a technique used in machine learning to prevent overfitting and improve generalization by introducing additional constraints or penalties to the objective function. Regularization methods add a regularization term $\Omega(\theta)$ to the loss function $L(\theta, \mathcal{D})$ to create a new regularized loss function:
$$ \tilde{L}(\theta, \mathcal{D}) = L(\theta, \mathcal{D}) + \lambda \Omega(\theta) $$where $\lambda$ is a regularization parameter that controls the balance between fitting the data and the complexity of the model, and $\theta$ represents the model parameters.
Two common regularization methods are L1 and L2 regularization:
L1 regularization (Lasso): This method adds an L1 penalty, which is proportional to the absolute value of the model parameters:
$$ \Omega(\theta) = \sum_{i=1}^{n} |\theta_i| $$
L1 regularization promotes sparsity in the model parameters, as it encourages some parameters to be exactly zero. This can be useful for feature selection.
L2 regularization (Ridge): This method adds an L2 penalty, which is proportional to the square of the magnitude of the model parameters:
$$ \Omega(\theta) = \sum_{i=1}^{n} \theta_i^2 $$
L2 regularization encourages the model parameters to be small but not necessarily zero. It can help prevent overfitting by avoiding extreme parameter values.
When optimizing the regularized loss function, we can use gradient-based methods such as gradient descent. The gradient of the regularized loss function with respect to the model parameters is given by:
$$ \nabla_{\theta} \tilde{L}(\theta, \mathcal{D}) = \nabla_{\theta} L(\theta, \mathcal{D}) + \lambda \nabla_{\theta} \Omega(\theta) $$For L1 regularization, the gradient of the regularization term is given by the element-wise sign of the model parameters:
$$ \nabla_{\theta} \Omega(\theta) = \text{sign}(\theta) = \begin{cases} 1 & \text{if } \theta > 0 \\ 0 & \text{if } \theta = 0 \\ -1 & \text{if } \theta < 0 \end{cases} $$For L2 regularization, the gradient of the regularization term is given by the element-wise product of the model parameters:
$$ \nabla_{\theta} \Omega(\theta) = 2\theta $$Regularization is a powerful tool for preventing overfitting and improving the generalization capabilities of machine learning models by controlling their complexity.
10. Dimensionality Reduction¶
Dimensionality reduction is the process of reducing the number of features or variables in a dataset while preserving its essential structure and relationships. This is particularly useful for visualizing high-dimensional data, reducing computational complexity, and mitigating the curse of dimensionality.
One common dimensionality reduction technique is Principal Component Analysis (PCA). PCA aims to find a lower-dimensional subspace that maximizes the variance of the projected data. Given a dataset $X \in \mathbb{R}^{n \times d}$, where $n$ is the number of samples and $d$ is the number of features, the covariance matrix of the data is given by:
$$ \Sigma = \frac{1}{n} X^T X \in \mathbb{R}^{d \times d} $$PCA finds the eigenvectors and eigenvalues of the covariance matrix, which correspond to the principal components and the variances explained by these components, respectively. Let $\lambda_i$ and $v_i$ denote the $i$-th eigenvalue and eigenvector of $\Sigma$, respectively. The eigenvectors and eigenvalues satisfy the following relationship:
$$ \Sigma v_i = \lambda_i v_i $$By selecting the top $k$ eigenvectors corresponding to the $k$ largest eigenvalues, we can project the original data onto a lower-dimensional subspace:
$$ Z = X V_k \in \mathbb{R}^{n \times k} $$where $V_k \in \mathbb{R}^{d \times k}$ is the matrix formed by the top $k$ eigenvectors. This results in a reduced representation of the original data with $k$ dimensions.
Another popular dimensionality reduction technique is t-Distributed Stochastic Neighbor Embedding (t-SNE). t-SNE aims to find a lower-dimensional representation of the data that preserves the pairwise distances between data points. Given a pairwise distance matrix $D \in \mathbb{R}^{n \times n}$ for the original data, t-SNE minimizes the divergence between two probability distributions, one representing the pairwise similarities in the original space and the other in the lower-dimensional space. The Kullback-Leibler (KL) divergence between these two distributions is given by:
$$ C = \mathrm{KL}(P || Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}} $$where $p_{ij}$ and $q_{ij}$ denote the probabilities of pairwise similarities in the original and lower-dimensional spaces, respectively. The t-SNE algorithm optimizes the lower-dimensional representation by minimizing the KL divergence using gradient descent.
In summary, dimensionality reduction techniques like PCA and t-SNE play an essential role in various machine learning applications, such as data visualization, feature extraction, and noise reduction. By understanding these techniques and their underlying principles, practitioners can make informed choices when selecting appropriate methods for their specific tasks.
11. Conclusion¶
In conclusion, we have explored various basic mathematical concepts that underpin many machine learning algorithms. Understanding these concepts is essential for researchers and practitioners alike, as they provide the foundation for developing and implementing state-of-the-art machine learning models.
Gaussian Processes
Kernel Methods and Kernels
Taylor Series Expansion
Vector-to-vector Derivatives
Differential Equations
Central Limit Theorem
Bayesian Inference
Optimization
Regularization
Dimensionality Reduction