Linear Transformation and Deep Learning

kirubel Awokekirubel Awoke
9 min read

Introduction to Linear Transformations

A linear transformation is a mathematical concept used to map vectors from one space to another while preserving the operations of vector addition and scalar multiplication. It is defined as a function \(T\) that maps vectors \(v\) in a vector space \(V\) to vectors \(T(\mathbf{v})\) in another vector space \(W\).

Mathematically, a transformation \( T\) is linear if it satisfies the following two conditions for all vectors \(v, w\) in the vector space and scalars \(c, d\):

  1. Additivity (Preserving Addition):

    $$T(\mathbf{v} + \mathbf{w}) = T(\mathbf{v}) + T(\mathbf{w})$$

    This means the transformation applied to the sum of two vectors is equal to the sum of the transformations applied to each vector individually.

  2. Homogeneity (Preserving Scalar Multiplication):

    $$T(c \mathbf{v}) = c \, T(\mathbf{v})$$

    This means that scaling a vector by a scalar and then applying the transformation is the same as applying the transformation first and then scaling the result by that scalar.

Linearity Condition

The equation given in the paragraph:

$$T(c \mathbf{v} + d \mathbf{w}) = c \, T(\mathbf{v}) + d \, T(\mathbf{w})$$

expresses the idea that the transformation applied to any linear combination of vectors (like \( c \mathbf{v} + d \mathbf{w}\)) is equal to the same linear combination of the transformations of the individual vectors. This is the essence of linearity.

A non-linear transformation might look like \(T(\mathbf{v}) = \mathbf{v} + \mathbf{u}_0\) ( where \(u_o\) is a fixed vector ), but this is not linear because it would not satisfy the additivity property:

$$T(\mathbf{v} + \mathbf{w}) = (\mathbf{v} + \mathbf{w}) + \mathbf{u}_0 \ne (\mathbf{v} + \mathbf{u}_0) + (\mathbf{w} + \mathbf{u}_0) = T(\mathbf{v}) + T(\mathbf{w})$$

Thus, the addition of a constant vector \(u_o\)​ introduces a shift, turning the transformation into an affine transformation.

Matrix Representation of Linear Transformations

In the case of linear transformations between vector spaces, they can often be represented as matrix multiplications. If \(A\) is an \(m * n\) matrix, then the transformation \(T\) that maps a vector \(\mathbf{v} \in \mathbb{R}^n\) to a vector in \(\mathbb{R}^m\) can be written as:

\(T(\mathbf{v}) = A\mathbf{v}\)

Here, \(A\) is the matrix representation of the transformation, and \(v \) is the vector being transformed. The matrix \(A\) encodes how the transformation acts on the vector space. For each vector \(v\), applying \(A\) to \(v\) will produce another vector in the output space \(\mathbb{R}^m\).

Key Properties of Linear Transformations

Zero Vector Transformation: A key property of linear transformations is that they always map the zero vector to the zero vector. This is known as the zero map property:

$$T(\mathbf{0}) = \mathbf{0}$$

This means that applying the transformation to the zero vector results in the zero vector in the output space.

Preservation of Structure: Linear transformations preserve the structure of vector spaces. In the context of geometry, this means that straight lines remain straight, and parallelism and ratios of distances between vectors are maintained under the transformation.

A linear transformation maps linear structures to linear structures:

  • Lines to lines

  • Planes to planes

  • Triangles to triangles

  • Equally spaced points to equally spaced points

This is because linear transformations preserve:

  • Vector addition

  • Scalar multiplication

Non-Linear Transformations

transformations involving non-linear operations such as squaring the components of vectors, taking their norms, or multiplying the components of two vectors together do not preserve linearity.

A transformation like \(T(\mathbf{v}) = \mathbf{v}^2\) or \(T(\mathbf{v}) = \| \mathbf{v} \|\) would not be linear because these operations do not satisfy the additivity or homogeneity properties.For instance, \(T(\mathbf{v} + \mathbf{w}) \ne T(\mathbf{v}) + T(\mathbf{w})\) if \(T\) involves squaring the components.

For example Norm / Length Transformation is NOT Linear

Transformation: \( T(\mathbf{v}) = \| \mathbf{v} \| \) This maps a vector in \(\mathbb{R}^n\) to a scalar (its Euclidean norm). Formally:

\(T : \mathbb{R}^n \rightarrow \mathbb{R}, \quad T(\mathbf{v}) = v_1^2 + v_2^2 + \cdots + v_n^2 \)

T is not linear because it doesn’t satisfy

Additivity: \(T(\mathbf{v} + \mathbf{w}) = T(\mathbf{v}) + T(\mathbf{w}) \)

Homogeneity: \(T(c\mathbf{v}) = cT(\mathbf{v})\)

Example : Rotation is Linear

Transformation:

Let \(T\) be rotation by angle \(\theta = 30^\circ\) counter-clockwise in \(\mathbb{R}^2\). The transformation \(T(\mathbf{v})\) rotates every input vector \(\mathbf{v} \in \mathbb{R}^2\) by \(\theta = 30^\circ\).

Rotation Matrix:

\(T(\mathbf{v}) = R_{\theta} \mathbf{v},\) where \(R_{\theta} = \begin{bmatrix} \cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \end{bmatrix}\)

For \(\theta = 30^\circ : \)

\(R_{30^\circ} = \begin{bmatrix} \cos(30^\circ) & -\sin(30^\circ) \\\\ \sin(30^\circ) & \cos(30^\circ) \end{bmatrix} = \begin{bmatrix} \frac{\sqrt{3}}{2} & -\frac{1}{2} \\\\ \frac{1}{2} & \frac{\sqrt{3}}{2} \end{bmatrix}\)

So:

\(T(\mathbf{v}) = R_{30^\circ} \mathbf{v} \) checking for linearity

  • . Additivity:

Let \(\mathbf{v}, \mathbf{w} \in \mathbb{R}^2\) , then

\(T(\mathbf{v} + \mathbf{w}) = R \cdot (\mathbf{v} + \mathbf{w}) = R\mathbf{v} + R\mathbf{w} = T(\mathbf{v}) + T(\mathbf{w}) \) Additivity holds

  • Homogeneity

  • - For scalar c:

    \(T(c\mathbf{v}) = R(c\mathbf{v}) = c(R\mathbf{v}) = cT(\mathbf{v})\) Homogeneity holds.

    Since \(R^\top R = I \) , the dot product is preserved.

    Orthogonality of Rotation Matrices: R⊤R=IR^\top R = I

    Orthogonality of Rotation Matrices : \(R^\top R = I\)

    If a matrix \(R\) satisfies:

    \(R^\top R = I\)

  • Then \(R\) is called an orthogonal matrix.

  • This means:

    • The columns (and also rows) of \(R\) form an orthonormal basis.

    • \(R\) preserves lengths and angles.

    • The inverse of \(R\) is is its transpose: \(R^{-1} = R^\top\)

  • For Rotation:

    The standard 2D rotation matrix:

  • \(R = \begin{bmatrix} \cos \theta & -\sin \theta \\\\ \sin \theta & \cos \theta \end{bmatrix}\)

  • satisfies:

  • \(RR^\top = \begin{bmatrix} \cos \theta & -\sin \theta \\\\ \sin \theta & \cos \theta \end{bmatrix} \begin{bmatrix} \cos \theta & \sin \theta \\\\ - \sin \theta & \cos \theta \end{bmatrix} = \begin{bmatrix} 1 & 0 \\\\ 0 & 1 \end{bmatrix} = I\)

So: Orthogonality holds.

Representing Differentiation as a Linear Transformation

We interpret the derivative operator \(T(u) = \frac{du}{dx}\) as a linear transformation between function spaces.

Derivative via Linearity

Let the function be:

\(u(x) = 6 - 4x + 3x^2\)

We decompose this function into a linear combination of the basis:

\( u(x) = 6 \cdot 1 + (-4) \cdot x + 3 \cdot x^2\)

Here, \(\{ 1, x, x^2 \}\) form a basis for the space of quadratic polynomials, \(V = \text{span}(1, x, x^2)\) , which is 3-dimensional.

We define a linear operator:

\(T(u) = \frac{du}{dx}\)

By linearity, we apply the derivative to each term individually:

\(\frac{du}{dx}\) = \(6.\cdot \frac{d}{dx}(1) - 4 \cdot \frac{d}{dx}(x) + 3 \cdot \frac{d}{dx}(x^2) = 6 \cdot 0 - 4 \cdot 1 + 3 \cdot 2x = -4 + 6x\)

This reinforces the idea that the derivative operator is linear:

Thus, once we know the derivatives of the basis functions, we can derive any function in the space by a linear combination of those derivatives.

Matrix Representation of the Derivative

Choose the basis:

  • Input basis \(\{ 1, x, x^2 \} \)

  • Output basis \(\{ 1, x\}\)

Now we express the transformation \(T(u) = \frac{du}{dx}\) as a matrix \(A\) such that:

\(T(u) = A u \)

We compute \(T\) on each basis vector of the input:

\(T(1) = 0 \)

\(T(x) = 1 \)

\(T(x^2) = 2x\)

Now express each result in terms of the output basis \(\{ 1, x\}\)

Input BasisDerivativeCoefficients in Output Basis
10[0.0]
\(x\)1[1,0]
\(x^2\)\(2x\)[0,2]

So the matrix \(A\) representing \(T\) is:

\[A = \begin{bmatrix} 0 & 1 & 0 \\\\ 0 & 0 & 2 \end{bmatrix}\]

Apply this to the coefficient vector of \(u(x) = a + b x + c x^2\)

\[\begin{bmatrix} 0 & 1 & 0 \\\\ 0 & 0 & 2 \end{bmatrix} \begin{bmatrix} a \\\\ b \\\\ c \end{bmatrix} = \begin{bmatrix} b \\\\ 2c \end{bmatrix} => b + 2cx\]

Moments of a Linear (Affine) Transformation

Suppose a random vector \(x \in \mathbb{R}^n\) has a mean \(\mu = \mathbb{E}[x]\) and covariance matrix \(\Sigma = \text{Cov}[x]\)

Now consider an affine transformation:

\(y = Ax + b\)

where:

\(A \in \mathbb{R}^{m \times n}\) is a matrix (linear part),

\(b \in \mathbb{R}^m\) is a bias vector (translation),

\(y ∈ ℝᵐ\) is the transformed output.

Mean of the Transformed Variable

The expected value (mean) of y is:

\(E[y] = E[Ax + b] = A E[x] + b = A \mu + b\) ( 2.1 )

For any random vector \(x\), and constant matrix \(A\), and constant vector \(b\), we have:

\(E[Ax] = A E[x]\) , \(E[b] = b\)

by definition

\(E[y]=E[Ax+b]\)

then we apply linearity

\(E[Ax+b]=E[Ax]+E[b]\)

then pulling out constants

\(E[Ax]=AE[x],E[b]=b\)

\(⇒E[y]=AE[x]+b\)

let \(μ=E[x]\). Then:

\(E[y]=Aμ+b\)

This is the transformed mean vector of the random variable \(y\) under the affine transformation.

Equation (2.1) follows from the linearity of expectation, which holds regardless of the distribution of x.

If \(f(x) = aᵀx + b\) is a scalar-valued affine function \((a ∈ ℝⁿ)\), then:

\(E[a^Tx+b]=a^TE[x]+b=a^Tμ+b\) (2.2)

Where:

\(x∈R^n \) : a random vector

\(a∈R^n\) : a fixed vector (a linear functional or projection direction)

\(b∈R\): a constant scalar

\(μ=E[x]\): the mean vector of the random variable \(x\)

\(a^Tx+b\) is a scalar-valued affine transformation of a vector-valued random variable.

it is a linear projection of \(x\) onto direction \(a\) ( via dot product \(a^Tx\) ) Then a bias term \(b\) is added, shifting the result.

which is fundamental in

Linear regression (prediction is \(y^​=w^Tx+b)\)

Perceptrons and neural networks (neurons compute \(z=w^Tx+b\)

Signal projections, compressed sensing, PCA projections

Expectation is linear, which means for any scalars \(α, β\) , and random variables \(X, Y \) :

\(E[αX+βY]=αE[X]+βE[Y]\)

We can generalize this to vectors and matrices.

Step-by-Step Derivation

\(E[a^Tx+b]\)

Separate the sum:

\(=E[a^Tx]+E[b]\)

Pulling out the constants :

\(a^T\) is fixed → can be pulled out:

\(E[a^Tx]=a^TE[x]\)

\(E[b]=b\) , since it's a constant

Final result:

\(E[a^Tx+b]=a^TE[x]+b\)

let \(μ=E[x]\), then:

\(E[a^Tx+b]=a^Tμ+b\)

Covariance of the Transformed Variable

The covariance matrix encapsulates the pairwise relationships between variables. It tells us:

  • Whether two variables move together (positive covariance),

  • Whether they move in opposite directions (negative covariance),

  • How strongly they are related, based on the magnitude of the covariance values.

The covariance of \(y = Ax + b\) is:

\(Cov[y]=Cov[Ax+b]=ACov[x]A^T=AΣA^T\) (2.3)

Let’s begin with the mathematical definition Covariance Matrix \(Σ\).If \( x\) is a vector of random variables, say \(x=[x1,x2,…,xn]Tx=[x1​,x2​,…,xn​]^T\) , then the covariance matrix \(Σ \) is defined as

\( Σ=Cov[x]=E[(x−μ)(x−μ)T]\)

Where:

\(x\) is an \(n-\)dimensional random vector \(x∈R^n\)

\(μ\)is the mean vector of \(x\), i.e., \( μ=E[x]\)

\(E[⋅] \) denotes the expectation (mean),

\((x−μ)^T\) is the transpose of the vector \((x−μ)\)

So, the covariance matrix \(Σ\) is a square matrix of dimension \(n×n\), where the element in row \(i\) , column \(j\) is the covariance between the variables \(x_i\) and \(x_j\):

\(Σij​=Cov[xi​,xj​]=E[(xi​−μi​)(xj​−μj​)]\)

If \(i = j , Cov[xi​,xi​] \) is the variance of \(x_i\).

if \(i ≠ j\), \(Cov[xi​,xj​] \) represents the covariance between \(x_i\) and \(x_j\), indicating how changes in one variable correspond to changes in another.

So to explain the derivation of equation (2.3)

we got the equation of a covariance

\(Cov[y]=E[(y−E[y])(y−E[y])^T]\)

Substitute \(y=Ax+b, E[y]=Aμ+bE[y]=Aμ+b\)

\(Cov[y]=E[(Ax+b−Aμ−b)(Ax+b−Aμ−b)^T]\)

\= \(E[A(x−μ)(x−μ)^TA^T]\)

\= \(AE[(x−μ)(x−μ)^T]A^T\)

\= \(AΣA^T\)

So the covariance transforms via similarity transformation induced by the matrix A.

For a scalar linear function \(y=aTx+b,\), its variance is:

\(Var[y]=Var[a^Tx+b]=a^TΣa\)

9
Subscribe to my newsletter

Read articles from kirubel Awoke directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

kirubel Awoke
kirubel Awoke