Camera Calibration with Zhang's Method

Ayush SaraswatAyush Saraswat
10 min read

In 3D Computer Vision, camera calibration is the process of determining specific camera parameters to complete operations with specified performance measures.

These parameters include focal length, optical centre, skew coefficient, radian and tangential distortions etc. This process is really important because it helps us improve the accuracy of our calculation and remove lens distortions. It enables us to perform 3D image reconstruction and helps maintaining consistency of image across platforms.

Camera Matrix Model

While discussing the theory, we will consider a pinhole camera model as the result can be easily derived through it and can be generalised with ease.

The camera matrix model describes a set of important parameters that affect how a world point P is mapped to image coordinates P’ .

$$P' = \begin{bmatrix} x' \\ y' \end{bmatrix} = \begin{bmatrix} f\frac{x}{z} + c_x \\ f\frac{y}{z} + c_y \end{bmatrix}$$

Where c_x and c_y are translation offsets. This can be easily obtained through similarity of triangles in the pinhole camera diagram.

Here point P has 3D coordinates \((x_1, x_2, x_3)\) and point Q has 2D coordinates \((y_1, y_2)\). Since both are similar, we can say :

$${\displaystyle {\begin{pmatrix}y_{1}\\y_{2}\end{pmatrix}}=-{\frac {f}{x_{3}}}{\begin{pmatrix}x_{1}\\x_{2}\end{pmatrix}}}$$

You can compare this with the above equations to see that they are same except for the variable names.

Since digital images are represented in pixels, we will introduce two scaling parameters (of units pixels/cm) to account for this difference. So the new matrices are :

$$P' = \begin{bmatrix} x' \\ y' \end{bmatrix} = \begin{bmatrix} fl\frac{x}{z} + c_x \\ fl\frac{y}{z} + c_y \end{bmatrix} = \begin{bmatrix} \alpha\frac{x}{z} + c_x \\ \beta\frac{y}{z} + c_y \end{bmatrix}$$

We use Homogeneous Coordinates to express this in the form of matrix multiplications:

$$P'_h = \begin{bmatrix} \alpha x + c_x(z) \\ \beta y + c_y(z) \end{bmatrix} = \begin{bmatrix} \alpha & 0 & c_x & 0 \\ 0 & \beta & c_y & 0 \\ 0 & 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \\ 1 \end{bmatrix}$$

which can also be represented as :

$$\begin{align} P'_h &= \begin{bmatrix} \alpha & 0 & c_x \\ 0 & \beta & c_y \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} I & 0 \end{bmatrix} P \\ \\ \\ P'_h &= K \begin{bmatrix} I & 0 \end{bmatrix} P \end{align}$$

where K is called the camera matrix. To account for a skewness of angle θ in the coordinate axis, the new camera matrix becomes:

$$K = \begin{bmatrix} \alpha & -\alpha cot\theta & c_x \\ 0 & \frac{\beta}{sin\theta} & c_y \\ 0 & 0 & 1 \end{bmatrix}$$

the derivation of which you can find in this blog.

Therefore our camera matrix K has 5 degrees of freedom: 2 for focal length, 2 for offset, and 1 for skewness. These parameters are collectively known as the intrinsic parameters, as they are unique and inherent to a given camera and relate to essential properties of the camera.


Extracting Intrinsic Parameters

We will be using Zhang’s method or the Direct Linear Transform to calculate intrinsic parameters of the camera since this is the simplest and most popular method. In the original paper, the camera matrix is represented as:

$$K = \begin{bmatrix} \alpha & \gamma & c_x \\ 0 & \beta & c_y \\ 0 & 0 & 1 \end{bmatrix}$$

where γ is used to represent skewness directly and β/sin⁡θ is absorbed into β since skewness is considered to be negligible. However, if skewness is accounted for separately then the previous matrix can also be used as it is.

A 2D point is written as \(\textcolor{white}{m = [X, \space Y]^{\small T}}\), and a 3D point is written as \(\textcolor{white}{\mathbf{M} = \begin{bmatrix} X \ Y \ Z \end{bmatrix}^{\small T}}\). We can extend these points by adding a "1" at the end, creating augmented vectors: \(m_e = [u , v, 1]^T\)for the 2D point and \(M_e = [u,v,w,1]^T\) for the 3D point.

In a pinhole camera model, the relationship between a 3D point M and its 2D projection m is given by:

$$s\tilde{m} = K[R \space t]\tilde{M}$$

where:

  • s is a scaling factor,

  • K is the camera's intrinsic matrix,

  • R is the rotation matrix,

  • t is the translation vector.

This can also be represented through Homography:

$$\begin{align} s\tilde{m} = H\tilde{M} \hspace{5mm} where \space H = K[r_1 \space r_2 \space t] \end{align}$$

r3 is not present because we assume the model plane is on Z = 0 for the world coordinate system, so the third coordinate of the rotation matrix will always be zero.

Through this we will obtain two constraints (see Appendix for detailed derivation):

$$\begin{align} h_1^T B h_2 \space &= \space 0 \quad \quad \text{and,} \\ h_1^T B h_1 \space &= \space h_2^T B h_2 \end{align}$$

where B is a symmetric matrix and it can be defined by a 6D vector:

$$b = [B_{11}, B_{12}, B_{22}, B_{13}, B_{23}, B_{33}]^{\small T}$$

Let the i’th column vector of H be \( h_i = [h_{i1}, h_{i2}, h_{i3}]^{\small T}\). Then we can write the multiplication of matrices \(h_i\), B and \(h_j\) like this:

$$h^{\small T}i B h{j} = v^{\small T}_{ij}b$$

with

$$\textcolor{white}{ \begin{align*} v_{ij} = [ & h_{i1}h_{j1}, \space (h_{i1}h_{j2} + h_{i2}h_{j1}), \space h_{i2}h_{j2}, \\ & (h_{i3}h_{j1} + h_{i1}h_{j3}), \space (h_{i3}h_{j2} + h_{i2}h_{j3}), \space h_{i3}h_{j3} ] \end{align*} }$$

Therefore, the two fundamental constraints, for a given homography, can be rewritten as 2 homogeneous equations in b:

$$\begin{bmatrix} v^{\small T}{\small 12} \\ (v{\small 11} − v_{\small 22})^{\small T} \end{bmatrix} b = 0$$

If n images of the model plane are observed, by stacking n such equations we will have Vb = 0.

Now we can solve for b using the JacobiSVD method and obtain the values of B matrix which in turn, will help us obtain all the intrinsic parameters.

$$\begin{align} v_{0} &= (B_{12}B_{13} − B_{11}B_{23})/(B_{11}B_{22} − B^2_{12}) \\[2ex] λ &= B_{33} − [B^2_{13} + v_{0}(B_{12}B_{13} − B_{11}B_{23})]/B_{11} \\[2ex] α &= \sqrt{λ/B_{11}} \\[2ex] β &= \sqrt{λB_{11}/(B_{11}B_{22} − B^2_{12})} \\[2ex] γ &= −B_{12}α^2\left(\frac{β}{λ}\right) \\[2ex] u_{0} &= \frac{γv_{0}}{β} − \frac{B_{13}α^2}{λ} . \end{align}$$


Extracting Extrinsic parameters

We already know that : \(\space H = \lambda K[r_1 \space r_2 \space t]\) where \(\lambda\) is an arbitrary scale factor. Now that we have acquired the \(K\) matrix, we can use its inverse to calculate all three extrinsic parameters.

$$[r_1 \space r_2 \space t ] = \frac{1}{\lambda}K^{-1}H$$

Since \(r_1\)​ and \(r_2\)​ are orthonormal we can solve for \(\lambda\) like this:

$$\begin{align} \| \lambda K^{-1} h_1 \| &= 1 \\ \| \lambda K^{-1} h_2 \| &= 1 \end{align}$$

which means :

$$\begin{align} \lambda = \frac{1}{\| K^{-1} h_1 \| } \\ \lambda = \frac{1}{\| K^{-1} h_2 \| } \end{align}$$

For robustness, we will average the two, so \(\lambda\) becomes:

$$\lambda = \frac{1}{2}\cdot \left( \frac{1}{\| K^{-1} h_1 \| } + \frac{1}{\| K^{-1} h_2 \| } \right)$$

Now we can compute the extrinsic parameters :
\(r_1 = \lambda K^{-1}h_1\) , \(r_2= \lambda K^{-1}h_2\), and \(t = \lambda K^{-1}h_3\)


Distortion Parameters

In pinhole camera model, some distortions may arise in image capturing due to the deformity of the lens. Camera Calibration helps us to mitigate the damage done through these distortions. There are mainly two types of distortions : radial and tangential.

Radial Distortions occur because light bends more around the edges of the camera compared to centre. All radial distortions are symmetric around the optical centre of the lens. Two further types of radial distortions are barrel and pincushion distortions.

$$\begin{gather} x' = x + x[k_1(x^2 + y^2) + k_2(x^2+y^2)^2] \\ y' = y + y[k_1(x^2 + y^2) + k_2(x^2+y^2)^2] \end{gather}$$

The left figure represents barrel distortion, and the right image represents pincushion distortion.

Tangential distortions occur when the lens and the image plane are not perfectly parallel. This misalignment causes a "tilt" in the image, leading to skewed lines.

$$\begin{gather} x' = x + [\space 2p_1xy + p_2(r^2 + 2x^2) \space ] \\ y' = y + [\space p_1(r^2 + 2y^2) + 2p_2xy \space ] \end{gather}$$


Levenberg-Marquardt Optimization

Our aim is to minimize the re-projection error by adjusting:

  1. 5 Intrinsic Parameters \(\left( f_x, f_y, c_x, c_y, s \right)\)

  2. 3 Radial Distortion parameters and 2 Tangential Distortion parameters \(\left( k_1, k_2, k_3 \right) \space and \space \left( p_1, p_2 \right )\)

  3. 6 Extrinsic parameters per image ( 3 for Rotation and 3 for translation)

This means that total optimizable parameters are :

$$N_{total} = 5(\text{Intrinsic}) + 3(\text{Radial}) + 2(\text{Tangential}) + 6(\text{Extrinsics})\cdot N_{Images}$$

We will first convert the three Rotation parameters from the extrinsics to Rodrigues vectors, and then we will optimize all of them together using Levenberg-Marquardt algortihm. It is a way of combining the strengths of both Gradient Descent and Newton Rhapson Method.

We already know that :

$$\begin{align} \text{Gradient Descent Equation : } x_{n+1} &= x_n - \lambda \nabla f(x_n) \\ \text{Newton Rhapson Equation : } x_{n+1} &= x_n - \left( \nabla ^2 f(x_n) \right) ^{-1}\nabla f(x_n) \end{align}$$

But regular Gradient Descent relies heavily on learning rate, which if less, will make the convergence very slow, and if too large, will lead to no convergence at all since we will keep jumping around the minima. Also it does not take into account the local curvature of the cost function which Newton Rhapson does.

But the problem with Newton Rhapson is that we have to find the inverse of the Hessian everytime, which is computationally intensive and less robust.

Levenberg Marquardt find a workaround by combining the multipliers of both methods into one :

$$x_{n+1} = x_n - \left( \nabla ^2 f(x_n) + \lambda I \right) ^{-1} \nabla f(x_n)$$

here the \(\lambda\) is used as a damping parameter, which is often auto adjusted in the implementations. This combination, as expected in theory provides faster and more accurate results than previous two approaches. However, there is a slight twist. Since we know for sure that the cost function is that of a least squares problem, we can utilize this to do further optimization. In fact, the real equation used in practical implementation of Levenberg-Marquardt differs a lot:

$$x_{n+1} = x_n - \left( \mathbf{J}_n^{\small \mathsf{T}}\mathbf{J}_n \space + \space \lambda _n \mathbf{D}_n \right)^{-1} \space \mathbf{J}_n^{\small \mathsf{T}}r(x_n)$$

where:

  • \(r(x_n)\): Residual vector (error between model predictions and observations).

  • \(\mathbf{J}_n\) : Jacobian matrix of the residual vector

  • \(\mathbf{D}_n\)​: Scaling matrix (usually diagonal). Two common choices:

    1. Identity Matrix.

    2. Diagonal of \(\mathbf{J}^\top _n \mathbf{J}_n\) (ensures scale-invariance).

These changes are done in order to increase speed and ensure robustness. As you can see, calculating \(\mathbf{J}^\top _n \mathbf{J}_n\) is the Gauss-Newton approximation of the Hessian. We will not go into details since this is not a Levenberg Marquardt article but these techniques help a lot in maintain the numerical stability and speed while not sacrificing on accuracy of the task.

I have provided a small self implementation of this theory. It can be used for having a better understanding through code, or to find out the difference between the OpenCV’s implementation and this rough one.

( Note: Though OpenCV too utilizes the same principles of Zhang’s method in its implementations, its method of optimization and ensuring numerical stability is way superior and hence it creates great results. )

Due to the intermittent nature of the development of this article and corresponding implementation, there are be some inconsistencies in the practices used in LaTeX, for which I apologize in advance. However, they don’t affect the clarity of the equation/argument in any form so it is of no big concern.

Thank you for your time.


References

  1. Linear Camera Model by First Principles of Computer Vision

  2. A Flexible New Technique for Camera Calibration

  3. Multiple View Geometry in Computer Vision, Richard Hartley and Andrew Zisserman


Appendix

Derivation of Homography constraints:

$$\begin{align} h₁ = λKr₁ \\ h₂ = λKr₂ \end{align}$$

where h₁ and h₂ are the first two columns of H.

Now, let's solve for r₁ and r₂:

$$\begin{align} r₁ = (1/λ)K⁻¹h₁ \\ r₂ = (1/λ)K⁻¹h₂ \end{align}$$

Using Rotation Properties For orthogonality (r₁ᵀr₂ = 0):

$$\begin{gather} r₁^{\small T}r₂ = 0 \cr \\ \left((\frac{1}{\lambda})K^{\small -1}h₁\right)^{\small T} \cdot \left( (\frac{1}{\lambda})K^{\small -1}h₂ \right) = 0 \cr \\ h₁^{\small T} (K^{\small -T} K^{\small -1}) h₂ = 0 \end{gather}$$

Let B = K⁻ᵀK⁻¹, then:

$$h₁^{\small T}Bh₂ = 0$$

For equal length (r₁ᵀr₁ = r₂ᵀr₂):

$$\begin{gather} r₁^{\small T}r₁ = r₂^{\small T}r₂ \\ \\ \left((\frac{1}{\lambda})K^{\small -1}h_1\right) ^{\small T} \left((\frac{1}{\lambda})K^{\small -1}h_1\right) = \left((\frac{1}{\lambda})K^{\small -1}h_2\right) ^{\small T} \left((\frac{1}{\lambda})K^{\small -1}h_2\right) \\ \\ h_{\small 1} ^{\small T} (K^{\small -T} K^{\small -1}) h_{\small 1} = h_{\small 2}^{\small T} (K^{\small -T}K^{\small -1}) h_{\small 2} \end{gather}$$

Again with B = K⁻ᵀK⁻¹:

$$h_{\small 1}^{\small T}Bh_{\small 1} = h_{\small 2} ^{\small T}Bh_{\small 2}$$


0
Subscribe to my newsletter

Read articles from Ayush Saraswat directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ayush Saraswat
Ayush Saraswat

Aspiring Computer Vision engineer, eager to delve into the world of AI/ML, cloud, Computer Vision, TinyML and other programming stuff.