next up previous
Next: Simulated annealing Up: Optimization Previous: Line minimization

Conjugate gradient method

The gradient descent method may not be efficient because it could get into the zigzag pattern and repeat the same search directions many times. This problem is avoided in the conjugate gradient (CG) method, which does not repeat any previous search direction and converge in $N$ iterations. The CG method is a significant improvement in comparison to the gradient descent method.

We will first assume the function $f({\bf x})$ to be minimized is quadratic:

\begin{displaymath}f({\bf x})=\frac{1}{2}{\bf x}^T{\bf A}{\bf x}-{\bf b}^T{\bf x}+c \end{displaymath}

where ${\bf A}={\bf A}^T$ is symmetric. The gradient (first derivatives) of the function is (see here):

\begin{displaymath}
{\bf g}=\frac{d}{d{\bf x}}f({\bf x})
=\frac{d}{d{\bf x}}\l...
... A}{\bf x}-{\bf b}^T{\bf x}+c \right)
={\bf A}{\bf x}-{\bf b}
\end{displaymath}

and the Hessian matrix (second derivatives) of the function is simply ${\bf A}$, which is assumed to be positive definite, so that $f({\bf x})$ has a minimum.

At the solution ${\bf x}$ where $f({\bf x})$ is minimized, we have ${\bf g}={\bf A}{\bf x}-{\bf b}={\bf0}$. We see that the minimization of a quadratic function $f({\bf x})={\bf x}^T{\bf A}{\bf x}-{\bf b}^T{\bf x}+c$ is equivalent to solving a linear equation systems ${\bf A}{\bf x}={\bf b}$ for ${\bf x}$ given a symmetric positive definite matrix ${\bf A}$ and a vector ${\bf b}$. The CG method considered below can therefore be used for solving both problems. Later we will relax the condition for $f({\bf x})$ and consider using CG to minimize non-quadratic functions.

Before discussing the CG algorithm, we first consider the concept of conjugate vectors, which is the key to the CG method. A set of vectors $\{{\bf d}_0,\cdots,{\bf d}_{N-1}\}$ satisfying

\begin{displaymath}
\langle{\bf d}_i,{\bf A}{\bf d}_j\rangle
={\bf d}_i^T{\bf A}...
...0,
\;\;\;\;\;\;\;\;{\bf A}^T={\bf A},\;\;\;\;\;0\le i,j\le N-1
\end{displaymath}

is mutually conjugate to each other with respect to ${\bf A}$. The vectors are also said to be A-conjugate or A-orthogonal to each other. Comparing this to two orthogonal vectors ${\bf u}$ and ${\bf v}$ satisfying $\langle{\bf u},{\bf v}\rangle=\langle {\bf v},{\bf u}\rangle=0$, we see that two conjugate vectors satisfying $\langle{\bf d}_i,{\bf A}{\bf d}_j\rangle
=\langle{\bf d}_j,{\bf A}{\bf d}_i\rangle=0$ can also be considered as orthogonal to each other with respect to ${\bf A}$. As these $N$ vectors are independent of each other, they can be considered as a basis that spans the N-D space, in which any vector can be expressed as a linear combination of them.

The basic strategy of the CG method is for the iteration to follow a search path composed of $N$ segments each along one of the $N$ mutually conjugate search directions. After the nth iteration that traversed along the search direction ${\bf d}_n$ to arrive at ${\bf x}_n$, the search direction ${\bf d}_{n+1}$ in the next iteration will be A-conjugate to the previous one ${\bf d}_n$. By doing so, the total error at the initial guess ${\bf x}_0$

\begin{displaymath}
{\bf e}_0={\bf x}_0-{\bf x}=\sum_{i=0}^{N-1} c_i{\bf d}_i
\end{displaymath}

will be eliminated one component at a time in each of the $N$ iterations, so that after $N$ such iterations the error at ${\bf x}_N$ becomes zero, ${\bf e}_N={\bf x}_N-{\bf x}={\bf0}$, and ${\bf x}_N={\bf x}$ becomes the solution.

Now we consider specifically the CG algorithm with the following general iteration:

\begin{displaymath}
{\bf x}_{n+1}={\bf x}_n+\delta_n \; {\bf d}_n
={\bf x}_n-\frac{{\bf g}_n^T{\bf d}_n}{{\bf d}_n^T{\bf A}{\bf d}_n}{\bf d}_n
\end{displaymath}

where $\delta_n=-{\bf g}_n^T{\bf d}_n/{\bf d}_n^T{\bf A}{\bf d}_n$ is the optimal step size derived previously. Subtracting ${\bf x}={\bf x}_{n+1}-{\bf e}_{n+1}={\bf x}_n-{\bf e}_n$ from both sides, we get

\begin{displaymath}
{\bf e}_{n+1}
={\bf e}_n-\frac{{\bf g}_n^T{\bf d}_n}{{\bf d}_n^T{\bf A}{\bf d}_n}{\bf d}_n
\end{displaymath}

where ${\bf g}_n$ is the gradient of $f({\bf x})$ at ${\bf x}_n$, which can be found to be

\begin{displaymath}
{\bf g}_n={\bf A}{\bf x}_n-{\bf b}={\bf A}({\bf x}+{\bf e}_...
... b}
={\bf A}{\bf x}-{\bf b}+{\bf A}{\bf e}_n={\bf A}{\bf e}_n
\end{displaymath}

(Note that ${\bf A}{\bf x}-{\bf b}={\bf g}={\bf0}$.) Substituting this ${\bf g}_n={\bf A}{\bf e}_n$ into the previous equation we get

\begin{displaymath}
{\bf e}_{n+1}
={\bf e}_n-\frac{{\bf g}_n^T{\bf d}_n}{{\bf ...
... d}_n^T{\bf A}{\bf e}_n}{{\bf d}_n^T{\bf A}{\bf d}_n}{\bf d}_n
\end{displaymath}

Pre-multiplying both sides by ${\bf d}^T_n{\bf A}$ we get

\begin{displaymath}
{\bf d}_n^T{\bf A}{\bf e}^T_{n+1}
={\bf d}^T_n{\bf A}\left...
...bf d}_n^T{\bf A}{\bf d}_n}\right){\bf d}_n^T{\bf A}{\bf d}_n=0
\end{displaymath}

We see that after taking the nth step along the direction ${\bf d}_n$ to arrive at ${\bf x}_{n+1}$, the remaining error ${\bf e}_{n+1}$ is A-conjugate to the previous search direction ${\bf d}_n$. To reduce the error as much as possible, the next search direction ${\bf d}_{n+1}$ should be along one of the directions that span the remaining ${\bf e}_{n+1}$, that is A-conjugate to the previous search direction ${\bf d}_n$, instead of just orthogonal to it as in the case of gradient descent method, so that the component in ${\bf e}_{n+1}$ corresponding to the direction of ${\bf d}_{n+1}$ can be eliminated. Carrying out such a search in $N$ iterations each eliminating one of the $N$ components of the initial error ${\bf e}_0$, we get ${\bf e}_N={\bf0}$, i.e., the error is completely eliminated.

To see this, we premultiply both sides of the initial error equation ${\bf e}_0=\sum_{j=0}^{N-1}c_j{\bf d}_j$ by ${\bf d}^T_n{\bf A}$ and get

\begin{displaymath}
{\bf d}^T_n{\bf A}{\bf e}_0=\sum_{j=0}^{N-1}c_j{\bf d}^T_n{\bf A}{\bf d}_j
=c_n{\bf d}^T_n{\bf A}{\bf d}_n
\end{displaymath}

Solving for $c_n$ we get

\begin{displaymath}c_n=\frac{ {\bf d}^T_n{\bf A}{\bf e}_0}{ {\bf d}^T_n{\bf A}{\...
...rac{{\bf d}^T_n{\bf A} {\bf e}_n}{{\bf d}^T_n{\bf A}{\bf d}_n}
\end{displaymath}

Here we have used the fact that ${\bf e}_n={\be e}_0+\sum_{i=0}^{n-1}\delta_i{\bf d}_i$ and that the nth search direction ${\bf d}_n$ is A-orthogonal to all previous directions. Comparing this result with the expression of $\delta_n$ obtained above, we see that $\delta_n=-c_n$, and the expression for ${\bf e}_n$ can now be written as

\begin{displaymath}
{\bf e}_{n+1}={\bf e}_0+\sum_{j=0}^n\delta_j{\bf d}_j={\bf e...
... d}_j-\sum_{j=0}^n c_j{\bf d}_j=\sum_{j=n+1}^{N-1}c_j{\bf d}_j
\end{displaymath}

We see that as $n$ increases from 0 to $N-1$, the error is gradually eliminated from ${\bf e}_0$ to ${\bf e}_N=0$, one component at a time. After $N$ steps, the error is reduced to ${\bf e}_N={\bf0}$ and the true solution is obtained ${\bf x}_N={\bf x}$.

The specific A-conjugate search directions $\{{\bf d}_0,\cdots,{\bf d}_{N-1}\}$ satisfying ${\bf d}_i^T{\bf A}{\bf d}_j=0$ for any $i\ne j$can be constructed by the Gram-Schmidt process based on any set of $N$ independent vectors $\{{\bf v}_0,\cdots,{\bf v}_{N-1}\}$, which can be used as the basis to span the space. Specifically, starting from ${\bf d}_0={\bf v}_0$, we construct each of the subsequent ${\bf d}_n$ for $n>0$ based on ${\bf v}_n$, with all of its components along the previous directions ${\bf d}_i$ ( $i=0,\cdots,n-1$) removed:

\begin{displaymath}
{\bf d}_n={\bf v}_n-\sum_{j=0}^{n-1} {\bf p}_{{\bf d}_j}({\bf v}_n)
={\bf v}_n-\sum_{j=0}^{n-1} \beta_{nj}{\bf d}_j
\end{displaymath}

where $\beta_{nj}{\bf d}_j$ ($j<n$) is the A-projection of ${\bf v}_n$ onto ${\bf d}_j$. The coefficient $\beta_{nj}$ can be found by pre-multiplying both sides by ${\bf d}^T_i{\bf A}$ with $i<n$ to get:

\begin{displaymath}
{\bf d}^T_i{\bf A}{\bf d}_n={\bf d}^T_i{\bf A}{\bf v}_n
-\su...
...}^T_i{\bf A}{\bf v}_n-\beta_{ni}{\bf d}^T_i{\bf A}{\bf d}_i=0
\end{displaymath}

Solving for $\beta_{nj}$ we get:

\begin{displaymath}
\beta_{nj}=\frac{{\bf d}^T_j{\bf A}{\bf v}_n}{{\bf d}^T_j{\bf A}{\bf d}_j}
\end{displaymath}

and the A-projection of ${\bf v}_n$ onto ${\bf d}_j$ is

\begin{displaymath}
{\bf p}_{{\bf d}_j}({\bf v}_n)=\beta_{nj}{\bf d}_j
=\left(\f...
...\bf A}{\bf v}_n}{{\bf d}^T_j{\bf A}{\bf d}_j}\right)
{\bf d}_j
\end{displaymath}

Now the nth direction can be expressed as:

\begin{displaymath}
{\bf d}_n={\bf v}_n-\sum_{j=0}^{n-1} \beta_{nj}{\bf d}_j
={\...
...\bf A}{\bf v}_n}{{\bf d}^T_j{\bf A}{\bf d}_j}\right)
{\bf d}_j
\end{displaymath}

In particular, we could construct the $N$ A-orthogonal search directions ${\bf d}_n$ ( $n=0,\cdots,N-1$) based on the gradient directions ${\bf v}_n=-{\bf g}_n$. The benefit of choosing this particular set of directions is that the computational cost is much reduced as we will see later. Premultiplying ${\bf d}_m^T{\bf A}$ with $m<n$ on both sides of ${\bf e}_n=\sum_{j=n}^{N-1}c_j{\bf d}_j$ we get:

\begin{displaymath}
{\bf d}_m^T{\bf A}{\bf e}_n={\bf d}_m^T{\bf g}_n
=\sum_{j=n}^{N-1}c_j{\bf d}_m^T{\bf A}{\bf d}_j=0
\end{displaymath}

We see the ${\bf d}_m^T{\bf g}_n={\bf g}_n^T{\bf d}_m=0$ for all $n>m$, i.e., ${\bf g}_n$ is orthogonal to all previously traveled directions ${\bf d}_m$. Now the Gram-Schmidt process can be written as

\begin{displaymath}
{\bf d}_m={\bf v}_m-\sum_{j=0}^{m-1} \beta_{mj}{\bf d}_j
=-{\bf g}_m-\sum_{j=0}^{m-1} \beta_{mj}{\bf d}_j
\end{displaymath}

where

\begin{displaymath}
\beta_{mj}=\frac{{\bf d}_i^T{\bf A}{\bf v}_m}{{\bf d}^T_i{\b...
...\bf A}{\bf g}_m}{{\bf d}^T_i{\bf A}{\bf d}_i}\;\;\;\;\;\;(m>j)
\end{displaymath}

Premultiplying ${\bf g}_n^T$ ($n\ge m$) on both sides we get

\begin{displaymath}
{\bf g}_n^T{\bf d}_m=-{\bf g}_n^T{\bf g}_m-\sum_{j=0}^{m-1}...
...vert\vert{\bf g}_n\vert\vert^2&m=n\\
0&m<n\end{array}\right.
\end{displaymath}

Note that all terms in the summation are zero as ${\bf g}_n^T{\bf d}_j=0$ for all $j<m\le n$. Substituting ${\bf g}_n^T{\bf d}_n=-\vert\vert{\bf g}_n\vert\vert^2$ into the expression for the step size $\delta_n$, we get

\begin{displaymath}
\delta_n=-\frac{{\bf g}_n^T{\bf d}_n}{{\bf d}_n^T{\bf A}{\bf...
...\vert\vert{\bf g}_n\vert\vert^2}{{\bf d}_n^T{\bf A}{\bf d}_n}
\end{displaymath}

Next we consider

\begin{displaymath}
{\bf g}_{m+1}={\bf A}{\bf x}_{m+1}-{\bf b}
={\bf A}({\bf x...
...+\delta_m{\bf A}{\bf d}_m
={\bf g}_m+\delta_m{\bf A}{\bf d}_m
\end{displaymath}

Premultiplying ${\bf g}_n^T$ on both sides we get

\begin{displaymath}
{\bf g}_n^T{\bf g}_{m+1}={\bf g}_n^T{\bf g}_m+\delta_m{\bf g}_n^T{\bf A}{\bf d}_m
\end{displaymath}

Solving for ${\bf g}_n^T{\bf A}{\bf d}_m$ we get

\begin{displaymath}
{\bf g}_n^T{\bf A}{\bf d}_m
=\frac{1}{\delta_m}\left( {\bf...
...}_n\vert\vert^2/\delta_{n-1}&m=n-1 0&m<n-1\end{array}\right.
\end{displaymath}

Finally the mth coefficient in the Gram-Schmidt process for ${\bf d}_n$ ($m<n$) can be written as

\begin{displaymath}
\beta_{nm}=-\frac{{\bf d}_m^T{\bf A}{\bf g}_n}{{\bf d}^T_m{\...
...}_n\vert\vert^2/\delta_{n-1}&m=n-1 0&m<n-1\end{array}\right.
\end{displaymath}

Note that $\beta_{nm}=0$ except when $m=n-1$, i.e., there is only one non-zero term left in the summation of the update formula for ${\bf d}_n$. This is the reason why we choose ${\bf v}_n=-{\bf g}_n$. We can now drop the second subscript $m$ in $\beta_{nm}$. Substituting the step size $\delta_{n-1}=\vert\vert{\bf g}_{n-1}\vert\vert^2/{\bf d}_{n-1}^T{\bf A}{\bf d}_{n-1}$ obtained above into the expression for $\beta_{nm}=\beta_m$ we get

\begin{displaymath}\beta_n=-\frac{\vert\vert{\bf g}_n\vert\vert^2}{\vert\vert{\bf g}_{n-1}\vert\vert^2} \end{displaymath}

In summary here is the conjugate gradient algorithm (note ${\bf g}_n=-{\bf r}_n$):

  1. Set $n=0$ and initialize the search direction (same as gradient descent):

    \begin{displaymath}{\bf d}_0={\bf r}_0=-{\bf g}_0={\bf b}-{\bf A}{\bf x}_0 \end{displaymath}

  2. Termination check: if the error $\vert\vert{\bf r}_n\vert\vert^2$ is smaller than a preset threshold, stop. Otherwise, find step size:

    \begin{displaymath}\delta_n=\frac{\vert\vert{\bf g}_n\vert\vert^2}{{\bf d}_n^T{\...
...\vert\vert{\bf r}_n\vert\vert^2}{{\bf d}_n^T{\bf A}{\bf d}_n}
\end{displaymath}

  3. Step forward:

    \begin{displaymath}{\bf x}_{n+1}={\bf x}_n+\delta_n{\bf d}_n \end{displaymath}

  4. Update gradient:

    \begin{displaymath}{\bf g}_{n+1}={\bf g}_n+\delta_n{\bf A}{\bf d}_n \end{displaymath}

  5. Find coefficient for Gram-Schmidt process:

    \begin{displaymath}\beta_{n+1}=-\frac{\vert\vert{\bf g}_{n+1}\vert\vert^2}{\vert...
...ert{\bf r}_{n+1}\vert\vert^2}{\vert\vert{\bf r}_n\vert\vert^2} \end{displaymath}

  6. Update search direction:

    \begin{displaymath}{\bf d}_{n+1}=-{\bf g}_{n+1}-\beta_{n+1}{\bf d}_n
={\bf r}_{n+1}-\beta_{n+1}{\bf d}_n \end{displaymath}

    Set $n=n+1$ and go back to step 2.

The algorithm above assumes the objective function $f({\bf x})$ to be quadratic with known ${\bf A}$. However, when $f({\bf x})$ is not quadratic and ${\bf A}$ is not available, the algorithm can be modified so that it does not depend on ${\bf A}$, by the following two methods.

In the figure below, the conjugate gradient method is compared with the gradient descent method for the case of $N=2$. We see that the first search direction is the same $-{\bf g}_0$ for both methods. However, the next direction ${\bf d}_1$ is A-orthogonal to ${\bf d}_0$, same as the next error ${\bf e}_1$, different from the search direction $-{\bf g}_1$ in gradient descent method. The conjugate gradient method finds the solution ${\bf x}$ in $N=2$ steps, while the gradient descent method has to go through many more steps all orthogonal to each other before it finds the solution.

ConjugateGradient.png

Now we relax the requirement that the function $f({\bf x})$ be quadratic, but we still assume that it can be approximated as a quadratic function near its minimum. In this case, the update of the search direction, or specifically the coefficient $\beta_{n+1}$, which is derived based on the assumption that $f({\bf x})$ is quadratic, may no longer be valid. Some alternative formulas for $\beta_{n+1}$, can be used:

\begin{displaymath}
\beta_{n+1}=\frac{{\bf g}^T_{n+1}({\bf g}_{n+1}-{\bf g}_n)}{...
...f g}_{n+1}-{\bf g}_n)}{ {\bf d}_n^T({\bf g}_{n+1}-{\bf g}_n)}
\end{displaymath}

These expressions are identical to $\beta_{n+1}=\vert\vert{\bf g}_{n+1}\vert\vert^2/\vert\vert{\bf g}_n\vert\vert^2$ when $f({\bf x})$ is indeed quadratic. Note that it is now possible for $\beta_{n+1}<0$. If this happens to be the case, we will use $\beta_{n+1}=0$, i.e., the next search direction is simply ${\bf d}_{n+1}=-{\bf g}_n-\beta_n{\bf d}_n=-{\bf g}_n$, same as the gradient descent method.

Example To compare the conjugate method and the gradient descent method, consider a very simple 2-D quadratic function

\begin{displaymath}
f(x,y)={\bf x}^T{\bf A}{\bf x}
=[x_1, x_2]\left[\begin{arra...
...rray}\right]
\left[\begin{array}{c}x_1 x_2\end{array}\right]
\end{displaymath}

The performance of the gradient descent method depends significantly on the initial guess. For the specific initial guess of ${\bf x}_0=[1.5,\;-0.75]^T$, the iteration gets into a zigzag pattern and the converge is very slow, as shown in the table below.

n ${\bf x}=[x_1, x_2]$ $f({\bf x})$
0 1.500000, -0.750000 2.812500
1 0.250000, -0.750000 0.468750e-01
2 0.250000, -0.125000 7.812500e-02
3 0.041667, -0.125000 1.302083e-02
4 0.041667, -0.020833 2.170139e-03
5 0.006944, -0.020833 3.616898e-04
6 0.006944, -0.003472 6.028164e-05
7 0.001157, -0.003472 1.004694e-05
8 0.001157, -0.000579 1.674490e-06
9 0.000193, -0.000579 2.790816e-07
10 0.000193, -0.000096 4.651361e-08
11 0.000032, -0.000096 7.752268e-09
12 0.000032, -0.000016 1.292045e-09
13 0.000005, -0.000016 2.153408e-10

However, as expected, the conjugate gradient method takes exactly $N=2$ steps from any initial guess to reach at the solution, as shown below.

n ${\bf x}=[x_1, x_2]$ $f({\bf x})$
0 1.500000, -0.750000 2.812500e+00
1 0.250000, -0.750000 4.687500e-01
2 0.000000, -0.000000 1.155558e-33

GDvsCG1.png

For an $N=3$ example of $f({\bf x})={\bf x}^T{\bf A}{\bf x}$ with

\begin{displaymath}
{\bf A}=\left[\begin{array}{ccc}5 & 3 & 1 3 & 4 & 2 1 & 2 & 3\end{array}\right]
\end{displaymath}

from an initial guess ${\bf x}_0=[1,\;2,\;3]^T$, it takes the gradient descent method 36 iterations to reach ${\bf x}_{36}=[0.000017,\;-0.000020,\;0.000018]^T$ corresponding to $f({\bf x})=6.002596\times 10^{-10}$. From the same initial guess, it takes the conjugate gradient method only $N=3$ iterations to converge to the solution:

n ${\bf x}=[x_1, x_2, x_3]$ $f({\bf x})$
0 1.000000, 2.000000, 3.000000 4.500000e+01
1 -0.734716, -0.106441, 1.265284 2.809225e+00
2 0.123437, -0.209498, 0.136074 3.584736e-02
3 -0.000000, 0.000000, 0.000000 3.949119e-31

Conjugate gradient method used for solving linear equation systems:

As discussed before, if ${\bf x}$ is the solution that minimizes the quadratic function $f({\bf x})={\bf x}^T{\bf A}{\bf x}/2-{\bf b}^T{\bf x}+c$, with ${\bf A}$ being symmetric and positive definite, it also satisfies $d f({\bf x})/d{\bf x}={\bf A}{\bf x}-{\bf b}={\bf0}$. In other words, the optimization problem is equivalent to the problem of solving the linear system ${\bf A}{\bf x}-{\bf b}={\bf0}$, both can be solved by the conjugate gradient method.

Now consider solving the linear system ${\bf A}{\bf x}={\bf b}$ with ${\bf A}={\bf A}^T$. Let ${\bf d}_i$ ($i=1,\cdots,N$) be a set of $N$ A-orthogonal vectors satisfying ${\bf d}_i^T{\bf A}{\bf d}_j=0$ for $i\ne j$. The solution ${\bf x}$ of the equation ${\bf A}{\bf x}={\bf b}$ can be represented by these $N$ vectors as

\begin{displaymath}
{\bf x}=\sum_{i=1}^N c_i{\bf d}_i
\end{displaymath}

Now we have

\begin{displaymath}
{\bf b}={\bf A}{\bf x}={\bf A}\left[\sum_{i=1}^N c_i{\bf d}_i\right]
=\sum_{i=1}^N c_i{\bf A}{\bf d}_i
\end{displaymath}

Premultiplying ${\bf d}_j^T$ on both sides we get

\begin{displaymath}
{\bf d}_j^T{\bf b}=\sum_{i=1}^N c_i{\bf d}_j^T{\bf A}{\bf d}_i=c_j{\bf d}_j^T{\bf A}{\bf d}_j
\end{displaymath}

Solving for $c_j$ we get

\begin{displaymath}
c_j=\frac{{\bf d}_j^T{\bf b}}{{\bf d}_j^T{\bf A}{\bf d}_j}
\end{displaymath}

Substituting this back to the expression for ${\bf x}$ we get the solution of the equation:

\begin{displaymath}
{\bf x}=\sum_{i=1}^N c_i{\bf d}_i
=\sum_{i=1}^N \left(\fra...
... d}_i^T{\bf b}}{{\bf d}_i^T{\bf A}{\bf d}_i}\right)
{\bf d}_i
\end{displaymath}

Also note that as ${\bf b}={\bf A}{\bf x}$, the ith term of the summation above is simply the A-projection of ${\bf x}$ onto the ith direction ${\bf d}_i$:

\begin{displaymath}
{\bf p}_{{\bf d}_i}({\bf x})
=\left(\frac{{\bf d}_i^T{\bf A}{\bf b}}{{\bf d}_i^T{\bf A}{\bf d}_i}\right){\bf d}_i
\end{displaymath}

One application of the conjugate gradient method is to solve the normal equation to find the least-square solution of an over-constrained equation system ${\bf A}{\bf x}={\bf b}$, where the coefficient matrix ${\bf A}$ is $M$ by $N$ of rank $R$, i.e., $M>N=R$. As discussed previously, the normal equation of this system is

\begin{displaymath}{\bf A}^T{\bf A}{\bf x}={\bf A}^T{\bf b} \end{displaymath}

Here ${\bf A}^T{\bf A}$ is an $N$ by $N$ symmetric, positive definite matrix. This normal equation can be solved by the conjugate gradient method.


next up previous
Next: Simulated annealing Up: Optimization Previous: Line minimization
Ruye Wang 2015-02-12