The gradient descent method may not be efficient because it could get into the zigzag pattern and repeat the same search directions many times. This problem is avoided in the conjugate gradient (CG) method, which does not repeat any previous search direction and converge in iterations. The CG method is a significant improvement in comparison to the gradient descent method.

We will first assume the function to be minimized is quadratic:

where is symmetric. The gradient (first derivatives) of the function is (see here):

and the Hessian matrix (second derivatives) of the function is simply , which is assumed to be positive definite, so that has a minimum.

At the solution where is minimized, we have . We see that the minimization of a quadratic function is equivalent to solving a linear equation systems for given a symmetric positive definite matrix and a vector . The CG method considered below can therefore be used for solving both problems. Later we will relax the condition for and consider using CG to minimize non-quadratic functions.

Before discussing the CG algorithm, we first consider the concept of conjugate
vectors, which is the key to the CG method. A set of vectors
satisfying

is

The basic strategy of the CG method is for the iteration to follow a search
path composed of segments each along one of the mutually conjugate
search directions. After the nth iteration that traversed along the search
direction to arrive at , the search direction
in the next iteration will be A-conjugate to the previous one
. By doing so, the total error at the initial guess

will be eliminated one component at a time in each of the iterations, so that after such iterations the error at becomes zero, , and becomes the solution.

Now we consider specifically the CG algorithm with the following general
iteration:

where is the optimal step size derived previously. Subtracting from both sides, we get

where is the gradient of at , which can be found to be

(Note that .) Substituting this into the previous equation we get

Pre-multiplying both sides by we get

We see that after taking the nth step along the direction to arrive at , the remaining error is A-conjugate to the previous search direction . To reduce the error as much as possible, the next search direction should be along one of the directions that span the remaining , that is A-conjugate to the previous search direction , instead of just orthogonal to it as in the case of gradient descent method, so that the component in corresponding to the direction of can be eliminated. Carrying out such a search in iterations each eliminating one of the components of the initial error , we get , i.e., the error is completely eliminated.

To see this, we premultiply both sides of the initial error equation
by
and get

Solving for we get

Here we have used the fact that and that the nth search direction is A-orthogonal to all previous directions. Comparing this result with the expression of obtained above, we see that , and the expression for can now be written as

We see that as increases from 0 to , the error is gradually eliminated from to , one component at a time. After steps, the error is reduced to and the true solution is obtained .

The specific A-conjugate search directions
satisfying
for any can be constructed
by the Gram-Schmidt process
based
on any set of independent vectors
,
which can be used as the basis to span the space. Specifically, starting from
, we construct each of the subsequent for
based on , with all of its components along the previous
directions (
) removed:

where () is the A-projection of onto . The coefficient can be found by pre-multiplying both sides by with to get:

Solving for we get:

and the A-projection of onto is

Now the nth direction can be expressed as:

In particular, we could construct the A-orthogonal search directions
(
) based on the gradient directions
. The benefit of choosing this particular set of
directions is that the computational cost is much reduced as we will see
later. Premultiplying
with on both sides of
we get:

We see the for all , i.e., is orthogonal to all previously traveled directions . Now the Gram-Schmidt process can be written as

where

Premultiplying () on both sides we get

Note that all terms in the summation are zero as for all . Substituting into the expression for the step size , we get

Next we consider

Premultiplying on both sides we get

Solving for we get

Finally the mth coefficient in the Gram-Schmidt process for () can be written as

Note that except when , i.e., there is only one non-zero term left in the summation of the update formula for . This is the reason why we choose . We can now drop the second subscript in . Substituting the step size obtained above into the expression for we get

In summary here is the conjugate gradient algorithm (note ):

- Set and initialize the search direction (same as gradient descent):

- Termination check: if the error
is smaller than a
preset threshold, stop. Otherwise, find step size:

- Step forward:

- Update gradient:

- Find coefficient for Gram-Schmidt process:

- Update search direction:

Set and go back to step 2.

The algorithm above assumes the objective function to be quadratic with known . However, when is not quadratic and is not available, the algorithm can be modified so that it does not depend on , by the following two methods.

- In step 2 above, the optimal step size is calculated based on . But it can also be found by line minimization based on any suitable algorithms for 1-D optimization. In step 4, the gradient is calculated based on , but it can also be computed locally at , once it is made available in step 3.
- If the Hessian matrix of the objective function can be made available, it can be used to approximate so that the optimal step size can still be obtained without the iterative 1-D line optimization. Although this method may result in more iterations, it avoids the 1-D optimization, which may be time consuming.

In the figure below, the conjugate gradient method is compared with the gradient descent method for the case of . We see that the first search direction is the same for both methods. However, the next direction is A-orthogonal to , same as the next error , different from the search direction in gradient descent method. The conjugate gradient method finds the solution in steps, while the gradient descent method has to go through many more steps all orthogonal to each other before it finds the solution.

Now we relax the requirement that the function be quadratic, but we
still assume that it can be approximated as a quadratic function near its minimum.
In this case, the update of the search direction, or specifically the coefficient
, which is derived based on the assumption that is
quadratic, may no longer be valid. Some alternative formulas for ,
can be used:

These expressions are identical to when is indeed quadratic. Note that it is now possible for . If this happens to be the case, we will use , i.e., the next search direction is simply , same as the gradient descent method.

**Example** To compare the conjugate method and the gradient descent method,
consider a very simple 2-D quadratic function

The performance of the gradient descent method depends significantly on the initial guess. For the specific initial guess of , the iteration gets into a zigzag pattern and the converge is very slow, as shown in the table below.

n | ||

0 | 1.500000, -0.750000 | 2.812500 |

1 | 0.250000, -0.750000 | 0.468750e-01 |

2 | 0.250000, -0.125000 | 7.812500e-02 |

3 | 0.041667, -0.125000 | 1.302083e-02 |

4 | 0.041667, -0.020833 | 2.170139e-03 |

5 | 0.006944, -0.020833 | 3.616898e-04 |

6 | 0.006944, -0.003472 | 6.028164e-05 |

7 | 0.001157, -0.003472 | 1.004694e-05 |

8 | 0.001157, -0.000579 | 1.674490e-06 |

9 | 0.000193, -0.000579 | 2.790816e-07 |

10 | 0.000193, -0.000096 | 4.651361e-08 |

11 | 0.000032, -0.000096 | 7.752268e-09 |

12 | 0.000032, -0.000016 | 1.292045e-09 |

13 | 0.000005, -0.000016 | 2.153408e-10 |

However, as expected, the conjugate gradient method takes exactly steps from any initial guess to reach at the solution, as shown below.

n | ||

0 | 1.500000, -0.750000 | 2.812500e+00 |

1 | 0.250000, -0.750000 | 4.687500e-01 |

2 | 0.000000, -0.000000 | 1.155558e-33 |

For an example of
with

from an initial guess , it takes the gradient descent method 36 iterations to reach corresponding to . From the same initial guess, it takes the conjugate gradient method only iterations to converge to the solution:

n | ||

0 | 1.000000, 2.000000, 3.000000 | 4.500000e+01 |

1 | -0.734716, -0.106441, 1.265284 | 2.809225e+00 |

2 | 0.123437, -0.209498, 0.136074 | 3.584736e-02 |

3 | -0.000000, 0.000000, 0.000000 | 3.949119e-31 |

**Conjugate gradient method used for solving linear equation systems:**

As discussed before, if is the solution that minimizes the quadratic function , with being symmetric and positive definite, it also satisfies . In other words, the optimization problem is equivalent to the problem of solving the linear system , both can be solved by the conjugate gradient method.

Now consider solving the linear system
with
. Let () be a set of
A-orthogonal vectors satisfying
for .
The solution of the equation
can be
represented by these vectors as

Now we have

Premultiplying on both sides we get

Solving for we get

Substituting this back to the expression for we get the solution of the equation:

Also note that as , the ith term of the summation above is simply the A-projection of onto the ith direction :

One application of the conjugate gradient method is to solve the normal equation
to find the least-square solution of an over-constrained equation system
, where the coefficient matrix is by
of rank , i.e., . As discussed previously, the normal equation of this
system is

Here is an by symmetric, positive definite matrix. This normal equation can be solved by the conjugate gradient method.