First consider nonlinearly mapping all data points to
in a higher dimensional feature space , where
the covariance matrix can be estimated as

Plugging this into the eigenequation of the covariance matrix

we get

We see that the eigenvector is a linear combination of the mapped data points:

where

Left multiplying to both sides of the equation above, we get

where

is the kernel representing a inner-product of two vectors in space . If we consider , the above scalar equation becomes the m-th component of the following vector equation:

where

is a matrix of kernel elements ( ), and () are the eigen vectors of , which can be obtained by solving the eigenvalue problem of . As the eigenvalues of are proportional to the eigenvalues of the covariance matrix in the feature space, feature selection in regular PCA can be carried out by keeping only a small number of components corresponding to the largest eigenvalues without losing much information.

Any new data point can now be mapped to in the
high-dimensional space , where its i-th PCA component can be
obtained as its projection on the i-th eigenvector

These are the steps for the implementation of the kernel PCA algorithm:

- Choose a kernel mapping .
- Obtain based on training data .
- Solve eigenvalue problem of to get and .
- For each given data point , obtain its principal components in the feature space:
- Do whatever processing (e.g., feature selection, classification) in the feature space.

Some commonly used kernels are:

- Gaussian:

- Sigmoid:

- Polynomial:

The above discussion is based on the assumption that the data points have
zero mean, and therefore the covariance of two data points is the same as
their correlation. While the data points can be easily de-meaned in the
original space by subtracting the mean vector from all points, it is done
differently in the feature space. The de-meaned data points in the feature
space is:

But this cannot be actually carried out as the mapping is not explicite and is never available. However, we can still obtain the kernel matrix for the zero-mean data points in terms of for :

In summary, the mapping in the kernel PCA algorithm is never explicitly specified, neither is the dimensionality of the feature space . Similarly, the covariance matrix and its eigenvectors are only mentioned in the above derivation, but they do not need to be computed in implementation. The potential high dimensionality of does not cost extra computation, as only the kernel is needed in the implementation.

Kernel PCA will have all the advantages of the regular PCA, as well as the implicit nonlinear mapping to a feature space where the features representing the structure in the data may be better extracted. For example, data classification may be much more easily carried out in due to linear saparability.