Bias-Variance Tradeoff

The mean squared error as defined below is a common way to measure the error of the estimation of some parameter, or, more generally, the performance of a supervised learning algorithm (for either regression or classification). Let $\hat{\theta}$ be an estimator of the parameter $\theta$ of a data model. Then mean squared error (MSE) of this estimator is

$\displaystyle \MSE(\hat{\theta})$ $\displaystyle =$ $\displaystyle E[ (\hat{\theta}-\theta)^2 ]$  
  $\displaystyle =$ $\displaystyle E[ (\hat{\theta}-E(\hat{\theta})+E(\hat{\theta})-\theta)^2 ]$  
  $\displaystyle =$ $\displaystyle E[ (\hat{\theta}-E(\hat{\theta}))^2 ]
+2E[ (\hat{\theta}-E(\hat{\theta}))(E(\hat{\theta})-\theta) ]
+E[ (E(\hat{\theta})-\theta)^2 ]$  
  $\displaystyle =$ $\displaystyle E[ (\hat{\theta}-E(\hat{\theta}))^2 ]
+E[ (E(\hat{\theta})-\theta)^2 ]$  
  $\displaystyle =$ $\displaystyle \Var(\hat{\theta})+\Bias(\hat{\theta})$ (217)

Note that the middle term disappears as

$\displaystyle E[ \hat{\theta}-E(\hat{\theta}) ]=E(\hat{\theta})-E(\hat{\theta})=0$ (218)

The equation above indicates that the MSE is composed of two parts: In genral, a proper bias-variance tradeoff needs to be made for an algorithm to avoid either overfitting or underfitting, so that it can learn the essential relationship in the data while not affected by the innevitable noise in the data.