- Batch gradient descent:
One way to minimize that function and find the optimal θ is using the gradient descent
The gradient descent is an iterative algorithm that will get closer to the optimal parameters with time.
First: we start some random initial θ
Second: we repeatedly perform this iteration:
Annotations:
- a := b means assign the value of b to a
- α is the learning rate (more explanation later)
- ∂ is the partial derivative symbol
Third: Let’s calculate that partial derivative for clearer iterations:
So for a single training example we get the rule:
This rule is called LMS (least mean squares) or Widrow-Hoff learning rule, because the magnitude of the update is proportional to the error term (y(i) − hθ (x(i) ))
Thus, for instance, if we are encountering a training example on which our prediction nearly matches the actual value of y (i) , then we find that there is little need to change the parameters; in contrast, a larger change to the parameters will be made if our prediction h θ (x(i)) has a large error (i.e., if it is very far from y(i) ).
Fourth: When there are more than one single training example, the rule will be:
So, this is simply gradient descent on the original cost function J.
Annotation reminder:
- The small j in the θj goes from 1 to n and denotes which parameter
- The i is the data sample row
This method looks at every example in the entire training set on every step, and is called batch gradient descent.
Note that, while gradient descent can be susceptible to local minima in general, the optimization problem we have posed here for linear regression has only one global, and no other local, optima; thus gradient descent always converges (assuming the learning rate α is not too large) to the global minimum. Indeed, J is a convex quadratic function.
Here is an example of gradient descent as it is run to minimize a quadratic
function:
The ellipses shown above are the contours of a quadratic function. Also
shown is the trajectory taken by gradient descent, with was initialized at
(48,30). The x’s in the figure (joined by straight lines) mark the successive
values of θ that gradient descent went through.