• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
  • Home
  • Crypto Currency
  • Technology
NEO Share

NEO Share

Sharing The Latest Tech News

  • Home
  • Artificial Intelligence
  • Machine Learning
  • Computers
  • Mobile
  • Crypto Currency

Linear Regression theory

December 30, 2020 by systems

  1. Batch gradient descent:

One way to minimize that function and find the optimal θ is using the gradient descent

The gradient descent is an iterative algorithm that will get closer to the optimal parameters with time.

First: we start some random initial θ

Second: we repeatedly perform this iteration:

Annotations:

  • a := b means assign the value of b to a
  • α is the learning rate (more explanation later)
  • ∂ is the partial derivative symbol

Third: Let’s calculate that partial derivative for clearer iterations:

So for a single training example we get the rule:

This rule is called LMS (least mean squares) or Widrow-Hoff learning rule, because the magnitude of the update is proportional to the error term (y(i) − hθ (x(i) ))

Thus, for instance, if we are encountering a training example on which our prediction nearly matches the actual value of y (i) , then we find that there is little need to change the parameters; in contrast, a larger change to the parameters will be made if our prediction h θ (x(i)) has a large error (i.e., if it is very far from y(i) ).

Fourth: When there are more than one single training example, the rule will be:

So, this is simply gradient descent on the original cost function J.

Annotation reminder:

  • The small j in the θj goes from 1 to n and denotes which parameter
  • The i is the data sample row

This method looks at every example in the entire training set on every step, and is called batch gradient descent.

Note that, while gradient descent can be susceptible to local minima in general, the optimization problem we have posed here for linear regression has only one global, and no other local, optima; thus gradient descent always converges (assuming the learning rate α is not too large) to the global minimum. Indeed, J is a convex quadratic function.
Here is an example of gradient descent as it is run to minimize a quadratic
function:

The ellipses shown above are the contours of a quadratic function. Also
shown is the trajectory taken by gradient descent, with was initialized at
(48,30). The x’s in the figure (joined by straight lines) mark the successive
values of θ that gradient descent went through.

Filed Under: Machine Learning

Primary Sidebar

A Critical Appraisal of Deep Learning

A Critical Appraisal of Deep Learning

Here’s how to activate Google Assistant on a Fitbit Sense or Versa 3

Introduction to Recursion and Merge Sort

Anomaly Detection Techniques: Part 2

Footer

  • Privacy Policy
  • Terms and Conditions

Copyright © 2021 NEO Share

Terms and Conditions - Privacy Policy