h1# The gradient descent function

How to find the minimum of a function using an iterative algorithm.

Other articles from this series
• What machine learning is about, types of learning and classification algorithms, introductory examples.

• Finding the best-fitting straight line through points of a data set.

• It's time to put together the gradient descent with the cost function, in order to churn out the final algorithm for linear regression.

• How to upgrade a linear regression algorithm from one to many input variables.

• A collection of practical tips and tricks to improve the gradient descent process and make it easier to understand.

• Get your feet wet with another fundamental machine learning algorithm for binary classification.

• Preparing the logistic regression algorithm for the actual implementation.

• Overfitting makes linear regression and logistic regression perform poorly. A technique called "regularization" aims to fix the problem for good.

During the previous episode of this Machine Learning series we took a look at the theory behind linear regression. We also outlined some paper formulas for minimizing the cost function J. Or, in other words, finding the best values of theta's, the parameters of the hypothesis function.

During that stage we basically guessed those best theta's by staring at the graphical plot on the screen. That was definitely not an automated task: an algorithm called gradient descent will come in handy.

Gradient descent is a more generic algorithm, used not only in linear regression problems and cost functions. In the following example I will minimize an arbitrary function J, then in the next chapter I'll apply it to the original house pricing task.

h2## The gradient descent algorithm in a nutshell

You have a generic function J(theta_0, theta_1, ... theta_n) with an undefined number of parameters. You want to minimize it, that is

stackrel"minimize"{theta_0, theta_1, ... theta_n}\ J(theta_0, theta_1, ... theta_n)

What to do:

1. start with some initial guesses for theta_0, theta_1, ... theta_n. The common choice is to set them to 0 but sometimes you want to initialize them to other values as well;

2. keep changing those theta_0, theta_1, ... theta_n values to try to reduce J until you find a minimum for that function.

The picture 1. below displays a generic three-dimensional function J(theta_0, theta_1). There are only two theta's there, in order to generate an understandable plot. The height, or the vertical axis shows J, that is the output. Minimizing that function means to find the lowest values of J that correspond to the blue depressed areas.

1. A generic, 3-dimensional function J.

h2## Applying the descent algorithm in the real world

Suppose that the function in picture 1. represents a hilly landscape with yourself standing on the central mountain. Your task is to walk downhill as quickly as possible by taking small baby steps until you cannot go down any further. You start looking around you, asking yourself: what direction should I go in order to take a baby step downhill? You find the best one given your current position, you take the step and then ask yourself the same question again. Repeat the process until you cannot go down any further, that is until you find a minimum.

Picture 2. below shows the baby steps experiment with two different routes. Your starting position on the hill corresponds to the initial values given to theta_0, theta_1. Black route has a slightly different starting point compared to the white one, which reveals an interesting property of the gradient descent algorithm: changing the initial value of theta's might lead you to a different minimum.

2. Two different downhill routes in a three-dimensional function, determined by the starting point.

h2## Gradient descent algorithm definition

The complete gradient descent formula looks like the following:

 theta_j := theta_j - alpha * del/(del theta_j) J(theta_0, theta_1, ... theta_n) \ \ \ \ \ \ \ "for"\ j=0, j=1, ... j=n\ "until convergence"

Let me dissect it a little. First of all the := symbol means assignment. It works like when you assign a value to a variable in a programming language, for example code`int x = 3`.

The term alpha is a number called learning rate. It basically controls the step size while descending the hill. Larger alpha means larger steps. The remaining part is a derivative of function J; I will delve into it shortly.

To put it briefly, the gradient descent works as follows: for every theta_j of the J function (that is theta_0, theta_1, ... theta_n), compute the value of theta_j by subtracting from itself the derivative of the function at point theta_j mupliplied by a number alpha. Rinse and repeat until convergence. We will see the concept of convergence later. For now think of it as a point in your downhill walk in which you haven't reached the optimal result yet, but you are really quite quite near, if not on it.

The correct way to implement a gradient descent algorithm requires simultaneous updates of theta_j. That is, store each value in a temporary variable:

 "temp"_0 = theta_0 - alpha * del/(del theta_0) J(theta_0, theta_1, ... theta_n)

 "temp"_1 = theta_1 - alpha * del/(del theta_1) J(theta_0, theta_1, ... theta_n)

 "..."

 "temp"_n = theta_n - alpha * del/(del theta_n) J(theta_0, theta_1, ... theta_n)

Then assign those temporary values to the actual theta_j:

 theta_0 = "temp"_0

 theta_1 = "temp"_1

 "..."

 theta_n = "temp"_n

h2## A gradient descent algorithm at work

Let's try now to understand what each part of the gradient descent algorithm does. I'm going to use a simpler example with only one parameter theta_0: it will help a lot with visualization.

I have a generic function J(theta_0) to minimize, that is stackrel"min"{theta_0}\ J(theta_0), with theta_0 in RR (a real number). The function might look like the one you see in the picture 3. below.

3. A generic function J(theta_0) that looks like a parabola. On the left, a randomly choosen point theta_0 before the gradient descent. On the right, the point is closer to the minimum after an algorithm iteration.

You may also notice the black point on the curve: that's the initial value of theta_0 set during the gradient descent initialization. It's just a random value. The gradient descent function will shift that point until it reaches the minimum, that is the bottom of the parabola. Let's see how.

The core part of the algorithm is the derivative: it basically churns out the slope of the tangent line to the black point. Once the value of the slope is known, the algorithm adds or subtracts that value in order to move the point closer to the minimum.

In the picture 3. the slope of the tangent line is positive, so its derivative will be positive. The derivative at a point is just a number, let's call it d,\ d >= 0. Then the update to theta_0 will be:

 theta_0 := theta_0 - alpha * d

We are subtracting a positive number from theta_0, which makes it smaller. The point will shift to the left, as you may see in the rightmost plot in picture 3.

Of course the algorithm works also in case you have a tangent with a negative slope, like when you initialize theta_0 so that it lands in the leftmost part of the plot. Look at the picture below for an example:

4. The same function J(theta_0) seen before, with a different starting point. Here the tangent line has negative slope.

Up there the slope of the tangent line is negative, so its derivative will be negative. Let's call it d,\ d <= 0 as before. Then the update to theta_0 will be:

 theta_0 := theta_0 - alpha * (-d)

that is

 theta_0 := theta_0 + alpha * d

We are adding a positive number to theta_0, which makes it larger. The point will shift to the right, as you may see in the rightmost plot in picture 4.

The process continues until the algorithm has reached the minimum. But what does it mean? The minimum, namely the bottom of the parabola, is the point where the tangent has slope zero: a perfectly horizontal line. The derivative of such horizontal line is 0 (let's call it d,\ d = 0), so the algorithm will eventually add nothing to theta_0, like so:

 theta_0 := theta_0 - alpha * d

that is

 theta_0 := theta_0 - 0

At this point, theta_0 does not shift anymore: the minimum has been reached.

h2## The gotchas of alpha, the learning rate

The parameter alpha (the learning rate) defines how big the step will be during a gradient descent. It's the number that multiplies the derivative during the theta_0 updates. The greater the learning rate, the faster the algorithm will descent to the minimum point.

Defining a good value for alpha requires some planning. If it's too small, the algorithm will take many tiny baby steps to get to the minimum, as shown in picture 5. below, left side. Thus gradient descent can be slow.

If alpha is too large, the algorithm can miss the minimum point, namely it fails to converge, as seen in picture below, right side. Worse, it could diverge, that is going further and further away from the minimum, taking up an infinite amount of time.

5. The effects of different sizes for the learning rate alpha. Baby steps on the left (slow processing), failure to converge on the right (divergence).

The good news is that once you've found a proper value for the learning rate, the algorithm will reach the minimum naturally, step by step. As theta_0 slides toward the minimum, the slope of the tangent line keeps getting smaller, until it reaches 0. So there's no need to adjust alpha, which can remain fixed for the entire process.

In the next episode I will try to apply the gradient descent algorithm to our initial house pricing and churn out some useful values from it.