Like it!

Join me on Facebook!

Like it!

The gradient descent in action

It's time to put together the gradient descent with the cost function, in order to churn out the final algorithm for linear regression.

Other articles from this series

In the previous article I outlined the theory behind the gradient descent algorithm. It's now time to apply it to our original problem: the house pricing task. Let's make some good use out of it!

In words, I'm going to apply gradient descent to the cost function [texi]J(\theta_0, \theta_1)[texi] found in the 2nd episode, in order to minimize it. Let me recap all the formulas found so far. We have a linear hypothesis function

[tex] h_\theta(x) = \theta_0 + \theta_1 (x) [tex]

and its cost function

[tex] J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (\theta_0 + \theta_1x^{(i)} - y^{(i)})^2 [tex]

.

I will apply the gradient descent algorithm

[tex] \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1, ... \theta_n) \\ \text{ for } j=0, j=1, ... j=n \text{ until convergence} [tex]

to the cost function in order to minimize it, that is

[tex] \min_{\theta_0, \theta_1} J(\theta_0, \theta_1) [tex]

Once we have found the minimized (i.e. best) values for [texi]\theta_0[texi] and [texi]\theta_1[texi] we can plug them into the linear hypothesis function and obtain the best fitting line through our data set.

Finding the derivative

The key part of out task is the derivative inside the gradient descent function, that is

[tex] \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) = \frac{\partial}{\partial \theta_j} \frac{1}{2m} \sum_{i=1}^{m} (\theta_0 + \theta_1x^{(i)} - y^{(i)})^2 [tex]

Nothing fancy here: I just plugged in the definition of the cost function.

According to the gradient descent algorithm, we have to figure out what is the derivative in two cases, one for each [texi]\theta[texi] in our cost function ([texi]\theta_0[texi] and [texi]\theta_1[texi] of course). I don't want to harass you with the step-by-step development of those derivatives, so trust the following results:

[tex] \begin{align} j = 0 : \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1) & = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \\\\ j = 1 : \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1) & = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x^{(i)} \end{align} [tex]

Remember: each of those derivatives gives us the slope of a tangent line, which is just a plain number.

The linear regression algorithm in all its splendor

Now that we have found those derivatives, we can plug them back into the gradient descent function:

[tex] \begin{align} \text{repeat until convergence \{} \\ \theta_0 & := \theta_0 - \alpha \cdot \frac{1}{m} \cdot \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \\\\ \theta_1 & := \theta_1 - \alpha \cdot \frac{1}{m} \cdot \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x^{(i)} \\\\ \text{\}} \end{align} [tex]

This is the heart of the linear regression. You may not know that we have implicitly solved a nasty problem that might arise in other applications besides linear regression. In the previous article I mentioned that, while initializing the gradient descent, different initial values of [texi]\theta[texi]'s might lead to a different minimum of the cost function. Such thing will never happen in linear regression: here all cost functions are bowl-shaped, with only one minimum point at the bottom. They are said to be convex functions.

Running the algorithm

Now that I've packed all things together, I am able to run the algorithm and see how it behaves with the initial data set. The picture below briefly shows how the hypothesis function and the cost function progress as the whole thing runs.

Gradient descent algorithm at work
1. The gradient descent algorithm at work on the linear regression.

In the first step, I've initialized the parameter of the cost function at a random value (the white dot), say for example [texi]\theta_0 = 100, \theta_1 = 3[texi]. Those values correspond to [texi]h(x) = 100 + 3x[texi], definitely not a good outcome for the hypothesis function, as you may see on the left.

As the algorithm takes further steps in the gradient descent, new values for [texi]\theta[texi]s pop up in the contour plot. Those new values corresponds to improved hypotheses that better fit to the data. Eventually the algorithm discovers the minimum on the cost function, which yields a hypothesis function that fits best the data set. We can now proudly use that hypothesis function to predict new values, namely new house prices given new sizes.

Some final words on the algorithm

The gradient descent algorithm used so far is called batch gradient descent, because it runs on the entire data set. You may notice that by looking at those two derivatives we've found earlier and their [texi]\sum_{i=1}^{m} ...[texi], the summation part. The summation forces the algorithm to parse the entire dataset from [texi]1[texi] to [texi]m[texi], where [texi]m[texi] is the total number of inputs. There are other non-batch versions of the gradient descent that look at a small subset of the training set in order to speed up the computation. We will see some examples in the future.

Finally, our algorithm is clearly an iterative one. There are different, non-iterative versions based on normal equations that find the minimum without multiple steps. However the iterative ones scale better on large data sets. We will take a look at those versions soon.

Sources

Machine learning @ Coursera - Gradient Descent For Linear Regression (link)
StackOverflow - Why gradient descent when we can solve linear regression analytically (link)