Join me on Facebook!
— Written by Triangles on April 01, 2017 • ID 52 —
A collection of practical tips and tricks to improve the gradient descent process and make it easier to understand.
Introduction to machine learning — What machine learning is about, types of learning and classification algorithms, introductory examples.
Linear regression with one variable — Finding the best-fitting straight line through points of a data set.
The gradient descent function — How to find the minimum of a function using an iterative algorithm.
The gradient descent in action — It's time to put together the gradient descent with the cost function, in order to churn out the final algorithm for linear regression.
Multivariate linear regression — How to upgrade a linear regression algorithm from one to many input variables.
Introduction to classification and logistic regression — Get your feet wet with another fundamental machine learning algorithm for binary classification.
The cost function in logistic regression — Preparing the logistic regression algorithm for the actual implementation.
Real-world data can come up in different orders of magnitude. For example, your age ranges from 0 to 100 years, while your yearly income from €10,000 to €10,000,000 (and more). Using such unprocessed data as input features for a linear regression system might slow down the gradient descent algorithm to a crawl.
It happens because — as we will see shortly — such not normalized data warps the cost function the gradient descent has to process, making the minimum point really difficult to reach.
Because of that, an important trick in machine learning and in linear regression is to make sure that all the input features are on a similar scale. This is a preparatory step you do in order to optimize the input data, known as feature scaling.
In feature scaling you basically normalize your input values. For example, say you have two features:
Below you will find a contour plot for the cost function [texi]J(\theta_1, \theta_2)[texi] as if we were using the raw, unprocessed values. As you may see the result is a very thin and stretched version of it. The gradient descent algorithm would oscillate a lot back and forth, taking a long time before finding its way to the minimum point.
With feature scaling we will bring back the original bowl-shaped figure in order to let the gradient descent algorithm do its job efficiently. You have to options here: min-max scaling or standardization.
The idea is to get every input feature into approximately a [texi][-1,1][texi] range. The name comes from the use of [texi]\min[texi] and [texi]\max[texi] functions, namely the smallest and greatest values in your dataset. It requires dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable:
[tex] x_i^{\prime} = \frac{x_i - \min(x_i)}{\max(x_i) - \min(x_i)} [tex]
where [texi]x_i[texi] is the original [texi]i[texi]-th input value, [texi]x_i^{\prime}[texi] is the normalized version.
For example, say I'm dealing with the yearly income [texi]x_1[texi] and in particular I want to normalize the value of $30,000:
[tex] x_1^{\prime} = \frac{30,000 - 10,000}{10,000,000 - 10,000} \approx 0.002 [tex]
Just rinse and repeat such normalization for every value in your dataset. Of course if you are in a multivariate scenario remember to skip feature [texi]x_0[texi], since [texi]x_0 = 1[texi] as seen in the previous episode.
This technique goes also under the name of z-score normalization and many other confusing aliases I wish I could forget. In brief, you transform your data set so that the values follow the property of a normal distribution, namely with mean 0 ([texi]\mu = 0[texi]) and standard deviation 1 ([texi]\sigma = 1[texi]). Unlike min-max scaling, with standardization you are thinking in terms of how many standard deviations a value is far from the mean of the entire data set.
The general formula for standardization:
[tex] x_i = \frac{x_i - \mu_i}{\sigma_i} [tex]
Following the links to my previous articles above I'm able to compute the mean and the standard deviation on my data set. I'll show you an example with the yearly income [texi]x_1[texi]. Suppose I have collected five samples: $10,000, $30,000, $32,000, $35,000, $150,000. The mean and the standard deviation are:
[tex] \begin{equation} \mu_1 = 51,400 \qquad \sigma_1 \approx 50,078 \end{equation} [tex]
Now, let's apply the standardization to the value of $30,000 as I did before with the min-max scaling:
[tex] x_1 = \frac{30,000 - 51,400}{50,078} \approx -0.4 [tex]
You can read it as [texi]-0.4[texi] standard deviations ([texi]-0.4\text{STD}[texi]) from the mean.
Rinse and repeat the procedure for every value in your dataset as for the min-max scaling, and remember to skip [texi]x_0[texi] in multivariate problems.
Using standardization is important when you are comparing measurements that have different units, like years and dollars. It is also a general requirement for many machine learning algorithms besides linear regression. As a rule of thumb I'd say: when in doubt, just standardize the data, it shouldn't hurt.
You want to know if the gradient descent is working correctly. Since the job of the gradient descent is to find the value of [texi]\theta[texi]s that minimize the cost function, you could plot the cost function itself (i.e. its output) and see how it behaves as the algorithm runs.
The image below shows what I mean. The number of iterations on the horizontal axis, the cost function output on the vertical one. On each iteration the gradient descent churns out new [texi]\theta[texi]s values: you take those values and evaluate the cost function [texi]J(\theta)[texi]. You should see a descending curve if the algorithm behaves well: it means that it's minimizing the value of [texi]\theta[texi]s correctly.
More generally, the gradient descent works properly when [texi]J(\theta)[texi] decreases after every iteration.
Plotting [texi]J(\theta)[texi] also tells you whether or not the gradient descent has converged. Different problems require different number of iterations until convergence, so in general you can assume that the algorithm has found a minimum when [texi]J(\theta)[texi] decreases less than some small value [texi]\epsilon[texi] in one iteration.
Choosing a proper value [texi]\epsilon[texi] is not an easy task. Some people set it to value [texi]10^-3[texi] and also automatize the task in what is called automatic convergence test: their algorithm stops when [texi]J(\theta)[texi] has decreased less than [texi]\epsilon[texi] in one iteration.
If your [texi]J(\theta)[texi] plot seen before starts to look weird — upward curves, dramatically slow decreasing, ... — the gradient descent is not working properly: it is time to fix [texi]\alpha[texi], by using a smaller value.
It has been proved mathematically that for sufficiently small [texi]\alpha[texi], [texi]J(\theta)[texi] decreases on every iteration. On the other hand if [texi]\alpha[texi] is too small the gradient descent can be slow to converge.
The rule of thumb here is to try a range of [texi]\alpha[texi] values. Start with [texi]\alpha = 0,001[texi] and look at the [texi]J(\theta)[texi] plot. Does it decrease properly and rapidly? You are done with it. Otherwise, switch to [texi]\alpha = 0,01[texi] ([texi]\times 10[texi] scale), rinse and repeat until the algorithm works fine.
Machine Learning @ Coursera - Gradient Descent in Practice I - Feature Scaling (link)
Machine Learning @ Coursera - Gradient Descent in Practice II - Learning Rate (link)
Wikipedia - Feature Scaling (link)
Sebastianraschka.com - About Feature Scaling and Normalization (link)