Like it!

Join us on Facebook!

Like it!

Linear regression with one variable

Finding the best-fitting straight line through points of a data set.

Other articles from this series

Linear regression is one of the most famous way to describe your data and make predictions on it. The picture 1. below, borrowed from the first chapter of this stunning machine learning series, shows the housing prices from a fantasy country somewhere in the world. You are collecting real-estate information because you want to predict the house prices given, say, the size in square feet.

House prices given their size
1. House prices given their size.

Given your input data, how can you predict any house price outside your initial data set? For example, how much a 1100 square feet house is worth? Linear regression will help answering that question: you shrink your data into a line (the dotted one in the picture above), with a corresponding mathematical equation. If you know the equation of that line, you can find any output (y) given any input (x).

Terminology and notations

When you gathered your initial data, you actually created the so-called training set, which is the set of housing prices. The algorithm's job is to learn from those data to predict prices of new houses. You are using input data to train the program, that's where the name comes from.

The training set can be summarized in a table, like the one you see below:

size (feet2) §(x)§ price ($) §(y)§
815 165,000
1510 310,000
2100 410,000
... ...

The number of training examples, or the number of lines in the table above, are noted as §m§; the input variable §x§ is the single house size on the left column and §y§ is the output variable, namely the price, on the right column.

The list §(x, y)§ denotes a single, generic training example, while §(x^{(i)}, y^{(i)})§ represents a specific training example. So if I write §(x^{(2)}, y^{(2)})§ I'm referring to the second row in the table above, where §x^{(2)} = 1510§ and §y^{(2)} = 310,000§.

Naming the algorithm parts

Linear regression algorithm.
2. Overview of a linear regression algorithm.

The training set of housing prices is fed into the learning algorithm. Its main job is to produce a function, which by convention is called §h§ (for hypothesis). You then use that hypothesis function to output the estimate house price §y§, by giving it the size of a house in input §x§.

The hypothesis function

The hypothesis function must have a formula, like any other function in the world. That is:

§h_theta(x) = theta_0 + theta_1 (x)§

Theta's (§theta_i§ in general) are the parameters of the function. Usually the theta subscript gets dropped and the hypothesis function is simply written as §h(x)§.

That formula might look scary, but if you think about it it's nothing fancier than the traditional equation of a line, except that we use §theta_0§ and §theta_1§ instead of §m§ and §q§. Do you remember?

§ f(x) = q + mx §

In fact the hypothesis function is just the equation of the dotted line you can see in the picture 1.

In our humble hypothesis function there is only one variable, that is §x§. For this reason our task is often called linear regression with one variable. Experts call it also univariate linear regression, where univariate means "one variable".

The cost function: a mathematical intuition

Well, at this point we know that there's a hypothesis function to be found. More precisely we have to find the parameters §theta_0§ and §theta_1§ so that the hypothesis function best fits the training data. If you recall how the equation of a line works, those two parameters control the slope and the height of the line. By tweaking §theta_0§ and §theta_1§ we want to find a line that represents at best our data. Picture 3. below shows what I mean:

Different values of thetas for the hypothesis function
3. Varying the values of §theta_0§ and §theta_1§ provide different outcomes: from good (left) to bad (right).

We definitely want something like the first example in the picture above. So how to find proper values for §theta_0§ and §theta_1§? You certainly recall that in our training set we have several examples where we know the size of the house §x§ and the actual price of the house §y§. We know those prices and sizes because we previously took a survey for those data. So the idea in a nutshell: let's try to choose the hypothesis function parameters so that at least in the existing training set, given the §x§ as input parameter to the hypothesis function we make reasonably accurate predictions for the §y§ values. Once we are satisfied, we can use the hypothesis function with its pretty parameters to make predictions on new input data.

From a mathematical point of view I want that, for each §i§-th point in my data set, the difference §h_theta(x^{(i)}) - y^{(i)}§ is very small. Here §h_theta(x^{(i)})§ is the prediction of the hypothesis when it is input the size of house number §i§, while §y^{(i)}§ is the actual price of the house number §i§. If that difference is small, it means that the hypothesis has made an accurate prediction, because it's similar to the actual data.

The operation I described so far is a part of the so-called mean squared error function (MSE), a function that does exactly what we want: it measures how close a fitted line is to some data points. The smaller the MSE, the closer the fit is to the data. Actually there are many other functions that work well for such task, but the MSE is the most commonly used one for regression problems.

If I plug our data into the MSE function, our final formula looks like that:

§text{MSE} = 1/{2m} sum_{i=1}^{m} (h_theta(x^{(i)}) - y^{(i)})^2§

Note the §1/{2m}§ and the summation part: we are properly computing a mean. That 2 at the denominator will ease some calculations in future steps. Also, the squaring is done so negative values do not cancel positive values.

Let me now expand the above equation. Since

§h_theta(x^{(i)}) = theta_0 + theta_1x^{(i)}§

then

§text{MSE} = 1/{2m} sum_{i=1}^{m} (theta_0 + theta_1x^{(i)} - y^{(i)})^2§

By convention we would define a cost function (aka loss function) §J§ that is just the above equation written more compact:

§J(theta_0, theta_1) = 1/{2m} sum_{i=1}^{m} (theta_0 + theta_1x^{(i)} - y^{(i)})^2§

Now, we want to find good values of §theta_0§ and §theta_1§, so good that the above cost function can produce the best possible values, namely the smallest ones (because small values mean less errors). This is an optimization problem: the problem of finding the best solution from all feasible solutions. It can be written as

§stackrel"minimize"{theta_0, theta_1}\ J(theta_0, theta_1)§

Applying the cost function

Let's now feed our theoretical function with some real data. To better understand how the cost function works I will temporaly set §theta_0 = 0§ so that our hypothesis function looks like

§h_theta(x) = theta_1x§

and the minimization task like

§stackrel"minimize"{theta_1}\ J(theta_1)§

This will help a lot with cost function visualization: keeping §theta_0 != 0§ would require a three-dimentional plot that initially would be a source of annoyance. Just remember: with §theta_0 = 0§ the hypothesis function becomes a line passing through the origin §(0, 0)§, while §theta_1§ controls the slope.

Cost function with one variable

Picture 4. shows the relationship between the hypothesis function and the cost function. Let's suppose that our data is made of three points as you may see in the leftmost plot.

Hypothesis function and cost function with one variable
4. Hypothesis function (left) and cost function (right) with one variable.

Changing the values of §theta_1§, namely changing the slope of the hypothesis function produces points in the cost function.

For example, with §theta_1 = 1§:

§J(theta_1 = 1) = 1/{2m} sum_{i=1}^{m} (h_theta(x^{i}) - y^{(i)})^2§

since §theta_0 = 0§ I can write the hypothesis function like that:

§J(theta_1 = 1) = 1/{2m} sum_{i=1}^{m} (theta_1(x^{i}) - y^{(i)})^2§

Now let's plug some numbers in:

§J(theta_1 = 1) = 1/6 [(1*1 - 1)^2 + (1*2 - 2)^2 + (1*3 - 3)^2]§

§J(theta_1 = 1) = 1/6 [(0)^2 + (0)^2 + (0)^2]§

§J(theta_1 = 1) = 0§

In words: for §theta_1 = 1§, the cost function has produced a value of 0. Let's try with the other two values:

§J(theta_1=0.5) = 1/6 [(0.5*1 - 1)^2 + (0.5*2 - 2)^2 + (0.5*3 - 3)^2] ~= 0.6§ §J(theta_1=0) = 1/6 [(0*1 - 1)^2 + (0*2 - 2)^2 + (0*3 - 3)^2] ~= 2.3§

In picture 4. you may find the values 0, 0.6 and 2.3 plotted as full dots on the cost function. The empty ones are values from other theta's, not shown in the hypothesis function but computed separately: they reveal that the cost function is actually a parabola with its minimum at 0.

You can read the whole picture in reverse: every point of the cost function corresponds to a specific slope of the hypothesis function. We decided to take the best value, namely the minimum of the cost function. Looking at the curve of the cost function, the value that minimizes §J(theta_1)§ is §theta_1 = 1§: that value means the best slope of the hypothesis function for our particular training set.

Cost function with two (or more) variables

I previously simplified the problem by setting §theta_0§ to zero, a decision that led to a 2-dimensional cost function: great for learning purposes but not so realistic. In the real world 3-dimensional (and even more!) cost functions are quite common. Fortunately there are some neat ways to visualize them without losing too much information and mental sanity.

The first approach is to just draw the actual function in three dimensions, as shown in picture 5. Just note that, not knowing the exact formula yet, axes values are more or less random.

3-dimensional cost function
5. Three-dimensional cost function.

It looks like a cup and the optimization problem consists in finding the lowest point on the bottom edge.

Sometimes, when the picture becomes too messy, it's common to switch to another representation: the contour plot. A contour plot is a graphical technique for representing a 3-dimensional surface by plotting constant horizontal slices, called contours, on a 2-dimensional format. Figure 6. shows what I'm talking about.

Contour plot of cost function
6. Contour plot of cost function.

It's now much more easy to read, don't you think? Our optimization task now requires to find the exact center of that figure, where §theta_0 = theta_1 = 0§.

However so far we have just guessed for the best values of theta's, simply by looking at the function's plot. What we actually want is our program to find those values that minimize the cost function. In the next chapters we will see some proper algorithms to do that.

Sources

Machine learning @ Coursera - Model representation (link)
Machine learning @ Coursera - Cost function (link)
Machine learning @ Coursera - Cost function intuition 1 (link)
Machine learning @ Coursera - Cost function intuition 2 (link)
Wikipedia - Mean squared error (link)
Wikipedia - Optimization problem (link)
Wikipedia - Loss function (link)
NIST/SEMATECH e-Handbook of Statistical Methods - Contour plot (link)

previous article
Introduction to machine learning
next article
The gradient descent function
comments
Burak on October 01, 2018 at 15:27
This cleared everything up for me. Thank you very much.
Anne Martinez on January 02, 2019 at 20:14
You make this so easy to understand! I love the way you break it down :)
From India on December 26, 2019 at 18:05
This is some great resource to read along with Andrew Ng's course.
Wesley on March 21, 2020 at 20:15
Great introduction into the intuition behind cost function. I didn't realize that you just minimize MSE!
vysakh on April 14, 2020 at 20:31
great notes,cleared all my doubts..Thank u
Socialbee on April 27, 2020 at 01:11
Glad I found this page. This cleared all the doubts of Andrew's lecture on linear regression with one variable. Thank you!
Vaibhav on April 29, 2020 at 07:09
Great job
Alekhya on June 19, 2021 at 19:44
Explained it in easier way. Thank god i found this.