Join me on Facebook!

— Written by Triangles on August 06, 2015 • ID 10 —

Basic tools to split and analyze data sets.

Any set of data, arranged in ascending or descending order, can be divided into various parts, also known as *partitions* or *subsets*, regulated by *quantiles*. Quantile is a generic term for those values that divide the set into partitions of size *n*, so that each part represents 1/n of the set. Quantiles are not the partition itself, they are the numbers that define the partition. You can think of them as a sort of numeric boundary.

There are various kind of quantiles, like the *quartiles* (watch out for the different letter!) which divide a list of numbers into quarters. They are also known as *4-quantiles*. The list of type of quantiles is quite big; you can also have 3-quartiles, 5-quartiles, 6-quartiles up to 1000-quantiles (and above, I guess). It all depends of the size of your data and what you have to do with it.

What follows is an example with quartiles, where *x* is a set of numbers:

§ x = {5, 6_1, 9, 11_2, 13, 20_3, 26} §

- first quartile, or Q1 = 6
- second quartile, or Q2 = 11
- third quartile, or Q3 = 20

The second quartile always corresponds to the median of the set *x*.

*Percentiles* are quite similar to quantiles: they split your set, but only into two partitions. For a generic k^{th} percentile, the lower partition contains *k* % of the data, and the upper partition contains the rest of the data, which amounts to 100 - *k* %, because the total amount of data is 100%. Of course *k* can be any number between 0 and 100.

Say I have the usual set of numbers:

§ x = {1, 2, 4, 5, 5, 5, 6, 6, 6, 7, 8, 8, 8, 9, 10, 11, 13, 14, 14, 18, 20, 20, 21} §

and I want to compute the *percentile ranking* of a specific number at position *n*, for example 13. In *x*, the number at position 13 is 8. I can use the generic formula:

§ "percentile(n)" = "number of values below n" / "size of set x" * 100 §

which turns into:

§ "percentile(13)" = 12 / 23 * 100 ~~ 52.1 §

In words, the number at position 13 splits the set in two parts: the lower one contains approximately 52.1% of the data, while the upper one contains approximately 47.9% of the data. I can also say that the number at position 13 is approximately the *52.1th percentile* of the set, namely the value that divides the data so that approximately 52.1% is below it.

This is actually the opposite question. Given a set of numbers, I want to know the position *n* of that number that gives me a percentile ranking *r* (or the *r*-th percentile). Given the previous data set, I want to know what value exists at percentile ranking of - say - 40% (or at the 40th percentile). I can use the generic formula:

§ n = r / 100 * ("size of set x" + 1) §

which is:

§ n = 40 / 100 * (23 + 1) ~~ 9.6 §

It tells me that the number I'm looking for lies roughly in the 9.6th position of my set, between 6 and 7. Unfortunately I'm working with a discrete set of numbers and I don't have such a 9.6th value in there. I can easily approximate it, since 9,6 means "a little bit more than 9 and a half" and I pick up the 10th position, which corresponds to the number 7.

In words, the 40th percentile in *x* is the number at position 10 (7 in my data set).

Wikipedia - *Quantile* (link)

Math Is Fun - *Quartiles* (link)

Dummies.com - *How to Calculate Percentiles in Statistics* (link)

StatisticsLectures.com - *Percentiles and Quartiles* (link, video)

comments