Amar"s Blog

Sep 1, 2007

Making sense of standard deviation

I love the feeling of getting to understand a seemingly abstract concept in intuitive, real-world terms. It means you can comfortably and freely use it in your head to analyse and understand things and to make predictions. No formulas, no paper, no Greek letters. It’s the basis for effective analytical thinking. The best measure of whether you’ve “got it” is how easily you can explain it to someone and have them understand it to the same extent. I think I recently reached that point with understanding standard deviation, so I thought I’d share those insights with you.

Standard deviation is one of those very useful and actually rather simple mathematical concepts that most people tend to sort-of know about, but probably don’t understand to a level where they can explain why it is used and why it is calculated the way it is. This is hardly surprising, given that good explanations are rare. The Wikipedia entry, for instance, like all entries on mathematics and statistics, is absolutely impenetrable.

First of all, what is deviation? Deviation is simply the “distance” of a value from the mean of the population that it’s part of:

Deviation

Now, it would be great to be able to summarise all these deviations with a single number. That’s exactly what standard deviation is for. But why don’t we simply use the average of all the deviations, ignoring their sign (the mean absolute deviation or, simply, mean deviation)? That would be quite easy to calculate. However, consider the following two variables (for simplicity, I will use data sets with a mean of zero in all my examples):

Standard deviation vs. mean deviation

There’s obviously more variation in the second data set than in the first, but the mean deviation won’t capture this; it’s 2 for both variables. The standard deviation, however, will be higher for the second variable: 2.24. This is the crux of why standard deviation is used. In finance, it’s called volatility, which I think is a great, descriptive name: the second variable is more volatile than the first. [Update: It turns out I wasn't being accurate here. Volatility is the standard deviation of the changes between values – a simple but significant difference.] Dispersion is another good word, but unfortunately it already has a more general meaning in statistics.

Next, let’s try to understand why this works; that is, how does the calculation of standard deviation capture this extra dispersion on top of the mean deviation?

Standard deviation is calculated by squaring all the deviations, taking the mean of those squares and finally taking the square root of that mean. It’s the root-mean-square (RMS) deviation (N below is the size of the sample):

RMS Deviation = √(Sum of Squared Deviations / N)

Intuitively, this may sound like a redundant process. (In fact, some people will tell you that this is done purely to eliminate the sign on the negative numbers, which is nonsense.) But let’s have a look at what happens. The green dots in the first graph below are the absolute deviations of the grey dots, and the blue dots in the second graph are the squared deviations:

Root-mean-square

The dotted blue line at 5 is the mean of the squared deviations (this is known as the variance). The square root of that is the RMS deviation, lying just above 2. Here you can see why the calculation works: the larger values get amplified compared to the smaller ones when squared, “pulling up” the resulting root-mean-square.

That’s mostly all there’s to it, really. However, there’s one more twist to calculating standard deviation that is worth understanding.

The problem is that, usually, you don’t have data on a complete population, but only on a limited sample. For example, you may do a survey of 100 people and try to infer something about the population of a whole city. From your data, you can’t determine the true mean and the true standard deviation of the population, only the sample mean and an estimate of the standard deviation. The sample values will tend to deviate less from the sample mean than from the true mean, because the sample mean itself is derived from, and therefore “optimised” for, the sample. As a consequence, the RMS deviation of a sample tends to be smaller than the true standard deviation of the population. This means that even if you take more and more samples and average their RMS deviations, you will not eventually reach the true standard deviation.

It turns out that to get rid of this so-called bias, you need to multiply your estimate of the variance by N/(N-1). (This can be mathematically proven, but unfortunately I have not been able to find a nice, intuitive explanation for why this is the correct adjustment.)

For the final formula, this means that instead of taking a straightforward mean of the squared deviations, we sum them and divide by the sample size minus 1:

Estimated SD = √(Sum of Squared Deviations / (N - 1))

You can see how this will give you a slightly higher estimate than a straight root-mean-square, and how the larger the sample size, the less significant this adjustment becomes.

23 comments:

Alex UK said...

Thank you. It's one of the best explanations of standard deviation I have read.
Can you follow it up with more topics - like what is used when mean and standard deviation are the same? (skewness and ketosis). I was going through it in the last couple of months.
Also it might be interesting to explain the median in the same way and relation between mean and median.
Thanks again.

Anonymous said...

I don't understand how the second data set has more variation than the first one. The average distance to the mean is the same for both, so if you get a random piece of data then it would on average be 2 units from the mean for both data sets.

Amar said...

anonymous:
Imagine a curve that moves up and down over time but does so smoothly, like a sine curve. Now imagine a second curve that follows the same general path but constantly fluctuates up and down as it follows the other. How do you capture that "jaggedness" of the second curve? If the fluctuation is symmetrical, the mean deviation could be almost the same for the two curves. The standard deviation, however, would be higher for the more volatile curve.

Anonymous said...

thank you so much for the interesting post! probably a stupid question, but could it be that if i have two samples A and B, that A has a higher standard deviation than B, but B has a higher mean deviation than A?

prakash malaysia said...

Amar, I will echo what alex said, this is probably the best i've seen so far. I'm from a control background and reqd to teach many control classes. Since the objective of control is to reduce variability, i get this question of standard d very often and especially about 'why n-1'. I think what you did was a great job... and u're right, its nice to finally understand something and be able to use it without wondering why its the way it is... thank you.

Dr Jayesh said...

Hi,
I found that your opening statement echoes word to word my feeling on understanding abstract concepts & also explaining them to others. Congratulations also on the lucid explanation. May I invite you to read this web page & comment?
http://www.leeds.ac.uk/educol/documents/00003759.htm

Dr Jayesh said...

Amar,
I am sure you will like this...
http://betterexplained.com/articles/an-intuitive-guide-to-exponential-functions-e/

jb said...

Thanks you very much, great help, i finally understand it! Jonathan

Ben said...

Good explanation. I would like to join in with alex uk and encourage you to write more explanations.

Mr. Tollic said...

Woohoo! I finally figured out why that n-1 thing is there... After my exam... Good explanation by the way, very down to earth.

Jennifer said...

Thank you! I have been trying to find a good visualization and no one else has done one. It would be even better animated I think. I am printing this out and taking it to my statistics class

PrincessTinkles said...

dude. ur awesome. ive been doodling on my paper trying to explain to myself why we use root-mean-square rather than standard mean. and then, i serendipitously stumbled on ur blog.

Anonymous said...

great post, really helped me understand SD vs mean deviation
thanks!

Anonymous said...

Hi Amar,
Thanks for your really nice explantion of SD. It helped me a lot...
Oscar

E. Scrubb said...

Great explanation. Thank you. The use of 9,1,40,and 8, in particular were very helpful.

kalid said...

Hi, I just wanted to say thanks for the clear, concise explanation. I had wondered why we take RMS also, and simply "removing the sign" wasn't the reason. I now see it as "punishing" the outliers proportionately more (3^2 = 9 vs 2^2 = 4) to show that one population is more volatile than another. Appreciate the post.

Anonymous said...

I still don't understand the need of N-1 there.

bsodmike said...

The standard deviation is the square root of this variance, where the formula is similar to that on Wikipedia where the variance is divided by N.

In statistics, given a set of N samples from an unknown distribution, requires the variance to be divided by N-1 (or the variance * N/N-1) so as to account for the fact that the samples are taken from an unknown distribution.

To Amar: Thank you for the concise explanation of standard deviation; it is by far one of the simplest explanations I have come across ~ very well explained. If only the authors of many math texts are capable to eloquently explain in similar manner!

Shruthi said...

It was of great help to me. In our MBA course we are made to study from lousy books which gives a rote explanation to this concept. Wish you would write a book on statistics. Thank You.

Diablo said...

Your blog is a good attempt in explaining standard deviation. All this mathematics does not make sense to a normal user. A normal user will think how he can apply this knowledge of standard deviation. One good way of looking at it is Chebyshev's inequality.

1) At least 75% of the values are within 2 standard deviations from the mean.

2)At least 89% of the values are within 3 standard deviations from the mean.

If I should be designing some thing that should work for 75% of the people, I will have to design to accommodate for all values with 2 std deviations!

Diablo said...

http://blog.ashwins.com/2009/04/standard-deviation.html

AGB said...

"The square root of that is the RMS deviation, lying just above 2. Here you can see why the calculation works: the larger values get amplified compared to the smaller ones when squared, “pulling up” the resulting root-mean-square."

Amplyfying a value just because it lies further away from the mean does not make too much sense to me - it just looks like we are amplyfying the more noisy samples.. in fact if you want to amplify noise then why stop at square?? why not square the square and so on..

gmarble said...

I agree with AGB - why the square was chosen to amplify the noise is not intuitive.