Frequency distributions (see Related topics) illustrate graphically how the values in the population of data are dispersed in the form of a shape. In order to use frequency distributions we need more information than just the shape.
For example, one important parameter is where the centre of the distribution is located, known as central tendency, i.e. the averages, notably the arithmetic mean, median and mode amongst others.
We also need to know the dispersion of the data, that is, how spread out from the mean are the values (e.g. are they all closely clustered around the mean or are they well scattered). The three most usual measures of dispersion are:
The standard deviations is essential for:
Standard deviation is usually denoted by the Greek symbol sigma (σ). The calculation of σ depends on the format of the data or variables, which can be divided into three categories:
The variance is: (sum of the deviations of the values from their mean)2 divided by (sample size)
In symbolic form this is:
Example: Calculate the standard deviation for the following ten lengths:
Values: 12, 9, 3, 10, 12, 22, 7, 11, 15 and 19cm.
Mean = 120 ÷ 10 = 12
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | sum | |
---|---|---|---|---|---|---|---|---|---|---|---|
Values (cm) | 12 | 9 | 3 | 10 | 12 | 22 | 7 | 11 | 15 | 19 | 120 |
deviations | 0 | -3 | -9 | -2 | 0 | 10 | -5 | -1 | 3 | 7 | 0 |
Dev.squared | 0 | 9 | 81 | 4 | 0 | 100 | 25 | 1 | 9 | 49 | 278 |
Sum of the deviations squared = 278
so the variance = 278 ÷ 10 = 27.8 cm
and standard deviation = √27.8 = 5.27 cm
Binomial: σ = √[p(1-p) x n] where p is the proportion of the values and σ is the absolute standard deviation
also σ = √[p(1-p) ÷ n] where p is the proportion of the values and σ is the proportional standard deviation
Examples: An activity sampling study shows that the number of times the subject was observed to be working during the day was 36 out of a total of 50 random observations. Estimate the probable proportion of the day the subject was actually working.
Using the second, proportional, formula:
p = 36 out of 50 = 0.72, (or 72%).
So, σ = √[0.72*(1-0.72) ÷ 50] = √(0.2 ÷ 50) = 0.063 or 6.3%
Therefore, our estimate of the proportion of a day the subject was working
= p ± standard deviation = 0.72 ± 0.063 or 72% ± 6.3%
i.e. somewhere between 65.7% and 78.3%.
(Note on significance: as we have only taken one standard deviation in the calculation this result is only reliable to 68%. In other words we are only 68% confident that the result for a whole day actually IS between 65.7% and 78.3%. To be more accurate we need to take a larger sample and to be more confident in the result, more standard deviations. Statistical tables tell us that for 95.4% confidence we must take 2 s.d. and for 99.8% confidence we must take 3 s.d.
So in the above calculations the estimates become, respectively:
95.4% confidence: estimated mean = 0.72 ± (2 x 0.063) or 72% ± 12.6%
99.8% confidence: estimated mean = 0.72 ± (3 x 0.063) or 72% ± 18.9%
It is clear that the more confident we wish to be that the result is reliable, the bigger the possible error. (What you gain on the swings you lose on the roundabouts).
Where n is not known
Example. A company calculates the mean number of orders placed per week is 400 but obviously it cannot know the number of orders not placed.
This is a case of the Poisson distribution, the standard deviation for which is simply: σ = √mean. So in this example, σ = 20 orders.
Each distribution (such as Beta, Gamma, exponential, Weibull among others) has its own particular standard deviation formula.
The standard deviation all other types of data such as continuous and discrete data can be used similarly to assess errors on sample means. However, standard deviations must be converted into standard errors - but that is another story!