You need to decide what statistics you will use and what summary information will enable you to answer your survey question.
For example:
- Will you find the mean or median or both?
- Will you exclude extreme values?
When describing a distribution, statisticians usually comment on its centre, spread and shape.
a) Measures of centre
The centre of a set of data is important. Often you want to know which value occurs most commonly or the average of a set of values.
Numerical data
Mean: The arithmetic mean is calculated by finding the sum of all values in the data set and then dividing it by the number of values in the set. The mean is often called the average.
Median: The median is the middle value of a dataset when all the values are arranged from smallest to largest number. If a dataset contains an even number of values, the median is taken as the mean of the middle two. The median is the preferred measure of centre when the data is not symmetric or contains outliers.
Numerical and categorical data – Figure 14
Mode: The mode is the value that occurs most often in a data set.
b) Measures of Spread
The spread of a data set, when combined with its centre, will give a more complete picture.
Range
The range is the distance from the minimum value to the maximum value in the data set.
Range = maximum – minimum
Interquartile range – Figure 15
Quartiles: As the name suggests, quartiles divide data into four equal sets.
When observations are placed in ascending order according to their value, the first or lower quartile is the value of the observation at or below which one-quarter (25%) of observations lie. The second quartile is the median at or below which half (50%) of observations lie. The third or upper quartile is the value of the observation at or below which three-quarters (75%) of the observations lie.
Another way to think about this is: the median divides the data into two equal sets: the lower quartile is the value of the middle of the lower half, and the upper quartile is the value of the middle of the upper half. The difference between upper and lower quartiles (Q_{3} - Q_{1}) also indicates the spread of a data set. This is called the interquartile range (IQR).
Interquartile Range = Upper quartile - Lower quartile
The interquartile range spans 50% of a data set, and eliminates the influence of outliers because, in effect, the highest and lowest quartiles are removed.
Calculating the IQR will help you to identify potential outliers in the data. Any data above Q_{3} +1.5 x IQR (upper fence) and below Q_{1} – 1.5 x IQR (lower fence) should be investigated to decide whether or not these observations need to be excluded from the data before it is analysed further.
Outliers
A potential outlier might be a mistake or an extreme value so you need to check the original data to determine if it is to be discarded or retained. However, cleaning data by discarding extreme values might give you a false view of variations that can be found in datasets. In turn, this could result in inaccurate modelling based on the data and false conclusions.
The standard deviation – Figure 16
The standard deviation measures the average distance each value in a data set is from the mean. A data set with a higher standard deviation will be more spread out than one with a lower standard deviation. Compare the following two data sets:
c) Shape
The standard distribution – Figure 17
Data which has a standard distribution is characterised by a symmetric, single curve. Also known as the normal or bell shaped distribution.
You would expect data such as the height of Year 9 students or the length of cane toads to show a standard distribution. A standard distribution results from most of the data clustering around the mean and less data occurring further away from the mean.
Bi modal – Figure 18
Bi modal data usually comes from two distinct populations. The bar chart of Year 5 girls' and Year 10 boys' heights shows two distinct modal groups.
Skew – Figure 19
Some non-symmetric data distributions can be described as skewed.
Data which is positively skewed typically has a cluster of lower end values and a taper of higher end values. Data which is negatively skewed typically has a cluster of higher end values and a taper of lower end values.
You could expect data such as the amount of total weekly income to show a negative skew. The number of people on a higher weekly income is fewer than those on low or medium incomes. Data you could expect to show a positive skew might be the amount of hours spent on Facebook by age. It's likely that the amount of time spent on Facebook will peak in the late teens to twenties and then taper off as a person's age increases. |