Australian Bureau of Statistics

Rate the ABS website
CensusAtSchool
ABS @ Facebook ABS @ Twitter ABS RSS ABS Email notification service
Education Services
 

Education Services homepage

Teacher Statistical Literacy

Back to Education Services home page

Concepts and definitions

Click on the triangles to open a section. The table below is a list of the concepts covered in each section.

    Hide details for StatisticsStatistics

    Statistics are numerical data that have been organised to serve a useful purpose. A major role of the ABS is to provide the Australian community with statistics that will help them make informed decisions. Statistical information provided by the ABS is used widely in Australia by governments, business people, researchers, members of the public, teachers and students.

    Data
    Data are observations or facts which, when collected, organised and evaluated, become information or knowledge.

    Data item
    A data item is the smallest piece of information that can be obtained from a survey or census.

    Dataset
    A dataset is data collected for a particular study. A dataset represents a collection of elements; and for each element, information on one or more characteristics is included.

    Outliers
    An outlier is an extreme value of the data. It is an observation value that is significantly different from the rest of the data. There may be more than one outlier in a set of data.
    Sometimes, outliers are significant pieces of data and should not be ignored. In other instances, they occur as a result of an error or misinformation and should be ignored. The decision to include or exclude an outlier needs to be clearly justified when discussing results.

    Example:
    The weights (in kilograms) of 30 students were measured and recorded in the stem and leaf plot shown in Figure 1. In this case, the stem is the whole number values and the leaves are the decimal values. The outliers are 56.3 and 67.7.






















    Stem Leaf

    563
    57
    584 4 9
    590 0 2 3 8
    600 2 4 5 7 8 9
    611 2 4 4 5 6 7 9 9
    621 2 3 7
    63
    64
    657

    Fig 1 Stem and leaf plot

    Hide details for VariablesVariables

    A variable is any measurable characteristic or attribute that can have different values for different subjects. Height, age, amount of income, country of birth, grades obtained at school and type of housing are examples of variables.

    Observation
    An observation is a single piece of data about a variable

    Independent variable
    An independent variable is the variable whose values are independent of changes in the values of other variables. It its the variable deliberately controlled or changed to assess changes in the dependent variable.

    Dependent variable
    A dependent variable depends on the independent variable.

    Categorical variables
    Nominal variable
    A nominal variable describes a name or category. For example, for the variable 'method of travel to school' all its values are words such as bus, walk, car and tram. Nominal variables are often referred to as categorical variables.

    Ordinal variable
    An ordinal variable is a number that represents a category. For example, postcodes and school year levels.

    Numerical variables
    A numerical variable is one that describes a numerically measured value. Numerical variables can be either discrete or continuous.

    Continuous variable
    A continuous variable is a numeric variable that can take any value within a certain range. For example, distance, age and temperature are continuous variables.

    Discrete variable
    A discrete variable can only take a finite number of values within a certain range. An example of a discrete variable is the number of children in a family – a family can have 0,1,2 or 3 children but not 2.5.

    Class interval
    A class interval is a group of data values for a variable. The intervals are generally the same size – for example, 4-6, 7-9 and 10-12. However, the intervals may have different sizes such as 4-6, 7-9 and 10-14. The boundaries of class intervals must not overlap so that each observation can be allocated to only one interval.








    Show details for SamplingSampling
    Show details for Frequency and distributionFrequency and distribution
    Hide details for Graphs and displaysGraphs and displays

    Graph
    A graph is a diagram representing a system of connections or interrelations among two or more variables by a number of distinctive dots, lines, bars, etc.

    Chart
    A chart is a visual representation of data. Bar, line, pie and other types of charts are examples of charts.

    Box and whisker plots (often called ‘box plots’) can be used to show the interquartile range. Figure 1 shows a box and whisker plot of student ages.
    Notice that a scale is drawn underneath. Box plots can be drawn horizontally or vertically.

    Frequency distribution tables can be used for nominal and numeric variables.

    Example:
    Twenty people were asked how many cars were registered to their households. The results were recorded as follows: 1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0. This data can be presented in a frequency distribution table – see Figure 2.

    Stem and leaf plots are a convenient way to organise data. Each observation value is considered to consist of two parts - a stem and a leaf.

    • the stem is the first digit or digits
    • the leaf is the final digit

    Example:
    The number of books ten students read in one year were as follows: 12, 23, 19, 6, 10, 7, 15, 25, 21, 12.
    In ascending order, these are: 6, 7, 10, 12, 12, 15, 19, 21, 23, 25. Figure 3 is a stem and leaf plot of this data.

    In the stem and leaf plot (fig 3):

    • the stem '0' represents the class interval 0-9
    • the stem '1' represents the class interval 10-19
    • the stem '2' represents the class interval 20-29.

    If there are a large number of observations for each stem, the stem can be split in two. For example the interval 0-9 could be split into intervals 0-4 and 5-9. The stem would then be written as 0(0) and 0(5).

    Time series
    A time series is a collection of observations of well-defined data items obtained through repeated measurements over time. For example, measuring the value of retail sales each month of the year would comprise a time series.

    Trend
    The ABS defines a trend as the long term movement in a time series without calendar related and irregular effects, and is a reflection of the underlying change in that measure. It is the result of influences such as population growth, price inflation and general economic changes.

    Equation: Box and whisker plot
    Fig 1 Box and whisker plot






    Number of cars (x)
    Tally
    Frequency (f)

    0
    l l l l
    4
    1
    l l l l l
    6
    2
    l l l l
    5
    3
    l l l
    3
    4
    l l
    2

    Fig 2 Frequency distribution table




    Stem Leaf

    0
    1
    2
    6 7
    0 2 2 5 9
    1 3 5

    Fig 3 Stem and leaf plot

    Hide details for Summary statisticsSummary statistics

    Mean
    The mean of a numeric variable is calculated by adding together the values of all observations in a dataset and then dividing by the number of observations in the set. It is often referred to as the average. Thus:
    Mean = sum of all the observations number of observations

    For example, find the mean of these numbers 5, 3, 4, 5, 7, 6.

    Mean = (5 + 3 + 4 + 5 + 7 + 6) 6
    = 30 6
    = 5

    Notice that the value of every member of the dataset is used to calculate the mean.

    Median
    The median is the middle value of a set of odd numbered data, or the mean of the middle two in an even numbered set after the data have been placed in ascending order.

    For example, dataset A contains 3, 7, 1, 9, 2, 5, 9. Rearranged in ascending order it becomes: 1, 2, 3, 5, 7, 9, 9. The middle number is 5 so, the median is 5.
    Dataset B contains 1, 3, 4, 5, 10, 12, 13, and also has a median of 5 although the values of the data vary considerably.

    The position of the median can also be found by using the formula (n + 1) 2 , where n is the number of values in a set of ordered data.

    For dataset A: n = 7

    So the position of the median = (7 + 1) 2
    = 8 2
    = 4

    The median is the fourth number which has a value of 5.

    The above example is for an odd number of observation, i.e. n = 7. However, an extra step is necessary when the number of observations is even.
    For example, if n = 8 then
    the position of the median= (8 + 1) 2
    = 9 2
    = 4.5
    This means that the position of the median lies between the fourth and fifth observations. To find the value of the median, add together the fourth and fifth observations and divide by two. For example, if the dataset is: 1, 1, 4, 4, 8, 9, 9, 10 then the median is, (4 + 8) / 2 = 6.

    The median value is decided by its location in the ordered dataset and not because of its actual value. Notice that the values of the other members of the dataset are not taken into consideration, only their position. There are as many values above the median as there are below.

    The median is usually calculated for numeric variables but may also be calculated for an ordinal nominal variable.

    Mode
    The mode is the most frequently observed value in a dataset. Mode is the only measure you can use when the data is categorical and has no order – for example, place of birth, favourite colour and hair colour. As the dataset is not numbers, you cannot add and divide, so you cannot find a mean. The dataset cannot be sorted from smallest to largest so you cannot find the middle value and median. The mode does not necessarily give an indication of a dataset’s centre. A set of data can have more than one mode (see Figure 1).

    For example, a group friends in Year 10 have the following hair colours: red, brown, blonde, black, blonde, black, brown, brown, black, blonde, brown, brown, black.

    HAIR COLOUR FREQUENCY

    Red 11
    Brown 55
    Black 44
    Blonde 33


    The most common hair colour is brown so the mode is brown.

    Range
    The range is the actual spread of data including any outliers. It is the difference between the highest and lowest observation.
    Range = maximum value – minimum value
    For the following dataset of students' ages: 17, 15, 14, 16, 14, 15, 16, 12, 17, 13, 12, 17, 13, 16, 15

    Maximum value
    Minimum value
    = 17
    = 12

    Range= maximum value – minimum value
    = 17 – 12
    = 5

    The range of the student's ages is 5 years.

    Quartiles
    Quartiles divide data into four equal groups. Using the example of 15 students above, we have the following ordered dataset: 12, 12, 13, 13, 14, 14, 15, 15, 15, 16, 16, 16, 17, 17, 17. We can divide this set into four equal sized groups with each group containing one quarter of the data:

    • The first quartile (Q1) is the value that 25% of the data is below.
    • The second quartile (Q2) is the value that 50% of the data is below. This is the same as the median.
    • The third quartile (Q3) is the value that 75% of the data is below.
    In the example:Q1 = 13
    Q2 = 15
    Q3 = 16

    Interquartile range

    The interquartile range refers to the middle 50% of data. Another way to put it is the interquartile range is the difference between the upper (75%) and lower (25%). The interquartile range is an indicator of the spread of the data. It eliminates the influence of outliers since the highest and lowest quarters are removed. The interquartile range is found by subtracting Q1 from Q3.

    Five number summary (quartiles)
    This is a useful way to summarise data. It consists of:

    • the lowest value
    • the highest value
    • the first quartile (Q1)
    • the third quartile (Q3)
    • the second quartile (Q2).
    The range can be found from the difference between the highest and lowest value. The median is the second quartile (Q2) and the interquartile range is the difference between the third and first quartiles (Q3 – Q1).

    Standard deviation
    Standard deviation (s) is the measure of spread most commonly used when the mean is the measure of centre. Standard deviation is most useful for symmetric distributions with no outliers.
    The standard deviation for a discrete variable made up of n observations is the positive square root of the variance as show in Figure 3.
























    n+1 divided by 2





































    Image: Unimodal, bimodal and multimodal distributions

    Fig 1 Unimodal, bimodal and multimodal






































    Equation: Five number summary (quartiles)
    Fig 2 Quartiles






























    Equation: Standard deviation
    Fig 3 Standard deviation formula


    List of items in each category


    Commonwealth of Australia 2008

    Unless otherwise noted, content on this website is licensed under a Creative Commons Attribution 2.5 Australia Licence together with any terms, conditions and exclusions as set out in the website Copyright notice. For permission to do anything beyond the scope of this licence and copyright terms contact us.