EXPLANATORY NOTES - SOCIOECONOMIC FACTORS AND STUDENT ACHIEVEMENT
DATA SOURCES
1 This publication uses Queensland school enrolments data from 2006 to 2013, National Assessment Program – Literacy and Numeracy (NAPLAN) data for Queensland from 2011 and data from the 2011 Census of Population and Housing (the Census).
School Enrolments data
2 The school enrolment data used in this publication was collected by the Queensland Department of Education, Training and Employment. The statistics are compiled from student level data, excluding name, for each child enrolled in a government school in Queensland at any time between 2006 and 2013 inclusive. This dataset contained records for 966,242 students.
National Assessment Program – Literacy and Numeracy
3 NAPLAN is a skills test that provides nationally comparable data on the literacy and numeracy performance of students at the time of the test. Each year, all students in Years 3, 5, 7, and 9 participate in tests covering reading, writing, language conventions and numeracy. The Australian Curriculum Assessment and Reporting Authority (ACARA) manages the development of, and oversees the delivery of tests for NAPLAN. Administration of the tests, including marking, is managed by the Test Administration Authority in each state or territory.
4 This publication uses NAPLAN results in Queensland for 2011. The statistics are compiled from student unit record level data, excluding name and address, for each child participating in the tests in government schools in Queensland. The NAPLAN data utilised in this study were provided to the ABS by ACARA. This dataset contained records for 156,160 students.
Census of Population and Housing
5 The Census is undertaken by the Australian Bureau of Statistics every five years, and is collected under the authority of the Census and Statistics Act 1905. For information about the 2011 Census please refer to the information provided on the Census 2011 Reference and Information section of the ABS website. Information about the data quality of the Census is also available on the ABS website under Census Data Quality.
SCOPE
6 The scope of this phase of the Measuring Educational Outcomes over the Life-course project utilising the Queensland school enrolments, NAPLAN and Census integrated dataset is restricted to people with a 2011 Census of Population and Housing record, who had a Queensland government school enrolment record in 2006 to 2013.
DATA INTEGRATION
7 Statistical data integration involves combining information from different administrative and/or survey sources to provide new datasets for statistical and research purposes. Further information on data integration is available on the National Statistical Service website – Data Integration.
8 Data linking is a key part of statistical data integration and involves the technical process of combining records from different source datasets using variables that are shared between the sources. Data linkage is typically performed on records that represent individual persons, rather than aggregates. The most common methods link records on exact matches for common variables ('deterministic' linkage), or close matches ranked by probabilities that the variables used will result in a true match ('probabilistic' linkage).
Creating a longitudinal school enrolments dataset from the school enrolments data
9 The creation of a longitudinal school enrolments dataset was carried out using the Student ID.
10 The Student ID is a unique identifier assigned to each individual student enrolled in the Queensland government school education system, from pre-year 1 to senior secondary.
11 Records were linked using Student ID across the multiple years of school enrolment records. Only information required for linking to Census was retained on the longitudinal school enrolments dataset.
Linkage between the longitudinal school enrolments, Census and NAPLAN data
12 The Queensland longitudinal school enrolments dataset was linked to the 2011 Census of Population and Housing data using a deterministic linkage methodology that requires exact matches between variables common to both datasets. It is considered a "bronze" standard linkage because name and address information were not used in the linkage process. Data was linked using date of birth, sex, and codes representing small geographic areas.
13 As the school enrolments data was for multiple years, data was linked to the Census using geographic codes representing place of usual residence, place of usual residence one year ago, and place of usual residence five of years ago.
14 The NAPLAN data did not include a Student ID that could be used to link the NAPLAN data longitudinally, or to link the NAPLAN data to the school enrolments data. The 2011 NAPLAN data was therefore linked to the 2011 school enrolments data using a bronze deterministic linkage methodology. Data was linked using date of birth, sex and School ID.
15 The Student ID obtained from the previous step was then used to link the 2011 NAPLAN data to the integrated school enrolments and 2011 Census dataset.
16 Information about linkage methodologies used in similar studies can be found in Research Paper: Assessing the Quality of Linking School Enrolment Records to 2011 Census Data: Deterministic Linkage Methods (cat. no. 1351.0.55.045) and Research Paper: Assessing the Quality of Different Data Linking Methodologies Across Time, Using Tasmanian Government School Enrolment Data (cat. no. 1351.0.55.047).
Linkage results
17 At the completion of the linkage process, 101,979 (65%) out of the 156,160 records from the NAPLAN dataset were linked to the 2011 Census data. Whilst the linkage rate is slightly lower than results from other Bronze linkage projects using the 2011 Census data, the overall linkage accuracy for this project was estimated to be very high, at 96%. This was deemed to be the most appropriate balance of linkage rate and linkage accuracy. There is potential to raise the linkage rate, however, any small increase in the linkage rate would be outweighed by the loss in link accuracy. As the focus of this project is to analyse individual characteristics, including those for sub-populations, linkage accuracy was treated as higher importance than a slight increase in the linkage rate.
18 While of a high quality, these links still have a small chance of being false. This chance of error is influenced by a few factors. The first factor is the amount of missing or invalid information for the linking variables used. Matches can only be made on valid responses and any of the unique links could have potentially been duplicated in the records with missing or invalid information if that information was present.
19 The second factor is persons on the school enrolments/NAPLAN data who were missing on the Census data. While both sources of data are population counts, students may not have filled in a Census form in 2011 because they were no longer a resident of Australia, were abroad temporarily at the time of collection, or were missed for another reason. Similarly to the first factor, these people who were missing from the 2011 Census could have created duplicate records for the links that were considered unique.
20 Another factor impacting on potential error is the quality of the variables used for linking. While inaccurate responses for variables have a small impact here, the larger impact comes from the efficacy of variables to match records uniquely out of a pool of possible links. Variables that are more likely to contain unique responses, such as date of birth, are more effective for linking than variables that are less likely to be unique, such as sex.
CREATION OF DATA ITEMS
Parental Education
21 Records within a family were linked to enable the identification of parental characteristics. This was only undertaken for parents, natural or adopted children, step children, and foster children who were at home on Census night.
Mother's age at time of child's birth
22 Mother's age at the time of child's birth was only calculated where both the child and the mother were at home on Census night, and the child was identified as the natural or adopted child of both parents or lone female parent, or the step-child of the male parent. This was unable to be calculated for children living in a female same-sex couple family.
Housing Costs
23 Personal income is collected in Census as ranges, whereas mortgage and rent amounts are collected as dollar values. In order to calculate the proportion of income spent on housing costs an imputed median of the income range was used. As such, the proportions calculated may not be exact and households paying close to 30% may be incorrectly classified as being above or below 30%. Further information on the imputed medians can be found in the Income data in the Census Fact Sheet.
WEIGHTING
24 Weighting is the process of adjusting a sample to infer results for the relevant population. The estimates in this publication are obtained by assigning a 'weight' to each linked record. The weight is a value which indicates how many students' records are represented by the linked record. Weights aim to adjust for the fact that the linked student records may not be representative of all the student records. Weighting was used to ensure better representation of population sub-groups and to enhance the reliability of linked education data for longitudinal and cross-sectional analysis.
25 A 'two-step linking propensity calibration' procedure was used, which involves estimating the link rate using a logistic regression model.
26 The first step of the calibration process used methodology developed to adjust for non-response in sample surveys. The concepts of non-response and non-links differ in that the former is a result of an action by a person selected in a sample, and the latter is the failure to link a record likely as a result of the quality of its linking variables. However, both situations may result in under/over representation, and as such the methodology developed to adjust for non-response is suitable to apply to adjust for non-links. The Integrated 2011 Census, school enrolments and NAPLAN dataset is unique in that many characteristics of the non-links are known, and these characteristics can therefore be used as inputs into a non-links adjustment.
27 The propensity of a school enrolment record to be linked to a Census record was modelled, and each record was assigned an initial weight. Records in the linked dataset which share characteristics with unlinked records are given higher weights by this model, such that unlinked records are adequately represented on the linked file.
28 The second step of the calibration process used the weighted file as produced in step one, and calibrated it to the school enrolment totals. A parallel second step was also run, to calibrate to the NAPLAN totals.
29 Calibration was conducted using the following variables:
- which years of enrolment records were available for that student
- SEIFA Index of Relative Socio-Economic Disadvantage
- age group
- sex
- Indigenous status
- remoteness area
- grade
- school size
- numeracy, reading and writing bands.
30 The weights have a mean value of 1.53 and range between 1 and 19.
POPULATIONS
31 This article uses all NAPLAN results in Queensland for 2011 which were successfully linked to the 2011 Census. This represents a weighted total of 144,375 students with numeracy results, 145,134 for reading and 144,865 for writing. The breakdowns of students by grade level are outlined in the following table.
WEIGHTED NUMBER OF STUDENTS WITH NAPLAN RESULTS, BY GRADE
|
| Numeracy | Reading | Writing |
|
|
Year 3 | 36,865 | 37,034 | 36,849 |
Year 5 | 37,247 | 37,403 | 37,287 |
Year 7 | 37,845 | 38,009 | 37,902 |
Year 9 | 32,418 | 32,688 | 32,826 |
Total | 144,375 | 145,134 | 144,865 |
|
| |
LOGISTIC REGRESSION ANALYSIS
32 Logistic regression is a popular and widely used statistical technique for analysis of relationships among variables and for prediction purposes. Specifically, this technique models the relationship between a categorical dependent variable that is frequently binary in nature (e.g. 1/0, yes/no, success/failure) and a set of explanatory variables. The explanatory variables can be continuous, discrete, categorical or a combination of these. The objective in the standard logistic regression is to model the conditional probability of an event of interest occurring (i.e. whether a student scores above or below the national minimum standard in reading).
33 The logistic regression model is generally expressed in terms of the odds of the event. In the context of this study, values below one indicate a decreased likelihood of scoring at or above the NAPLAN national minimum standard, while values above one are associated with an increased likelihood of scoring at or above the national minimum standard.
USE OF DATA
34 As the data in this publication is based on weighted estimates, there may be differences between figures in this publication and those published elsewhere.
35 While every effort was made to assure the quality of the statistics presented in this publication, they should be considered experimental and treated with caution.
36 Any discrepancies between totals and sums of components are due to rounding.