EXECUTIVE SUMMARY
As more governments and research sectors look towards using data integration to better answer policy questions, solving the problem of how to maximise linkage quality becomes paramount. This research paper compares two datasets to determine the impact that geographic information has on data linkage quality.
Previous ABS research examined National Centre for Vocational Education Research data with large-area geography, and investigated what effect this geography had on data linkage quality. The research recommended improvements to linkage methods and to the data itself to create a suitable linked dataset. Further work showed that improved linkage methods such as basing linkage on the geographic information available as well as weighting or calibrating the linked dataset to the input dataset led to the creation of a linked dataset that is of sufficient quality for statistical purposes. In turn, this improved method was used to link Vocational Education and Training in Schools data to 2011 ABS Census of Population and Housing (Census) data successfully in the publication Outcomes from Vocational Education and Training in Schools, experimental estimates (ABS cat. no. 4260.0).
In this current research, Queensland Curriculum and Assessment Authority data containing small-area geography was linked to the Census. The linkage quality of this data was compared to the results of the previous research to examine the effects of using small-area geography on data linkage quality.
The results showed that linking on small-area geography increased linkage rates compared to linking on large-area geography, even after improvements in linkage methods were considered. Linkage rates rose a total of 12.4 percentage points when compared to using large-area geography with improved linkage methods, and 22.9 percentage points when compared to using large-area geography without improved linkage methods. Further analysis showed that the differences in available geography increased the representativeness of the integrated dataset, particularly for variables based on geography.
Lastly, for the dataset with small-area geography, more links were made on variables with lower duplicate rates, and hence the links were likely to be more accurate when compared to linking on variables with higher duplicate rates.
While linked datasets of sufficient quality can be created using large-area geography, more detailed geography improves the quality of linked datasets in multiple areas. Improvements to the detail of geographical and other information on administrative data should be sought to deliver these enhancements. However, improved data linkage methods have also made it possible to integrate data even with less detailed geographical information, and greater use can be made of existing datasets to create a richer and more informative picture of Australia.
1. INTRODUCTION
Statistical data integration, which involves combining information from different administrative or survey sources, expands the usefulness of available data by enriching it with more information without the cost or effort that would be involved in collecting this additional information (see endnote 1). The ABS has used statistical data integration with data from the 2011 Census of Population and Housing (Census) to enrich the information available from several administrative datasets. For example, integration with the Census has allowed analysis of the post-school employment and study outcomes of Vocational Education and Training (VET) in Schools students (see endnote 2); and settlement outcomes of migrants entering Australia between the years 2000-2011 (see endnote 3).
Data linkage is the process of linking records together between two or more datasets based on variables that are common to those datasets. The ABS has explored several methods of linking data together and has published these results previously in a number of research papers. Of particular relevance is a research paper released in 2013 investigating the feasibility of linking 2011 VET in Schools data to 2011 Census data (see endnote 4). The key finding in this research paper was that linking primarily on larger geographic units such as Statistical Area 2 and Statistical Local Area led to a relatively small proportion of records being linked with low levels of accuracy. This is because there are more people in larger regions than in smaller regions, and therefore linkage is less accurate as it is difficult to distinguish records in larger groups due to the higher likelihood that individual records share common information. Based on this finding, the paper recommended that further improvements in the method used for linking be investigated and that more detailed geographical information should be sought for improving data linkage quality.
Further analysis has found that datasets with less detailed geographic information can be made more suitable for linkage by using different techniques. Two key methodological improvements have been made to improve the linkage quality of the resultant linked datasets.
- The first of these was to directly use existing geographical data such as Postcode, rather than coding to statistical units such as Statistical Area 2 and Statistical Local Area, and to combine large-area geographic variables together to create more specific geographic areas.
- The second was to weight the linked dataset to the input dataset. As the input dataset is a complete dataset of the VET in Schools population, it is possible to weight up individual records to ensure the linked population is representative of the original population. For instance, if the original population is composed of 40% females while the linked population is composed of 35% females, it is possible to weight the females in the linked population to make the final composition equivalent to 40%. These methods ensure that even datasets with relatively large-area geographic information can often be linked with sufficient quality. This weighting approach was used in the publication Outcomes from Vocational Education and Training in Schools, experimental estimates (see endnote 5) to successfully link a dataset containing large-area geographic information to the 2011 Census of Population and Housing.
Despite these improvements in linkage methods and the use of weighting, further benefits that can be achieved by using small-area geography are still of significant interest. To obtain this data, and in line with the goals of the 'Building capability to maximise use of Vocational Education and Training data and ABS Census of Population and Housing data' project (see endnote 6), the ABS sought out small-area geographic data on VET in Schools. For this purpose, the Queensland Curriculum and Assessment Authority provided the ABS with VET in Schools data that was able to be geocoded to very small geographic regions. By comparing the results between the two research projects, it is possible to evaluate the effect that small-area geography has on data linkage quality.
2. THE DATA
The datasets used in this research were:
- 2011 Queensland Curriculum and Assessment Authority (QCAA) VET in Schools
- 2011 National Centre for Vocational Education Research (NCVER) VET in Schools (Queensland records only);
- 2011 ABS Census of Population and Housing (Census);
Table 2.1 shows the number of records in each dataset. As the NCVER dataset only included persons from 15-20 years of age, the QCAA and Census datasets were also filtered to this age range. Additionally, the NCVER dataset was filtered to only include Queensland records for a more accurate comparison with the QCAA dataset.
2.1 COUNT OF PERSON RECORDS
|
| Persons |
|
Census | 1,638,156 |
NCVER(a) | 80,367 |
QCAA(a) | 70,628 |
|
(a) VET in Schools students
Source: 2011 Census of Population and Housing; 2011 NCVER VET in Schools records; 2011 QCAA person records |
The NCVER and QCAA datasets were processed and standardised for linkage to Census person records as in the previous research paper (see endnote 7).
3. THE LINKAGE PROCESS
This section provides an overview of the work undertaken to create the linked NCVER/Census integrated dataset and the linked QCAA/Census integrated dataset.
3.1 Deterministic data linking
VET in Schools records were linked to Census of Population and Housing records through exact matches on responses for common variables ('deterministic' linkage). For example, a variable that was common to each dataset was Sex which had the possible responses of '1' (male) or '2' (female), and if a record had a response of '1' on both datasets it would be one step closer to becoming a link. At least one geographic variable, Sex, and Date of birth or Age were kept as a minimum in all combinations that were used to search for links.
To link the QCAA and Census records together, different combinations of variables common to the Census and VET in Schools datasets were used (see table 3.1 below). The combinations were ranked in order of quality (where higher quality combinations are better able to uniquely identify people between datasets) and matches were sought for records iteratively through the combinations. In this linkage project, a ‘unique link’ was defined as instances where a record on one of the VET in Schools datasets had only one matching record on the Census, and that same Census record did not match to any other record on one of the VET in Schools datasets. This method also includes the concept of prioritising certain variables. For example, where a pair of records match on a combination using small-area geography (e.g. Mesh Block) and later match on a combination using large-area geography (e.g. Statistical Area 2), the first pair is accepted over the second. This is done to increase the accuracy of the data that is linked. Additionally, certain variables likely to have multiple links were combined to increase the chances of creating unique links during linkage.
3.1 NCVER, QCAA, AND CENSUS LINKING VARIABLES
|
NCVER variable | Census variable | QCAA variable |
|
- | Mesh Block (usual address) | Mesh Block (usual address) |
- | Statistical Area 1 (usual address) | Statistical Area 1 (usual address) |
Statistical Area 2 (usual address) | Statistical Area 2 (usual address) | Statistical Area 2 (usual address) |
Statistical Local Area (usual address) | Statistical Local Area (usual address) | Statistical Local Area (usual address) |
Statistical Area 2 and Statistical Local Area combined (usual address) | Statistical Area 2 and Statistical Local Area combined (usual address) | - |
Postcode (usual address) | Postcode (usual address) | - |
State Suburb Code (usual address) | State Suburb Code (usual address) | - |
Postcode and State Suburb Code combined (usual address) | Postcode and State Suburb Code combined (usual address) | - |
Day of birth | Day of birth | Day of birth |
Month of birth | Month of birth | Month of birth |
Year of birth | Year of birth | Year of birth |
Age | Age | Age |
Sex | Sex | Sex |
Country of birth | Country of birth | Country of birth |
Main language spoken at home | Main language spoken at home | Main language spoken at home |
|
4. EVALUATION OF LINKAGE
To evaluate the quality of the integrated datasets, the following sections investigate:
- an indication of the accuracy of links made based on the amount and proportion of duplicate links and missing information (duplicate rate/missing information);
- how many records successfully linked to the Census dataset (linkage rate);
- whether person characteristics in the original dataset are proportionally the same in the linked datasets (representativeness)
4.1 Comparing duplicate rate and missing information
Missing information increases the likelihood that a false link is made, and hence, the more missing information there is, the less accurate the links are likely to be. This is because links are only made on non-missing data, and for any link that is accepted, a more accurate link may have been accepted if all the data was available. Table 4.1 below summarises the missing information, showing that the NCVER dataset is likely to be less accurate for linkage purposes than the QCAA dataset due to the lack of small-area geography variables available for linking. However, approximately 10% of records have missing Mesh Block and Statistical Area 1 information in the QCAA dataset, which will decrease the number of links that can be made. Also of note is the relatively high rate of missing Date of birth information on the Census dataset which is also likely to decrease accuracy for combinations using these three variables, although combinations using age are available across all datasets.
4.1 PROPORTION OF MISSING DATA FROM INPUT DATASETS
|
| NCVER | QCAA | Census |
|
|
Linking variable | Proportion (%) | Proportion (%) | Proportion (%) |
|
Mesh Block (usual address) | na | 10.9 | 0.0 |
Statistical Area 1 (usual address) | na | 10.0 | 0.0 |
Statistical Area 2 (usual address) | 14.5 | 2.1 | 0.0 |
Statistical Local Area (usual address) | 21.0 | 2.5 | 0.0 |
Postcode (usual address) | <0.1 | na | 0.0 |
State Suburb Code (usual address) | 3.7 | na | 0.0 |
Day of birth | 0.0 | 0.0 | 11.0 |
Month of birth | 0.0 | 0.0 | 11.0 |
Year of birth | 0.0 | 0.0 | 11.1 |
Age, on 9 August 2011 | 0.0 | 0.0 | 0.0 |
|
Source: 2011 Census of Population and Housing; 2011 NCVER VET in Schools records; 2011 QCAA person records
The duplicate rate gives an indication of the likelihood of erroneously matching records when there is missing information, with less detailed variables and larger areas having more commonly shared characteristics. The higher the duplicate rate (i.e. the more common responses are to a set of variables) the higher the chance that an alternative, and potentially better link could have been made.
To calculate the duplicate rate, the number of records on the input dataset that agree with two or more Census records are divided by the number of records on the input dataset that agree with one or more Census records based on a set of common variables. For example, there may be five potential matches on the Census for a certain VET in Schools record using Statistical Area 2, Date of Birth and Sex, while there are two potential matches using Statistical Area 1, Date of Birth and Sex. The combination that uses Statistical Area 1 in this example would have the lower duplicate rate.
Table 4.2 shows that Mesh Blocks have much lower duplicate rates than any other geographic variable. As the majority of links made on the QCAA to Census dataset are based on Mesh Block, these links are much higher in quality than the links made on the NCVER to Census dataset based on larger-area geographic information. This is because the links made on the NCVER to Census dataset are based on geographic variables with larger duplicate rates and are therefore less accurate. Similarly, links made on combined geographical variables have lower duplicate rates and therefore higher accuracy than links made on Statistical Local Area, Postcode and State Suburb Codes alone.
4.2 DUPLICATE RATE AND PROPORTION OF LINKED RECORDS
|
| NCVER to Census dataset | QCAA to Census dataset |
|
|
Variable | Duplicate rate | Proportion of records linked (%) | Duplicate rate | Proportion of records linked (%) |
|
Mesh Block (usual address) | na | na | 3.0 | 84.5 |
Statistical Area 1 (usual address) | na | na | 12.8 | 2.9 |
Postcode and State Suburb Code combined (usual address) | 13.0 | 78.1 | na | na |
Statistical Area 2 and Statistical Local Area combined (usual address) | 17.4 | 6.7 | na | na |
Statistical Area 2 (usual address) | 19.6 | 1.1 | 17.4 | 9.8 |
Statistical Local Area (usual address) | 27.9 | 4.4 | 28.8 | 3.0 |
Postcode (usual address) | 33.1 | 8.0 | na | na |
State Suburb Code (usual address) | 74.5 | 1.7 | na | na |
|
Source: NCVER to Census integrated dataset; QCAA to Census integrated dataset
4.2 Comparing linkage rates between datasets
Overall, small-area geography led to higher linkage rates (see table 4.3). In the earlier ABS research paper (see endnote 8), a linkage rate of 52.6% was obtained (when examining only Queensland NCVER records based on linkage at the Statistical Area 2 or Statistical Local Area level). This rate improved to 63.1% of the Queensland NCVER records when using the improved method (i.e. linking directly on original variables such as Postcode). Of the QCAA records, 75.5% were successfully linked to the Census, based primarily on linkage undertaken at the Mesh Block level.
Together, these results show that improved data linkage methods lead to an approximate 10 percentage point increase in linkage rates, and using small-area geography can increase this linkage rate further by approximately 12 percentage points. Without improvements in the method used, using small-area geography would increase the linkage rate by almost 25 percentage points.
4.3 LINKAGE RATES BETWEEN DATASETS
|
| Linkage rate (%) |
|
NCVER to Census dataset (original method) | 52.6 |
NCVER to Census dataset (improved method) | 63.1 |
QCAA to Census dataset | 75.5 |
|
Source: NCVER to Census integrated dataset; QCAA to Census integrated dataset |
4.3 Comparing representativeness between datasets
This section investigates selected person variables and analyses whether the linked datasets are representative of the original population. While a high linkage rate means more records are available for further analysis, if the linked population does not accurately represent the original population then any output reported may not reflect reality.
The section below investigates Sex, Age, Socio-Economic Indexes For Areas (SEIFA), Remoteness Areas and Main language spoken at home characteristics of persons in the linked datasets.
Sex is the first variable examined below (see table 4.4). Examining the absolute sum of differences reveals that the linked QCAA/Census dataset is more representative of the original population's Sex characteristics than the linked NCVER/Census dataset. Similarly, Age was assessed across the two integrated datasets and similar results were found in table 4.5 below.
4.4 SEX CHARACTERISTICS BETWEEN DATASETS
|
| NCVER to Census dataset | QCAA to Census dataset |
|
|
Sex | Unlinked (%) | Linked (%) | Unlinked (%) | Linked (%) |
|
Male | 52.3 | 52.1 | 53.1 | 53.0 |
Female | 47.7 | 48.0 | 46.9 | 47.0 |
Unknown | <0.1 | 0.0 | 0.0 | 0.0 |
|
Absolute sum of percentage point differences between unlinked to linked datasets | na | 0.5 | na | 0.2 |
|
Source: 2011 NCVER VET in Schools records; 2011 QCAA person records; NCVER to Census integrated dataset; QCAA to Census integrated dataset
4.5 AGE CHARACTERISTICS BETWEEN DATASETS
|
| NCVER to Census dataset | QCAA to Census dataset |
|
|
Age | Unlinked (%) | Linked (%) | Unlinked (%) | Linked (%) |
|
15 | 24.7 | 25.6 | 27.1 | 27.6 |
16 | 41.6 | 40.9 | 40.1 | 39.6 |
17 | 29.8 | 29.5 | 28.9 | 29.0 |
18 | 3.5 | 3.6 | 3.5 | 3.4 |
19 | 0.4 | 0.4 | 0.4 | 0.4 |
20 | <0.1 | <0.1 | 0.1 | 0.1 |
|
Absolute sum of percentage point differences between unlinked to linked datasets | - | 2.0 | - | 1.2 |
|
Source: 2011 NCVER VET in Schools records; 2011 QCAA person records; NCVER to Census integrated dataset; QCAA to Census integrated dataset
Comparisons by SEIFA show a more pronounced difference in the representativeness of the two datasets compared with the Sex and Age characteristics results (see table 4.6 below). The QCAA to Census dataset here is more representative of the original population's SEIFA characteristics than the NCVER to Census dataset by 1.3 percentage points.
4.6 SEIFA CHARACTERISTICS BETWEEN DATASETS
|
| NCVER to Census dataset | QCAA to Census dataset |
|
|
SEIFA | Unlinked (%) | Linked (%) | Unlinked (%) | Linked (%) |
|
Quintile 1 | 19.0 | 19.3 | 21.7 | 21.2 |
Quintile 2 | 16.0 | 16.9 | 17.7 | 17.8 |
Quintile 3 | 23.3 | 24.2 | 25.5 | 26.0 |
Quintile 4 | 19.3 | 19.4 | 21.3 | 21.9 |
Quintile 5 | 10.8 | 11.4 | 12.2 | 13.1 |
Unknown | 11.7 | 8.9 | 1.7 | 0.0 |
|
Absolute sum of percentage point differences between unlinked to linked datasets | na | 5.5 | na | 4.2 |
|
Source: 2011 NCVER VET in Schools records; 2011 QCAA person records; NCVER to Census integrated dataset; QCAA to Census integrated dataset
Remoteness Areas show a similar pattern to Sex, Age and SEIFA characteristics results (see table 4.7 below). As before, the QCAA to Census dataset is more representative than the NCVER to Census dataset.
4.7 REMOTENESS AREAS CHARACTERISTICS BETWEEN DATASETS
|
| NCVER to Census dataset | QCAA to Census dataset |
|
|
Remoteness | Unlinked (%) | Linked (%) | Unlinked (%) | Linked (%) |
|
Major City of Australia | 55.9 | 56.7 | 62.6 | 64.4 |
Inner Regional Australia | 16.6 | 18.1 | 19.7 | 20.5 |
Outer Regional Australia | 14.0 | 14.6 | 14.3 | 13.7 |
Remote Australia | 1.0 | 0.8 | 1.0 | 0.9 |
Very Remote Australia | 0.9 | 0.8 | 0.7 | 0.5 |
Unknown | 11.7 | 8.9 | 1.7 | 0.0 |
|
Absolute sum of percentage point differences between unlinked to linked datasets | na | 6.1 | na | 5.3 |
|
Source: 2011 NCVER VET in Schools records; 2011 QCAA person records; NCVER to Census integrated dataset; QCAA to Census integrated dataset
Similarly, when examining Main language spoken at home in table 4.8 below, the QCAA to Census dataset is more representative of the original dataset than the NCVER to Census dataset.
4.8 MAIN LANGUAGE SPOKEN AT HOME CHARACTERISTICS BETWEEN DATASETS
|
| NCVER to Census dataset | QCAA to Census dataset |
|
|
Main language spoken at home | Unlinked (%) | Linked (%) | Unlinked (%) | Linked (%) |
|
English | 92.5 | 92.1 | 93.3 | 93.5 |
Southern European language | 0.3 | 0.4 | 0.4 | 0.4 |
Eastern European language | 0.3 | 0.4 | 0.4 | 0.4 |
Southwest and Central Asian language | 0.3 | 0.4 | 0.4 | 0.4 |
Southern Asian language | 0.3 | 0.4 | 0.4 | 0.4 |
Southeast Asian language | 1.0 | 1.3 | 1.2 | 1.3 |
Eastern Asian language | 1.1 | 1.2 | 1.2 | 1.1 |
Australian indigenous language | 0.1 | 0.1 | 0.1 | 0.1 |
Other languages | 1.1 | 1.3 | 1.4 | 1.3 |
Unknown | 2.9 | 2.7 | 1.4 | 1.2 |
|
Absolute sum of percentage point differences between unlinked to linked datasets | na | 1.5 | na | 0.8 |
|
Source: 2011 NCVER VET in Schools records; 2011 QCAA person records; NCVER to Census linked dataset; QCAA to Census linked dataset
Overall, these results show that linking on small-area geography information leads to a more representative linked population compared to using large-area geographic information, particularly
when examining variables based on geography such as Remoteness Areas and SEIFA.
5. LOOKING AHEAD
While linked datasets of sufficient quality can be created using large-area geography, more detailed geography improves the quality of linked datasets in multiple areas. Improvements to the detail of geographical and other information on administrative data should be sought to deliver this enhancement. However, improved data linkage methods have also made it possible to integrate data even with less detailed geographical information, and greater use can be made of existing datasets to create a richer and more informative picture of Australia.
ENDNOTES
1. Australian Bureau of Statistics, 2014,
ABS Statistical Data Integration, Canberra, <
www.abs.gov.au>.
2. Australian Bureau of Statistics, 2014,
Outcomes from Vocational Education and Training in Schools, experimental estimates, Australia, 2006-2011 (cat. no. 4260.0).
3. Australian Bureau of Statistics, 2013,
Understanding Migrant Outcomes - Enhancing the Value of Census Data, Australia, 2011 (cat. no. 3417.0).
4. Australian Bureau of Statistics, 2013,
Assessing the Feasibility of Linking 2011 Vocational Education and Training in Schools Data to 2011 Census Data, Dec 2013 (cat. no. 1351.0.55.044).
5. See endnote 2.
6. National Statistical Service, 2014,
Building capability to maximise use of Vocational Education and Training data and ABS Census of Population and Housing data, NSS, Canberra, <
www.nss.gov.au>.
7. See endnote 4.
8. See endnote 4.