2080.5 - Information Paper: Australian Census Longitudinal Dataset, Methodology and Quality Assessment, 2011-2016

Quality Declaration

ARCHIVED ISSUE Released at 11:30 AM (CANBERRA TIME) 27/02/2018

Summary
Downloads
Explanatory Notes
Related Information
Past Releases

Page tools: Print

Print Page Print all pages in this product

Print All

3. LINKAGE RESULTS

At the completion of the linkage process 927,520 (76%) of the 1,221,057 records from the 2011 ACLD Panel sample were linked to a 2016 Census record to create the linked 2011-2016 ACLD file with an estimated false link rate of 1.4%.

All results presented in this publication (unless identified in the relevant table) are based on characteristics from the 2011 ACLD Panel sample and have been confidentialised to prevent the identification of individuals.

Table 1 displays the linkage rate for a range of sub-populations.

TABLE 1 - LINKAGE RATES, By Selected Characteristics

	2011 Panel sample	Linked records	Linkage rate
	(no.)	(no.)	(%)

SEX
Male	600 724	450 092	74.9
Female	620 334	477 426	77.0

AGE GROUP
0-14	236 383	189 641	80.2
15-19	79 971	57 114	71.4
20-24	82 222	52 044	63.3
25-29	85 198	57 331	67.3
30-39	168 979	127 974	75.7
40-49	172 576	139 142	80.6
50-59	155 652	127 702	82.0
60-69	121 036	99 537	82.2
70-74	40 657	32 211	79.2
75 and over	78 384	44 823	57.2

INDIGENOUS STATUS
Non-Indigenous	1 171 794	897 076	76.6
Aboriginal	29 156	18 515	63.6
Torres Strait Islander	1 819	1 174	64.5
Both Aboriginal and Torres Strait Islander	1 243	802	64.6
Not stated	17 050	9 948	58.3

STATE/TERRITORY OF USUAL RESIDENCE
New South Wales	393 519	298 795	75.9
Victoria	304 513	233 623	76.7
Queensland	245 366	183 703	74.9
South Australia	91 555	71 650	78.3
Western Australia	125 449	95 053	75.8
Tasmania	28 580	21 831	76.4
Northern Territory	11 628	7 240	62.3
Australian Capital Territory	20 272	15 530	76.6

REMOTE AREAS
Major Cities	852 825	651 866	76.4
Inner Regional	228 174	174 567	76.5
Outer Regional	110 441	82 485	74.7
Remote	16 570	11 462	69.2
Very Remote	10 201	6 016	59.0
No Usual Address	2 593	1 002	38.6

Total(a)(b)(c)	1 221 057	927 520	76.0

(a) Data presented in the table have been perturbed. As a result, the sum of individual categories may not align with totals.

(b) Includes Other Territories.

The linkage rates for the 2011-2016 ACLD were relatively consistent across most sub-populations and were in line with expected results. Compared with the overall linkage rate of 76%, the sub-populations which achieved the highest linkage rates were persons:

aged 60 to 69 years (82%), followed by 50 to 59 years (82%) and 0 to 14 years (80%);
of non-Indigenous origin (77%);
who usually lived in South Australia (78%); and
who usually lived in major cities (76%) and inner regional areas (77%).

The sub-populations which achieved the lowest linkage rates were persons:

aged 20-24 years (63%) and 75 years and over (57%);
of Aboriginal (64%), Torres Strait Islander (65%) or both Aboriginal and Torres Strait Islander origin (65%);
who usually lived in the Northern Territory (62%); and
who usually lived in remote (69%) and very remote areas (59%) or who had no usual address in 2011 (39%).

Traditionally, the Census Post Enumeration Survey (PES) has shown that the Census has higher rates of undercount for people of Aboriginal and/or Torres Strait Islander origin, those aged between 20 and 29 and for those in the Northern Territory. As expected, the lower ACLD linkage rates broadly aligned with the same groups that experience higher levels of undercount in the 2016 Census. One additional group that had lower linkage rates were persons aged 75 and over at the time of the 2011 Census who, due to age, had an increased risk of death over the ensuing five years. Further information on Census undercount can be found in Census of Population and Housing: Details of Overcount and Undercount, 2016 (cat. no. 2940.0).

Further, data cubes demonstrating the linkage rates for various sub-populations are available as an attachment to this Information paper.

3.1 LINKAGE ACCURACY

The following quality measures were calculated for the ACLD and indicate a good level of overall quality:

The linkage rate, being the proportion of the 2011 ACLD Panel records linked to a 2016 Census record, including both true matches and false links.
The estimated proportion of correctly linked records, otherwise referred to as 'linkage precision'.
The consistency of reporting of common information between record pairs.

3.1.1 Linkage Precision

Not all record pairs assigned as links in a data linkage process are a true match, that is, a record pair belonging to the same individual. While the methodology is designed to ensure that the vast majority of links are true, some are actually false, i.e. the records in the link belong to different people rather than the same person. The linkage strategy used for the ACLD was designed to ensure a high level of accuracy while also achieving a sufficiently high number of links to enable longitudinal research. Accordingly, the strategy was restrictive and conservative.

One of the key measures of linkage quality is the proportion of links in the dataset that are false. The number of false links is able to be estimated through the use of methods such as clerically reviewing a sample of links, or by using modelling techniques. Once an estimate of the number of false links is obtained, a 'precision' can be calculated. The precision is an estimate of the proportion of links that are matches (i.e. belonging to the same entity).

Equation: Precision = (Total links - False link estimate)/Total links

Once the precision of the dataset is estimated, the false link rate is easily calculated.

Precision estimation for the ACLD involved conducting clerical review on a stratified random sample of links. Potential links were stratified by their link weight value, with a minimum of 5% of links sampled from each individual link weight value (after rounding down to the nearest integer). After reviewing the sample, the results were used to calculate precision estimates for links grouped by pass and rounded link weight value. These estimates were then applied to the entire set of linkage results. This provided an estimate of precision for each individual link, which can be referred to as 'marginal precision'. Using the marginal precision, the 'cumulative precision' of the final set of one-to-one links could be estimated.

After producing both marginal and cumulative precision estimates, a cut-off point was selected. This cut-off is intended to optimise both the number of links and cumulative precision of the links retained above the cut-off point, while at the same time maintaining a high level of marginal precision for every individual link above the cut-off. The marginal precision estimates were used to select the cut-off, with all links with a marginal precision of at least 81% being retained. This resulted in a final file of 927,520 links once the cut-off was applied, with an estimated cumulative precision of 98.6%, or a false link rate of 1.4%, for these links.

Clerical review relies upon judgment by a well trained individual, therefore, while efforts are taken to minimise the risk, it is possible for a link to be incorrectly assigned as a match or non-match. An alternative way of measuring precision is through the use of models. We applied the method of Chipperfield et al (2018) to provide an independent model-based estimate of the precision. While the clerical estimate of cumulative precision was 98.6%, the model-based approach estimated the precision to be over 99%. The precision as estimated by the clerical review process was retained as the more conservative estimate.

Table 2 provides a summary of the precision estimate and false link rate by the pass where each link was selected (estimated via clerical review).

TABLE 2 - PRECISION ESTIMATES AND FALSE LINK RATES, By Pass Number

Pass Number (a)(b)	Proportion of Overall Links	Estimated True Link Rate / Precision Estimate	Estimate False Link Rate
(no.)	(%)	(%)	(%)

1	72.7	100	0
2	15.7	94.4	5.6
3	1.2	96.4	3.6
4	1.5	95.3	4.7
5	0.8	92.9	7.1
6	1.1	99.8	0.2
7	1.6	96.2	3.8
8	1.0	93.8	6.2
9	4.4	95.9	4.1
Total(b)	100	98.6	1.4

(a) Data presented in the table have been unperturbed.

(b) Pass number 1 refers to the deterministic linkage.

The conservative and restrictive nature of the blocking and linking strategy, accompanied by quality controls that were implemented during clerical review, helped to minimise the estimated number of false links throughout the linkage process.

Almost three quarters (73%) of all links were achieved in the first pass of the project, which used a deterministic linking methodology to identify and filter matches. This pass implemented tight geographic and demographic restrictions to maximise the number of high quality links assigned and to limit the amount of alternative comparisons required. Using this approach, links were only accepted if a single unique record pair was identified.

3.1.2 Consistency of Common Information on Record Pairs

In data linkage projects, geographic boundaries function as blocking variables that restrict the search for links to records which agree on the defined geography. They are also used as linking variables, and when combined with other linking fields (such as hashed name, age, sex and date of birth), they provide a high level of uniqueness, and reduce the likelihood of linking to an incorrect record.

Table 3 displays the number of records that had consistent information on key linking variables, grouped by levels of geography.

TABLE 3 - CONSISTENCY OF LINKED RECORDS, By Geography And Selected Linking Fields

Consistency of key linkage fields(a)(b)(c)
	(no.)	(%)

MESH BLOCK
First name hash, Surname hash, Age exact, Mesh Block, Sex, DOB Day and Month agree	530,305	57.2
First name hash, Surname hash, Age exact, Mesh Block, Sex agree	160,953	18.3
Age exact, Mesh Block, Sex, DOB Day and Month agree	96,202	10.4
Age exact, Mesh Block, Sex agree	7,176	0.8
Age +/- 2 years, Mesh Block, Sex agree	31,223	3.4

STATISTICAL AREA LEVEL 2
First name hash, Surname hash, Age +/- 2 years, SA2, Sex, DOB Day and Month agree	28,767	3.1
Age exact, Mesh Block, Sex, DOB Day and Month agree	8,677	0.9
Age +/- 2 years, SA2, Sex agree	7,226	0.8

STATISTICAL AREA LEVEL 4
First name hash, Surname hash, Age +/- 2 years, SA4, Sex, DOB Day and Month agree	33,103	3.6
Age +/- 2 years, SA4, Sex, DOB Day and Month agree	8,103	0.9

Total records included	911,735	98.3

Total records linked	927,520	100

(a) Only includes records that agree on all key linking fields.

(b) Categories are mutually exclusive. Records that agree in each category are excluded from subsequent categories.

Over 98% of all records that were matched in the ACLD linkage process agreed on small to medium levels of geographic area combined with other key linking fields, such as first name and surname hash codes, age, sex and date of birth. While the number of consistent fields can give a strong indication of likely linkage quality, other factors should be taken into account, for example, the expected number of people in a geographic area that are likely to share a characteristic by chance. A tolerance of plus or minus one year was used at certain parts of the linkage process to cater for persons who may have understated their age in 2011 and/or overstated it in 2016 or vice versa.

By contrast, record pairs may have inconsistent information and yet be a match. Inconsistent information may be recorded for the same person in different Censuses due to a range of factors, including:

transcription errors in the Census, where the wrong category is selected or the information is transposed, such as the day the person was born being reported in the month field instead of in the day field;
data capture errors, where the Census form is scanned using Optical Character Recognition (OCR) software and certain characters may be mis-classified, such as a 1 captured as a 7 or a 3 as an 8;
reporting errors, where information is given for the wrong member of the household (e.g. person 1's information is reported for person 3) or where the person completing the Census form for a household guesses or estimates information about a fellow household member;
information that was not stated by the respondent and has been imputed as part of Census processing (such as age or sex), while set to missing for linking, the imputed values are included in the analytical dataset;
census form questions are interpreted differently at each Census; or
questions are coded differently for each Census.

Of particular note is inconsistency due to non-reporting of name and date of birth. Respondents are becoming less likely to provide their date of birth, with 90% reporting in the 2011 Census decreasing to 81% reported date of birth in the 2016 Census. Further, just over one per cent of Australians had a missing, or blank, response for first name or surname in the 2016 Census. There appeared to be a relationship between having a missing response for both first name and surname and non-response on other variables. Of the people who did not report first name and surname, approximately half did not report at least one of sex, age, or Indigenous status. The vast majority of missing responses came from paper forms, with the overall level of missing responses in the 2016 Census remaining low.

3.2 CHARACTERISTICS OF LINKED AND UNLINKED 2011 ACLD PANEL SAMPLE

The random sample selected from the 2011 Census for the 2011 ACLD Panel was designed to maximise overlap with the 2006 ACLD Panel, while also being representative of the Australian population by age, sex and jurisdiction as well as other characteristics such as Indigenous status and country of birth. The 2011 Panel sample size was increased in comparison to the 2006 Panel sample size primarily due to the increase in the Australian population from 2006 to 2011. The 2011 Panel size was increased slightly to 5.7%, to achieve a linked sample size closer to 5% of the population after allowing for missed links and people no longer being in scope of the ACLD due to death or overseas migration.

Table 4 shows the distribution of key populations across the 2011 Census, the 2011 ACLD Panel sample and the linked results.

TABLE 4 - SELECTED CHARACTERISTICS, By 2011 Census, 2011 ACLD Panel Sample, ACLD Linked Results

	2011 Census		2011 Panel Sample		Linked Results		Weighted Linked Results (a)
	(no.)	(%)	(no.)	(%)	(no.)	(%)	(no.)	(%)

SEX
Male	10 634 012	49.4	600 724	49.2	450 092	48.5	10 440 753	49.5
Female	10 873 706	50.6	620 334	50.8	477 426	51.5	10 639 417	50.5

STATE/TERRITORY OF USUAL RESIDENCE
New South Wales	6 917 656	32.2	393 519	32.2	298 795	32.2	6 787 716	32.2
Victoria	5 354 039	24.9	304 513	24.9	233 623	25.2	5 304 805	25.2
Queensland	4 332 727	20.2	245 366	20.1	183 703	19.8	4 223 043	20.0
South Australia	1 596 569	7.4	91 555	7.5	71 650	7.7	1 548 407	7.3
Western Australia	2 239 171	10.4	125 449	10.3	95 053	10.2	2 182 402	10.4
Tasmania	495 351	2.3	25 580	2.3	21 831	2.4	476 403	2.3
Northern Territory	211 943	1.0	11 628	1.0	7 240	0.8	211 411	1.0
Australian Capital Territory	357 218	1.7	20 272	1.7	15 530	1.7	343 595	1.6

AGE GROUP
0-9	2 772 971	12.9	157 597	12.9	126 844	13.7	2 823 442	13.4
10-19	2 776 848	12.9	158 761	13.0	119 912	129	2 822 767	13.4
20-29	2 973 916	13.8	167 423	13.7	109 375	11.8	3 047 805	14.5
30-39	2 973 913	13.8	168 979	13.8	127 974	13.8	2 987 460	14.2
40-49	3 047 023	14.2	172 576	14.1	139 142	15.0	3 050 851	14.5
50-59	2 744 653	12.8	155 652	12.7	127 702	13.8	2 718 221	12.9
60-69	2 125 435	9.9	121 036	9.9	99 537	10.7	2 051 448	9.7
70-79	1 253 349	5.8	71 658	5.9	54 430	5.9	1 098 356	5.2
80 and over	839 609	3.9	47 387	3.9	22 603	2.4	479 854	2.3

INDIGENOUS STATUS
Non-Indigenous	19 900 765	92.5	1 171 794	96.0	897 076	96.7	20 228 715	96.0
Aboriginal and/or Torres Strait Islander	548 368	2.5	32 218	2.6	20 491	2.2	617 382	2.9
Aboriginal	495 754	2.3	29 156	2.4	18 515	2.0	558 748	2.7
Torres Strait Islander	31 407	0.1	1 819	0.1	1 174	0.1	34 407	0.2
Both Aboriginal and Torres Strait Islander	21 205	0.1	1 243	0.1	802	0.1	24 227	0.1
Not stated	1 058 585	4.9	17 050	1.1	9 948	1.1	233 961	1.1

Total (b)(c)(d)	21 507 719	100	1 221 057	100	927 520	100	21 080 214	100

(a) For more information on weighting see chapter 3.4.

(b) Data presented in the table have been perturbed. As a result the sum of individual categories may not align with totals.

(d) Includes Migratory areas.

The distribution of the ACLD file by sub-population was generally well aligned with both the 2011 Panel sample and the entire 2011 Census. When looking at the relative difference between these proportions, however, some differences are more clearly observed.

Compared with the entire 2011 Census, the linked 2011 ACLD Panel contains relatively more records for people aged 50-59 years, and to a lesser extent those aged 0-9 years, 40-49 years and 60-69 years. By contrast, the linked 2011 Panel contains relatively fewer records for people aged 20-29 years and 80 years and over. This is consistent with the 2006-2011 ACLD linkage as these subpopulations followed similar linkage rates.

In general, the distribution of weighted counts for the linked ACLD file is close to that of the entire 2011 Census, but it should be noted that the weighting process is not designed to produce counts corresponding to the population in 2011. Rather, the weighted population is that of people who were in scope of both the 2011 and 2016 Censuses (see Section 3.4 Weighting). Thus, for example, the lower proportion of older people in the linked file, even after weighting, reflects the impact on the 2011 Panel sample of deaths that occurred between 2011 and 2016.

Further data cubes demonstrating more detailed population distributions are provided as an attachment to this Information paper.

3.3 REASONS FOR UNLINKED RECORDS

There are two main reasons why records from the 2011 Panel sample were not linked to a 2016 Census record:

records belonging to the same individual were present in the 2011 Panel sample and the 2016 Census but these records failed to be linked because they contained missing or inconsistent information; or
there was no 2016 Census record corresponding to the 2011 Panel sample record because the person was not counted in the 2016 Census.

3.3.1 Missing and/or inconsistent information

In these cases, the true match was present in the pool of all record pairs but it was not identified because there was a high level of inconsistency between information on the 2011 ACLD Panel sample record and the 2016 Census record, or key linking fields were missing altogether. The reasons for the match being missed can be categorised into the following groups:

the missing or inconsistent information did not allow the record pair to be compared in the same blocking categories and could not be linked;
the record pair did not contain enough unique common information to distinguish the match from other potential record pairs;
the record pair was linked, but was attributed a low link weight as it contained a lot of missing or inconsistent information and was positioned below the cut-off identified in sample clerical review; or
the record pair was subjected to clerical review, but the high level of inconsistency did not enable it to be deemed a true link.

Accurate address coding was crucial in narrowing the search and differentiating between true and false links. It was a particular challenge for persons who had moved, since linkage was then dependent on the information supplied in 2016 about the person's address in 2011. Processing for the 2016 Census involved coding for address five years ago to a fine level of geography, ideally Mesh Block. This was not always possible, due to insufficient and/or incorrect address information being supplied for some persons, potentially due to recall issues.

3.3.2 No 2016 Census record

A person included in the 2011 ACLD Panel sample may have had no equivalent 2016 Census record because they were no longer in scope for the Census due to migration from Australia, or death between 2011 and 2016, or they may have been missed in the 2016 Census.

According to mortality data compiled by the ABS from data supplied by the Registrars of Births, Deaths and Marriages, approximately 913,000 people died in Australia between 2011 and 2016. If 5% of these people were selected in the 2011 Panel sample, then it could be estimated that up to 46,000 people could not have been linked due to death between 2011 and 2016. Similarly, migration data estimates that just over 1.4 million people left Australia as permanent emigrants over the same period, potentially resulting in up to 70,000 people from the 2011 Panel sample being unlikely to have a corresponding 2016 Census record. For more information please refer to the relevant releases of Migration, Australia (cat. no. 3412.0) and Deaths, Australia (cat. no. 3302.0).

Due to the size and complexity of the Census, it is inevitable that some people are missed and some are counted more than once. It is for this reason that the Census Post Enumeration Survey (PES) is run shortly after each Census, to provide an independent measure of Census coverage. The PES determines how many people should have been counted in the Census, how many were missed (undercount), and how many were counted more than once (overcount). It also provides information on the characteristics of those in the population who have been under- or overcounted.

The net undercount rate for the 2016 Census was 1%, with a higher rate for Aboriginal and Torres Strait Islander people than for the non-Indigenous population. Thus approximately 12,000 people from the 2011 Panel sample could have been missed in the 2016 Census. This estimate is a starting point only and does not take into account the likelihood of people being missed in successive Censuses. For more information please refer to Census of Population and Housing: Details of Overcount and Undercount, 2016 (cat. no. 2940.0).

When taking into account all of these factors, it is estimated that approximately 40% of the unlinked 2011 ACLD Panel sample (128,000 out of the 293,000 unlinked records) would not have a corresponding record in the 2016 Census. This would indicate that the initial linkage rate of 76% could be representative of up to 85% of the population that actually had an opportunity to be linked.

3.4 WEIGHTING

Weighting is the process of adjusting a sample to infer results for the relevant population. To do this, a 'weight' is allocated to each sample unit - in this case, persons. The weight can be considered an indication of how many people in the relevant population are represented by each person in the sample. Weights were created for linked records in the ACLD to enable longitudinal population estimates to be produced. Cross-sectional population estimates for 2011 and 2016 are available from each Census.

The 2011 Panel of the ACLD is a random sample of 5% of the Australian population in 2011. As such, each person in the sample should represent about 20 people in the population. Between Censuses, however, the in scope population changes as people die or move overseas. In addition, Census net undercount and data quality can affect the capacity to link equivalent records across waves. The weights of the linked records on the ACLD were calibrated to the estimated population that was in scope of both the 2011 and 2016 Censuses, 21,080,214 persons. The weights were based on four components: the design weight, undercoverage adjustment, missed link adjustment and population benchmarking.

The mean final weight for the linked records is 22.3 for females and 23.2 for males. The weights range between 14.8 and 83. The mean weight was higher for Aboriginal and Torres Strait Islander persons and for people in the Northern Territory.

The population benchmark is based on the 2016 Estimated Resident Population (ERP), which is adjusted by the estimated probability a person was also in Australia in 2011. This probability is formed using the 2016 Census reported address five year ago variable. Further information on this approach can be found in the paper Chipperfield, Brown & Watson (2016). See References section for details of this publication.

For more information about weighting please refer to the Appendix.