2940.0 - Census of Population and Housing - Details of Undercount, 2011  
ARCHIVED ISSUE Released at 11:30 AM (CANBERRA TIME) 21/06/2012   
   Page tools: Print Print Page Print all pages in this productPrint All

LINKING AND MATCHING


OVERVIEW

While the PES questionnaire collected information on whether a person was counted in the Census, the information was only used as a means of sequencing respondents through the questionnaire. Whether someone was missed, counted or counted more than once was determined through a linking and matching exercise where connections between PES information and the related Census information were established. This process involved a range of automated and manual processes, focused on linking and matching close to 100,000 PES records to their counterparts within around 22 million Census records.

This section describes the various processes that were used in the 2011 PES, beginning with input processing (where the data were prepared for linking) through to the final matching outcomes.


INPUT EDITING - ADDRESS CODING

Address information was essential for matching people between the PES and Census. This was facilitated by identifying and coding all addresses collected in the PES to the Australian Statistical Geographic Standard (ASGS). Addresses were coded to a Collector Workload (CLW) and a Statistical Area Level 1 (SA1) by an automated program, the AddressCoder@ABS application.

Address coding was even more important to the processing of the 2011 PES than in previous cycles. The introduction of Automated Data Linking (ADL) made it necessary to have a Census enumeration area that could be used as a filtering variable for a number of the ADL runs, requiring the positioning of each address within a single SA1 (where possible).

In addition to this coding, the accuracy and consistency of other address elements (such as street names, suburbs and postcodes) had to be checked. The CLW was also important for subsequent clerical match and search processing, as it was the default starting point for clerical dwelling and person searching.

PES addresses were divided into two categories:

  • Enumeration Addresses - the address at which the PES interview took place; and
  • Search Addresses - including the usual address of visitors, the address at which PES respondents were located on Census night, the address at which respondents were included on a Census form, and any other addresses where the respondent may have been included on a Census form.

The PES allowed up to seven search addresses to be recorded, however the greatest number of search addresses recorded in the field for a single respondent in 2011 was two. Search addresses comprised around 10% of the total number of addresses recorded in the PES.

Table 16 shows that for every enumeration state, 70-85% of search addresses were located within the same state as the enumeration address (with the exception of the ACT), which allowed PES respondents to be linked to their Census location in a state-based run of ADL. The remainder were linked in non-state-based ADL runs, and were distributed predominantly throughout the three most populous states of New South Wales, Victoria and Queensland.

16 SEARCH ADDRESS STATE, State of enumeration by state of search address - 2011

State of search address
NSW
Vic.
Qld.
SA
WA
Tas.
NT
ACT
Enumeration address
%
%
%
%
%
%
%
%

NSW
78.0
4.6
10.6
0.9
2.5
0.2
1.3
1.9
Vic.
7.8
75.0
10.6
1.6
2.7
1.1
0.6
0.6
Qld.
8.0
3.7
83.9
0.8
1.4
1.1
0.7
0.4
SA
5.0
4.7
7.6
76.1
3.8
0.8
1.1
1.1
WA
5.2
3.9
3.2
0.5
85.5
0.4
1.2
0.1
Tas.
4.1
8.7
11.9
2.2
3.9
68.8
0.2
0.2
NT
3.1
5.7
6.3
2.2
3.6
0.7
78.0
0.4
ACT
28.7
8.5
9.4
1.2
3.5
0.3
1.8
46.8



Search address data were collected directly from PES respondents and related to locations at which they were present up to two months previously. As such, the detail and accuracy of this information varied, ranging from perfectly spelt out addresses with street number, suburb, city and postcode, to 'vague addresses' such as "a motel in Sydney". Therefore, in order to code search addresses successfully, an additional two-stage process was carried out, as detailed below.

Address repair was conducted on all search addresses, that is, any address given in the PES that differed from the enumeration address. This was done manually by a team of coders who reviewed the address text fields and amended them through a variety of techniques. Quality assurance was then conducted.

Address coding was undertaken after address repair with the aim of identifying the correct geographic areas (Meshblock [MB], CLW and SA1) for all addresses (enumeration and search addresses), according to the ASGS. This was done by first running all addresses through the AddressCoder@ABS application. Quality assurance for this automated process involved a complete review of the addresses that were amended by the automated coder in order to fit into a geographic classification, and retention of all original addresses. Those records which were not automatically coded were then sent to a coding team for manual processing. This manual process utilised various methods, including mapping software, to thoroughly scrutinise addresses and achieve the most accurate geographic coding possible. Further quality assurance was then undertaken.


INPUT EDITING - ITEM DERIVATIONS

Most data on the PES file were of a sufficient quality to feed into both linking and matching processes and later output processing, without further detailed editing. However, certain validation processes highlighted issues that required amendments to be made.

Derivations were used to correct Age/Date of Birth (DOB) and Marital Status responses. Where one data field was missing (e.g. Age), but a similar one was available (e.g. DOB), the missing field was derived and populated. Derivations were also created by examining individual 'person level' records to derive 'dwelling level' information for the relevant dwelling (e.g. the number of Usual Residents and Visitors in the dwelling, or whether the dwelling contains any Indigenous respondents).


INPUT EDITING - STANDARDISATION

In preparation for ADL, PES data were repaired and standardised through a three-stage process, converting it into a format that could then be compared with similarly standardised Census data through both automated and manual systems.
  • Data Repair was conducted to clean the data by removing non-alphabetic characters and capitalising the remainder, and by removing additional spaces.
  • Name standardisation involved converting common nicknames, abbreviations, misspellings or variations on a name to their 'origin name' (e.g. Beth, Eliza or Libby were converted to Elizabeth).
  • Data transformation/recording was undertaken to ensure that each variable was comparable to its Census counterpart (e.g. ensuring PES numeric identifiers for Indigenous status matched to those of Census). Additional variables were then created from the existing PES data.


AUTOMATED DATA LINKING - LINKING

Automated Data Linking refers to the use of probabilistic linking methods to determine possible links between Census and PES data in an automated fashion, and was used as the primary linking method in 2011. Its introduction followed an evaluation exercise undertaken by linking experts within the ABS after the 2006 PES.

ADL uses a range of personal and address characteristics, to evaluate the likelihood that a PES record and a Census record pertain to the same individual. The software used in both the 2006 quality study and the 2011 PES was Freely Extensible Biomedical Record Linking (FEBRL), which was developed at the Australian National University.

ADL provided the opportunity to match persons in the 2011 PES with those in the 2011 Census who would have previously been too difficult to match, given the constraints of prior technology and processes. The key gains in matching effectiveness and efficiency provided by ADL in 2011 included:
  • the ability to conduct a more comprehensive search for PES respondents than was possible from previous clerical matching processes;
  • the ability to locate PES respondents at Census night addresses that were not identified in the PES; and
  • a reduced requirement for clerical matching resources.

A number of different linking runs were used in 2011 to compare PES and Census records, each of which focused on a slightly different combination of name, address and demographic variables. At the beginning of each run, a list of PES and Census records was obtained by selecting a subset of the PES and Census datasets based upon agreement on a small number of variables. This process, called 'blocking', was used to stratify identified links (i.e. links at earlier runs took precedence), and to reduce the quantity of poor quality links returned in each run. Table 17 shows the ADL runs and the relevant 'blocking' fields used in each run.

17 ADL Runs and relevant blocking fields

ADL run
Blocking field

1A
SA1 (Statistical Area Level 1)
1B
CLW (Collector Workload)
2
Postcode, Year of birth
3
State, Initial letter of standardised first name, Initial letter of surname, Marital status
4
Date of Birth (Day, Month, Year), Marital status



Potential links were then assessed by assigning weights that reflected the level of agreement on selected data items from the two records. Large positive weights indicated probable matches, while large negative weights were observed for probable non-matches. These weights were then grouped and organised in the processes of CARDS and DLR, which we now describe.


AUTOMATED DATA LINKING - CARDS AND DLR

Important to the effective use of ADL were a series of processes run after ADL output was obtained. The Collect, Analyse, Reduce, De-duplicate and Systematise (CARDS) process collated, processed, identified and rated the most plausible links from each ADL run for all PES respondents. The process then combined the person links from all ADL runs and removed any duplicates. The resulting output was a single numeric 'Person Link Rating' (PLR) for each individual linked pair (a PES respondent and a Census respondent) ranging from 0.1 to 10.0 based upon agreement on various characteristics.

Person links were then grouped into Platinum, Silver and Tin categories, based on their PLRs.
  • Platinum - those links which were so strong that clerical examination was not required;
  • Silver - those which were convincing links, but required some clerical review; and
  • Tin - those which were linked on broad fields (e.g. surname and age) and which were not considered informative.

The CARDS process concluded by identifying and rating dwelling links through the Dwelling Link Rating system. In order to identify dwelling links, all person links within one PES dwelling were grouped together into a 'dwelling'. Dwelling links were then created between that PES dwelling and the Census dwelling(s) of the linked Census respondents. A 'Dwelling Link Rating' (DLR) was then assigned to each dwelling link based on the number of people linked between the PES and Census dwellings proportional to the number (if any) that were not linked, and the PLRs of the links.

Similar to the person links, dwelling links were then stratified into Platinum, Silver and Tin categories based upon their DLRs, allowing strong links (e.g. those with many person links and high PLRs) to be investigated before weaker links (e.g. with few person links and low PLRs). For a dwelling link to be rated as platinum, all its persons had to have a platinum PLR and be linked to Census persons within a single dwelling. If there were missing people, in either the PES or Census dwelling, or not every person had a Platinum link, the maximum rating the dwelling could be assigned was Silver. As with person links, the remainder of dwellings were placed into either Silver or Tin, based on the quality of the person links within.


PROCESSING IN THE PES MATCH AND SEARCH SYSTEM (MSS)

While ADL was the next step in the evolution and continual improvement of PES processing, ADL could not entirely replace the clerical decision-making process that has previously been at the core of PES processing. Clerical judgment will always be required to resolve the more complex or ambiguous cases and be used as a means of quality assuring automated processes. Some adjustments to the clerical match and search processes were necessary in 2011 to ensure that the relative strengths of both ADL and the MSS were fully realised.

The MSS was the main PES clerical review facility and was specifically built for PES processing in 2006. In 2011, the MSS again allowed processing staff to clerically search, view, compare, and record matches between PES and Census data. PES processing staff used the MSS to record clerical matches of dwellings and people between PES and Census, and to clerically search for people on Census forms at alternative addresses provided in the PES. In 2011, it was also used to assure the quality of ADL output.

The initial phase of MSS processing involved confirming whether the ADL output was correct. Once a dwelling link was confirmed, the Census person records for that dwelling were clerically compared with the PES person records. The information compared included name, sex, date of birth, age, marital status, Indigenous status and country of birth. The extent to which each of these variables was the same, in both the PES and the Census, determined the ADL match status of the pair and the level of match.


AUTOMATED DATA LINKING - LINK UPGRADING

Link Upgrading was a process of secondary examination after the main runs of ADL and MSS clerical review were completed for each state. Once MSS had been run on the Silver links for each state, the highest rated tin links for those PES people who were not matched were extracted (i.e. effectively upgraded) and entered into a second run of MSS processing.


INTENSIVE SEARCH ACTIVITIES

Once all ADL links were reviewed, the final phase of MSS processing was to conduct an intensive clerical search for persons and dwellings not matched as a result of ADL-enabled processing. This was done by searching CLWs (and neighbouring CLWs) for addresses provided by respondents during the PES interview (search addresses), in order to locate possible Census forms where that person was included. This followed 2006 methodology, which is described in Census of Population and Housing - Undercount, 2006 (cat. no. 2940.0) and Census of Population and Housing - Details of Undercount, 2006 (cat. no. 2940.0).


MSS QUALITY ASSURANCE AND ADJUDICATION PROCESSES

To ensure the accuracy of MSS processing, quality assurance (QA) procedures were used in the match and search process whereby all PES records processed in MSS were processed a second time by a different clerk. There was no identifier on the workloads that allowed the PES processors to know whether they were processing an 'original' or a QA workload. Where the initial and the QA processing outcomes corresponded, the initial match status was accepted. Where there was a discrepancy between the initial match status and the QA match status, the records were flagged for adjudication by a senior officer who reviewed all information and determined which match status was correct. Where both the initial and QA records were deemed to be inaccurate, the adjudicator reprocessed the record.

The QA process was also useful in identifying potential processing issues or areas where processors were having difficulty. This allowed ongoing feedback to be provided to the PES processors and contributed to the overall quality assurance of PES processing.


DISCRETE INDIGENOUS COMMUNITY PROCESSING

MSS processing for discrete Indigenous communities followed the 2006 approach and involved searching the entire community for a person match, rather than just searching within a single dwelling. Person matching in discrete Indigenous communities used the same rules for determining a match as in the mainstream component, but allowed for the use of up to two alternate names for each person when matching on name.


CONFIDENCE OF MATCH DECISIONS

Table 18 shows the matching outcomes from the 2011 PES linking and matching processing. Of the 94,539 total mainstream matches, 52,398 (or 59.8%) were matched without clerical review, 34,653 (or 39.5%) were matched after clerical review of ADL links, with the remaining 594 (or 0.7%) matched as a result of intensive search processing.

18 MATCHING OUTCOMES(a) - 2011

Matches
no.
%

Mainstream
Matched
Used in estimation
ADL Platinum (not clerically reviewed)
51 791
59.9
ADL Silver (clerically reviewed)
34 081
39.4
Intensive search
549
0.6
Not used in estimation
ADL Platinum (not clerically reviewed)
607
49.6
ADL Silver (clerically reviewed)
572
46.7
Intensive search
45
3.7
Total matched
ADL Platinum (not clerically reviewed)
52 398
59.8
ADL Silver (clerically reviewed)
34 653
39.5
Intensive search
594
0.7
Not matched
6 894
7.1
Total mainstream
94 539
100.0
ICF
Matched
2 528
87.1
Not matched
373
12.9
Total ICF
2 901
100.0
Total
97 440
100.0

(a) This table includes multiple matches for persons matched more than once. Therefore, totals do not sum to the total number of matched persons.



STATISTICAL IMPACT STUDY PROCESSING

In order to assess the impact of ADL on 2011 PES estimates, a Statistical Impact Study was conducted during linking and matching processing. For further information see the Statistical Impact of ADL Technical Note (in Explanatory Notes).