2080.5 - Information Paper: Australian Census Longitudinal Dataset, Methodology and Quality Assessment, 2006-2011  
ARCHIVED ISSUE Released at 11:30 AM (CANBERRA TIME) 18/12/2013   
   Page tools: Print Print Page Print all pages in this productPrint All  
Contents >> 2. Data linking methodology >> 2.5 Decision model

2.5 DECISION MODEL

It is important to note that even where the original data is of very high quality, the information on equivalent records may not be identical across all the blocking and linking variables. For this reason, several ‘passes’ are used to optimise the opportunity for equivalent records to be linked, with different combinations of blocking and linking variables for each pass. Records that were not linked on one pass are included in the pool of possible links for the next pass.

In deterministic linking, an exact match is required on each of the variables specified in the blocking and linking strategy (see Table 1). Using this approach, links were only accepted if a single record pair was identified. Where a record was included in more than one possible pair, it was returned to the pool of unlinked records for subsequent passes.

In probabilistic linking, once record pairs are generated, a decision rule determines whether the record pair is linked, not linked or considered further as a possible link. The first phase of this process is automated, in which a record is assigned to its best possible pairing. This process is known as one-to-one assignment. Ideally (and often true in practice) each record has a single, obvious best pairing, which is its true match.

Probabilistic linking projects in the ABS have typically used an auction algorithm to assign optimally one record on the first dataset to one record on the second dataset. The auction algorithm maximises the sum of all the record pair comparison weights through alternative assignment choices, such that if a record A1 on File A links well to records B1 and B2 on File B, but record A2 links well to B2 only, the auction algorithm will assign A1 to B1 and A2 to B2, to maximise the overall comparison weights for all record pairs.

The second phase of the probabilistic decision rule stage takes the output of one-to-one assignment and decides which pairs should be retained as links, and which should be rejected as non-links. This is done by defining cut-off weights against which record pair comparison weights are evaluated. The simplest decision rule uses a single cut-off such that all record pairs with a weight greater than or equal to the cut-off are assigned as links, and all those pairs with a weight less than the cut-off are assigned as non-links. In order to establish the cut-off value, a sample of the record pairs are clerically reviewed. This provides the opportunity to ascertain the level of quality at each link weight and enables an estimate of the number of false links.

A more sophisticated decision rule employs lower and upper cut-off weights. Record pairs with a link weight above the upper cut-off are declared links while those with a weight below the lower cut-off are declared non-links. The record pairs with weights between the upper and lower cut-off weights are not automatically assigned a status, but designated for clerical review where all records within the upper and lower cut-off are reviewed and a judgement about the link status is made manually for each record pair.

As clerical review is a time and labour intensive element of data linkage projects, not all record pairs can be individually reviewed to determine their match status. While it is critical to examine a selection of record pairs manually to assess the quality of the automated linkage process and prepare for the next pass, it is also important to optimise the resource load so as to achieve the best value for effort.

For the ACLD project, a sample of record pairs were clerically reviewed to set a single cut-off in each pass. The single cut-off weight was set at a point where the review showed this was adequate to assign a high proportion of links with high accuracy. In this case, no further clerical review would be performed and unlinked records proceeded to the next pass.

There are some limitations with using a single cut-off. For example, adults for whom a wide range of Census characteristics (such as occupation or educational attainment) is collected will generally have a higher linkage weight than children (for whom there is limited information). Thus, an adult record pair could be positioned above, and a child record pair below the clerical cut-off, even when the adult link is false and the child link is true. Linkage weights for children may also give a false estimate of quality when compared with adult records at the same linkage weight. This could lead to an output bias and an under-representation of children in the final output file. The same considerations apply to adult records that have different amounts of missing information, for example, a record pair of an employed person (for whom there is a range of employment-related information) compared with a record pair for someone who was not in the labour force. To mitigate these factors, specifically targeted passes were conducted throughout the process as well as separate record sampling to determine accurate cut-offs for various sub-populations.

In clerical review, each sampled record pair was manually inspected to resolve its match status. A clerical reviewer is often able to use information which cannot be captured in the automated comparison process, such as common transcription errors (e.g. 1 and 7) or transposed information, such as the day of birth reported as the month or vice versa.

Along with the linking fields, supplementary information was also used to confirm a match. This included:

  • Non-linking fields such as Ancestry.
  • Frequency counts of personal characteristics, ie the number of people with the same date of birth in the same Mesh Block.
  • Displaying the dates of birth and ages of other members within the household.
  • Providing information on other members in the household that were listed on the Census form as temporarily absent.

These supplementary fields helped to clarify difficult decisions, especially on record pairs belonging to children, allowing for greater insight into whether a record pair was an actual match or just contained similar demographic and personal characteristics for two different individuals.



Previous PageNext Page