1504.0 - Methodological News, Sep 2007  
ARCHIVED ISSUE Released at 11:30 AM (CANBERRA TIME) 21/09/2007   
   Page tools: Print Print Page Print all pages in this productPrint All

Acceptance Sampling Based Clerical Review in Probabilistic Matching

During the Census processing period, several quality studies are being conducted as part of the Census Data Enhancement (CDE) project. There are two types of quality studies, the first to assess feasibility and quality of linking without names and addresses and the second to help improve ABS statistical outputs. As part of these studies, names and addresses as well as other variables are being used to link Census data with other selected data sets.

The method used is probabilistic linking where the aim is to link records that are believed to belong to the same person from two different data sets. This method has its foundations in the method proposed by Fellegi and Sunter (1969). In this method candidate record pairs are given a weight based on the degree of agreement between fields on the two records. Record pairs with a weight above some upper cut-off are declared links while those with a weight below some lower cut-off are declared non-links. However, there are many record pairs that cannot be automatically assigned a status and are designated for clerical review. Clerical review involves human assessment of each record pair to resolve match status.

Clerical review is a time intensive stage of the data linking process requiring a high level of VDU and keyboard equipment use. Some linkages can generate thousands, ten or potentially even hundreds of thousands of record pairs for clerical review. The CDE project has implemented an acceptance sampling based approach to dramatically reduce the amount of clerical inspection.

Acceptance sampling is a well established statistical method that replaces 100% inspection with inspection of samples selected from batches. The clerical review pairs are ordered by linkage weight and divided into batches . A sample is selected from each batch and each selected record pair is inspected. The number of matched and non-matched pairs in this selection is compared against a set of critical values. The entire batch is sentenced on the basis of these comparisons. If the number of matches observed is less than the lower critical value then the batch is assigned as non links. If the number of matches observed is greater than the higher critical value then the batch is assigned as non links. Otherwise the batch is assigned for complete manual clerical review.

As sampling is used there are a number of risks. Batches containing a high proportion of actual matches that would be assigned as links may, by chance, be sent for clerical review. Similarly, batches containing a low proportion of actual matches that would be assigned as non-links may also be sent for clerical review. Both these cases would result in wasted effort. Lastly, batches that should be clerically reviewed because they contain a relatively high proportion of both of true matches and true non-matches may, by chance, be assigned as either links or non-links and not reviewed in full. However, careful selection of a clerical review threshold and sample size enables these risks to be quantified and controlled. Acceptance based clerical review has provided an accurate and reliable means of assessing and setting the most appropriate clerical review bounds.

The acceptance sampling software, incorporated into the linkage software, FEBRL (http://datamining.anu.edu.au/software/febrl/febrldoc/) by ABS's Statistical IT Facilities, is flexible and user-friendly. It allows the operator to set sampling parameters, move freely through the batches, override the automatically assigned batch status and manually assign the status of a batch or group of batches.

Using this method we have been able to reduce the number of record pairs for clerical review from 11,000 to 4,000 in one linkage.

For further information, please contact Tenniel Guiver on (02) 6252 7310.