2.2 BLOCKING
Once data files have been standardised, record pairs (consisting of one record from each file) can be compared to see whether they are likely to be a match, i.e. belong to the same person. However, if the files are even moderately large, comparing every record on File A with every record on File B is computationally infeasible. Blocking reduces the number of comparisons by only comparing record pairs where matches are likely to be found – namely, records which agree on a set of blocking variables. Blocking variables are selected based on their reliability and discriminatory power. For instance, sex is partially useful as it is typically well reported, however it is minimally informative as it only divides datasets into two blocks, and is thus used in conjunction with other variables.
The process of blocking reduces the computational intensity of data linking. However, comparing only records that agree on a particular set of blocking variables means a record will not be compared with its match if it has missing, invalid or legitimately different information on a blocking variable. To mitigate this, the linking process is repeated a number of times ('passes'), using a range of different blocking strategies. For example, on the first pass, a block using a low level of geography (Mesh Block) was used to capture the majority of 2006 Census records that had matching information with their corresponding 2011 Census record. This approach meant, however, that those persons that did not have the same Mesh Block were not compared and conversely a potentially better match may exist in another Mesh Block. To mitigate, this a conservative approach to identify and confirm links was adopted in the early passes. This ensured that records which failed to link in the first pass proceeded to the next pass, in which a different set of blocking variables was used. Each pass used a different combination of blocking and linking variables to ensure each record pair had the highest possible chance of being linked. The blocking variables used for each pass are outlined in section 2.4.