3.1.1 LINKAGE RATES, TRUE AND FALSE LINKS
Not all record pairs assigned as links in a data linkage exercise are a match, that is, a record pair belonging to the same individual. While the methodology is designed to ensure that the vast majority of links are true, some are nevertheless false. The linkage strategy used for the ACLD was designed to achieve both a high number of links and to ensure a high level of accuracy to enable longitudinal research. Accordingly, the strategy was restrictive and conservative, especially in the early passes.
Analysis from the results of clerical review was conducted to determine the quality of the linkage process and estimate the number of true links in the linked ACLD file. This process involved calculating the proportion of rejected record pairs at each linkage weight and determining the amount of false links this would represent in the final output file.
Table 3 provides a summary from the results of clerical review, including an estimate of the number of false links accepted in each pass. Due to the nature of deterministic linking and the way in which linked records were retained, no false links were identified in passes 1 and 2. While it is assumed that all links assigned in these passes were true, as they contained consistent information across all key linking fields, in reality there may have been a small but un-quantifiable number of false links.
TABLE 3 - LINKAGE RESULTS, By pass number
| | Pass number(a)
|
| | | | | | | | | | | | |
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 11 | 12 | Total(b) |
| | | |
|
| | | | | | | | | | | | | | | |
Links created | (No.) | | 559 182 | 131 575 | 11 131 | 182 285 | 212 071 | 57 713 | 10 489 | 10 156 | 236 180 | 133 555 | 29 911 | 1 574 248 |
| Sampled in clerical review | (No.) | | 30 | 30 | 240 | 400 | 400 | 345 | 206 | 120 | 411 | 201 | 200 | 2 583 |
| | | | | | | | | | | | | | | |
Links assigned | (No.) | | 544 925 | 10 919 | 10 489 | 62 570 | 87 248 | 18 988 | 1 723 | 159 | 50 007 | 9 827 | 3 904 | 800 759 |
| Total false links | (No.) | | 0 | 0 | 997 | 9 929 | 17 274 | 1 832 | 237 | 29 | 10 712 | 1 051 | 731 | 42 792 |
| False link rate | (%) | | 0 | 0 | 9.5 | 15.9 | 19.8 | 9.6 | 13.7 | 18.4 | 21.4 | 10.7 | 18.7 | 5.3 |
(a) The results of Pass 10 were used to identify the blocking field to be used in Pass 11. As a result, there were no records output from Pass 10.
(b) Data presented in the table have been confidentialised. As a result the sum of individual categories may not align with totals. |
| | | | | | | | | | | | | | | |
The combined clerical review results indicate that the number of false links in the final ACLD file could be as low as 5%. By including a tolerance around these results and assuming a small false link rate for the deterministic passes, the false link rate for the ACLD is estimated to be about 5
-10%. The passes that contained the highest proportion of false links were Pass 9 (21.4%), where family information was used to try and resolve unlinked records, and Pass 5 (19.8%), which used a broad geography (SA4) as the blocking field.
Whilst this is only an approximate estimate, it does give an indication of the high level of overall quality examined through reviewing a sample of over 2,500 record pairs.
The linkage rate of 82% with a false link rate of 5% was broadly consistent with, or better than, other ABS Census linkage projects which did not use name and address as linkage variables (see
Assessing the Likely Quality of the Statistical Longitudinal Census Dataset (cat. no. 1351.0.55.026)).
The conservative and restrictive nature of the blocking and linking strategy helped to minimise the number of estimated false links throughout the linkage process accompanied by quality controls that were implemented during clerical review.
About two-thirds (68%) of all links were achieved in the first pass of the project, which used a deterministic linking methodology to identify and filter matches. In Pass 1, a tight geographic and demographic restriction was implemented to maximise the amount of high quality links assigned and to limit the amount of alternative comparisons required. Using this approach, links were only accepted if a single record pair was identified.