Within a blocking pass, records on the two files which agree on the specified blocking variables are compared on a set of linking variables. Each linking variable has associated field weights, which are calculated prior to comparison. Field weights indicate the amount of information (agreement, disagreement, or missing values) a linking variable provides about whether or not the records belong to the same person (match status). Field weights are based on two probabilities associated with each linking variable: first, the probability that the field values agree given that the two records belong to the same person (match); and second, the probability that the field values agree given the two records belong to different persons (non-match). These are called m and u probabilities (or match and non-match probabilities) and are defined as:
m = P(fields agree | records belong to the same person)
u = P(fields agree | records belong to different people)
Given that the m and u probabilities require knowledge of the true match status of record pairs, they cannot be known exactly, but rather must be estimated. The ABS calculated the m and u probabilities based on the training dataset, under the assumption that each deterministic link on the dataset was a match. The deterministic links used in this phase included (1) the highest quality links accepted in the deterministic linking passes, and (2) additional slightly lower quality links expected to be confirmed as accurate in the probabilistic linking phase. This method estimated the likelihood that a record would have a match by taking deaths and net overseas migration into account when estimating the m and u probabilities. This method also generated probabilities for disagreement, which can be referred to as md and ud probabilities:
md = P(fields disagree | records belong to the same person)
ud = P(fields disagree | records belong to different people)
Note that m and u probabilities were calculated separately for each pass, as the probabilities depend upon the characteristics of the pass' blocking variables. For example, the m probability for country of birth when blocking on mesh block will be different to the m probability for country of birth when blocking on sex.
Match (m) and non-match (u) probabilities are then converted to agreement and disagreement field weights. They are as follows:
Agree = log2(m/u)
Disagree = log2(md/ud)
These equations give rise to a number of intuitive properties of the Fellegi–Sunter framework (Fellegi & Sunter, 1969). First, in practice, agreement weights are always positive and disagreement weights are always negative. Second, the magnitude of the agreement weight is driven primarily by the likelihood of chance agreement. That is, a low probability of two random people agreeing on a variable (for example, date of birth) will result in a large agreement weight being applied when two records do agree.
The magnitude of the disagreement weight is driven by the stability and reliability of a variable. That is, if a variable is well reported and stable over time (for example, sex) then disagreement on the variable will yield a large negative weight. For each record pair comparison, the field weights from each linking variable are summed to form an overall record pair comparison weight or 'linkage weight'.
Before calculating m and u probabilities for some variables it is first necessary to define what constitutes agreement. Typical comparison functions used in the linkage include:
- Exact match (e.g. Sex). Agreement occurs only when the two variable values are identical. This criterion is used for most linking variables;
- Numeric difference (e.g. Age). A pair may be defined to agree if their variable values differ by an amount less than or equal to a specified maximum difference; and
- Approximate string comparison (e.g. First name). Two strings may be said to agree in spite of a certain proportion of missing, differing, or transposed characters, allowing for misspellings, transcriptions of poor handwriting, etc. Approximate string comparators, such as the Winkler comparator, allow for partial agreement if the strings being compared are similar but do not exactly match, and can be used to ensure that both identical and similar string pairs are defined to agree.
For further details on comparison functions used for probabilistic linkage, see Christen & Churches (2005).
Near or partial agreement may also be factored into the linking process through calculation of m and u probabilities accounting for such agreement. For example, a person’s age on equivalent records will frequently be an exact match, and the m and u probabilities are calculated based on this definition. During linkage, however, a partial agreement weight was given for age within a one or two year difference to cater for persons who may have incorrectly reported age for a variety of reasons.