1504.0 - Methodological News, Mar 2018  
ARCHIVED ISSUE Released at 11:30 AM (CANBERRA TIME) 22/03/2018   
   Page tools: Print Print Page Print all pages in this productPrint All

AN EMPIRICAL BAYESIAN APPROACH TO ENTITY-BASED DATA LINKAGE

The ABS’s involvement in cross-government data integration initiatives such as the Multi-Agency Data Integration Project (MADIP) and the Data Integration Partnerships for Australia (DIPA) are presenting new opportunities to improve the ABS’s data linkage methodology. An important pursuit as part of these initiatives, is the development of an entity-based approach to data linkage, where records in datasets are linked to entities in a population spine. This represents a paradigm shift from the ABS’s existing linkage methodology which is tailored to conventional record linkage across pairs of datasets. While the existing methodology can be adapted to the population spine, the empirical Bayesian approach has the potential to give more statistical validity to inferences about which records constitute an entity in the spine. For the creation of a linkage spine, existing methods such as probabilistic linkage (Fellegi & Sunter 1969) and deterministic linkage (ABS, 2016) need to be combined with a post hoc conflict resolution procedure. A separate issue with these methods is their inability to provide principled estimates of linkage uncertainty for analysis. If such estimates were available, they could be propagated through analyses on the spine-linked data to provide more statistically sound inferences.

With these areas for improvement in mind, the ABS has been investigating alternative, state-of-the-art approaches to entity-based linkage. One method which is well suited to a number of important ABS requirements is the empirical Bayesian approach described in (Steorts 2015), known as ebLink. Unlike many alternative methods, ebLink directly models the entities in the domain (e.g. Australian residents) and the links from records to entities. This makes it a good fit for the population spine, since it can provide a statistical measure of the association between spine entities and dataset records. It also includes the usual benefits of a Bayesian framework, namely: accounting of uncertainty through the posterior distribution, the ability to incorporate prior information, and the facilitation of complex hierarchical models.

In order to assess the feasibility of ebLink, the ABS is collaborating with the University of Melbourne through the APR.Intern programme. The poor scalability of ebLink was quickly identified as an obstacle, but has been somewhat mitigated by a re-parametrisation of the model that incorporates blocking ideas, and enables the inference to be distributed across a compute cluster. As part of the collaboration, a prototype is being implemented in Apache Spark (a distributed computing framework). Early experiments indicate that ebLink slightly outperforms the ABS’s established methods in terms of linkage accuracy, while also providing a full posterior distribution over the linkage structure. However, computational efficiency/scalability remains a challenge for future work.

References
Fellegi, Ivan; Sunter, Alan (1969). “A Theory for Record Linkage”. Journal of the American Statistical Association. 64 (328): pp. 1183–1210.

Steorts, Rebecca C. (2015). “Entity Resolution with Empirically-Motivated Priors”. Bayesian Analysis. 10 (4): pp. 849-875.

ABS (2016). “Personal Income Tax and Migrants Integrated Dataset (PITMID) 2011-12 Quality Assessment”. ABS Research Paper. cat. no. 1351.0.055.060

Further information
For more information, please contact Neil Marchant Neil.G.Marchant@unimelb.edu.au or Daniel Elazar Daniel.Elazar@abs.gov.au

The ABS Privacy Policy outlines how the ABS will handle any personal information that you provide to us.