1504.0 - Methodological News, Mar 2014  
ARCHIVED ISSUE Released at 11:30 AM (CANBERRA TIME) 26/03/2014   
   Page tools: Print Print Page Print all pages in this productPrint All

Treatment of Missing Data in Statistical Data Integration

Statistical data integration of micro level data from different administrative and/or survey sources is an emerging priority for the ABS and wider National Statistical Service (NSS), as a means to investigate more complex and expanded policy and research questions that would not be possible using separate unintegrated data sources. The linking of micro level data from these source files can often lead to records with missing data in the linked files which needs to be addressed in order to enable reliable analyses to be performed on these linked files.

The two standard methods for the treatment of missing data are weighting and imputation. Weighting methods are the current preferred approach to the treatment of missing data in statistical data integration, since weighting automatically maintains the relationships between the data items within the various source files, the theoretical framework for the calculation of variances is much more advanced and developing a weighting system usually requires fewer resources to build and less time to execute.

A two-phase approach has been proposed to handle weighting for the missing data, with the sampling mechanism as the first phase and the linking mechanism as the second phase. In some circumstances one of the datasets to be linked will conceptually represent the population of interest completely. In this situation the sampling mechanism in the first phase equates to a complete enumeration of the population.

The proposed weighting method is a two-step linking propensity calibration estimator. In the first step of the two-step calibration procedure, the estimated linking probabilities are obtained from fitting a logistic linking propensity model. The inverse of these estimated linking probabilities are used as the intermediate weights that feed into the second step of the calibration weighting procedure. In the second step of the two-step calibration procedures, the intermediate weights are calibrated to known population auxiliary totals.

The choice of auxiliary variables to be included in the first step of the two-step linking propensity calibration procedure should be based on those auxiliary variables which help explain the probability of linking. The choice of auxiliary variables to be included in the second step of the two-step linking propensity calibration procedure should based on those auxiliary variables which can improve the accuracy of the estimates (i.e. those auxiliary variables highly correlated with the survey variables) and/or those auxiliary variables which are needed to ensure consistency with known population totals.

This particular two-step weighting procedure was applied to the data linked between the 2011 Census and the Department of Immigration and Border Protection's Settlement Data Base (SDB). A paper on the proposed two-step linking propensity calibration procedure was recently presented to the Methodology Advisory Committee.

Further investigations are being conducted into the interplay between setting linking cut-off values and weighting linked records. The Bronze Low linked files (which uses a low threshold value to accept or reject links) are currently used, rather than the Bronze High linked files (which uses a high threshold value to accept or reject links). However, a major drawback of using the Bronze Low linked files is that there are more incorrectly linked records (i.e. accepting links to similar persons or businesses rather than the same person or business). The Bronze Low linked files also require substantially more resources to produce than the Bronze High linked files.


Further Information
For more information, please contact John Preston (07 3222 6229, john.preston@abs.gov.au)

The ABS Privacy Policy outlines how the ABS will handle any personal information that you provide to us.