1504.0 - Methodological News, Mar 2018  
ARCHIVED ISSUE Released at 11:30 AM (CANBERRA TIME) 22/03/2018   
   Page tools: Print Print Page Print all pages in this productPrint All

AN EVALUATION OF THE FEASIBILITY OF PRODUCING A PROTOTYPE SYNTHETIC DATASET USING INNOVATIVE METHODOLOGIES

The ABS is committed to increasing the accessibility of its’ valuable microdata while ensuring data confidentiality. In addition to existing microdata protection techniques, the ABS is undertaking exploratory work into synthetic microdata as an alternative approach to allow researchers to analyse the microdata while maintaining ABS’s legislative obligations. An important benefit will be an expansion of ABS’s confidentialised product suite to better accommodate the needs of the growing market niche of sophisticated users.

In a partially synthetic dataset, some of the original records - especially those sensitive ones - are treated as ‘missing values’ and then are replaced by synthetic values generated from an imputation model. A high quality synthetic dataset ensures that the confidential information is disguised so that the disclosure risk is kept minimal and valid inferential results of some estimates of interest are preserved from the original data.

A case study to examine the possibility of a partially synthetic data has recently been explored using the 2006-07 to 2010-11 Business Longitudinal Database (BLD). A good imputation model of the BLD should be able to mimic the data generating process well, in terms of capturing the complex dependencies among the variables of mixed type and the longitudinal structure. Three classes of multivariate imputation models are investigated: sequential regression, sequential random forest and Bayesian copula model, with random effects added to take into account the business specific effects. In the sequential regression and sequential random forest approaches, each variable to be synthesised is specified a generalised linear mixed model and a mixed effects random forest model respectively. In the copula based model, all the variables are modelled jointly, such that the latent variables transformed from the observed data are assumed to follow a Gaussian distribution.

The utility of the synthetic data sets is evaluated through comparisons of the parameter estimates in some models from published research, with those from the synthetic data and the original data. Simulation results suggest that the Bayesian copula model leads to the best utility in some longitudinal data analyses on the BLD, followed by the sequential random forest approach.

The ABS intends to build upon this research exploring the potential of developing synthetic data as a dissemination tool to enhance public access to microdata while ensuring confidentiality. Some considerations include: the demanding computation in big volume data sets, the costs in building more complicated imputation models, evidence that synthetic datasets meet ABS’s confidentiality requirements and the trade-off between utility and confidentiality.

Further Information

For more information, please contact Jiali Wang u5298171@anu.edu.au or Bernadette Fox Bernadette.Fox@abs.gov.au

The ABS Privacy Policy outlines how the ABS will handle any personal information that you provide to us.