Methodological News, December Quarter 2023

Features important work and developments in ABS methodologies

Released
12/12/2023

This issue contains three articles:

  • Modernising Probabilistic Record Linkage at the ABS
  • Modelling Estimates of Job Vacancies for Micro Businesses
  • Simulated Data for Enhancing Data Access

Modernising Probabilistic Record Linkage at the ABS

Record linkage is a computationally intensive activity and National Statistical Offices are increasingly relying on it to support the production of official statistics and data products.  For the past 18 years, the ABS has built and maintained capability in both deterministic and probabilistic record linkage.  Where feasible, our preference is to use probabilistic linking due to better and more defensible linkage outcomes. However, the size and complexity of the data being linked mean that the linkage run time is often prohibitively long, and the model assumptions are difficult to maintain.  For these reasons, deterministic linkage is being used for most production linkages in the ABS.

For deterministic linking, the ABS has been using an internally developed SAS macro called ‘D-MAC’.  For probabilistic linking, we have been using the open-source software ‘Febrl’.  The ABS has also recently been evaluating the implementation of probabilistic record linkage in the software 'Splink’ as a computationally feasible and methodologically defensible alternative to current methods and processes. 

Splink is a new generation software package developed by the UK Ministry of Justice.  Splink, written in Python, implements the Fellegi-Sunter model of record linking and uses Apache Spark to take full advantage of distributed computing infrastructure and hence greatly reduce computing run times.  The package is fully open source.

The objectives of the ABS’s evaluation of Splink include:

  • Investigating the ability to simplify linkage processes.
  • Assessing the quality of the linked data against that produced by current methods. 
  • Evaluating the costs and run times for some typical ABS linkages, using a secure analytical cloud environment. 
  • Assessing how well Splink can handle data structures frequently encountered in ABS linkages, while maintaining the model assumptions and producing sensible results.  More specifically, data item histories from longitudinal administrative datasets need to be incorporated effectively into the model.  This has recently been a major impediment to using probabilistic linking on a larger scale at the ABS.  Splink enables greater flexibility when defining how linking variables (and associated histories) are compared, offering hope of successful implementation.
  • Assessing the ease with which staff can develop and maintain capability in using the software.

While the evaluation of Splink is still underway, early results are promising. 

For further information, please contact Daniel Fearnley.

Modelling Estimates of Job Vacancies for Micro Businesses

As part of our ongoing commitment to reducing the burden on data providers, the ABS has recently developed a modelling approach to estimate the job vacancies for micro businesses (i.e., those with  four or fewer employees) for the quarterly Job Vacancies Survey.

The model is an ordinary least squares regression trained on the eight most recent historical cycles of job vacancies data to estimate the proportion of total job vacancies from micro businesses within each state and ANZSIC industry division. The predictors of this model were:

  • the state and industry division;
  • the total population of live in-scope micro-sized businesses in the quarter;
  • the sampled estimates of the non-micro-sized businesses selected in the survey; and
  • the applicable quarter of the calendar year (to account for any potential seasonal effect).

The modelled proportion is then converted to the number of job vacancies for these micro businesses and added to the total estimate.

A thorough retrospective analysis of statistical impacts from implementing this model has been conducted on historical data. It verified that the impact will be minimal for eight cycles as the model will remain accurate and robust during that period. To minimise any potential risk to data quality after eight cycles, the ABS will re-collect job vacancies information from these micro businesses every eight quarters to monitor and sustain the model’s performance.

The modelling approach is set to be implemented for the next Job Vacancies release with November 2023 as its reference period. Supplementing the existing surveyed job vacancies estimates with this model is expected to reduce direct data collection from approximately 1,200 businesses each quarter. Collectively from these providers, this is a reduction of about 167 hours spent on responding each quarter.

The ABS will investigate the feasibility of applying this methodology to other surveys to further reduce burden on data providers.

For more information, please contact Jacob Ryan or Summer Wang.

Simulated Data for Enhancing Data Access

In recent years there has been an increase in interest in synthetic microdata. Synthetic data is an artificial recreation of a real dataset that is generated using statistical modelling.

Synthetic data can enable research and development by:

  • Expediting access to microdata where governance processes would otherwise delay work commencing,
  • Providing access to realistic data for modelling and code development for researchers who do not have access to the original microdata,
  • Providing an alternative to Confidentialised Unit Record Files (CURFs), which may be subject to significant utility loss as part of the confidentialisation process.

The ABS has been investigating generating a type of synthetic data called simulated data. Unlike other forms of synthetic data, simulated data is produced without access to the underlying microdata. Instead, a model is fitted that uses various aggregate statistics and distributional properties of the original data to generate the simulated dataset. This simulated dataset can be generated to capture the univariate and at least some of the multivariate distributional properties of the original microdata.

Simulated data offers a high degree of protection as it only uses safe aggregate statistics as inputs, rather than the original microdata. With other forms of synthetic data, we must assess the disclosure risks by comparing the synthetic microdata to the original microdata. This risk assessment process can be computationally intensive depending on the size of the dataset, and at times subjective. The ABS has also found that non-simulated methods such as the iterative classification and regression trees (CART) approach occasionally produce replicates of records from the original data, potentially breaching confidentiality. Typically, such issues do not arise in the simulated data methods currently being explored.

One example of a method for producing simulated data is the copula method. Copulas provide a way  of simulating draws from a multivariate probability distribution with arbitrary marginal distributions and some known intercorrelation structure. The inputs for this method are the correlation matrix and the parameters for each of the marginal distributions (e.g. the proportion of males if modelling gender with a binomial distribution). As long as safe aggregates are used as inputs, the resulting dataset will also be safe. The utility of the data, while not necessarily as high as other forms of synthetic data, remains suitable for many applications that are not trying to infer population characteristics. Simulated data therefore holds potential for any future ABS synthetic data applications.

For more information, please contact Isaac Norden.

Contact us

Please email methodology@abs.gov.au to:

  • contact authors for further information
  • provide comments or feedback
  • be added to or removed from our electronic mailing list

Alternatively, you can post to:

Methodological News Editor
Methodology Division
Australian Bureau of Statistics
Locked Bag No. 10
Belconnen ACT 2617

The ABS Privacy Policy outlines how the ABS will handle any personal information that you provide to us.

Previous releases

Releases from June 2021 onwards can be accessed under research.

Releases up to March 2021 can be accessed under past releases.

Back to top of the page