This issue contains three articles:
- Modernising Probabilistic Record Linkage at the ABS
- Modelling Estimates of Job Vacancies for Micro Businesses
- Simulated Data for Enhancing Data Access
Features important work and developments in ABS methodologies
This issue contains three articles:
Record linkage is a computationally intensive activity and National Statistical Offices are increasingly relying on it to support the production of official statistics and data products. For the past 18 years, the ABS has built and maintained capability in both deterministic and probabilistic record linkage. Where feasible, our preference is to use probabilistic linking due to better and more defensible linkage outcomes. However, the size and complexity of the data being linked mean that the linkage run time is often prohibitively long, and the model assumptions are difficult to maintain. For these reasons, deterministic linkage is being used for most production linkages in the ABS.
For deterministic linking, the ABS has been using an internally developed SAS macro called ‘D-MAC’. For probabilistic linking, we have been using the open-source software ‘Febrl’. The ABS has also recently been evaluating the implementation of probabilistic record linkage in the software 'Splink’ as a computationally feasible and methodologically defensible alternative to current methods and processes.
Splink is a new generation software package developed by the UK Ministry of Justice. Splink, written in Python, implements the Fellegi-Sunter model of record linking and uses Apache Spark to take full advantage of distributed computing infrastructure and hence greatly reduce computing run times. The package is fully open source.
The objectives of the ABS’s evaluation of Splink include:
While the evaluation of Splink is still underway, early results are promising.
For further information, please contact Daniel Fearnley.
As part of our ongoing commitment to reducing the burden on data providers, the ABS has recently developed a modelling approach to estimate the job vacancies for micro businesses (i.e., those with four or fewer employees) for the quarterly Job Vacancies Survey.
The model is an ordinary least squares regression trained on the eight most recent historical cycles of job vacancies data to estimate the proportion of total job vacancies from micro businesses within each state and ANZSIC industry division. The predictors of this model were:
The modelled proportion is then converted to the number of job vacancies for these micro businesses and added to the total estimate.
A thorough retrospective analysis of statistical impacts from implementing this model has been conducted on historical data. It verified that the impact will be minimal for eight cycles as the model will remain accurate and robust during that period. To minimise any potential risk to data quality after eight cycles, the ABS will re-collect job vacancies information from these micro businesses every eight quarters to monitor and sustain the model’s performance.
The modelling approach is set to be implemented for the next Job Vacancies release with November 2023 as its reference period. Supplementing the existing surveyed job vacancies estimates with this model is expected to reduce direct data collection from approximately 1,200 businesses each quarter. Collectively from these providers, this is a reduction of about 167 hours spent on responding each quarter.
The ABS will investigate the feasibility of applying this methodology to other surveys to further reduce burden on data providers.
For more information, please contact Jacob Ryan or Summer Wang.
In recent years there has been an increase in interest in synthetic microdata. Synthetic data is an artificial recreation of a real dataset that is generated using statistical modelling.
Synthetic data can enable research and development by:
The ABS has been investigating generating a type of synthetic data called simulated data. Unlike other forms of synthetic data, simulated data is produced without access to the underlying microdata. Instead, a model is fitted that uses various aggregate statistics and distributional properties of the original data to generate the simulated dataset. This simulated dataset can be generated to capture the univariate and at least some of the multivariate distributional properties of the original microdata.
Simulated data offers a high degree of protection as it only uses safe aggregate statistics as inputs, rather than the original microdata. With other forms of synthetic data, we must assess the disclosure risks by comparing the synthetic microdata to the original microdata. This risk assessment process can be computationally intensive depending on the size of the dataset, and at times subjective. The ABS has also found that non-simulated methods such as the iterative classification and regression trees (CART) approach occasionally produce replicates of records from the original data, potentially breaching confidentiality. Typically, such issues do not arise in the simulated data methods currently being explored.
One example of a method for producing simulated data is the copula method. Copulas provide a way of simulating draws from a multivariate probability distribution with arbitrary marginal distributions and some known intercorrelation structure. The inputs for this method are the correlation matrix and the parameters for each of the marginal distributions (e.g. the proportion of males if modelling gender with a binomial distribution). As long as safe aggregates are used as inputs, the resulting dataset will also be safe. The utility of the data, while not necessarily as high as other forms of synthetic data, remains suitable for many applications that are not trying to infer population characteristics. Simulated data therefore holds potential for any future ABS synthetic data applications.
For more information, please contact Isaac Norden.
Please email methodology@abs.gov.au to:
Alternatively, you can post to:
Methodological News Editor
Methodology Division
Australian Bureau of Statistics
Locked Bag No. 10
Belconnen ACT 2617
The ABS Privacy Policy outlines how the ABS will handle any personal information that you provide to us.
Releases from June 2021 onwards can be accessed under research.
Releases up to March 2021 can be accessed under past releases.