DATA INTEGRATION @ THE NMSU
At the NMSU we are getting excited about data integration. Commonly known as DI or data linkage, the purpose of data integration is to gain more information from the combination of datasets than is available from the datasets separately, without increasing the burden on providers through further survey collections. Linked datasets are particularly appealing because they are often very large, enabling cross tabulations that may not be possible with survey data due to the sample size. Furthermore, where multiple years of data can be linked, cohort analysis can be undertaken to establish common pathways.
The ABS is in a good position to integrate sensitive data from administration sources because we are governed by the Census and Statistics Act 1905 which prevents the release of information that could be attributed to a specific individual. So, the public can rest assured that their data is in safe hands.
You may notice in our project descriptions the terminology 'Gold Standard' and 'Bronze Standard'. Gold Standard and Bronze Standard probabilistic data integration both use exactly the same methodology, the only difference being that Gold Standard probabilistic data integration uses name and address as linking variables, whilst Bronze Standard does not.
The NMSU is currently working on two data integration projects, both using extracts from the Department of Immigration and Citizenship's Settlement Database.
Migrants Census Data Enhancement (CDE) Project
The 2011 Migrants Census Data Enhancement (CDE) Project uses both Gold Standard probabilistic linking and Bronze Standard probabilistic linking to combine the 2011 Census of Population and Housing with the Department of Immigration and Citizenship's (DIAC) Settlement Database (SDB). The integration of this data will enhance the statistical and research value of both datasets by enabling the settlement outcomes of migrants who have arrived in Australia since 1 January 2000 to be analysed in the context of their entry conditions (i.e. their visa type, whether a primary or secondary applicant and onshore/offshore status).
The NMSU has now completed the Gold Standard linking (with names and addresses) of the SDB and Census files. In accordance with ABS policy the names and addresses were deleted prior to the shutdown of the 2011 Census Data Processing Centre on 7 December 2012. Bronze Standard linking is currently underway (without names and addresses), using variables such as age, sex, geographic variables, country of birth and year of arrival. The Gold Standard linked file serves only as a bench mark and is used to assess the quality of the Bronze Standard linking.
NMSU are now focussing on the output side of the project. An information paper assessing the quality of the linking is expected to available on the ABS website in July 2013. It is also planned that a publication and associated data cubes from the Bronze Standard linked file will be published at the beginning of August 2013.
The ABS is also linking a 5% sample of the 2011 Census data to the 2006 Census data using Bronze Standard probabilistic linking to create a 5% Statistical Longitudinal Census Dataset (SLCD). NMSU will then be able to enhance the SLCD with information from the Migrants CDE Project Bronze Standard linked dataset. At this stage we anticipate output from this longitudinal linkage to be available early 2014, however we will keep you updated about our progress in future newsletters.
Migrant Personal Income Tax (PIT) Data Integration (DI) project - Feasibility phase
The Migrant PIT DI project seeks to establish if an extract of the Department of Immigration and Citizenship (DIAC) Settlement Database (SDB) can be integrated with Personal Income Tax (PIT) data from the Australian Taxation Office (ATO) using Gold Standard probabilistic methods. It has been a long haul to get this project rolling (it was first conceptualised back in 2006) so we are really pleased that we are now all set to commence this phase of the project. The linking process is set to begin in the next month, with a research paper scheduled for release via the ABS website later this year.
A successful linkage of these two data sources could lead to the production of new longitudinal statistics on the economic outcomes of recent permanent migrants who arrived on or after 1 January 2000. The linked dataset will be unique in that it will contain many disaggregated income variables not collected elsewhere for these recent migrants, including own unincorporated business income, investment income, and superannuation and annuity income. For more information see the project listing on the Public Register of Data Integration Projects.