Understanding re-identification
Re-identification in aggregate data and microdata, managing re-identification risk
What is re-identification
Re-identification occurs when the identity of a person or organisation is determined even though directly identifying information has been removed. This may be able to be done using other publicly or privately held information about the individual or organisation. This type of disclosure, or breach of confidentiality, that can occur when someone has access to either aggregate data (such as tables) or microdata (unit record data). This section considers the risk of re-identification, the other main disclosure risk, attribute disclosure. The risk of re-identification of an individual is likely to be increased if an attribute about them is revealed, for example a particular level of income that is common to a group of 15-18 year olds.
It is important that data, and especially unit record data, can be accessed securely and used effectively for research and policy making. Providing data in open environments is an important part of the Australian Government Public Data Policy Statement. However, open data may not always be the most appropriate manner for providing data for research, particularly when the requirements for utility conflict with confidentiality.
For datasets that cannot be made open and accessible, strategies to manage confidentiality and disclosure risks when providing access to the data should consider:
- how the dataset could be used to re-identify an individual or organisation
- whether information available elsewhere could be combined with the dataset to re-identify a person or organisation
Data sources and methods
Data sources and analytical methods may also increase disclosure risk, and this risk needs to be carefully managed.
Administrative data
- contains direct identifiers such as name, address and Tax File Number that allow an agency to identify the people accessing a government service or program
- is usually collected from everyone who accesses a service and may cover a large proportion of the population
- even when the directly identifying information is removed, people are still at higher risk of being re-identified from other information held about them when they are known to be in a dataset, or when the dataset is large
Integrated datasets
- multiple information sources about people and organisations can be combined (integrated), forming rich and deep repositories of information and presenting opportunities for detailed analysis
- re-identification risks similar to administrative datasets (the larger range of information for each record may increase the risk of re-identification)
Customer information
- businesses collect customer information through registration processes and reward schemes, holding databases containing detailed information on user characteristics and behaviour
- knowledge of these characteristics may be combined with information in a released dataset to re-identify an individual or business
Social media
- many people are willing to share their private information for social purposes, with vast and increasing amounts of personal information available online
- publicly available information may be combined with information in a released dataset to re-identify an individual or business
Big Data analytics
- while new technologies make it possible to produce and store vast amounts of transactional data, advanced techniques also enable Big Data to be summarised, analysed and presented in new ways
- computer systems are increasingly able to draw together disparate data to discover patterns and trends
- research is being conducted into how new technologies can also create modern data treatment processes that match the scale of Big Data and balance the dual goals of privacy protection and analytical utility
Data custodians have ethical and legal responsibilities to actively manage the re-identification risks of their data collections.
Managing the risk of re-identification
Re-identification may occur through a deliberate attack (where a user consciously tries to determine the identity of an individual or organisation) or it may occur spontaneously (where a user inadvertently thinks they recognise an individual or organisation without a deliberate attempt to identify them). As the amount of data collected and released by government increases and technologies advance, re-identification risk management should be an iterative process of assessment and evaluation.
Two broad complementary approaches exist for managing re-identification risks:
- control the context of the data release - important when managing re-identification risks as it allows for more detailed data to be made available to approved researchers in a safe manner
- treat the data - decisions about the level of data treatment required can only be made after determining the release context
The release context includes:
- the audience who will have access to the data
- the purpose for which the data will be used
- the release environment
The level of data treatment appropriate for authorised access in a controlled environment is unlikely to be sufficient for open and unrestricted public access. It should also be noted that if one or more aspects of the context changes, a reassessment of the disclosure risks should be performed in order to ensure data subjects remain unlikely to be re-identified.
Re-identification in aggregate data
There can be a risk of disclosure even though data is aggregated (grouped into categories or with combined values). This is because publicly or privately held information may be used to identify one or more contributors to a cell in a table.
Established techniques such as cell suppression and data perturbation exist to protect the confidentiality of aggregate (or tabular) data and preventing re-identification. However, with the increased volume of aggregate data available through electronic channels (such as machine-to-machine distribution) and at finer levels of geography, the risk of re-identification is increased and poses challenges for data custodians.
Although commonly used by many agencies, the application of cell suppression may be insufficient to prevent re-identification. As a response to this challenge, the ABS' TableBuilder service applies a perturbation algorithm to automatically protect privacy in user-specified tables. This perturbation algorithm leads to some loss of utility, but maintains a very high level of confidentiality.
Re-identification in microdata
To prevent re-identification of people or organisations from microdata we need to do requires one or both of:
- controlling the context:
- the manner in which data is released (on a continuum ranging from open data to highly controlled situations such as access in a locked room)
- treating the data:
- at a minimum, removing direct identifiers such as name and address
- in most cases applying further statistical treatment depending on the release context.
- for open data, appropriate and sufficient data treatment eliminates the need to control the context (but this is at the expense of data utility)
The following factors should be considered when deciding whether and under what contextual controls data will be released.
Private knowledge
Users looking at a dataset are likely to possess private information about individuals or organisations represented in the dataset (such as a neighbour or family member). In these cases, the Private information could enable them to re-identify someone in the dataset.
Strategies to manage this risk include:
- releasing a sample, rather than the entire dataset
- providing access only to authorised users who give a binding undertaking not to re-identify any individual or organisation
Public knowledge
Users may draw on publicly available information (such as a well-known person or business) when examining a dataset. For example, if a dataset containing information on businesses with very high turnover is released (even to a restricted group of researchers) the researchers may be able to re-identify large public companies that hold monopolies in certain industries.
Strategies to manage this risk include:
- releasing a sample, rather than the entire dataset
- providing access to authorised users only who give a binding undertaking not to re-identify any individual or organisation
- modifying the data to mask high-profile publicly-known individuals or organisations
List matching
List matching refers to a user linking records in a dataset with information from other datasets. This is done by either matching common identifiers or characteristics that are common to both datasets. There is a potentially increased risk of re-identification simply because the combined data increases the amount of detail available for each unit record (a person or organisation).
Strategies to manage this risk include:
- using secure data facilities to control which datasets are available to authorised researchers at any one time
- extracting subsets of the microdata to provide users with only the data they require
- using unique randomised record identifiers for each published dataset
The process of matching characteristics that are common to datasets for linking purposes is undertaken legitimately as part of securely managed data integration processes. The ABS, Australian Institute for Health and Welfare (AIHW) and the Australian Institute of Family Studies (AIFS) are formally accredited Commonwealth Data Integrating Authorities. Bringing data together in this way is an important method of extending and enhancing research. Accredited Data Integrating Authorities have procedures and controls in place in order to perform this linking function safely.