Treating microdata
Assessing and treating microdata disclosure risks
Microdata and disclosure risks
Microdata files are datasets of unit records, where each record contains information about a person, or organisation, or other type of record. This information can include individual responses to questions on surveys, censuses or administrative forms. Microdata files are potentially valuable resources for researchers and policy makers because they contain detailed information about each record. The challenge for data custodians is to strike the right balance between maximising the availability of information for statistical and research purposes and fulfilling their obligations to maintain confidentiality by:
- assessing the context in which the data will be released
- treating the data appropriately for that context
Assessing disclosure risk
The two key risks when releasing microdata are when disclosure occurs through:
- spontaneous recognition - where, in the normal course of their research analysis, a data user recognises an individual or organisation without deliberately attempting to identify them (for example, when checking for outliers in a population)
- deliberate attempts at re-identification - looking for a specific individual in the data, or using other research to confirm the identity of an individual who stands out because of their characteristics
As with aggregate data, there is also a risk with any published analysis from the microdata output.
Several methods for assessing microdata disclosure risk can be used:
- cross-tabulate the variables (e.g. look at age by income or marital status) to identify records with unique or remarkable characteristics
- compare sample data with population data to determine whether records with unique characteristics in the sample are in fact unique in the population
- compare potentially risky records to see how similar they are to other records that may provide some protection (a unique 30 year old with certain characteristics may be considered similar to a 31 year old with the same characteristics)
- identify high-profile individuals or organisations known to be in the dataset and who may be easily recognisable
- consider other datasets and publicly available information that could be used to re-identify records, such as through list matching
Factors contributing to the risk of disclosure should also be considered. These factors have different bearings in different contexts. For example, if releasing microdata publicly, data custodians should carefully consider each of the factors below. If these microdata is only be released in a secure data facility to authorised researchers, some factors may not be applicable.
Level of detail
The more detailed a unit record, the more likely re-identification becomes. Microdata files containing detailed categories or many data items could, through unique combinations of characteristics, reveal enough to enable re-identification.
With microdata output (or aggregate data), the main risk is attribute disclosure which may in turn increase risks of re-identification. In addition, with detailed tables, there is increased risk of disclosure due to differencing attacks or mathematical techniques that undo some or all of the data protections.
Data sensitivity
Some variables may require additional treatment if they are sensitive, such as health, ancestry or criminal information. This treatment may be dictated by legislation and policy as well as confidentiality obligations. This can be a significant balancing act as often the variables that are of most interest to researchers are also sensitive.
Rare characteristics
A disclosure risk may exist if the data contains a rare and remarkable characteristic (or combination of characteristics). This can happen even if there are few data items or categories. This risk depends on how remarkable the characteristic is. For example, a widow aged 19 years is more likely to be identifiable than one aged 79 years. In addition, it is important to consider the rarity of a record from a population perspective. For example, there may only be one 79 year old widow in a sample, but they are not unique in the entire population. The sampling process is a significant contributor to protection of the confidentiality of that individual (a user is unlikely to know which 79 year old widow was selected). It is advisable however, to protect that single individual in any subsequent outputs that may be publically released.
Data accuracy
Data accuracy can increase the risk of disclosure. While it is not recommended to produce data with low accuracy as a method to manage this risk, data custodians should be aware that datasets subject to reporting errors or containing out-of-date information may present a lower disclosure risk.
Data age
As a general rule accessing older data is less likely to enable re-identification of an individual or organisation than accessing up to date information. This is particularly true for variables that change over time, such as area of residence or marital status.
Data coverage (completeness)
Individuals or organisations are more easily identifiable if they are known to be in the dataset. Datasets that cover the complete population increase the risk of disclosure because a user knows that all individuals are represented in the dataset. This risk applies to administrative data and population censuses.
Sample data
Data based on surveys or samples taken from a population are generally of lower disclosure risk than full population datasets. This is because there is the inherent uncertainty whether a record belongs to a particular individual or organisation. The risk is not reduced to zero, particularly when considering records with rare characteristics which may be re-identifiable in a sample as well as the population, so sampling should not be the only method of protection.
Data structure
In some cases, how the dataset is structured increases the disclosure risk. Longitudinal datasets (those where individuals are tracked over time, as opposed to datasets that are time-based snapshots of different sample of the population) may have significant disclosure issues. Individuals or organisations that have changes in their characteristics over time are much more likely to be re-identified than those that don't (and in reality very few individuals or organisations don't change characteristics over time). For example a business that has a relatively constant income over 5 years, but then triples their income for the next three years is more likely to be re-identified compared to a business with a constant income over the same time frame.
Another structural aspect of datasets is their hierarchical nature. This is where datasets have information at more than one level such as a person level as well as a family level. The information may be non-disclosive at one level, but be disclosive at a higher level. For example a count of people with household income of $801-$1,000 per week may be 6. However, the 6 may refer to a single household (2 parents and 4 children), which has effectively disclosed information about all the people in the household.
Incentive
The more an individual or organisation is likely to gain from re-identifying a record, the greater the risk of disclosure. Conversely, the risk of attack is lower when the gains are lower. This is the fundamental principle of trusted access, where researchers share accountability for protecting data confidentiality and where the incentive for them is ongoing authorisation to access information.
Software for assessing disclosure risks
Various software packages can help data custodians assess, detect and treat disclosure risks in microdata. These include:
- Mu-ARGUS: Developed by Statistics Netherlands to protect against spontaneous recognition only (not against list matching).
- SDC-Micro: An R-based open source package, developed by International Household Survey Network. This program calculates disclosure risk, for whole datasets and individual records, and applies treatments.
- SUDA (Special Uniques Detection Algorithm): Developed by the University of Manchester to identify unit records which, due to rare or unique combinations of characteristics, pose a re-identification risk. SUDA looks for uniqueness in the dataset but does not consider whether a particular record is unique in the population as a whole.
Microdata treatment methods
Once a re-identification or disclosure risk is known, it can be addressed through a number of data modification and reduction techniques. These techniques should be applied to only those records or variables judged to be a risk - a judgement that should consider the specific release context:
- detailed microdata may require few of these treatments because it is accessed in a context where people, projects, settings and outputs are controlled (see the Five Safes framework)
- publicly available files may require many or data treatments.
Usually the minimum level of protection for any microdata to be used for statistical or research purposes is removal of direct identifiers such as name and address. Depending on the legal obligations of data custodians and controls on access, such as (e.g. user authorisation, project assessment, security of access environment, output checking), the removal of direct identifiers alone may be sufficient to protect confidentiality. Further disclosure controls may be required depending on the data release context, especially for public access of open data. Data custodians must carefully assess the microdata to identify records posing a disclosure risk and treat them to prevent re-identification.
Limit the number of variables
This means reducing the number of variables in the dataset. For example, you could remove detailed geographic variables.
Modify cell values
This can be done through rounding or perturbation. The amount of rounding should be relative to the magnitude of the original value. For example, rounding could vary from $1,000 (for personal income) to $1 million (for business income). Perturbation in the context of microdata means adding 'noise' to the values for individual records. For example, someone's true income of $1,000,000 might be perturbed to $1,237,000. In order to maintain totals, that $237,000 could be removed from one or more other records.
Rounding or perturbation may also be applied to exact dollar amount that may be otherwise at risk of list matching. If a record in a population is the only one with the exact value of income of $97,8999.21 then this record may be at risk if a user also has access to another dataset with income variables. The user could combine both datasets based on exact income values and be able to learn new information about the records. Treating all dollar amounts on a dataset by a small amount provides protection. This can be done by grouping the records into clusters and adjusting records within each cluster so that the mean for each cluster remains the same.
Combine categories
Combine categories that are likely to enable re-identification, such as:
- using age ranges rather than single years
- collapsing hierarchical classifications such as industry at higher levels (e.g. mining rather than the more detailed coal mining or nickel ore mining)
- combining small territories with larger ones (e.g. ACT into NSW)
You can combine categories containing a small number of records so that the identities of individuals in those groups remain protected (e.g. combine use of electric wheelchairs and use of manual wheelchairs). See also Treating aggregate data section.
Top/bottom coding
Collapse top or bottom categories containing small populations. For example, survey respondents aged over 85 could be coded to an 85+ category as opposed to having individual categories for 85-89, 90-94 and 95-100 years which might be very sparse.
Data swapping
To hide a record that may be identifiable by its unique combination of characteristics, swap it for another record where some other characteristics are shared. For example, someone in NSW who speaks an uncommon language could have their record moved to Victoria, where the language may be more commonly spoken. This allows characteristics to be reflected in the data without the risk of re-identification.
As a consequence of applying this method, additional changes may also be required. In the previous example, after moving the record from NSW to Victoria, family-related information would also need to be adjusted so that both the original record and the records of family members remain consistent.
Suppression
If the above methods are insufficient, suppress particular values or remove records that cannot otherwise be protected from the risk of re-identification.
Understand the relationship between microdata and aggregate data
To ensure that hierarchical data cannot be used to identify higher level contributors (i.e. in aggregate data), the above methods may need to be applied to a greater degree. Alternatively, removing variables or other information relating to the higher levels may be effective. See also Treating aggregate data.