Glossary
Data confidentiality guide
Explanation of statistical and confidentiality terms used in this guide
Released
8/11/2021
Administrative data
- information (including personal information) collected by agencies for the administration of programs, policies or services
- can be microdata (unit-record data) or macrodata (aggregate data)
- may be used for statistical or research purposes
Aggregate data
- produced by grouping information into categories and combining values within these categories
- example: a count of the number of people of a particular age (obtained from the question 'In what year were you born?').
- also known as tabular data or macrodata
- aggregate data is often presented in tables
Attribute disclosure
- occurs when previously unknown information is revealed about an individual, group or organisation (without necessarily formally re-identifying them)
Big Data
- extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions
Cell concentration rule
- used to assess whether a table cell may enable re-identification or attribute disclosure
- also called the cell concentration rule
- finds cells where a small number of data providers contribute a large percentage to the cell. iIf a cell fails this rule, further investigation or data treatment is needed to ensure the attributes of predominant data providers are not disclosed
- aggregate data treatment method
Confidentiality
- protecting the secrecy and privacy of information collected from individuals and organisations, and ensuring that no data is released in a manner likely to enable their identification
Confidentiality rules
- applied to all cells in an aggregate dataset in order to identify elements that pose a risk of disclosure
- frequency rule and cell dominance rule are two common rules applied to aggregate data
- microdata treatment rules may also be applied to unit record data
Data access context
- the environment and manner in which data is released
- data custodians need to consider who will have access to the data, the purpose for which the data will be used and the release environment itself (whether physical, IT or legal)
Data custodian
- organisation or agency responsible for the collection, management and release of data
- they have legal and ethical obligations to keep the information they are entrusted with confidential
Data laboratory
- secure data environment where researchers can perform detailed analysis of microdata
- also known as secure research centres
- can be accessed virtually (remotely) or on-site
- ABS data laboratory is called the DataLab
Data modification
- technique used to treat data to limit re-identification or other disclosure
- changes all non-zero cells by a small amount while aiming to maintain the table's overall usefulness
- examples include rounding and perturbation
Data provider
- an individual, household, business or other entity that supplies data for statistical or administrative purposes
- also known as a respondent
Data reduction
- a technique for statistical disclosure control
- methods to control or limit the amount of detail available in a table to prevent individuals or organisations from being re-identified
- methods include combining variables or categories, or suppressing (removing) information in unsafe cells
- can be applied to aggregate data or microdata
Data rounding
- slightly altering cells in a table to make them all divisible by the same number
- common numbers used for rounding are 3, 5 or 10
- may be random or controlled
- prevents the original data values from being known with certainty while ensuring the usefulness of the data is not significantly affected
- aggregate data treatment method
Data swapping
- process of moving the values of one or more variables from one microdata record to another record, so it no longer poses a disclosure risk
- microdata treatment method
Differencing or differencing attack
- where someone with access to multiple tables can deduce the true values of cells that had been modified or suppressed
- individual tables may be non-disclosive, but when the tables are compared, the difference between cells across the tables may be disclosive
- example: if a user accessed a table with information on 20-25 year olds and then accessed a subsequent table with information on 20-24 year olds, the difference between the two tables will reveal information about 25 year olds only
Direct identification
- when the data includes an identifier (such as name or address) that can be used, without any additional information, to establish the identity of a person, group or organisation
Disclosure or disclosive
- a breach of confidentiality, where a person, group or organisation is identified or has previously unknown characteristics (attributes) associated to them as a result of releasing data
Disclosure control
- the process of limiting the risk of an individual or organisation being directly or indirectly identified
- can be via statistical (data focused) or non-statistical (data context-focused) techniques or processes
Disclosure risk management
- In the context of confidentiality, determining whether released datasets (or sections of released datasets) constitute a risk of disclosure or re-identification, and then putting in place controlling mechanisms to mitigate those risks
- the Five Safes framework provides a way of assessing risk within the constraints provided by policies and legislation
Five Safes framework
- multi-dimensional approach to managing disclosure risk, consisting of safe people, safe projects, safe settings, safe data and safe outputs
- each safe is considered both individually and in combination to determine disclosure risks and to put in place mitigation strategies for releasing and accessing data
Frequency rule
- sets a particular value for the minimum number of unweighted contributors (such as people, households or businesses) to any cell in the table
- cells with very few contributors (small cells) may pose a disclosure risk
- common threshold values are 3, 5 or 10
- if a cell fails this rule, further investigation or action is needed to ensure the cell is adequately protected
- also called the threshold rule
Hierarchical data
- datasets that contain more than one level
- example: a dataset containing unit records with information about individual people (such as personal income) may also contain information about the families these people are part of (such as household income)
Identified data
- data that includes information that refers directly to an individual or organisation, such as name or address, ABN, Medicare number
Identifier(s)
- information that directly establishes the identity of an individual or organisation
- examples include name, address, driver's licence number, Medicare number and ABN
- also known as direct identifiers
Indirect identification
- occurs when the identity of an individual, group or organisation is disclosed due to a unique combination of characteristics (that are not direct identifiers) in a dataset
- example: a famous individual may be identifiable on the basis of their age, sex, occupation, geography and income
List matching
- where a user compares records from one dataset with records from another in an attempt to find records that have corresponding information, so that it may be concluded that the two records belong to the same individual
- this is a clear breach of the Privacy Act and other legislation governing data access where this is done in an attempt to re-identify that individual
Macrodata
- see aggregate data
Microdata
- datasets of unit records where each record contains information about a person, organisation or other type of unit
- can include individual responses to a census, survey or administrative form
Open data
- Data that is made available with no restriction on access or use (excluding possible copyright or licensing requirements). In terms of the Five Safes framework, the only control is on safe data.
- Data on data.gov.au is open data as any researcher can download files
- Data underlying ABS TableBuilder is not considered open data as there is a safe setting control - users cannot directly access the underlying microdata
- Aggregate output (tables, graphs or maps) from TableBuilder are open data
Outlier
- an unusual record that, because it has an extreme value for one or more data items, stands out from the rest of the population or sample because it has an extreme value for one or more data items
- outliers are potentially risky for confidentiality
P% rule
- statistical disclosure control rule that prevents any user from estimating the value of a cell contributor to within P% (where P is defined by the data custodian)
- aggregate data treatment method
Personal information
- information that identifies, or could identify, a person
- can include not only names and addresses, but also medical records, bank account details, photos, videos, and even information about what a person likes or where they work
- information can still be personal without having a name attached to it
- example: idate of birth and postcode may be enough to identify someone
- see also Sensitive information
In the Privacy Act 1988, personal information is "information or an opinion about an identified individual, or an individual who is reasonably identifiable:
- whether the information or opinion is true or not true; and
- whether the information or opinion is recorded in a material form or not."
Perturbation
- a statistical disclosure control technique used for count or magnitude data (aggregate data) or for microdata
- data modification method that involves changing the data slightly to reduce the risk of disclosure while retaining as much data content and structure as possible
- data rounding is a type of perturbation
Privacy
- not specifically defined in the Privacy Act
- an individual's right to have their personal information kept confidential unless informed consent has been given to release the information, or a legal authority exists - this is in accordance with the requirements of the Privacy Act 1988
Re-identification
- the act of determining the identity of a person or organisation using publicly or privately held information about that individual or organisation
Remote analysis facility
- remote access facilities are used by agencies around the world
- enables approved researchers to submit data queries from their desktops through a secure online interface
- requests are run against microdata that is securely stored within the data custodian's control
Remarkable characteristics
- rare characteristics or attributes in the data that can pose an identification risk, depending on how extraordinary or noticeable they are
- may include unusual jobs, very large families or very high income
- remarkable characteristics (or remarkable combinations of characteristics) can lead to re-identification of individuals, households or organisations
Respondent
- see data provider
Response knowledge
- information that is publicly or privately known about a respondent
- may be used to breach confidentiality
Rounding
- see data rounding
Safe data
- one of the Five Safes, safe data poses the question: has appropriate and sufficient protection been applied to the data?
- at a minimum, direct identifiers such as name and address must be removed or encrypted
- further statistical disclosure control may be needed depending on the context in which data is released
Safe outputs
- one of the Five Safes, safe outputs poses the question: are the statistical results non-disclosive?
- the final check, aiming for negligible risk of disclosure
- all data made available outside of the data custodian's IT environment must be checked for disclosure
- example: statistical experts may check all outputs for inadvertent disclosure before the data leave a secure data centre
Safe people
- one of the Five Safes, safe people poses the question: is the researcher appropriately authorised to access and use the data?
- by placing controls on the way data is accessed, the data custodian requires the researcher to take some responsibility for preventing re-identification
- as the detail in the data increases, so should the level of user authorisation required
Safe projects
- one of the Five Safes, safe projects poses the question: is the data to be used for an appropriate purpose?
- before users can access detailed microdata, they may need to demonstrate to the data custodian that their project has a valid research aim, public benefit and/or statistical purpose
- depends on the context in which the data is accessed
Safe settings
- one of the Five Safes, safe settings poses the question: does the access environment prevent unauthorised use?
- can be considered in terms of both the IT and physical environment
- in some data access contexts, such as open data, safe settings are not applicable
- at the other end of the spectrum, sensitive information is accessed through secure research centres
Secure Research Centre
- see data laboratory
Security
- safe storage of, and access to, data held by organisations or individuals
- covers both IT security and the physical security of buildings
Sensitive information (data)
- sensitive information is considered a subset of personal information
- under the Privacy Act, is of greater importance in terms of confidentiality (in particular where it leads to worse consequences for a re-identified individual)
- the Office of the Australian Information Commissioner lists a number of characteristics about an individual that are defined as sensitive [link]
- community and ethical expectations may not consider this list to be exhaustive (example: financial information is not present)
- all personal information can be potentially sensitive depending on the context and the individual concerned
- businesses may consider much of their information to be sensitive, but only personal data applicable under the Privacy Act
Statistical or research purposes
- purposes which support the collection, storage, compilation, analysis and transformation of data for the production of statistical outputs, the dissemination of those outputs and the information describing them
- statistical or research purposes may be distinguished from administrative, regulatory, compliance, law enforcement or other purposes that affect the rights, privileges or benefits of particular individuals or organisations
Suppression
- not releasing information that is considered a disclosure risk
- aggregate data:
- removing specific values from a table so that people and organisations cannot be re-identified from the released data
- initial suppression is known as primary suppression
- additional cells needing suppression are known as consequential or secondary suppression
- microdata:
- removing specific records from the microdata file
- removing specific data items for all records on the microdata file
Tabular data
- see aggregate data
Threshold rule
- see frequency rule
Uniqueness
- where an individual has a characteristic or combination of characteristics that are different to all other members in a population or sample
- determined by the size of the population or sample, the degree to which it is segmented (for example by geographic information), and the number and detail of characteristics provided for each unit in the dataset
- records that are unique are not necessarily re-identifiable, as this also depends on the remarkability of the characteristics and the availability of other information or knowledge held by the researcher (response knowledge)
Unit record data
- see microdata