CONFIDENTIALITY
Introduction
Confidentiality is the act of protecting the privacy of respondents through ensuring that information collected is not revealed to a third party. This includes ensuring that a third party does not sight the form filled out by a respondent, and also ensuring that the publishing or disclosing of data cannot be related back to individual respondents. Assurances of confidentiality can be provided in an advance letter to the respondent, on the form itself or by the interviewer. If respondents are informed that a collection agency is taking steps to ensure confidentiality, response rates can be improved.
Factors to Consider
Agencies involved in the collection of data have a responsibility to ensure the protection of the information that they have been supplied with. Collectors should implement procedures to ensure confidentiality. There are a number of ways in which the privacy of respondents can be maintained:
- Code numbers, instead of names, can be used on the questionnaires to minimise the links that can be made between questionnaires and respondents.
- Suppression of personal identification information such as names, addresses and telephone numbers should be undertaken at the data processing or analysis stage.
- Questionnaires and data sources can also be destroyed as early as possible in the process.
- When creating tables, broad cross-classifications can be used to avoid cells with only a small number of contributing units.
Securing of Forms
Forms should be locked up in a safe place and destroyed not long after the project is finished. If unit record tapes (magnetic tapes containing the complete survey records for each individual respondent) are to be released then the following things must happen: a third party must not be able to recognise an individual or an organisation from that release; and that third party must not learn something that they did not already know about the individual or organisation.
There is an argument for keeping forms for a long period of time and then releasing the complete information kept in them when the respondents are no longer alive. Consideration in this case must be given to descendants of respondents.
Publishing Results
When presenting or publishing data in the form of tables, confidentiality considerations need to be taken into account especially when there exists cells with small counts (for example, cells representing only two observations). There are many techniques which can be used for confidentialising tabulations; some of which are described below.
TYPES OF DISCLOSURE
The aim of confidentialising data prior to publication is to prevent the disclosure of information where that information would identify the respondents who provided the data. There are several types of disclosure that can occur.
Direct Disclosure
Direct disclosure occurs as a direct result of a given cell value in a statistical table. Cells leading to direct disclosure are called sensitive cells. To determine which cells are sensitive, certain rules are generally used - the threshold rule for count data and the (n,k) dominance rule for quantities.
Inadvertent Disclosure
Although a cell value in a statistical table may be suppressed, the value can be worked out through the addition or subtraction of other cell values and the totals. Consideration of this kind of disclosure leads to the problem of consequential (or complementary) confidentiality.
Residual Disclosure
Residual disclosure results from a comparison of a set of tables. In this situation, the differences between the tables may provide details on a particular information source. Consideration of this type of disclosure leads to complex confidentiality problems.
Disclosure from External Information
This type of disclosure occurs from a comparison with information obtained from a source which is independent of that from which the tables are obtained (that is, an external source). There is no systematic method available to protect against such disclosure.
TYPES OF CONFIDENTIALITY ASSURANCE TECHNIQUES
There are several techniques that have been developed to minimise the risk of disclosure of information that can be traced back to the responding units. These techniques fall into three main categories, as listed below.
Data Suppression
This technique simply involves not releasing that information which may identify individuals. This is probably the oldest method but it still offers interesting avenues of investigation in terms of automation. To date some simple automated suppression algorithms have been developed for two dimensional tables. The far more complex problems associated with tabulations of higher dimensions are still being tackled. To produce an efficient and practical automated system requires high resource input and much fine tuning.
Threshold Rules and Cell Concentration Rules
A threshold rule specifies the minimum number of units that must contribute to the value of a cell. Where the number of units contributing to the value of a cell is less than a pre-specified threshold value, then the cell would be suppressed in order to prevent disclosure.
The cell concentration rule (also called a cell dominance rule) prevents the publication of cells where a small number of respondents contribute a large percentage to the cell total. For example, it may be decided that if two respondents contribute 85 per cent to a cell total, the cell will not be published.
Data Rounding
Random rounding involves the technique of replacing small values that are to appear in a table with other small random numbers. Because random rounding results in data distortion, it is therefore not additive (additivity means that the table total, either between or within tables, are equal to the sum of the relevant cell values or subtotals). This technique can be unbiased (a value is biased if the expected value of the data after a confidentiality techniques has been applied does not equal the value of the original entry it is replacing) if done in an appropriate manner.
Controlled rounding is a combination of conventional rounding and random rounding. Controlled rounding may result in additivity, unbiasedness, and reduction in data distortion (when compared to other rounding methods). However, this method may not give consistency among tables.
Conclusion
Since any release of statistical information will result in something being learnt about all units in that release, this policy could be interpreted literally as meaning that no statistics should be published. Obviously, a reasonable confidentiality policy involves establishing an acceptable level of risk of identifying individuals and balancing this against the amount of legitimate statistical information contained in a release.