Confidentiality in ABS business data using Pufferfish differential privacy

This paper investigates a potential use of Pufferfish differential privacy to maintain data confidentiality in ABS business data

Released
24/10/2022

Introduction

As part of the ABS’s broader strategic direction, the Data Access and Confidentiality Methodology Unit (DACMU) aims to design and implement a confidentiality method that achieves a better risk-utility trade-off subject to data and output requirements. This ensures the ABS can enhance data utility in statistical collections while providing state of the art privacy protection to data providers. Furthermore, it is important that the new confidentiality method can be generally applied to a broad range of household and business statistics publications across the ABS for consistency purposes. Under the Census and Statistics (Information Release and Access) Determination 2018, the ABS provides passive confidentiality, which protects data providers (within particular industries) who can demonstrate that the release of an ABS statistical output would likely enable their identification or accurate estimation of one or more of their attributes. These data providers are called passive claimants. This effectively puts responsibility on data providers to notify the ABS for privacy protection, subject to checking by the ABS. The work outlined in the paper investigates an instantiation of Pufferfish differential privacy (DP) through log-Laplace multiplicative perturbation to protect sugarcane production estimates sourced from an administrative dataset. If successful, this could be applied to data that the ABS collects from households and businesses, as well as statistics produced from administrative data sources. This would greatly improve the utility of the information.

The current confidentiality method for protecting passive claimants is consequential suppression; an aggregate statistic (e.g. a total) is not published if a passive claimant’s value that contributes to the statistic is sensitive. The value is then suppressed which is referred to as primary suppression. This leads to additional suppressions that are required to prevent the calculation of the primary suppressed value based on related statistics. This is referred to as consequential suppression. Suppression is an output confidentiality method which focuses on protecting aggregate outputs i.e. suppression is applied to the aggregate outputs but not to the true values of unit-level data.

Consequential suppression is no longer viable for the following reasons:
I. It limits the ABS’s capability to meet increasing user demand for more detailed statistics due to complexity in applying consequential suppression across multiple small geographic areas. There is demand for this space from the ABS Agriculture Statistics collections.
II. It restricts the ABS's ability to produce timely, flexible and user-specified outputs as geospatial differencing exposes units resulting in further suppression.
III. Research has shown that suppression is not as effective as perturbation with privacy protection.
IV. It does not help meet user demand for more detailed unit-level analysis using alternative data sources.
V. It is extremely resource intensive to apply and it needs an automated privacy approach to reduce resources required to produce ABS statistics.
VI. As the ABS moves towards increasing use of administrative datasets, direct engagement with data providers through surveys will be reduced.

Log-Laplace multiplicative perturbation improves data utility by enabling more detailed statistics to be safely published compared to suppression. This is because log-Laplace multiplicative perturbation is an input confidentiality method that perturbs a passive claimant’s unit-level value before it is used to produce aggregate statistics. This implies that the output aggregate statistics are naturally protected. Unlike suppression, additional processes are not required to protect the final statistical outputs. As a result, log-Laplace multiplicative perturbation is easier to implement than suppression, even as datasets and statistical outputs become complex. An example is publishing integrated datasets from other sources such as the Business Longitudinal Analysis Data Environment (BLADE). In addition, this approach protects against geospatial differencing risks where a passive claimant’s value may be recovered from differencing aggregate outputs from overlapping geographic regions.

Another advantage of the log-Laplace multiplicative perturbation is that it satisfies an instantiation of Pufferfish DP. Pufferfish DP is a framework for generating privacy definitions which are variations of differential privacy. These privacy definitions protect “secrets” by limiting the amount of information that users of a statistical output can learn about these “secrets” in the data. Pufferfish DP provides flexibility with the customisation of the “secrets” in a statistical collection that the ABS wants to protect. This property aligns with the ABS’s passive confidentiality policy. I.e. the sensitive variables or values of a passive claimant are the “secrets” the ABS wants to protect.

Our instantiation of Pufferfish offers a privacy protection guarantee by connecting the p% rule with the Pufferfish DP framework. The p% rule is used in the ABS and other national statistical offices to determine if a passive claimant’s value requires privacy protection. The p% rule is defined as follows: if a passive claimant’s value can be estimated to within p% of its reported value, then it requires protection. In our instantiation, the “secrets” are statements that take a form of “passive claimant A’s reported value is within p% of the value y”. Log-Laplace multiplicative perturbation protects these “secrets” by ensuring users of our statistical outputs cannot confidently determine a passive claimant’s sensitive value to within p% of its reported value, unless a user was already quite confident before observing the statistical outputs.

We have chosen the SRA sugarcane dataset as a test case and examined the data utility loss and disclosure risk from log-Laplace multiplicative perturbation. The structure of the paper is as follows: We will first briefly discuss other confidentiality methods we have considered as part of the investigation in section 2. We then provide the mathematical proof of our Pufferfish DP instantiation which connects the p% rule in section 3. Finally, we will present the case study results in section 4.

Background

The ABS investigated two other input confidentiality methods as alternatives to suppression. They are data imputation method and removal of passive claimant units. However, we have found that log-Laplace multiplicative perturbation is more effective than these methods. The following provides a brief description of each method and its suitability for the ABS Agriculture Statistics collections.

The data imputation method replaces a passive claimant’s sensitive value with one that is imputed using values of similar records. A key step is building an appropriate model from the observed data. However, it is challenging to define a set of criteria for “similar” records because it is difficult to balance privacy and utility. The donor records will need to be similar to the passive claimants but not so similar that they could reveal sensitive information about the passive claimants. A second complication is one of practicality. Due to the sparsity of Agriculture data, particularly Agriculture Census, some passive claimants might not have donor records that are similar enough to produce a satisfactory imputed value. This means that a different set of criteria is needed for obtaining a sufficiently large pool of possible donors. An example of the data imputation method is data smearing where a passive claimant’s sensitive value is replaced with an average value calculated from the subpopulation of records which are similar to the passive claimant’s record.

Removal of passive claimant units protects passive claimants’ information by removing their records before publication estimates are calculated. The primary argument against this method is that it will result in negative bias in both cell estimates and cell counts. This method was implemented and tested with the Agriculture Census 2015-2016 publication, which highlighted some more pertinent arguments against this method. For example, all chicken production in TAS and WA are estimated as 0, nursery production in ACT is 0 and mushroom production in WA is 0. This could affect public trust in the ABS because it is public knowledge that there are chicken farms in TAS and WA but the ABS releases data suggesting otherwise. The key commodities at the national level are largely unaffected. If this method is implemented for the Agriculture Census 2020-2021 publication, it will change the economic story because all estimates will be artificially deflated compared to the previous publication. It is easy to implement methods such as log-Laplace multiplicative perturbation which is unbiased and does not suffer from this problem. In the next section, we will detail our instantiation of Pufferfish DP that incorporates the p% rule but we will first describe the general form of the DP framework.

2.1. Definition of differential privacy

The following is the mathematical definition of \(\epsilon\)-differential privacy (Dwork and Roth, 2014).

Definition (\(\epsilon\)-differential privacy): Given a privacy parameter \(\epsilon\), a (potentially randomised) algorithm M satisfies \(\epsilon\)-differential privacy if for all \(W\subseteq Range(\mathcal{M})\) and for neighbouring databases \(D\) and \(D'\) that differ in only one record (i.e. one has one more record than the other), the following holds:

\(P(\mathcal{M}(D)\in W)\le e^\epsilon P(\mathcal{D'}\in W)\tag{2.1.1}\)

The original concept of \(\epsilon\)-differential privacy aims to ensure that the presence or absence of an individual record in a microdata set does not significantly affect statistical outputs produced from the microdata. Since statistical outputs produced from an \(\epsilon\)-differential privacy method are insensitive to the presence or absence of individual records in the microdata set, differential privacy limits how much information data users can learn from the statistical outputs about any individual record. Let \(\epsilon\) denote the value controlling the upper bound of information users can gain. Variations of \(\epsilon\)-differential privacy exist with similar aims.

Desfontaines and Pejo (2020) describe the key contribution of DP as defining anonymity as a property of the process of generating confidential outputs from a dataset, rather than as a property of the dataset itself. Bambauer et. al. (2013) use several examples to demonstrate how the use of the strict DP definition in (2.1.1) can have significant impact on data utility and lead to significant errors in data analysis. There has been considerable amount of research on variants or extensions of DP and adapt these definitions to different contexts and assumption to enhance data utility while maintaining confidentiality. Desfontaines and Pejó (2020) highlight that there are approximately 200 different definitions, inspired by DP, in the last 15 years. Table 2.2 summarises seven key dimensions for describing variants or extensions of DP. According to the seven dimensions in Table 2.2, Pufferfish DP is an extension of DP that allows different definitions of neighbourhood (2) and background knowledge (4). The “neighbourhoods” are pairs of secrets in a discriminative pair and the background knowledge is the set of data evolution scenarios. Section 3.1 elaborates on this when describing Pufferfish privacy for p% intervals.

Table 2.2: The seven dimensions and their usual motivation
DimensionsDescriptions
(1) quantification of privacy losshow is the privacy loss quantified across outputs?
(2) definition of neighborhoodwhich properties are protected from the attacker?
(3) variation of privacy losscan the privacy loss vary across inputs?
(4) background informationhow much prior knowledge does the attacker have?
(5) formalism knowledge gainhow to define the attacker's knowledge gain?
(6) relative knowledge gainhow to measure relative knowledge gain?
(7) computing powerhow much computational power can the attacker use?

Source. Adapted (Desfontaines and Pejó, 2020, p.290)

Our proposed instantiation of Pufferfish differential privacy for q% intervals

Key notation:
NotationDefinition
\(q\)A parameter that controls the size of the interval that is protected by Pufferfish privacy for q% intervals (see Definition 3.1.4). To make notation cleaner in definitions, theorems and proofs, let \(q\in(0, 1)\). E.g. Pufferfish privacy for 15% intervals means \(q=0.15\).
\(p\)A parameter that controls the definition of disclosure based on the p% rule. To make notation cleaner in definitions, theorems and proofs, let \(p\in(0, 1)\). E.g. 15% rule means \(p=0.15\).
\(\mathbb{S}\)The set of potential secrets that is protected by Pufferfish privacy for q% intervals (see Definition 3.1.4).
\(\mathbb{S}_{pairs}\)The set of discriminative pairs of secrets in Pufferfish privacy for q% intervals (see Definition 3.1.4).
\(s_{i}\)The statement that record \(i\)'s true value is in some pre-specified q% interval.
\(\sigma_{[I]}\)The statement that record \(i\)'s true value is in interval \(I\).
\(\mathbb{D}\)The set of data evolution scenarios for Pufferfish privacy for q% intervals (see Definition 3.1.4). Data evolution scenarios represent assumptions about how the data was generated.
\(\theta\)A prior probability distribution in \(\mathbb{D}\).
\(\epsilon\)The privacy parameter in Pufferfish privacy for q% intervals (see Definition 3.1.4).
\(\mathcal{M}\)A perturbation mechanism for protecting privacy in data.
\(\omega\)An output of a perturbation mechanism \(\mathcal{M}\).
\(W\)The set of all possible outputs from a perturbation mechanism \(\mathcal{M}\).
\(X\)Laplace distributed random variable.
\(e^{X}\)Log-Laplace distributed random variable, where \(X\) is a Laplace distributed random variable.
\(b\)The dispersion parameter for the Laplace distributed random variable \(X\).
\(c\)Bias correction factor for the log-Laplace multiplicative perturbation mechanism.

Kifer & Machanavajjhala (2014) offered a Pufferfish DP instantiation that protects intervals formed by a multiplicative factor, and proposed a log-Laplace multiplicative perturbation mechanism that satisfies that instantiation. We prove that the log-Laplace multiplicative perturbation mechanism also protects q% intervals. In section 3.1, we will describe the definition of our Pufferfish DP instantiation for q% intervals. In section 3.2, we will provide the mathematical proof that shows log-Laplace multiplicative perturbation satisfies our instantiation of Pufferfish DP for q% intervals i.e. guarantee privacy protection based on the p% rule. Given the q% interval is larger than p% interval specified by the p% rule, we will be ensuring that a data user cannot confidently determine if a passive claimant’s true value is within p% by protecting the q% interval.

3.1. Pufferfish privacy for q% intervals

There are three essential components that form the Pufferfish DP framework,

  • A set of potential of secrets \(\mathbb{S}\)
  • A set of discriminative pairs of secrets, \(\mathbb{S}_{pairs}\subseteq\mathbb{S}\times\mathbb{S}\)
  • A collection of data evolution scenarios \(\mathbb{D}\)

The set of potential secrets is what a data custodian wants to protect in statistical outputs. In our instantiation, it takes the form of a statement such as “record \(i\)’s true value is in this q% interval”. This forms the domain for the set of discriminative pairs of secrets. The ABS wants to ensure data users cannot distinguish, in a probabilistic sense, which of statements \(s_i\) or \(s_j\) in a discriminative pair \(\left(s_i, s_j\right)\) is true. For example, consider the discriminative pair (record \(i\)’s true value lies in the q% interval around \(y\), record \(i\)’s true value lies in the q% interval around \(\frac{1+q}{1-q}y\)). Note that these are adjacent q% intervals, which means they are non-overlapping but end-to-end. This choice of discriminative pair means that upon observing the statistical outputs, a data user cannot significantly improve their prior knowledge about which of two adjacent q% intervals is more likely to contain record \(i\)’s true value. The data evolution scenarios in \(\mathbb{D}\) describe a data user’s prior knowledge about the data generation process for the underlying data from which statistical outputs are produced.

Kifer & Machanavajjhala (2014) introduced an instantiation of Pufferfish DP to offer privacy protection for intervals of the form \(\left[\alpha y, \frac{y}{a}\right)\) where \(y>0\) and \(\alpha\in\left(0, 1\right)\). For brevity, we call this the “interval around \(y\) formed by factor \(\alpha\)” hereafter. This is done by adding multiplicative perturbation noise from a log-Laplace distribution to a record’s value. Using the Pufferfish DP framework definition, we define our instantiation of Pufferfish DP for the p% privacy rule as follows,

Choose a fixed \(q\in\left(0, 1\right)\). Define the set of secrets as,

\(\mathbb{S}=\left\{\sigma_{\left[\left(1-q\right)y,\ \left(1+q\right)y\right)}:y>0\ \right\}\cup\left\{\sigma_{\left(\left(1+q\right)y,\ \left(1-q\right)y\right]}∶y<0\right\}\tag{3.1.1}\)

where \(\sigma_{\left[\left(1-q\right)y,\ \left(1+q\right)y\right)}\) is the statement that a record’s value is in the interval \(\left[\left(1-q\right)y,\ \left(1+q\right)y\right)\) and \(\sigma_{\left(\left(1+q\right)y,\ \left(1-q\right)y\right]}\) is the statement that a record’s value is in the interval \(\left(\left(1+q\right)y,\ \left(1-q\right)y\right]\).

Define the set of discriminative pairs as,

\(\mathbb{S}_{pairs}=\left\{\left(\sigma_{\left[\left(1-q\right)y,\ \left(1+q\right)y\right)},\ \sigma_{\left[\left(1+q\right)y,\frac{\left(1+q\right)^2}{1-q}y\right)}\right)∶y>0\right\}\\ \cup \left\{\left(\sigma_{\left(\frac{\left(1+q\right)^2}{1-q}y,\ \left(1+q\right)y\right]},\ \sigma_{\left(\left(1+q\right)y,\ \left(1-q\right)y\right]}\right)∶y<0\right\}\tag{3.1.2}\)

Define the set of data evolution scenarios \(\mathbb{D}\) as the set of probability distributions where

\(\theta\in\mathbb{D}\ \text{if and only if } P\left(y>0\middle|\theta\right)+P\left(y<0\middle|\theta\right)=1\tag{3.1.3}\)

That is, \(\mathbb{D}\) is the set of probability distributions with support that is contained in \(\mathbb{R}-\left\{0\right\}\)\(\theta\) is a specific probability distribution that corresponds to a data user’s prior knowledge about the data generation process.

Definition 3.1.4 (Pufferfish privacy for q% intervals): Given the set of potential secrets \(\mathbb{S}\) in (3.1.1), the set of discriminative pairs \(\mathbb{S}_{pairs}\) in (3.1.2), the set of data evolution scenarios \(\mathbb{D}\) in (3.1.3), and privacy parameter \(\epsilon>0\), a (potentially randomised) algorithm \(\mathcal{M}\) satisfies \(\left(\mathbb{S},\ \mathbb{S}_{pairs},\ \mathbb{D},\ \epsilon\right)\)-Pufferfish if

(i) for all possible outputs \(\omega\in range\left(\mathcal{M}\right)\),
(ii) for all pairs \(\left(s_i, s_j\right)\in\mathbb{S}_{pairs}\) of potential secrets,
(iii) for all distributions \(\theta\in\mathbb{D}\) for which \(P\left(s_i\middle|\theta\right)\neq0\) and \(P\left(s_j\middle|\theta\right)\neq0\),

the following holds:

\(P\left(\mathcal{M}\left(Data\right)=\omega\middle| s_i,\ \theta\right)\le e^\epsilon P\left(\mathcal{M}\left(Data\right)=\omega\middle| s_j,\ \theta\right)\tag{3.1.5}\)

\(P\left(\mathcal{M}\left(Data\right)=\omega\middle| s_j,\ \theta\right)\le e^\epsilon P\left(\mathcal{M}\left(Data\right)=\omega\middle| s_i,\ \theta\right)\tag{3.1.6}\)

Remark: (i) in definition 3.1.4 implies that the output of \(\mathcal{M}\) is discrete because if it was continuous, (3.1.5) and (3.1.6) would be trivially satisfied since \(P\left(\mathcal{M}\left(Data\right)=\omega\middle|\ldots\right)=0\) for all \(\omega\in range\left(\mathcal{M}\right)\). It is more sensible/general to write (i) for all possible sets \(W\subseteq range\left(\mathcal{M}\right)\)’ and rewrite (3.1.5) and (3.1.6) with \(P\left(\mathcal{M}\left(Data\right)\in W\middle|\ldots\right)\) instead of \(P\left(\mathcal{M}\left(Data\right)=\omega\middle|\ldots\right)\). However, we leave definition 3.1.4 as it is because Kifer & Machanavajjhala used this convention and the change would have insignificant impact on this section. For example, the proof of theorem 3.2.2 in section 3.2 would just have an outer integral \(\int_{\omega\in W}{\ldots d\omega}\) but nothing else in the proof would be adversely affected.

3.2. Proof: log-Laplace multiplicative perturbation satisfies Pufferfish differential privacy for q% intervals

We first introduce a lemma that describes the relative positions of adjacent q% intervals and adjacent intervals formed by factor \(1-q\) (\(\alpha=1-q\) in our case). We visualise their relative positions below. The square and round brackets lie on a real number line but we omit the number line to reduce clutter. The diagrams merely show that adjacent intervals formed by factor \(1-q\) contain adjacent q% intervals, a fact which we prove in Lemma 3.2.1.

Adjacent intervals formed by factor \(1-q\):

\(\left[\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \right)\left[\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \right)\)

\(\left(1-q\right)y\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \frac{y}{1-q}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \frac{y}{\left(1-q\right)^3}\)

Adjacent q% intervals:

\(\left[\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \right)\left[\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \right)\)

\(\left(1-q\right)y\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \left(1+q\right)y\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \frac{\left(1+q\right)^2}{\left(1-q\right)}y\)

Lemma 3.2.1: The following inequalities, which describe the relative positions of adjacent q% intervals and adjacent intervals formed by factor \(1-q\), hold for all \(q\in\left(0, 1\right)\).

\(1+q\le\frac{1}{1-q}\\ \frac{1}{1-q}\le\frac{\left(1+q\right)^2}{1-q}\\ \frac{\left(1+q\right)^2}{1-q}\le\frac{1}{\left(1-q\right)^3}\)

Proof is provided in Appendix C.

We now prove that the log-Laplace multiplicative perturbation mechanism satisfies Pufferfish privacy for q% intervals. We follow the method of proof used in Appendix H of “Pufferfish: A Framework for Mathematical Privacy Definitions” (Kifer & Machanavajjhala, 2014).

Theorem 3.2.2: Given the set of potential secrets \(\mathbb{S}\) in (3.1.1), the set of discriminative pairs \(\mathbb{S}_{pairs}\) in (3.1.2), the set of data evolution scenarios \(\mathbb{D}\) in (3.1.3), and privacy parameter \(\epsilon>0\), the log-Laplace multiplicative perturbation mechanism

\(\mathcal{M}\left(Data=y\right)=ce^Xy\tag{3.2.3}\)

satisfies \(\left(\mathbb{S},\ \mathbb{S}_{pairs},\ \mathbb{D},\ \epsilon\right)\)-Pufferfish, where \(X\) is distributed as \(Laplace\left(0,b\right)\) with \(b=-\frac{4}{\epsilon}\ln{\left(1-q\right)}\) and \(c=1-b^2\) (bias correction factor, equal to \(\frac{1}{E\left(e^X\right)}\), which can be obtained from Appendix A.2 Proposition 4). Note that \(e^X\) has a log-Laplace distribution if \(X\) has a Laplace distribution.

Proof

Let \(f\left(t\right)\) be the probability density function (pdf) of \(t\). Let \(f\left(t\in A\right)\) be the probability that \(t\) is in the set \(A\), and \(f\left(t\middle| A\right)\) be the conditional pdf of \(t\) given \(t\in A\). Assume \(support\left(f\right)\subset\left[0,\infty\right)\) (we deal with the case \(support\left(f\right)\subset\left(-\infty,0\right]\) later). Let \(\theta=f\) and assume \(f\left(t\in\left[\left(1-q\right)y,\left(1+q\right)y\right)\right)\neq0\) and \(f\left(t\in\left[\left(1+q\right)y,\frac{\left(1+q\right)^2}{1-q}y\right)\right)\neq0\). Note \(t\) is the true value of a record from the intruder’s perspective. \(t\) is random because we assume the intruder does not know what \(t\) is.

Remarks are provided throughout the derivation of the proof to improve readability.

Regarding the q% interval:

\(P\left(\mathcal{M}\left(t\right)=\omega\middle| t\in\left[\left(1-q\right)y,\ \left(1+q\right)y\right),\ \theta\right)\\ =P\left(\ln{\mathcal{M}(t)}=\ln{\omega}\middle|\ln{t}\in\left[\ln{y}-\ln{\left(\frac{1}{1-q}\right)},\ln{y}+\ln{\left(1+q\right)}\right),\ \theta\right)\\ =P\left(\ln{c}+\ln{t}+X=\ln{\omega}\middle|\ln{t}\in\left[\ln{y}-\ln{\left(\frac{1}{1-q}\right)},\ln{y}+\ln{\left(1+q\right)}\right),\ \theta\right)\\ =P\left(\ln{t}+X=\ln{\left(\frac{\omega}{c}\right)}\middle|\ln{t}\in\left[\ln{y}-\ln{\left(\frac{1}{1-q}\right)},\ln{y}+\ln{\left(1+q\right)}\right),\ \theta\right)\)

Remark: In \(\ln{t}+X\) where \(X\) is distributed as \(Laplace\left(0,b\right)\) with \(b=-\frac{4}{\epsilon}\ln{\left(1-q\right)}\)\(\ln{t}\) and \(X\) are random. The pdf of a sum of two random variables is given by convolution. The terms before \(f\) in the integrand below come from the Laplace pdf for \(X\) evaluated at \(\ln{\left(\frac{\omega}{c}\right)}-\ln{t}\).

\(P\left(\ln{t}+X=\ln{\left(\frac{\omega}{c}\right)}\middle|\ln{t}\in\left[\ln{y}-\ln{\left(\frac{1}{1-q}\right)},\ln{y}+\ln{\left(1+q\right)}\right),\ \theta\right)\)

\(=\int_{0}^{\infty}{-\frac{\epsilon}{8\ln{\left(1-q\right)}}\exp{\left(\frac{\epsilon\left|\ln{\left(\frac{\omega}{c}\right)}-\ln{t}\right|}{4\ln{\left(1-q\right)}}\right)}f\left(\ln{t}\middle|\ln{t}\in\left[\ln{y}-\ln{\left(\frac{1}{1-q}\right)},\ln{y}+\ln{\left(1+q\right)}\right),\theta\right)}dt\\ =\int_{0}^{\infty}{-\frac{\epsilon}{8\ln{\left(1-q\right)}}\exp{\left(\frac{\epsilon\left|\ln{\left(\frac{\omega}{c}\right)}-\ln{y}+\ln{y}-\ln{t}\right|}{4\ln{\left(1-q\right)}}\right)}f\left(\ln{t}\middle|\ln{t}\in\left[\ln{y}-\ln{\left(\frac{1}{1-q}\right)},\ln{y}+\ln{\left(1+q\right)}\right),\theta\right)}dt\)

Remark: Apply \(\left|a+b\right|\le\left|a\right|+\left|b\right|\) (triangle inequality) to the (negative) exponent of the Laplace density.

\(\int_{0}^{\infty}{-\frac{\epsilon}{8\ln{\left(1-q\right)}}\exp{\left(\frac{\epsilon\left|\ln{\left(\frac{\omega}{c}\right)}-\ln{y}+\ln{y}-\ln{t}\right|}{4\ln{\left(1-q\right)}}\right)}f\left(\ln{t}\middle|\ln{t}\in\left[\ln{y}-\ln{\left(\frac{1}{1-q}\right)},\ln{y}+\ln{\left(1+q\right)}\right),\theta\right)}dt\)

\(\geq\int_{0}^{\infty}{-\frac{\epsilon}{8\ln{\left(1-q\right)}}\exp{\left(\frac{\epsilon\left|\ln{\left(\frac{\omega}{c}\right)}-\ln{y}\right|}{4\ln{\left(1-q\right)}}+\frac{\epsilon\left|\ln{y}-\ln{t}\right|}{4\ln{\left(1-q\right)}}\right)}f\left(\ln{t}\middle|\ln{t}\in\left[\ln{y}-\ln{\left(\frac{1}{1-q}\right)},\ln{y}+\ln{\left(1+q\right)}\right),\theta\right)}dt\)

Remark: Now \(\left|\ln{y}-\ln{t}\right|\le\ln{\left(\frac{1}{1-q}\right)}\) because \(\ln{t}\in\left[\ln{y}-\ln{\left(\frac{1}{1-q}\right)},\ln{y}+\ln{\left(1+q\right)}\right)\subseteq\left[\ln{y}-\ln{\left(\frac{1}{1-q}\right)},\ln{y}+\ln{\left(\frac{1}{1-q}\right)}\right)\) from Lemma 3.2.1.

\(\int_{0}^{\infty}{-\frac{\epsilon}{8\ln{\left(1-q\right)}}\exp{\left(\frac{\epsilon\left|\ln{\left(\frac{\omega}{c}\right)}-\ln{y}\right|}{4\ln{\left(1-q\right)}}+\frac{\epsilon\left|\ln{y}-\ln{t}\right|}{4\ln{\left(1-q\right)}}\right)}f\left(\ln{t}\middle|\ln{t}\in\left[\ln{y}-\ln{\left(\frac{1}{1-q}\right)},\ln{y}+\ln{\left(1+q\right)}\right),\theta\right)}dt\)

\(\geq\int_{0}^{\infty}{-\frac{\epsilon}{8\ln{\left(1-q\right)}}\exp{\left(\frac{\epsilon\left|\ln{\left(\frac{\omega}{c}\right)}-\ln{y}\right|}{4\ln{\left(1-q\right)}}+\frac{\epsilon\ln{\left(\frac{1}{1-q}\right)}}{4\ln{\left(1-q\right)}}\right)}f\left(\ln{t}\middle|\ln{t}\in\left[\ln{y}-\ln{\left(\frac{1}{1-q}\right)},\ln{y}+\ln{\left(1+q\right)}\right),\theta\right)}dt\)

Remark: Only \(f\) in the integrand depends on \(t\). Since \(f\) is a density, its integral equals \(1\).

\(\int_{0}^{\infty}{-\frac{\epsilon}{8\ln{\left(1-q\right)}}\exp{\left(\frac{\epsilon\left|\ln{\left(\frac{\omega}{c}\right)}-\ln{y}\right|}{4\ln{\left(1-q\right)}}+\frac{\epsilon\ln{\left(\frac{1}{1-q}\right)}}{4\ln{\left(1-q\right)}}\right)}f\left(\ln{t}\middle|\ln{t}\in\left[\ln{y}-\ln{\left(\frac{1}{1-q}\right)},\ln{y}+\ln{\left(1+q\right)}\right),\theta\right)}dt\)

\(=-\frac{\epsilon}{8\ln{\left(1-q\right)}}\exp{\left(\frac{\epsilon\left|\ln{\left(\frac{\omega}{c}\right)}-\ln{y}\right|}{4\ln{\left(1-q\right)}}+\frac{\epsilon\ln{\left(\frac{1}{1-q}\right)}}{4\ln{\left(1-q\right)}}\right)}\)

\(=-\frac{\epsilon}{8\ln{\left(1-q\right)}}\exp{\left(\frac{\epsilon\left|\ln{\left(\frac{\omega}{c}\right)}-\ln{y}\right|}{4\ln{\left(1-q\right)}}\right)}\exp{\left(-\frac{\epsilon}{4}\right)} \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;(3.2.4)\)

Regarding the adjacent q% interval:

\(P\left(\mathcal{M}\left(t\right)=\omega\middle| t\in\left[\left(1+q\right)y,\frac{\left(1+q\right)^2}{1-q}y\right),\ \theta\right)\\ =P\left(\ln{\mathcal{M}\left(t\right)}=\ln{\omega}\middle|\ln{t}\in\left[\ln{y}+\ln{\left(1+q\right)},\ \ln{y}+\ln{\frac{\left(1+q\right)^2}{1-q}}\right),\ \theta\right)\\ =P\left(\ln{c}+\ln{t}+X=\ln{\omega}\middle|\ln{t}\in\left[\ln{y}+\ln{\left(1+q\right)},\ln{y}+\ln{\frac{\left(1+q\right)^2}{1-q}}\right),\ \theta\right)\\ =P\left(\ln{t}+X=\ln{\left(\frac{\omega}{c}\right)}\middle|\ln{t}\in\left[\ln{y}+\ln{\left(1+q\right)},\ \ln{y}+\ln{\frac{\left(1+q\right)^2}{1-q}}\right),\ \theta\right)\)

Remark: In \(\ln{t}+X\) where \(X\) is distributed as \(Laplace\left(0,b\right)\) with \(b=-\frac{4}{\epsilon}\ln{\left(1-q\right)}\)\(\ln{t}\) and \(X\) are random. The pdf of a sum of two random variables is given by convolution. The terms before \(f\) in the integrand below come from the Laplace pdf for \(X\) evaluated at \(\ln{\left(\frac{\omega}{c}\right)}-\ln{t}\).

\(P\left(\ln{t}+X=\ln{\left(\frac{\omega}{c}\right)}\middle|\ln{t}\in\left[\ln{y}+\ln{\left(1+q\right)},\ \ln{y}+\ln{\frac{\left(1+q\right)^2}{1-q}}\right),\ \theta\right)\)

\(=\int_{0}^{\infty}{-\frac{\epsilon}{8\ln{\left(1-q\right)}}\exp{\left(\frac{\epsilon\left|\ln{\left(\frac{\omega}{c}\right)}-\ln{t}\right|}{4\ln{\left(1-q\right)}}\right)}f\left(\ln{t}\middle|\ln{t}\in\left[\ln{y}+\ln{\left(1+q\right)},\ln{y}+\ln{\frac{\left(1+q\right)^2}{1-q}}\right),\theta\right)}dt\\ =\int_{0}^{\infty}{-\frac{\epsilon}{8\ln{\left(1-q\right)}}\exp{\left(\frac{\epsilon\left|\ln{\left(\frac{\omega}{c}\right)}-\ln{y}+\ln{y}-\ln{t}\right|}{4\ln{\left(1-q\right)}}\right)}f\left(\ln{t}\middle|\ln{t}\in\left[\ln{y}+\ln{\left(1+q\right)},\ln{y}+\ln{\frac{\left(1+q\right)^2}{1-q}}\right),\theta\right)}dt\)

Remark: Apply \(\left|a+b\right|\geq\left|a\right|-\left|b\right|\) to the (negative) exponent of the Laplace density.

\(\int_{0}^{\infty}{-\frac{\epsilon}{8\ln{\left(1-q\right)}}\exp{\left(\frac{\epsilon\left|\ln{\left(\frac{\omega}{c}\right)}-\ln{y}+\ln{y}-\ln{t}\right|}{4\ln{\left(1-q\right)}}\right)}f\left(\ln{t}\middle|\ln{t}\in\left[\ln{y}+\ln{\left(1+q\right)},\ln{y}+\ln{\frac{\left(1+q\right)^2}{1-q}}\right),\theta\right)}dt\)

\(\le\int_{0}^{\infty}{-\frac{\epsilon}{8\ln{\left(1-q\right)}}\exp{\left(\frac{\epsilon\left|\ln{\left(\frac{\omega}{c}\right)}-\ln{y}\right|}{4\ln{\left(1-q\right)}}-\frac{\epsilon\left|\ln{y}-\ln{t}\right|}{4\ln{\left(1-q\right)}}\right)}f\left(\ln{t}\middle|\ln{t}\in\left[\ln{y}+\ln{\left(1+q\right)},\ln{y}+\ln{\frac{\left(1+q\right)^2}{1-q}}\right),\theta\right)}dt\)

Remark: Now \(\left|\ln{y}-\ln{t}\right|\le3\ln{\left(\frac{1}{1-q}\right)}\) because \(\ln{t}\in\left[\ln{y}+\ln{\left(1+q\right)},\ln{y}+\ln{\left(\frac{\left(1+q\right)^2}{1-q}\right)}\right) \subseteq\left[\ln{y}+\ln{\left(1-q\right)},\ln{y}+\ln{\left(\frac{1}{\left(1-q\right)^3}\right)}\right) \\=\left[\ln{y}-\ln{\left(\frac{1}{1-q}\right)},\ln{y}+3\ln{\left(\frac{1}{1-q}\right)}\right)\) 

from Lemma 3.2.1.

\(\int_{0}^{\infty}{-\frac{\epsilon}{8\ln{\left(1-q\right)}}\exp{\left(\frac{\epsilon\left|\ln{\left(\frac{\omega}{c}\right)}-\ln{y}\right|}{4\ln{\left(1-q\right)}}-\frac{\epsilon\left|\ln{y}-\ln{t}\right|}{4\ln{\left(1-q\right)}}\right)}f\left(\ln{t}\middle|\ln{t}\in\left[\ln{y}+\ln{\left(1+q\right)},\ln{y}+\ln{\frac{\left(1+q\right)^2}{1-q}}\right),\theta\right)}dt\)

\(\le\int_{0}^{\infty}{-\frac{\epsilon}{8\ln{\left(1-q\right)}}\exp{\left(\frac{\epsilon\left|\ln{\left(\frac{\omega}{c}\right)}-\ln{y}\right|}{4\ln{\left(1-q\right)}}-\frac{3\epsilon\ln{\left(\frac{1}{1-q}\right)}}{4\ln{\left(1-q\right)}}\right)}f\left(\ln{t}\middle|\ln{t}\in\left[\ln{y}+\ln{\left(1+q\right)},\ln{y}+\ln{\frac{\left(1+q\right)^2}{1-q}}\right),\theta\right)}dt\)

Remark: Only \(f\) in the integrand depends on \(t\). Since \(f\) is a density, its integral equals \(1\).

\(\int_{0}^{\infty}{-\frac{\epsilon}{8\ln{\left(1-q\right)}}\exp{\left(\frac{\epsilon\left|\ln{\left(\frac{\omega}{c}\right)}-\ln{y}\right|}{4\ln{\left(1-q\right)}}-\frac{3\epsilon\ln{\left(\frac{1}{1-q}\right)}}{4\ln{\left(1-q\right)}}\right)}f\left(\ln{t}\middle|\ln{t}\in\left[\ln{y}+\ln{\left(1+q\right)},\ln{y}+\ln{\frac{\left(1+q\right)^2}{1-q}}\right),\theta\right)}dt\)

\(=-\frac{\epsilon}{8\ln{\left(1-q\right)}}\exp{\left(\frac{\epsilon\left|\ln{\left(\frac{\omega}{c}\right)}-\ln{y}\right|}{4\ln{\left(1-q\right)}}-\frac{3\epsilon\ln{\left(\frac{1}{1-q}\right)}}{4\ln{\left(1-q\right)}}\right)}\)

\(=-\frac{\epsilon}{8\ln{\left(1-q\right)}}\exp{\left(\frac{\epsilon\left|\ln{\left(\frac{\omega}{c}\right)}-\ln{y}\right|}{4\ln{\left(1-q\right)}}\right)}\exp{\left(\frac{3\epsilon}{4}\right)} \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;(3.2.5)\)

Compare (3.2.4) and (3.2.5).

\(RHS\ of\ \left(3.2.5\right)=e^\epsilon\times RHS\ of\ \left(3.2.4\right)\\ LHS\ of\ \left(3.2.5\right)\le e^\epsilon\times LHS\ of\ \left(3.2.4\right)\\ P\left(\mathcal{M}\left(t\right)=\omega\middle| t\in\left[\left(1+q\right)y,\frac{\left(1+q\right)^2}{1-q}y\right),\ \theta\right)\le e^\epsilon P\left(\mathcal{M}\left(t\right)=\omega\middle| t\in\left[\left(1-q\right)y,\ \left(1+q\right)y\right),\ \theta\right)\)

A similar derivation results in

\(P\left(\mathcal{M}\left(t\right)=\omega\middle| t\in\left[\left(1-q\right)y,\ \left(1+q\right)y\right),\ \theta\right)\le e^\epsilon P\left(\mathcal{M}\left(t\right)=\omega\middle| t\in\left[\left(1+q\right)y,\frac{\left(1+q\right)^2}{1-q}y\right),\ \theta\right)\)

The case where \(support\left(f\right)\subset\left(-\infty,0\right]\) is proven by virtue of symmetry: the pdf for \(ce^Xy\), where \(y<0\), is the reflection of the pdf for \(-ce^Xy\) about \(y=0\), and every pair of adjacent q% intervals in \(\mathbb{R}^{<0}\) is the reflection (about \(y=0\)) of a pair of adjacent q% intervals in \(\mathbb{R}^{>0}\). In the case where \(y<0\), we can view the mechanism as applying three transformations sequentially: given some \(y<0\), multiply \(y\) by \(-1\) (reflection), multiply the result by \(e^X\) where \(X\) is distributed as \(Laplace\left(0,b\right)\) (perturbation), multiply the result by \(​​-1\) (reflection). The first and last transformations are deterministic. Thus, the log-Laplace multiplicative perturbation mechanism applied to negative scalars satisfies Definition 3.1.4 just like for positive scalars.

Case study: agriculture statistics with sugarcane data

The mathematical proof in the previous section shows the log-Laplace multiplicative perturbation mechanism protects q% intervals. In this case study, we use the SRA sugarcane administrative data to explore the utility and risk trade-off under different privacy parameter settings. We need to consider two privacy parameters in the log-Laplace multiplicative perturbation mechanism, ϵ and q. Both of the parameters control the dispersion of the log-Laplace perturbation distribution. The privacy parameter ϵ controls the bounds on the amount of information that a potential intruder can gain about a secret from the perturbed outputs. The secrets are passive claimants’ true values that reside within the q% intervals.

4.1. Case study design

As part of the ABS privacy policy, we are required to protect a passive claimant’s value when it violates the p% rule. For the case study, we set an arbitrary threshold of 15% as the p% rule (i.e. p=0.15 means that p% rule=15%). We then consider a disclosure risk scenario where there are only three contributors in a small area from the sugarcane dataset. We assume the largest contributor is interested in estimating the second largest contributor’s true value. We set the second largest contributor to be our passive claimant.  The third contributor is small relative to the size of the first and second largest contributors. Our study tests the effectiveness of the multiplicative perturbation mechanism under this scenario with a high risk of p% rule violation. Our case study only considers perturbing a single passive claimant.

4.2. Data structure

At the time of writing this paper, the Agriculture Statistics section within the ABS Physical and Environmental Accounts Statistics Branch has developed a roadmap to optimise the use of administrative data in the production of ABS Agriculture Statistics collections. We want to show that log-Laplace multiplicative perturbation mechanism will protect individual business data when producing statistics sourced from an administrative dataset.

We currently have access to the sugarcane data from the Levy Payer Register administrative source. For the purpose of our case study, we treat the sugarcane dataset as a census. While the sugarcane data does not contain “identified” passive claimants, users are interested in sugarcane production at fine geographical levels. This provides a similar disclosure risk scenario where Agriculture Statistics section needs to protect the passive claimants’ data in small areas.  

There are 3,865 observations in the sugarcane dataset with 4 main variables. The variables are Australian Business Number (ABN), sugarcane production (tonnes), Statistical Area Level 1 (SA1) and Statistical Area Level 2 (SA2). For the purposes of this case study, areas with missing values in ABN, SA1 or SA2 were excluded from this analysis. Sugarcane production is the variable that we want to perturb for a passive claimant.

4.3. Utility loss and disclosure risk measures

We now present our empirical and analytical estimates of utility loss and disclosure risk measurement. The empirical estimates are derived by simulation given a particular data scenario. For example, the passive claimant is the second largest contributor in a dataset of 3 units and the largest contributor is interested in estimating the passive claimant’s true value. The purpose of deriving the analytical estimates is twofold. Firstly, we can compare the empirical estimates with the analytical estimates as they should align. This helps us to verify our empirical estimates. Secondly, we can produce utility loss and disclosure risk estimates under any data scenarios (regardless of which contributor the passive claimant is or which contributor is interested in estimating the passive claimant’s true value) without running separate simulations. This can be done by plotting the analytical formulas and examine the theoretical level of utility loss and disclosure risk. However, note that the analytical formulas are only applicable for perturbing one passive claimant as our case study only considers perturbing a single passive claimant. The derivation becomes more complex as we perturb more passive claimants. Therefore, simulation is a less complex option to examine the level of utility loss and disclosure risk. We will examine the effects of perturbing multiple passive claimants in our future work.

Utility loss measure assessment

There is inevitably a degree of data utility loss when perturbing the true value of the observations. We derive the empirical and analytical RSE from the log-Laplace multiplicative perturbation for a single passive claimant to measure the level of data utility loss.

Algorithm for empirical estimation of utility loss (RSE)                                                                                                (4.3.1)

Note: q = 0.1 -> q% = 10%.

Require: input unit file
Require: A set of privacy parameters: ϵ and q
Require: Number of simulation runs M
       \(\begin{align*} \mu &= 0 &&\text{Mean of Laplace distribution} \\ b &= \left(\frac{-4}{\epsilon}\right)\ln{\left(1-q\right)} &&\text{Dispersion of Laplace distribution} \\ c &= 1 - b^2 &&\text{Bias correction factor} \end{align*}\)

for m = 1,…,M  do

       \(\begin{align*} z_m &=ce^{X_m} &&X_m \sim Laplace\left(\mu,b\right),\ \\ &&&z_m \ \text{is the multiplicative perturbation factor for each simulation run m } \end{align*} \)

       if passive claimant j then do

               \(\begin{align*} {\widetilde{y}}_{j,m}=z_my_j &&y_j \ \text{is the true value of the passive claimant j} \end{align*} \) 

       end

       /* Calculate the total of the units including the perturbed value for each simulation run m */

       \(\begin{align*} {\hat{Y}}_m=\sum_{h=1,h\neq j}^{n}y_h+{\widetilde{y}}_{j,m} &&\text{n is the total number of observations} \end{align*} \)

end

/* Calculate root mean squared error */

\(\begin{align*} RMSE=\sqrt{\frac{\sum_{m=1}^{M}{{(\hat{Y}}_m-Y)}^2}{M}} &&\text{Y is the true total} \end{align*} \)

/* Calculate RSE */

\(RSE=\frac{RMSE}{Y}\)

return RSE

return A data frame\(\ \hat{D}\)  with the perturbed value of the passive claimant \({\widetilde{y}}_{j,m}\) from each simulation m and the true values of all other units.

The analytical solution of RSE is as follows,

\(RSE_{analytical}=\frac{\sqrt{Var\left(\hat{Y}\ \right)}}{Y} \)

where,

\(Var\left(\hat{Y}\right)=\sum_{h\in n}{\left(1-a_hb_h^2\right)^{2a_h}y_h^2\left[\frac{1}{1-4a_hb_h^2}-\frac{1}{\left(1-a_hb_h^2\right)^2}\right]} \)

and 

\( a_h= \begin{cases} 1, & \text{if unit h is perturbed} \\ 0, & \text{otherwise} \end{cases} \)

and

\(b_h=\left(\frac{-4}{\epsilon}\right)\ln{\left(1-q\right)} \)
The above formula is derived by assuming a total Y is estimated using the Horvitz-Thompson estimator for a probability sampling method without replacement. We modify the formula to incorporate the effect of log-Laplace multiplicative perturbation for some unit’s values \(y_h\). The Horvitz-Thompson estimator was chosen since it is the simplest estimator for calculating weighted totals and is used in ABS surveys. We assume that the sampling method and perturbation are independent in order to derive an expression for the variance of our estimate subject to sampling error and perturbation noise. Note that under the assumptions of the design-based framework for survey sample designs, the true values \(y_h\) are constant and only the sample and perturbation factors for these values are randomly determined. The expression above is for the specific case of a completely enumerated population, meaning the only source of variance is from the perturbation. This is applicable to the sugarcane dataset as we are assuming it is a census. Full details of this derivation are presented in Appendix A.

Disclosure risk measure assessment

We derive the empirical and analytical probability of p% rule violation to examine the level of disclosure risk from log-Laplace multiplicative perturbation for a single passive claimant. We assume a contributor k to the sugarcane production in a particular area is interested in estimating passive claimant j’s contribution. This can be done by subtracting unit k’s value from the total and examining if the remainder is within p% of unit j’s true value. If it is then the p% rule is violated i.e. there is disclosure. Note that this is our definition of disclosure in our case study.

Algorithm for empirical estimation of disclosure risk probability                                                   (4.3.2)

Note: This algorithm uses the output data frame \(\hat{D}\). We use p for the input p% rule parameter instead of q in this algorithm because q in algorithm (4.3.1) is the parameter that describes the q% intervals we want to protect in Pufferfish DP. p and q can take different values. It is important to keep in mind that p is the threshold that defines disclosure (p% rule violation) and q is the intervals that Pufferfish DP protects (proved in section 3). 

Require: \(\hat{D}\)  

Require: input p% rule parameter: p [p=0.15 -> p% rule =15%]

for m = 1,......,M do

       \({\hat{Y}}_m=\sum_{h=1,h\neq j}^{n}y_h+{\widetilde{y}}_{j,m}\)

       /* Subtract the true value of contributor k, \(y_k \) from \(\hat{Y}_m\)*/

       \({\ddot{Y}}_m={\hat{Y}}_m-y_k\)

       if \({\ddot{Y}}_m\in\left[y_j\left(1-p\right),y_j\left(1+p\right)\right]\) then do

           \(P_m=1\)

       end else do

           \(P_m=0\)

       end

end

\(\begin{align*} ​​​​\gamma=\frac{\sum_{m=1}^{M}P_m}{M} &&\text{Probability of p% rule violation} \end{align*}\)

return \(\gamma\)

The analytical solution of disclosure risk (the probability of a p% rule violation) is given by,

\(\begin{equation*} P\left(\left(1-p\right)y_j<\ ce^{X_j}y_j+\sum_{i=1,\ i\neq j,k}^{n}y_i<\left(1+p\right)y_j\ |\ p,b,\mu,y_1,\ldots y_n\right) \\= \begin{cases} \frac{1}{2}\left[\left(\frac{1-p-R}{c}\right)^{-\frac{1}{b}}-\left(\frac{1+p-R}{c}\right)^{-\frac{1}{b}}\right], &if \ R\le1-p-c\ \\ 1-\frac{1}{2}\left[\left(\frac{1+p-R}{c}\right)^{-\frac{1}{b}}+\left(\frac{1-p-R}{c}\right)^\frac{1}{b}\right], &if \ 1-p-c<R\le\min{\left(1-p,1+p-c\right)} \\ \frac{1}{2}\left[\left(\frac{1+p-R}{c}\right)^\frac{1}{b}-\left(\frac{1-p-R}{c}\right)^\frac{1}{b}\right], &if \ {1+p-c<R<1-p} \\ 1-\frac{1}{2}\left(\frac{1+p-R}{c}\right)^{-\frac{1}{b}}, &if \ {1-p\le R\le1+p-c} \\ \frac{1}{2}\left(\frac{1+p-R}{c}\right)^\frac{1}{b}, &if \ \max{\left(1+p-c,1-p\right)}\le R<1+p\\ 0, &if \ R\geq1+p \end{cases} \end{equation*}\)

Note that the third and fourth case cannot occur at the same time for a particular value of \(p\in(0,1)\)

where,

\(R=\frac{\sum_{h=1,\ h\neq j,k}^{n}y_h}{y_j} \)

and 

\(b=\left(\frac{-4}{\epsilon}\right)\ln{\left(1-q\right)}\)

and

\(c=1-b^2\)

The above formula is derived by assuming a data user wishes to estimate passive claimant j’s contribution \(y_j\) to some total within p% (specified by the p% rule) assuming they know contribution \(y_k\) with certainty. The formula above holds for any total containing a single passive claimant perturbed with log-Laplace multiplicative perturbation of the form \(e^{X_j}\) where \(X_j\sim Laplace(0,b)\). The result is piecewise due to the piecewise definition of the Laplace distribution and conditions on R in order for the p% interval to be valid. This formula is presented in Appendix B.1.2 Corollary (Equation b.1.3). The derivation is provided in Appendix B.1. We also derive the upper bound of disclosure risk for more than one passive claimant in Appendix B.2.

4.4. Sugarcane data case study results

We have chosen a list of ϵ and q values and perturbed the passive claimant’s value 1000 times (m=1000). We have retrieved the empirical estimates of RSE and disclosure risk probability by running algorithm (4.3.1) and (4.3.2) respectively. We have tested that 1000 replicates were sufficient to derive an adequate empirical estimate. We have also derived the analytical solution of RSE and disclosure risk. We have chosen a fixed p in algorithm (4.3.2) and vary q in algorithm (4.3.1). This is because q changes the definition of Pufferfish privacy for q% intervals through the dispersion of the Laplacian distribution (b). Therefore, we want to ensure that we have a fixed threshold for disclosure that is different from the one that defines the level of perturbation. p is set to 0.15 (p% rule = 15%) and it is our interval to define when a disclosure has occurred. It is important to keep in mind that our definition of disclosure risk is when there is a p% rule violation. So even when there is a violation, it does not mean there is disclosure because Pufferfish DP guarantees that data users cannot significantly improve their confidence in determining which q% interval a passive claimant’s true value lies within.

Figure 4.4.1 and 4.4.2 depict utility loss vs disclosure risk given a set of ϵ and q.  The results are within expectation as a higher RSE (utility loss) leads to a lower disclosure risk probability. This is driven by the dispersion of the Laplace distribution, b which is determined by ϵ and q and \(b=\left(\frac{-4}{\epsilon}\right)\ln{\left(1-q\right)}\).

With q fixed, a higher ϵ results in a lower b which means a lower RSE and a higher disclosure risk probability and vice versa. With a fixed ϵ, a higher q leads to a higher b which means a higher RSE and a lower disclosure risk probability and vice versa. This is consistent with what we observe in Figure 4.4.1 and 4.4.2. The graphs show some discrepancy between empirical and analytical results for small ϵ and large q. This is because b becomes large for small ϵ and large q, and that if \(b>\frac{1}{2}\), the variance of \(e^X\), where \(X_j\sim Laplace(0,b)\), is unbounded (more details presented in Appendix A.1 Proposition 2). Hence, only the empirical estimates are shown in the graphs for these particular points.

Figure 4.4.1: Utility loss (RSE) vs Disclosure risk probability, p=0.15 (p% rule=15%) (ϵ panel)

Utility loss vs Disclosure risk

Figure 4.4.1: Utility loss (RSE) vs Disclosure risk probability, p=0.15 (p% rule=15%) (ϵ panel)
Comparing utility loss and disclosure risk. Utility loss is measured by relative standard error (RSE) on the y axis and disclosure risk is measured by disclosure risk probability on the x axis. There are two types of estimates for RSE and disclosure risk probability, analytical and empirical. The graph is divided into 9 panels with each panel indicating a specific value of ϵ (ranges from 1.1 to 1.9). Within each panel, it shows the estimate of RSE and disclosure risk probability at a given ϵ and q which ranges from 0.06 to 0.14. There is an inverse relationship between RSE and disclosure risk probability, a higher RSE (utility loss) leads to a lower disclosure risk probability and vice versa.

Figure 4.4.2: Utility loss (RSE) vs Disclosure risk probability, p=0.15 (p% rule=15%) (q panel)

Utility loss vs Disclosure risk

Figure 4.4.2: Utility loss (RSE) vs Disclosure risk probability, p=0.15 (p% rule=15%) (q panel)
Comparing utility loss and disclosure risk. Utility loss is measured by relative standard error (RSE) on the y axis and disclosure risk is measured by disclosure risk probability on the x axis. There are two types of estimates for RSE and disclosure risk probability, analytical and empirical. The graph is divided into 9 panels with each panel indicating a specific value of q (ranges from 0.06 to 0.14). Within each panel, it shows the estimate of RSE and disclosure risk probability at a given q and ϵ which ranges from 1.1 to 1.9. There is an inverse relationship between RSE and disclosure risk probability, a higher RSE (utility loss) leads to a lower disclosure risk probability and vice versa.

An interesting finding is that in certain scenarios from the sugarcane dataset we tested (results not shown), perturbing a passive claimant that contributes to a particular cell potentially increases the disclosure risk of another cell that a passive claimant also contributes to. A general example is given as follows,

Passive claimant j contributes to cell A (state=QLD, Goods=sugarcane) and cell B (state=QLD, Business type=Sole Proprietor). Suppose unit j violates the p% rule in cell A but not cell B. We perturb unit j’s true value because it violates the p% rule in cell A.

   Pre-perturbation:

        Disclosure risk of unit j in cell A = 100%

        Disclosure risk of unit j in cell B = 0%
   Post-perturbation:

        Disclosure risk of unit j in cell A = 40%

        Disclosure risk of unit j in cell B = 35%

This finding means that a decrease in disclosure risk in cell A via perturbation does not always come for free as it could increase the disclosure risk of cell B. The important aspect to keep in mind is that Pufferfish DP via log-Laplace distribution offers a different type of protection than absolute protection guarantees (0 disclosure risk i.e. 0% chance of p% rule violation). Instead, Pufferfish DP guarantees that a data user cannot significantly improve their confidence in determining if the true value lies within a q% interval or an adjacent q% interval.

Conclusion

We have demonstrated that Kifer & Machanavajjhala (2014)’s Pufferfish DP instantiation offers privacy protection for our q% intervals via log-Laplace multiplicative perturbation. This means that it also offers privacy guarantees specified by the p% rule. As expected, our case study results from perturbing a single passive claimant shows that there is an inverse relationship between utility loss and disclosure risk. Pufferfish DP offers a form of privacy protection that ensures data users cannot become significantly more confident in determining if a passive claimant’s true value lies within a q% interval or an adjacent q% interval. Our case study results help us to understand the effects of different privacy parameter values on utility loss and disclosure risk. For our future work, we will utilise these results and the analytical formulas we have derived for utility loss and disclosure risk to determine an appropriate set of privacy parameters \(\epsilon\) and \(q\) for a broader suite of ABS Agriculture Statistics collections. In addition, we will assess utility loss and disclosure risk trade off from perturbing two or more passive claimants because our case study only focused on perturbing a single passive claimant. We will also consider a utility and disclosure risk assessment with unit-level data as the case study results are based on aggregated outputs. An important task ahead is that we will need to investigate some relaxed form of Pufferfish DP where the multiplicative perturbation factor is bounded. This is to avoid post-processing outputs when the mechanism introduces an extreme perturbed value, which undermines user trust in the statistics. In other words, a relaxed form of Pufferfish DP can bound the level of utility loss.

Post release changes

18/11/2022 - Relabelled mathematical definitions and equations with continuous numbering.

Bibliography

Dwork, C. and Roth, A., 2014, The Algorithmic Foundations of Differential Privacy, Foundations and Trends in Theoretical Computer Science, Vol 9, Nos. 3-4, 2014, 211-407

Kifer, D and Machanavajjhala, A., 2014: Pufferfish: A Framework for Mathematical Privacy Definition, 2014 ACM Transactions on Database Systems, Article No. 3

Desfontaines, D. and Pejó, B., 2020. Sok: Differential privacies. Proceedings on Privacy Enhancing Technologies, 2020(2), pp.288-313.

Bambauer, J., Muralidhar, K. and Sarathy, R., 2013. Fool's gold: an illustrated critique of differential privacy. Vand. J. Ent. & Tech. L., 16, p.701.

Census and Statistics (Information Release and Access) Determination 2018

Appendices - Mathematical proofs and derivations

Appendix A - Analytical formula for the variance of Horvitz-Thompson estimator

Appendix B - Analytical derivation of disclosure risk

Appendix C - Relative positions of q% intervals proof

Back to top of the page