From Privacy-Preserving Machine Learning by J. Morris Chang, Di Zhuang, and G. Dumindu Samaraweera
This article introduces the concept and definition of differential privacy.
Read it if you’re a machine learning engineer, or a developer building around machine learning.
Take 25% off Privacy-Preserving Machine Learning by entering fcczhuang into the discount code box at checkout at manning.com.
There are many privacy-related threats and vulnerabilities in machine learning, and numerous concepts of privacy-enhancing technologies. In this article, we will introduce a very important privacy protection concept: differential privacy. Differential privacy is one of the most popular privacy protection schemes in use today. It introduces a very interesting concept on how to make a dataset robust enough to changes of any single sample in the dataset, while computing the data statistics (e.g. machine learning models could be considered as certain very complex statistics to describe the distribution of its training data).
What is Differential Privacy?
Data explosion in the Big Data Era results in tons of data that are generated and held by the individual (e.g. a person) and entities (e.g. organizations, companies, and governments), such as the personal images, financial records, medical records, transaction logs and census data. Meanwhile, there is a serious privacy concern when the data leaves the hands of the data owners and participates in certain applications. The AOL search engine log data leak (Shen, Tan, & Zhai, 2007) and Netflix recommendation contest privacy lawsuit (Narayanan & Shmatikov, 2008) show the threats and the need of rigorous privacy-enhancing techniques for machine learning algorithms. Differential privacy (DP) (Cynthia, 2006; Dwork & Roth, 2014), as a promising privacy-enhancing technique, provides a rigorous definition of the privacy and how it can be protected.
The Concept of Differential Privacy
Before introducing the definition of deferential privacy, let us glance through an example.
Figure 1 The problem of personal information leakage.
As shown in Figure 1, Alice and Bob are two news reporters from two different news agencies, that both would like to report the average salary of a private company. This company has a database that contains the personal information (e.g. position, salaries, contacts, etc.) of all its employees. Since the database contains the personal information subject to privacy concerns, any direct access to the database is restricted. To obtain the access, Alice and Bob were required to demonstrate that they planned to follow the company’s protocols for handling personal data, by undergoing confidentiality training and signing data use agreements proscribing their use and disclosure of personal information obtained from the database.
To accommodate the requests from Alice and Bob to report salary statistics of the company on their articles, the company agrees to grant the access to query certain aggregated statistics from the company’s database, including the total number of employees and the average salaries; but not having the permission to access any of the personal information such as Name or Age.
Now let us consider the following scenario. In January 2020, Alice writes an article according to the information obtained from that database, reporting that the private company has 100 employees, and the average salaries is $55,000. Meantime, in February 2020, Bob also writes another article according to the information obtained from that database in the same way, reporting that the private company has 101 employees, and the average salaries is $56,000. The only difference is Alice accessed the database in January whereas Bob accessed it in the February.
Eve is another news reporter working at the third news agency. After reading both articles from Alice and Bob, Eve concludes that between January and February, one new employee has joined that company, and his/her salary is $156,000 (i.e. $56,000 X 101 – $55,000 X 100). Eve then interviews certain employees in that company incognito, and someone told Eve that Mallory joins the company during that time period. Eve then writes another article reporting that Mallory joins the private company and earns $156,000 a year, which is much higher than the average salary of that company.
Now at this point, Mallory’s private information (i.e. salary) has been compromised. Thus, Mallory lodged a complaint with the relevant authorities and planning to sue the company and the reporters. However, on the contrary both Alice and Bob did nothing against the policies, where they just reported the aggregated information which does not contain any personal records.
This story shows a typical privacy leakage scenario, where even though the study, analysis or computation only release the aggregated/statistical information of a database or a dataset, it could lead to certain meaningful but sensitive conclusions about the individuals. So, how can we take care of these kind of issues in a straight but rigorous way? Here comes the differential privacy.
Differential Privacy (DP) attracts lots of attention in the privacy research community in the last decades, which provides a measurement of the information leakage of individual sample from the computation over an underlying dataset. Let us consider the following Figure 2. In DP there is a trusted data curator who gathers the data from multiple data owners to form a dataset. As we discussed in our previous example, the private company is the data curator, and the data owners are the employees of that company. The idea is to perform some computation or an analysis on this collected dataset, such as finding the mean value (e.g. the average salary) so that the data users can access this information without disclosing privacy of the data owners.
Figure 2 Differential Privacy framework.
Informally, DP aims to strengthen the protection by releasing the aggregated or statistical information of a database or dataset without revealing the presence of an individual in that database. As discussed in the story above, we can consider that the databases in January and February are identical to each other, other than the February database carries Mallory’s information. DP ensures that, no matter on which database the analysis is conducted, the probabilities to get the same aggregated or statistical information or reach to a specific conclusion would be the same (as shown in Figure 3).
In other words, if the data owner’s private data won’t significantly affect the calculation of certain aggregated or statistical information or reach to a specific conclusion from the database, he/she should be less worried about sharing his/her data in the database, since the analysis results could not distinguish whether he/she participates or not (i.e., nearly the same probabilities). This also well explains the “differential” word in the name of DP.
By far, we discussed a general idea about DP. Next, we will introduce how does DP works in real-world scenarios.
Figure 3 Use of Differential Privacy in terms of protecting personal data.
How Differential Privacy Works
As shown in Figure 2, to ensure that no one can reliably infer any individual’s information from the computation result, the curator adds random noise (i.e. DP sanitizer) to the result such that the released result would not change if any individual’s information of the underlying data changes. Since no single individual’s information can significantly affect the distribution, adversaries cannot infer such information corresponding to any individual confidently.
Let us continue our previous scenario and see what would happen if the private company had added random noise to the query results (i.e. the total number of employees and the average salaries), before sending them to the news reporters Alice and Bob.
Suppose in our previous scenario, the private company had utilized another advanced private information sharing protocol, as shown in Figure 3, where certain random noises would be added to the aggregated statistics calculated from the company’s database before sharing them to any news reporters.
As such, in January, Alice would write an article according to the information obtained from that database (accessed in January), reporting that the private company has 103 employees (where 100 is the real number and 3 is the added noise), and the average salary is $55,500 (where $55,000 is the real value and $500 is the added noise).
In February, Bob would write another article according to the information obtained from that database in the same way (but accessed in February), reporting that the private company has 99 employees (where 101 is the real number and -2 is the added noise), and the average salary is $55,600 (where $56,000 is the real value and -$400 is the added noise).
At that point, the noisy version of the total number of employees and average salary doesn’t have much effect on the fact of the private company’s information appearing in the public news reports (e.g. the number of employees is around 100, and the average salary is around $55,000-$56,000). However, this could effectively prevent Eve from concluding that between January and February, one new employee joins that private company (since 99-104=-5) and figuring out his/her salary, thus reducing the risk that Mallory’s personal information being revealed by such news reports.
This example enlightens us the idea about how differential privacy works by adding random noise to the aggregated data before publishing it. The next question would be how much noise we should add for each DP application. To answer this question, we would like to introduce the “sensitivity” and “privacy budget” within the DP applications.
THE “SENSITIVITY” OF DP APPLICATIONS
One of the core technical problems in DP is to determine the amount of random noise to be added to the aggregated data before publishing it. The random noise cannot come from an arbitrary random variable.
If the random noise is too small, it cannot provide enough protection to each individual’s private information. For example, if Alice has reported that the company is having 100.1 employees (i.e. +0.1 noise) and the average salary is $55,000.1 (i.e. +$0.1 noise), while in February, if Bob has reported the private company is having 101.2 employees (i.e. +0.2 noise) and the average salary is $55,999.9 (i.e. -$0.1 noise), still Eve could infer that most likely one new employee had joined that company, and his/her salary is around $155,979.9, which is identical to $156,000, the true value.
On the contrary, if the random noise is too large, the published aggregated data would be distorted and meaningless, thus providing no utility. For instance, if Alice reported the private company has 200 employees (i.e. +100 noise) and the average salary is $65,000 (i.e. +$10,000 noise), while Bob reported the private company has 51 employees (i.e. -50 noise) and the average salary is $50,000 (i.e. -$6,000 noise), even though nearly none of the employee’s private information would be revealed, such reports do not carry any meaningful information of the real situation.
So, how could we decide the amount of random noise to be added to the aggregated data before publishing it in a meaningful and scientific way? Roughly speaking, given an application or analysis that need to publish aggregated data on a dataset, the amount of random noise to be added to such aggregated data should be proportional to the largest possible difference that one individual’s private information (e.g. one row within a database table) could make to that aggregated data.
In DP, we call “the largest possible difference that one individual’s private information could make” as the “sensitivity” of the DP application. The “sensitivity” usually measures the maximum possible effect of each individual’s information on the output of the analysis.
For example, in our “private company” scenario, we have two aggregated data, the total number of employees and the average salary, to be published. Since an old employee leaving or a new employee joining the company could at most make “+1” or “-1” difference to the total number of employees, thus, its sensitivity is “1”. For the average salary, since different employees (having different salaries) leaving/joining the company could make different influence to the average salary, and “the largest possible difference” should come from the employee that has the highest possible salary, thus, its sensitivity should be proportional to the “highest salary”.
Usually, it would be difficult to calculate the exact sensitivity for an arbitrary application, and sometimes we need to estimate the approximation of certain sensitivities.
THERE’S NO FREE LUNCH FOR DP – INTRODUCING THE “PRIVACY BUDGET”
As we just discussed, it is important to determine the appropriate amount of random noise to be added while applying DP and such random noise should be proportional to the sensitivity of the applications (e.g. mean estimation, frequency estimation, regression, classification, etc.). But “proportional” is a fuzzy word, that could be a small proportion or a large proportion. So, is there anything else we should consider determining the actual amount of random noise to be added?
Let us first revisit our “private company” scenario that applies DP, where a random noise has been added to the query results (before and after Mallory joining the company) of the database, and the news reporters (i.e. Alice and Bob) would like to estimate the average salary of that company by looking at the “noisy” query results they retrieved. Ideally, in the context of applying DP, such estimation of the average salary should remain the same whether or not an employee, say Mallory, is leaving or joining the company. However, to ensure such property “exactly the same”, it requires the total exclusion of Mallory’s information from this study. If so, we would continue with the same argument and exclude the personal information of every employee in that company’s database. In such case, the estimated average salary cannot rely on any employee’s information, and hence it would be meaningless.
To avoid such dilemma, DP requires that the output of the analysis remain “approximately the same” but not “exactly the same”, with or without Mallory’s information. In other words, DP permits a slight deviation between the output of the analysis with or without any individual’s information. The “permitted deviation” is also considered as the “privacy budget” for DP. If one could tolerate more permitted deviation with or without his/her data, he/she could tolerate more privacy leakage, thus has more privacy budget to spend.
The Greek letter 𝜖 (epsilon) has been utilized to represent such “privacy budget” to quantify the extent of the permitted deviation. As discussed, the privacy budget 𝜖 is usually set by the data owners to tune the level of privacy protection required. A smaller value of 𝜖 results in a smaller permitted deviation (i.e., less privacy budget), and thus is associated with stronger privacy protection but less accuracy for utility. There is no free lunch for DP. For instance, we can set 𝜖 to zero, which permits zero privacy budget and provides perfect privacy protection meaning zero privacy leakage under the definition of DP. The output of the analysis would always be the same no matter whose information has been added or removed. However, as we discussed previously, this also requires ignoring all the information available, thus would not provide any meaningful results. What if we set 𝜖 to 0.1, a small number but larger than zero? The permitted deviation with or without any individual’s information would be small, providing stronger privacy protection while also enabling the data users (e.g., news reporters) to learn something meaningful.
In practice, 𝜖 is usually a small number. For statistical analysis tasks (e.g. mean/frequency estimation), 𝜖 is usually set between approximately 0.001 and 1.0. For machine learning or deep learning tasks, 𝜖 is usually set between somewhere around 0.1 and 10.0.
FORMULATING THE DP SOLUTION FOR THE “PRIVATE COMPANY” SCENARIO
Now that, you probably have a general idea about the concept of DP and how to derive the random noise based on the sensitivity of the application and the data owner determined privacy budget. Let us see how we can apply such techniques in our previous “private company” scenario mathematically.
Figure 4 An example of private company information database.
Go is an ideal language to use to learn about concurrent programming because its creators designed it with high-performance concurrency in mind. The aim was to produce a language that was efficient at runtime, readable, and easy to use. This means that Go has many tools for concurrent programming. Let’s take a look at some of the advantages to using Go for concurrent programs.
Goroutines at a glance
As illustrated in Figure 4, on the left-hand side is the private company’s January database (i.e. before Mallory joining the company) that is queried by Alice, and on the right-hand side is the private company’s February database (i.e. after Mallory joining the company) that is queried by Bob. Now, we would like to derive the differentially private solutions for the private company to share two aggregated data to the public (e.g. Alice and Bob): 1) the total number of employees and 2) the average salary.
Let us start from the easier task, publishing the total number of employees. First, let us, derive the sensitivity. As discussed above, since whether an employee leaving or joining the company, the influence on the total number of employees is 1, thus the sensitivity of this task is 1. Second, we should design the random noise to be added to the total number of employees before publishing it. As we discussed above, the amount of noise should be positively correlated to the sensitivity, and negatively correlated to the privacy budget (since less privacy budget means stronger privacy protection, thus requires more noise). As such, the random noise should be proportional to △f/𝜖, where △f is the sensitivity and 𝜖 is the privacy budget.
In this example, we could draw our random noise from the zero mean Laplace distribution, which is “double-sided” exponential distribution, defined as below.
The Laplace Distribution (centered at μ, usually we use μ = 0) with scale b is defined by the probability density function as:
where the variance of such Laplace Distribution is σ2 = 2b2. We can also consider Laplace distribution as a symmetric version of the exponential distribution.
Figure 5 The Laplace distribution.
Figure 6 The tradeoff between privacy and utility.
With that, we can calculate b = △f/𝜖, and draw random noise from Lap(x|△f/𝜖) to be added to the total number of employees. Figure 6 illustrates the published total number of employees of January and February, respectively, while applying different privacy budgets (i.e. 𝜖). In this case, based on the definition of DP, better privacy protection usually means the two published total number of employees of January and February are closer to each other with a higher probability. In other words, it has lower probability to reveal Mallory’s information, by just seeing those two published values. The utility performance usually means the accuracy of the computation of certain aggregated data or statistics from the database or dataset. In this case, it is the difference between the published perturbed value and its real value, and lower difference means better utility performance. In DP, it is usually a trade-off between privacy protection and utility performance. For instance, as 𝜖 decreasing, more noise would be added, and the difference between the two published values would be closer to each other, thus leading to stronger privacy protection. On the contrary, specifying larger 𝜖 could provide better utility but less privacy protection.
Now, let us consider the case of publishing the differentially private average salary. As we discussed above, while any individual employee leaving or joining the company, “the largest possible difference” should come from the employee that has the highest possible salary (i.e. CEO – Jack, $290,000). For instance, in the extreme situation, there is only one employee in the database, and that happens to be the CEO – Jack, who has the highest possible salary, $290,000, and his leaving would result in the “the largest possible difference” of the company’s average salary. Hence, the sensitivity should be $290,000. Then, we could follow the same procedure as dealing with the total number of employees, by adding random noise drawn from lap(x|△f/𝜖).
Learn more about the book here.
Cynthia, D. (2006). Differential privacy. Automata, languages and programming, 1-12.
Dwork, C., & Roth, A. (2014). The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4), 211-407.
Narayanan, A., & Shmatikov, V. (2008). Robust de-anonymization of large sparse datasets. Paper presented at the 2008 IEEE Symposium on Security and Privacy (sp 2008).
Shen, X., Tan, B., & Zhai, C. (2007). Privacy protection in personalized search. Paper presented at the ACM SIGIR Forum.