Differential Privacy
Privacy and confidentiality are of great concern in our era of Big Data. In this post, I want to discuss one formal approach to privacy, called differential privacy. The idea was invented by Dwork, McSherry, Nissim and Smith (2006). A nice review by Cynthia Dwork can be found here.
1. What Is It?
There is a long history of people proposing methods for releasing information while preserving privacy. But whenever someone released information, some smart cryptographer would find a way to crack the privacy.
A famous example is the Netflix Prize. Quoting from Wikipedia:
“Netflix has offered $1,000,000 prize for a 10% improvement in its recommendation system. Netflix has also released a training dataset for the competing developers to train their systems. While releasing this dataset they had provided a disclaimer: To protect customer privacy, all personal information identifying individual customers has been removed and all customer ids have been replaced by randomly-assigned ids. Netflix is not the only available movie rating portal on the web; there are many including IMDB. In IMDB also individuals can register and rate movies, moreover they have the option of not keeping their details anonymous. Narayanan and Shmatikov had cleverly linked the Netflix anonymized training database with the Imdb database (using the date of rating by a user) to partly de-anonymize the Netflix training database.”
The problem is that you have to protect against all other information that a malicious agent might have access to.
Suppose we have data database consisting of data points where
. We want to release some sort of statistical summary of the database but we do not want to compromise any individual in the database. Consider an algorithm
that takes a database
as input and then outputs some random quantity
. We say that two databases
and
are neighboring databases if they differ in one element. In that case we write
.
The algorithm satisfies
differential privacy if, for all sets
, and all pairs
,
The intuition is this. Suppose you are considering whether or not to allow yourself to be included in a database. If differential privacy holds, then the output barely depends on whether or not you agree to be in the database. More precisely, the distribution of the output has low dependence on any one individual.
2. Example
Suppose, to be concrete, that . Let
be a real-valued function of the database, such as the sample mean. Define the sensitivity
where the supremum is over all neighboring pairs of data bases. In the case of the mean, .
Now we release
where has a Laplace distribution with density
. It then can be shown easily that
-differential privacy holds. (There are other ways to achieve differential privacy but this is the simplest.)
It is also possible to release a differentially private database. To do this, we first construct a differentially private density estimator . The we draw a sample
. The resulting data base
is also differentially private. See this and this for example.
3. What’s the Catch?
The field of differential privacy has grown very quickly. Do a Google search and you will find a goldmine of interesting papers.
But there is one problem. If the data are high-dimensional and you want to release a differentially private database, you will inevitably end up adding lots of noise. (You can see this by considering the simple but common situation of a multinomial distribution with many categories. The real data will have many empty categories but differential privacy forces you to fill in these zeroes with non-trivial probability.)
In short, differential privacy does not work well with high-dimensional data.
I attended a workshop on privacy at IPAM. The workshop was interesting because it brought together two very different groups of people. One group was computer scientists whose background was mostly in cryptography. The other group was applied statisticians. There was an interesting intellectual tension. The cryptographers wanted the strongest possible privacy guarantees. The statisticians wanted practical methods that worked well with large, messy datasets. I had sympathy for both positions.
I’m not an expert in this area but my current feeling is that differential privacy is too strong for many realistic settings. You end up outputting mostly noise. But I don’t yet know a good way to weaken differential privacy without giving up too much.
This is a fascinating area because it is of great practical interest, there is lots of interesting theory and it brings together theoretical computer science and statistics.
If you’re interested, I reiterate my suggestion to do a Google search on differential privacy. There’s a lot of stuff out there.
4. References
Cynthia Dwork, Frank McSherry, Kobbi Nissim, Adam Smith (2006). Calibrating Noise to Sensitivity in Private Data Analysis. In Theory of Cryptography Conference (TCC), Springer, 2006.
Cynthia Dwork (2006). Differential Privacy. International Colloquium on Automata, Languages and Programming (ICALP) 2006, p. 1-12.
—Larry Wasserman
6 Comments
I want to make more explicit the problems of differential privacy being “too strong”, because it seems to be a worse situation than you’re letting on.
A few years ago Daniel Kifer and Ashwin Machanavajjhala wrote a paper called “No Free Lunch in Data Privacy”, which uses the eponymous No Free Lunch Theorem to argue that it is not possible to provide both privacy and utility without making assumptions about how the data are generated, at least as those things are (were?) usually defined.
One big impact here is that this seems to weaken the claim that we can provide privacy without making any assumptions about the data, because what is shown here is that if that’s the case, you will necessarily be compromising the utility of your data.
So, is DP too strong to do anything useful with? This seems to support that notion, though there may still be practical applications of the theory. Also, this does not exclude other possible notions of privacy, from which we can possibly recover routines for anonymizing data.
I should also say that I’m not an expert in the area either, so corrections are welcome.
Well there are lots of examples where you can achieve differential privacy and still have utility.
The simple example of estimating a mean on a bounded domain is such an example.
And there is no assumption about how the data were generated in that case.
Similarly, there are lots of papers on private data summaries, private classifiers etc.
So it clearly does preserve utility in many cases.
But then there are cases where it clearly does not work.
A high dimensional contingency table with manu zero counts is an example.
So I think it is very application dependent.
I did read The paper by Kifer and Machanavajjhala
and it was interesting but I do’t remember the details right now.
Hi Larry! I just found this blog post. I’d like to understand better your point that differential privacy necessitates adding too much noise for high dimensional data. For example, suppose your database consists of d-dimensional boolean valued data, and what you are interested in is a contingency table containing the ~ (d choose k) k-way marginals. We know how to provide these on any dataset, so that the relative error is not worse than ~ O(d * k / sqrt{n}) for any of these marginals. i.e. if you want relative error 1%, then you need your database to be of size Omega(k * d). In the theory community, we think of this as quite reasonable — the amount of data you need only grows linearly in the dimension of the data! Of course, as you say, we are necessarily talking about additive error, since it is necessary to hide the zeros. (When I talk about relative error, I mean dividing the additive error by n). Its true that some kinds of additive error are unavoidable when we are dealing with differential privacy — is it additive, rather than multiplicative error that is the problem?
Of course, what we can do in a computationally efficient way is another story: we do not have polynomial time algorithms that achieve the above bound. But this is a different issue than that differential privacy is too strong as an information theoretic constraint. The computational feasibility of differentially private analysis is still wide open for the most part, but the information theoretic bounds are what I would have considered extremely positive.
Hi Aaron
Interesting point.
But how well can you preserve the entries of the table
(not just the marginals)?
Isn’t the L_1 distance between the
multinomials (the original and the released table)
going to be large when d is large?
__Larry
Its true that if what you want to release is the histogram of the data itself (i.e. “1 copy of element A, 0 copies of element B, 0 copies of element C, 2 copies of element D”, etc…) that you are in trouble: this can’t be done privately without huge amounts of noise. This is by design — in fact, any method that lets you reconstruct this histogram too accurately is called “blatantly non private” (see, e.g. this paper, which pre-dated differential privacy http://dl.acm.org/citation.cfm?id=773173) — this is something that differential privacy protects against. What the lower bounds in the above paper show is that any algorithm that answers too many queries (of a type that includes marginals) to relative error o(1/sqrt{n}) allows an adversary to reconstruct the database almost exactly — and hence is not “private.”
So thats fair enough — if you want a summary of the data that captures everything you could compute from the data itself, you cannot do this with differential privacy. But often, you can still privately do whatever you would have used that data for. So in addition to computing all of the marginals, you can e.g. do empirical risk minimization, combinatorial optimization, compute the singular values and singular vectors of matrix valued data, and plenty of other things on high dimensional data. This is not to say that privacy is without cost — if you want to do these things to the same level of accuracy you could have done them non-privately, you will need more data. So I think one useful way of talking about this cost is in terms of how much larger your data set has to be before you can feasibly perform a task privately, as opposed to non-privately. In the “good” cases, this factor is only linear in the dimension of the data, but of course, in some settings, this may be unacceptably large.
Yes I think we are in agreement.
–LW