Yes I think we are in agreement.

–LW

]]>Its true that if what you want to release is the histogram of the data itself (i.e. “1 copy of element A, 0 copies of element B, 0 copies of element C, 2 copies of element D”, etc…) that you are in trouble: this can’t be done privately without huge amounts of noise. This is by design — in fact, any method that lets you reconstruct this histogram too accurately is called “blatantly non private” (see, e.g. this paper, which pre-dated differential privacy http://dl.acm.org/citation.cfm?id=773173) — this is something that differential privacy protects against. What the lower bounds in the above paper show is that any algorithm that answers too many queries (of a type that includes marginals) to relative error o(1/sqrt{n}) allows an adversary to reconstruct the database almost exactly — and hence is not “private.”

So thats fair enough — if you want a summary of the data that captures everything you could compute from the data itself, you cannot do this with differential privacy. But often, you can still privately do whatever you would have used that data for. So in addition to computing all of the marginals, you can e.g. do empirical risk minimization, combinatorial optimization, compute the singular values and singular vectors of matrix valued data, and plenty of other things on high dimensional data. This is not to say that privacy is without cost — if you want to do these things to the same level of accuracy you could have done them non-privately, you will need more data. So I think one useful way of talking about this cost is in terms of how much larger your data set has to be before you can feasibly perform a task privately, as opposed to non-privately. In the “good” cases, this factor is only linear in the dimension of the data, but of course, in some settings, this may be unacceptably large.

]]>Hi Aaron

Interesting point.

But how well can you preserve the entries of the table

(not just the marginals)?

Isn’t the L_1 distance between the

multinomials (the original and the released table)

going to be large when d is large?

__Larry

]]>Of course, what we can do in a computationally efficient way is another story: we do not have polynomial time algorithms that achieve the above bound. But this is a different issue than that differential privacy is too strong as an information theoretic constraint. The computational feasibility of differentially private analysis is still wide open for the most part, but the information theoretic bounds are what I would have considered extremely positive.

]]>Well there are lots of examples where you can achieve differential privacy and still have utility.

The simple example of estimating a mean on a bounded domain is such an example.

And there is no assumption about how the data were generated in that case.

Similarly, there are lots of papers on private data summaries, private classifiers etc.

So it clearly does preserve utility in many cases.

But then there are cases where it clearly does not work.

A high dimensional contingency table with manu zero counts is an example.

So I think it is very application dependent.

I did read The paper by Kifer and Machanavajjhala

and it was interesting but I do’t remember the details right now.

A few years ago Daniel Kifer and Ashwin Machanavajjhala wrote a paper called “No Free Lunch in Data Privacy”, which uses the eponymous No Free Lunch Theorem to argue that it is not possible to provide both privacy and utility without making assumptions about how the data are generated, at least as those things are (were?) usually defined.

One big impact here is that this seems to weaken the claim that we can provide privacy without making any assumptions about the data, because what is shown here is that if that’s the case, you will necessarily be compromising the utility of your data.

So, is DP too strong to do anything useful with? This seems to support that notion, though there may still be practical applications of the theory. Also, this does not exclude other possible notions of privacy, from which we can possibly recover routines for anonymizing data.

I should also say that I’m not an expert in the area either, so corrections are welcome.

]]>