Minimax theory is the best thing in statistical machine learning—or the worst— depending on your point of view.
1. The Basics of Minimaxity
Let be a set of distributions. We observe . Let be a function of the distribution (like the mean of ). The minimax risk is
where the infimum is over all estimators (an estimator is any function of the observed data) and is some distance (more generally, a loss function). In machine learning, one often reports the sample complexity instead of the minimax risk. The sample complexity is the smallest sample size needed to make less than some small number.
The challenge is to compute (or approximate) the minimax risk and to find an estimator that achieves the minimax risk. An estimator whose maximum risk
is equal to , is a minimax estimator. It has the smallest possible maximum (worst case) risk.
Here is a famous example. Let . Suppose that has the form where is any bowl-shaped function. This means that the sub-level sets are convex and symmetric about the origin. Then Wolfowitz (1950) proved that the unique estimator that is minimax for all bowl-shaped is the sample mean .
Here is a more sophisticated example: density estimation. Let be random variables where has density . We want to know the minimax risk
where is the set of densities such that . It turns out that for some constant . The kernel density estimator
achieves this rate of convergence (for appropriate and ). Histograms, however, do not. They converge more slowly. So we have discovered that, for the class , kernel density estimators are better than histograms and we cannot do much better than kernel estimators.
Now consider a set of spaces indexed by a parameter. Think of as a measure of smoothness. These spaces could have different minimax rates. Sometimes it is possible to construct estimators that achieves the minimax rate of convergence without knowing the true value of . In other words, these estimators automatically adapt to the right amount of smoothness and they are called adaptive estimators. A famous example is wavelet estimation (Donoho, Johnstone, Kerkyacharian and Picard 1995).
2. Technical Digresson
To find , we usually find an upper bound and a lower bound and then, if we are lucky, .
To find an upper bound, pick any estimator . The maximum risk of any estimator is an upper bound:
Finding a lower bound is trickier. Typically one uses an information-theoretic inequality like Fano’s inequality. In particular, we pick a finite set of distributions . Then
and is the Kullback-Leibler distance between and . (Sometimes you need to prune the original finite set of distributions using the Varshamov-Gilbert lemma. See Tsybakov 2009).
Choosing and the finite set is an art. If you don’t choose wisely, the bounds will not be tight.
3. In Favor Of Minimaxity
The purpose of minimax analysis is to understand the difficulty of a problem. Here is a small sample of the types of questions that minimax theory has cast light on recently:
- How hard is it to find the non-zero variables in a high-dimensional regression?
- How hard is it to estimate an undirected graph?
- How hard is it to estimate the homology of a manifold?
- How hard is it to test a hypothesis about a high-dimensional Normal mean?
- How hard is it to estimate a sparse signal from compressed measurements?
The virtue of minimax theory is that it gives a precise notion of the intrinsic difficulty of a problem. And it allows you to evaluate an estimator against a benchmark: if your estimator is not minimax then you can search for a better estimator.
4. Against Minimaxity
There are some valid criticisms of minimax theory. You have to specify a loss function. You have to specify a class of distributions . And you are using the worst case to define the value of a procedure.
Having to specify a loss and a class does not seem like a real problem to me. Minimax theory—or for that matter, all decision theory— is relative to some loss and some class. When we say that a problem is hard, we mean it is hard under certain assumptions. A problem may be easier or harder relative to different assumptions.
The claim that minimax theory is driven by the worst case is a more substantial criticism. I happen to think worst case analysis is a good thing. I want an estimator that does reasonably well under any circumstances.
And for nonparametric problems, where is a complicated infinite dimensional object, I would be skeptical of anything other than worst case analysis.
Think of it this way: if I buy a car, and I know that the car was tested and found to perform well even under extremely difficult circumstances, then I’ll feel better about buying the car.
Having said that, there are plenty of people who are uncomfortable with the worst case nature of minimax theory.
5. Minimaxity Gets the Last Word?
Minimax theory was important in statistics for a while then it sort of faded. But it had a revival, partly due to the work of Donoho et al.
But what is really interesting is that I see more and more minimax analyses in ML outlets such as NIPS, JMLR etc. So perhaps the marketplace of ideas is telling us that minimax theory is gaining in importance?
Donoho, D.L., Johnstone, I.M., Kerkyacharian, G. and Picard, D. (1995). Wavelet shrinkage: asymptopia?. Journal of the Royal Statistical Society. Series B. 301–369.
Tsybakov, A.B. (2009). Introduction to Nonparametric Estimation.
Wasserman, L. (2006). All of Nonparametric Statistics.
Wolfowitz, J. (1950). Minimax estimates of the mean of a normal distribution with known variance. The Annals of Mathematical Statistics, 21, 218-230.