Minimax Theory Saves The World

Minimax theory is the best thing in statistical machine learning—or the worst— depending on your point of view.

1. The Basics of Minimaxity

Let {{\cal P}} be a set of distributions. We observe {Y_1,\ldots, Y_n \sim P \in {\cal P}}. Let {\theta = \theta(P)} be a function of the distribution (like the mean of {P}). The minimax risk is

\displaystyle  R_n = \inf_{\hat \theta}\sup_{P\in {\cal P}} \mathbb{E}_P [ d(\hat\theta,\theta(P))]

where the infimum is over all estimators {\hat\theta} (an estimator is any function of the observed data) and {d(\cdot,\cdot)} is some distance (more generally, a loss function). In machine learning, one often reports the sample complexity instead of the minimax risk. The sample complexity is the smallest sample size needed to make {R_n} less than some small number.

The challenge is to compute (or approximate) the minimax risk {R_n} and to find an estimator {\hat\theta} that achieves the minimax risk. An estimator {\hat\theta} whose maximum risk

\displaystyle  \sup_{P\in {\cal P}} \mathbb{E}_P [ d(\hat\theta,\theta(P))]

is equal to {R_n}, is a minimax estimator. It has the smallest possible maximum (worst case) risk.

Here is a famous example. Let {Y_1,\ldots, Y_n \sim {\rm Normal}(\theta,1)}. Suppose that {d} has the form {d(a,b) = g(a-b)} where {g} is any bowl-shaped function. This means that the sub-level sets {\{ x:\ g(y) \leq c\}} are convex and symmetric about the origin. Then Wolfowitz (1950) proved that the unique estimator that is minimax for all bowl-shaped {g} is the sample mean {\overline{Y}_n}.

Here is a more sophisticated example: density estimation. Let {Y_1,\ldots, Y_n \sim P} be random variables where {P} has density {p}. We want to know the minimax risk

\displaystyle  R_n = \inf_{\hat p}\sup_{P\in {\cal P}} \mathbb{E}_P \int (\hat p(y) - p(y))^2 dy

where {{\cal P}} is the set of densities such that {\int (p''(y))^2 dy \leq L^2}. It turns out that {R_n = c n^{-\frac{4}{5}}} for some constant {c>0}. The kernel density estimator

\displaystyle  \hat p(y) = \frac{1}{n}\sum_{i=1}^n \frac{1}{h}K\left( \frac{|y-Y_i|}{h}\right)

achieves this rate of convergence (for appropriate {K} and {h}). Histograms, however, do not. They converge more slowly. So we have discovered that, for the class {{\cal P}}, kernel density estimators are better than histograms and we cannot do much better than kernel estimators.

Now consider a set of spaces {\{\cal P_\alpha\}} indexed by a parameter. Think of {\alpha} as a measure of smoothness. These spaces could have different minimax rates. Sometimes it is possible to construct estimators that achieves the minimax rate of convergence without knowing the true value of {\alpha}. In other words, these estimators automatically adapt to the right amount of smoothness and they are called adaptive estimators. A famous example is wavelet estimation (Donoho, Johnstone, Kerkyacharian and Picard 1995).

2. Technical Digresson

To find {R_n}, we usually find an upper bound {R_n^*} and a lower bound {R_{n*}} and then, if we are lucky, {R_{n*} \approx R_n^*}.

To find an upper bound, pick any estimator {\hat\theta}. The maximum risk of any estimator is an upper bound:

\displaystyle  R_n \leq \sup_{P\in {\cal P}} \mathbb{E}_P [ d(\hat\theta,\theta(P))] \equiv R_n^*.

Finding a lower bound is trickier. Typically one uses an information-theoretic inequality like Fano’s inequality. In particular, we pick a finite set of distributions {P_1,\ldots, P_N \in {\cal P}}. Then

\displaystyle  R_n \geq \frac{\alpha}{2} \left( 1 - \frac{ n \beta + \log 2}{\log N}\right) \equiv R_{n*}

where

\displaystyle  \alpha = \min_{j\neq k}d(\theta(P_j),\theta(P_k)),

\displaystyle  \beta = \max_{j\neq k} {\sf KL}(P_j,P_k)

and {{\sf KL}(P_j,P_k)} is the Kullback-Leibler distance between {P_j} and {P_k}. (Sometimes you need to prune the original finite set of distributions using the Varshamov-Gilbert lemma. See Tsybakov 2009).

Choosing {\hat\theta} and the finite set {P_1,\ldots, P_N} is an art. If you don’t choose wisely, the bounds will not be tight.

3. In Favor Of Minimaxity

The purpose of minimax analysis is to understand the difficulty of a problem. Here is a small sample of the types of questions that minimax theory has cast light on recently:

  1. How hard is it to find the non-zero variables in a high-dimensional regression?
  2. How hard is it to estimate an undirected graph?
  3. How hard is it to estimate the homology of a manifold?
  4. How hard is it to test a hypothesis about a high-dimensional Normal mean?
  5. How hard is it to estimate a sparse signal from compressed measurements?

The virtue of minimax theory is that it gives a precise notion of the intrinsic difficulty of a problem. And it allows you to evaluate an estimator against a benchmark: if your estimator is not minimax then you can search for a better estimator.

4. Against Minimaxity

There are some valid criticisms of minimax theory. You have to specify a loss function. You have to specify a class of distributions {{\cal P}}. And you are using the worst case to define the value of a procedure.

Having to specify a loss and a class {{\cal P}} does not seem like a real problem to me. Minimax theory—or for that matter, all decision theory— is relative to some loss and some class. When we say that a problem is hard, we mean it is hard under certain assumptions. A problem may be easier or harder relative to different assumptions.

The claim that minimax theory is driven by the worst case is a more substantial criticism. I happen to think worst case analysis is a good thing. I want an estimator that does reasonably well under any circumstances.

And for nonparametric problems, where {{\cal P}} is a complicated infinite dimensional object, I would be skeptical of anything other than worst case analysis.

Think of it this way: if I buy a car, and I know that the car was tested and found to perform well even under extremely difficult circumstances, then I’ll feel better about buying the car.

Having said that, there are plenty of people who are uncomfortable with the worst case nature of minimax theory.

5. Minimaxity Gets the Last Word?

Minimax theory was important in statistics for a while then it sort of faded. But it had a revival, partly due to the work of Donoho et al.

But what is really interesting is that I see more and more minimax analyses in ML outlets such as NIPS, JMLR etc. So perhaps the marketplace of ideas is telling us that minimax theory is gaining in importance?

6. References

Donoho, D.L., Johnstone, I.M., Kerkyacharian, G. and Picard, D. (1995). Wavelet shrinkage: asymptopia?. Journal of the Royal Statistical Society. Series B. 301–369.

Tsybakov, A.B. (2009). Introduction to Nonparametric Estimation.

Wasserman, L. (2006). All of Nonparametric Statistics.

Wolfowitz, J. (1950). Minimax estimates of the mean of a normal distribution with known variance. The Annals of Mathematical Statistics, 21, 218-230.

—Larry Wasserman

12 Comments

  1. Posted July 17, 2012 at 10:16 pm | Permalink

    ” if I buy a car, and I know that the car was tested and found to perform well even under extremely difficult circumstances, then I’ll feel better about buying the car.” Yes, but if the tests are maximally stringent, you might have to conclude it’s “unsafe at any speed”.

    • Posted July 18, 2012 at 7:38 am | Permalink

      Indeed.
      This emphasizes that the choice of
      {\cal P} matters.

  2. Zach
    Posted July 17, 2012 at 10:43 pm | Permalink

    In simple situations for which we know good low risk estimators, how much higher does the risk of the minimax estimator tend to be?

    • Zach
      Posted July 17, 2012 at 10:49 pm | Permalink

      Sorry, I’m talking about Bayes risk here. How much worse is the Bayes risk of the minimax estimator than the Bayes estimator typically? If there were a result, for instance, that under certain regular models the minimax estimator is not too far from the Bayes estimator, that would be very strong support for using minimax.

  3. Matías
    Posted July 18, 2012 at 1:10 pm | Permalink

    Thanks for ths clear and briefs expositions of statistical theory.
    Just yesterday I was going to meet with my statistics professor to ask him some doubts, we couldn’t so forgive me if I pose here the doubt related to minimax estimators:

    ¿How do you deal with the dependence with respect to irrelevant variables?, in the words of Chernoff (“Rational selection of decision functions”):

    “In some example the minimax regret criterion may select a strategy [estimator] d3 among the availables strategies d1, d2, d3, d4. […] if for some reason d4 is made unavailable, the minimax regret criterion will select d2 among d1, d2 and d3.”

    It puzzles me how this incoherence arises (or else how not to see this as an incoherence).

    By the way, in reference to the Bayes risk, may I mention that it suposses a prior distribution of the parameter in question, in that context minimax risk can be understood as the Bayes risk for the worst-case prior. (this last comment may be wrong, I’m not sure).

    Nice (and helpful) posts : )

    • Posted July 18, 2012 at 1:29 pm | Permalink

      Yes. You are absolutely correct. The minimax rule
      is a Bayes rule under a least favorable prior.
      (I don’t remember enough to answer your first question)

    • Paul
      Posted July 20, 2012 at 12:04 pm | Permalink

      Why is the minimax regret criterion ‘incoherent’? Because it requires simultaneous optimization with respect to multiple objectives. … Not such a great answer, but maybe it’s a start?

    • Keith O'Rourke
      Posted July 20, 2012 at 12:20 pm | Permalink

      Do you have a reference or further clarification for the d1, d2, d3, d4 comment?

      I remember the opposite in informal decision making – amongst d1, d2, d3 – I prefer d2.
      Ok, but did you know you could also choose d4?
      No, in that case I prefer d3.
      (Explanation is that d4 brought other information to mind that led to the change in preference.)

      • Matías
        Posted August 29, 2012 at 8:13 pm | Permalink

        @O’ Rourke: I suppose this answer comes way too late, but

        yes there is a reference: Chernoff’s “Rational selection of decision functions”, a 1954 econometrica paper
        you can find it as the first hit in google for the author and the title (http://dido.econ.yale.edu/P/cp/p00b/p0091.pdf)

        @Paul: thanks for the humble start, just have to see where it takes me : )

  4. Posted July 24, 2012 at 4:21 am | Permalink

    As to Larry’s mentioning that

    “The claim that minimax theory is driven by the worst case is a more substantial criticism. I happen to think worst case analysis is a good thing. I want an estimator that does reasonably well under any circumstances.”

    and Zach’s remark ‘how much higher does the risk of a Bayes estimator tend to be?’

    I think it’s important to mention the following: in some settings, whether or not minimax theory is useful (and the worst-case tends not to be too far from ‘most’ average cases) depends on whether one adopts the right loss function for the problem.

    For example, in the individual-sequence prediction setting, on which Larry posted two weeks ago, if one looks for the algorithm that achieves the minimax *standard loss* (eg absolute loss as in Larry’s example, or log loss), one ends up with a trivial algorithm that predicts 1/2 no matter what happened in the past.
    The prior- expected performance of a Bayesian algorithm will almost always be substantially better.

    If one looks for the algorithm that achieves the minimax *regret* (the regret is itself a meta-loss function), one gets something very interesting – some learning algorithms have small minimax regret in the worst-case over all sequences; and in many circumstances the expected regret of a Bayesian predictor, assuming the Bayesian’s prior is correct, will be only slightly smaller than the minimax regret.

    Think about data-compression: the worst-case optimal data compressor of binary sequences simply encodes 0 as 0 and 1 as 1, and will never achieve any compression. Algorithms like gzip try to minimize the compression *regret* (additional rather than absolute number of bits) compared to the best finite state predictor. These algorithms have worst-case guarantees and they work well in practice (we all use them).

    • Zach
      Posted July 24, 2012 at 10:38 am | Permalink

      Great response, thanks.

  5. Posted July 24, 2012 at 8:45 am | Permalink

    That’s a great point Peter. When writing my post, I had in mind nonparametric regresson etc (in batch mode) with standard loss functions.
    The setting (and the loss function) really do matter.
    –LW

One Trackback

  1. […] First, if we look at the maximum risk rather than the pointwise risk then we see that the mle is optimal. Indeed, is the unique estimator that is minimax for all bowl-shaped estimators. See my earlier post on this. […]