The Tyranny of Tuning Parameters

The Tyranny of Tuning Parameters

We all know about the curse of dimensionality. The curse creates serious practical and theoretical difficulties in many aspects of statistics. But there is another pernicious problem that gets less attention. I call it: The Tyranny of Tuning Parameters.

Many (perhaps most) data analysis methods involve one or more tuning parameters. Examples are: the {k} in {k}-means clustering, the regularization parameter in the lasso the bandwidth in density estimation, the number of neighbors in {k}-nearest neighbors, the number of eigenvectors in PCA, and so on.

For some of these problems we have good data-driven methods to choose the parameters. But for others we don’t. For example, the optimal choice of tuning parameter for the lasso requires knowledge that we would never have in practice.

1. Regression: A Success

An example of a success is cross-validation in regression. Here is one version. Suppose the data are {(X_1,Y_1),\ldots, (X_{2n},Y_{2n})}. Split the {2n} data points into a training set {{\cal T}} of size {n} and a test set {{\cal U}} of size {n}. From the training set we create regression estimators {\{\hat m_h:\ h\in {\cal H}\}} where {h} is a tuning parameter and {{\cal H}} is a finite set of values for {h}. For example, {h} could be the bandwidth for kernel regression, or the regularization parameter for the lasso.

If {(X,Y)} is a new observation, the mean prediction error is

\displaystyle  \mathbb{E} (Y- \hat m_h(X))^2.

Minimizing the prediction error is the same as minimizing the {L_2} risk:

\displaystyle  R(h) = \int (\hat m_h(x) - m(x))^2 dP(x)

where {m(x) = \mathbb{E}(Y|X=x)} is the true regression function.

Estimate the risk using the test set:

\displaystyle  \hat R(h) = \frac{1}{n}\sum_{(X_i,Y_i)\in {\cal U}} (Y_i - \hat m_h(X_i))^2.

Choose {\hat h} to minimize {\hat R(h)} and let {\hat m = m_{\hat h}}. To avoid technical complications, assume all the random variables and estimators are bounded by {L < \infty}. We then have the following theorem (Theorem 7.1, Gy\”{o}rfi, Kohler, Krzy\'{z}ak and Walk, 2002):

Let {m_* = m_{h_*}} be the oracle, that is, {h_*} minimizes {\int (\hat m_h(x) - m(x))^2 dP(x)} over {h\in {\cal H}}. Let {N} be the size of {{\cal H}}. For any {\delta>0},

\displaystyle  R(\hat m) \leq (1+\delta) R(m_*) + L^2 \left(\frac{16}{\delta} + 35 + 19\delta\right) \left(\frac{ 1 + \log N}{n}\right)

where

\displaystyle  R(\hat m) = \mathbb{E} \int (\hat{m}(x) - m(x))^2 dP(x)

and

\displaystyle  R(m_*)=\mathbb{E} \int (m_*(x) - m(x))^2 dP(x).

In words: if you use cross-validation to use the tuning parameter {h}, you lose at most a factor of size {O(\log N/n)} which is pretty small.

(Note: I focused on sample splitting but of course there are lots of other versions of cross-validation: leave-one-out, 10 fold etc.)

This is an example of a successful data-driven method for choosing the tuning parameter. But there are caveats. The result refers only to prediction error. There are other properties we might want; I’ll save that for a future post.

2. Density Estimation: A Failure

Now consider density estimation. Let {Y_1,\ldots, Y_n \sim P} where {P} has density {p} and {Y_i\in \mathbb{R}^d}. The usual kernel density estimator is

\displaystyle  \hat p_h(y) =\frac{1}{n}\sum_{i=1}^n \frac{1}{h^d} K\left(\frac{||y-Y_i||}{h}\right)

where {K} is a kernel and {h>0} is a bandwidth.

We want to choose {h\in {\cal H}} where {{\cal H}} has {N} element. To choose {h} one usual focuses on minimizing the {L_2} loss

\displaystyle  \int (\hat p_h(y) - p(y))^2 dy = \int \hat p_h^2(y)dy - 2 \int \hat p_h(y) p(y) dy + \int p^2(y) dy

which is equivalent to minimizing

\displaystyle  \int \hat p_h^2(y)dy - 2 \int \hat p_h(y) p(y) dy.

The latter can be estimated using test data {Y_1^*,\ldots, Y_n^*} by

\displaystyle  \int \hat p_h^2(y)dy - \frac{2}{n}\sum_{i=1}^n \hat p_h(Y_i^*).

If we choose {h} to minimize this estimated risk, then the resulting estimator {\hat p} satisfies the following: (see Wegkamp, 1999): for any {\epsilon >0},

\displaystyle  \mathbb{E} \int (\hat p - p)^2 \leq (1+\epsilon) \int (p_* - p)^2 + \frac{C \log N}{n}

for a constant {C} and {p_*} is the oracle (the true minimizer).

This is a nice result. Like the regression result above, it gives us assurance that cross-validation (data-splitting) chooses a good tuning parameter.

But, I don’t think cross-validation really does solve the bandwidth selection problem in density estimation. The reason is that I don’t think {L_2} loss is the right loss function.

Look at these three densities:

In this case, {\int (p_0 - p_1)^2 = \int (p_0 - p_2)^2} so, in the {L_2} sense, {p_1} is just as good an approximation to {p_0} as {p_2} is. And yet, {p_2} is in many respects a lousy approximation. The problem is that {L_2} is insensitive to shape information.

Another example is shown in the following plot:

The top left plot is the true distribution which is a mixture of a bimodal density with a point mass at 0. Of course, this distribution is singular; it does not really have a density. The top right shows the density estimator with a small bandwidth, the bottom left shows the density estimator with a medium bandwidth and the bottom right shows the density estimator with a large bandwidth. I would call the estimator in the bottom left plot a success; it shows the structure of the data nicely. Yet, cross-validation leads to the choice {h=0}. The problem isn’t cross-validation. The problem here is to find a suitable loss function and then find a way to estimate the loss.

Despite years of research, I don’t think we have a good way to choose the bandwidth in a sample problem like density estimation.

3. Scale Space

Steve Marron takes a different point of view (Chaudhuri and Marron, 2002). Borrowing ideas from computer vision, he and his co-authors have argued that we should not choose a bandwidth. Instead, we should look at all the estimates {\{\hat p_h:\ h\geq 0\}}. This is called the scale-space view. The idea is that different bandwidths provide different information.

I have much sympathy for this viewpoint. And it is similar to Christian Hennig’s views on clustering.

Nevertheless, I think in many cases it is useful to have a data-driven default value and I don’t think we have one yet.

4. Conclusion

There are many other examples where we don’t convincing methods for choosing tuning parameters. In many journal and conference papers, the tuning parameters have been chosen in a very ad-hoc way to make the proposed method look successful.

In my humble opinion, we need to put more effort into finding data-driven methods for choosing these annoying tuning parameters.

5. References

Gy\”{o}rfi, Kohler, Krzy\'{z}ak and Walk. (2002). A Distribution-free Theory of Nonparametric Regression.

Chaudhuri, P. and Marron, J.S. (2000). Scale space view of curve estimation. The Annals of Statistics, 28, 408-428.

Wegkamp, M.H. (1999). Quasi-universal bandwidth selection for kernel density estimators. Canadian Journal of Statistics, 27, 409-420.

— Larry Wasserman

12 Comments

  1. Claus
    Posted July 8, 2012 at 11:55 am | Permalink

    > The result refers only to prediction error. There are other properties we might want; I’ll save that for a future post.
    This would be interesting. Can there be something that goes beyond empirical prediction?

    • Posted July 8, 2012 at 12:16 pm | Permalink

      sure. like support recovery.
      I’ll post about that soon
      –LW

  2. Posted July 8, 2012 at 3:06 pm | Permalink

    Very good post! For me, one of the reasons for the recent popularity of shape constrained density estimation, like log-concave densities, is that the choice of a tuning parameter becomes unnecessary!

    This area is very promising, at least as I see it.

    Also, whenever dealing with testing procedures, the choice of the tuning parameter becomes even more serious since there is no clear way of choosing an appropriate loss function.

    I quite like the post! Very good and didactic.

    • Posted July 8, 2012 at 3:25 pm | Permalink

      That’s true. Shape constrained density estimation is often tuning parameter free which
      is nice.

  3. Christian Hennig
    Posted July 9, 2012 at 7:57 am | Permalink

    Nice that you mention my view on cluster analysis; if people want to read about it, something is here (still in “submitted” state):

    Click to access mixedsocialcluster1111.pdf

    The essence of my view is that I think that scientists want to be “objective”, which far too often tempts them into not deciding things that need to be decided. Not everything can be done data-driven. If you have a loss function, OK, the data can help you to optimise it. But the data can’t tell you which loss function to choose, and so the data really can’t tell you whether you should be interested in squared loss or some other loss, which actually depends on what you want to do with the result. In density estimation, really the scientist needs to decide how much smoothness he or she wants. The data can’t decide this because smoothness of a density is strictly not observable.
    In cluster analysis, if you have two nicely separated Gaussian mixture components and move them closer and closer to each other, at some point this will look like a single cluster, not two. And the researcher has to decide from which cutoff downwards this should be treated as a single cluster. There is no way the data can do this (OK, you could test unimodality but in several applications this is not the cluster concept you’re after). So there is no way to have the data properly decide the number of clusters without any kind of tuning. We should actually *want* to tune things so that they are of use to us (this assumes that it is well understood what tuning constants do, so I’m still in favour of getting rid of them where it’s not).
    Something about loss function in this spirit is
    C. Hennig and M. Kutlukaya: Some thoughts about the design of loss functions. REVSTAT 5 (2007), 19-39 (freely available online).

  4. Zach
    Posted July 10, 2012 at 9:10 am | Permalink

    Thanks for another great thought provoking post. I don’t see why the density estimation failure is an example of the tyranny of tuning parameters, though. It seems to stem purely from the loss function. Are there examples where cross validation is known to not necessarily choose a good tuning parameter asymptotically or in practice given a loss function?

  5. Posted July 10, 2012 at 10:38 am | Permalink

    Thanks for a nice post! One interesting question that it brings to mind is how to compare a plain training / validation-set split vs. cross-validation. One would hope that k-fold cross-validation would admit tighter generalization bounds than just using some fraction of the data as a validation set for picking the tuning parameter.

    Another interesting question is the common practice of selecting the tuning parameter using cross-validation, then re-training the selected estimator using the entire data set. Again, one would hope to be able to prove tighter bounds vs. using the version that’s trained on only (k-1)/k of the data.

    Both of the above questions seem like they might require additional assumptions on the estimation procedure in order to make progress. Do you know of any work in these directions?

  6. Posted February 8, 2013 at 5:24 pm | Permalink

    Dear Prof. Wasserman,

    Is that true an adaptive density estimator does not suffer from tuning parameter? Like the adaptive wavelet estimator by Donoho and Johnstone? How does an adaptive estimator compared to a Kernel estimator with a tuned bandwidth?

    Thank you!

    • Posted February 8, 2013 at 7:48 pm | Permalink

      If you have regression at equally spaced values
      with constant variance and normal error then, yes,
      wavelet estimators give you a precise rule for choosing
      the tuning parameter. But this situation is very special.
      For ordinary regression or density estimation, cross-validation
      is much more popular.