The Value of Adding Randomness

In computer science it is common to use randomized algorithms. The same is true in statistics: there are many ways that adding randomness can make things easier. But the way that randomness enters, varies quite a bit in different methods. I thought it might be interesting to collect some specific examples of statistical procedures where added randomness plays some role. (I am not referring to the randomness inherent in the original data but, rather, I refer to randomness in the statistical method itself.)

(1) Randomization in causal inference. The mean difference {\alpha} between a treated group and untreated group is not, in general, equal to the causal effect {\theta}. (Correlation is not causation.) Moreover, {\theta} is not identifiable. But if we randomly assign people to the two groups then, magically, {\theta =\alpha}. This is easily proved using either the directed graph approach to causation or the counterfactual approach: see here for example. This fact is so elementary that we tend to forget how amazing it is. Of course, this is the reason we spend millions of dollars doing randomized trials.

(As an aside, some people say that there is no role for randomization in Bayesian inference. In other words, the randomization mechanism plays no role in Bayes’ theorem. But this is not really true. Without randomization, we can indeed derive a posterior for {\theta} but it is highly sensitive to the prior. This is just a restatement of the non-identifiability of {\theta}. With randomization, the posterior is much less sensitive to the prior. And I think most practical Bayesians would consider it valuable to increase the robustness of the posterior.)

(2) Permutation Tests. If {X_1,\ldots, X_n \sim P} and {Y_1,\ldots, Y_m \sim Q} and you want to test {H_0: P=Q} versus {H_1: P\neq Q}, you can get an exact, distribution-free test by using the permutation method. See here. We rarely search over all permutations. Instead, we randomly select a large number of permutations. The result is still exact (i.e. the p-value is sub-uniform under {H_0}.)

(3) The Bootstrap. I discussed the bootstrap here. Basically, to compute a confidence interval, we approximate the distribution

\displaystyle  L_n=\mathbb{P}( \sqrt{n}(\hat\theta - \theta) \leq t)

with the conditional distribution

\displaystyle  \hat L_n=\mathbb{P}( \sqrt{n}(\hat\theta^* - \hat\theta) \leq t\ | X_1,\ldots, X_n)

where {\hat\theta = g(X_1,\ldots, X_n)}, {\hat\theta^* = g(X_1^*,\ldots, X_n^*)} and {X_1^*,\ldots, X_n^*} is a sample from the empirical distribution. But the distribution {\hat L_n} is intractable. Instead, we approximate it by repeatedly sampling from the empirical distribution function. This makes otherwise intractable confidence intervals trivial to compute.

(4) {k}-means++. Minimizing the objective function in {k}-means clustering is NP-hard. Remarkably, as I discussed here, the {k}-means++ algorithm uses a careful randomization method for choosing starting values and gets close to the minimum with high probability.

(5) Cross-Validation. Some forms of cross-validation involve splitting the data randomly into two or more groups. We use one part(s) for fitting and the other(s) for testing. Some people seem bothered by the randomness this introduces. But it makes risk estimation easy and accurate.

(6) MCMC. An obvious and common use of randomness is random sampling from a posterior distribution, usually by way of Markov Chain Monte Carlo. This can dramatically simplify Bayesian inference.

These are the first few things that came to my mind. Are there others I should add to the list?

20 Comments

  1. Posted June 9, 2013 at 11:51 am | Permalink

    In the linked post on causal inference, you mention a future post on Simpson’s paradox and the case in which p(y|Set X = x) \neq p(y|x). The huddled masses (or maybe it’s just me) eagerly await this post!

    • Posted June 9, 2013 at 11:52 am | Permalink

      Thanks for reminding about that.
      I will post on it soon

  2. Posted June 9, 2013 at 1:16 pm | Permalink

    Maybe ABC should be mentioned as a multi-level randomisation: the data y is replaced with a randomised version d(X,y)<\epsilon in the Bayes conditioning and the Bayesian inference only remains exact if the data is indeed replaced with its (once) randomised version. This comes in addition to the Monte Carlo randomisation that underlines your examples (2), (3), (4), and (6). Even though I would have argued at the classification of those as randomisation methods, per se, had we had the luck to meet in Roma last week…!

    • Posted June 9, 2013 at 1:23 pm | Permalink

      I knew I would regret missing our meeting in Rome!

      • Posted June 11, 2013 at 6:23 am | Permalink

        Btw, I loved your biosketch in amstat news: “In his spare time, he enjoys mountain climbing, parachuting, and big game hunting.”

      • Posted June 11, 2013 at 6:25 am | Permalink

        Yes I was able to sneak that by

  3. Posted June 9, 2013 at 1:18 pm | Permalink

    Randomness can also be an effective regularizer during training. Random Forest randomly decides to ignore a large subset of features at every split decision. Geoffrey Hinton’s dropout technique for regularization in training neural networks does something similar.

  4. george
    Posted June 9, 2013 at 2:54 pm | Permalink

    Multiple Imputation.

  5. Nicolas Le Roux
    Posted June 10, 2013 at 2:55 am | Permalink

    Randomly shuffling the examples can speed-up optimization in the finite sample case. Here are a couple of papers mentioning this (as a disclaimer, I’m an author of one of them):
    http://arxiv.org/abs/1209.1873
    http://arxiv.org/abs/1202.6258
    There was at least another one I knew of but I can’t seem to find it. I’m pretty sure it was by Léon Bottou.

  6. Subhash Chandra
    Posted June 10, 2013 at 4:25 am | Permalink

    This is an attention-catching, but misleading, title! An appropriate title would have been “The Value of Using Randomization”

  7. Posted June 10, 2013 at 4:28 am | Permalink

    There’s a few further techniques that crossed my mind, including optimization algorithms such as SA, GA (I assume if you count MCMC you should count them as well) as well as various re-sampling or simulation-based null-models.

    Maybe useful to classify in 1) Monte-Carlo algorithms, including MCMC, SA, GA and heaps of other ML algorithms 2) Algorithms that re-sample or permute data, including bootstrapping and permutation null models 3) Stochastic simulations, including simulation-based null models, ABC, etc. , ? Not sure whether that covers everything …

  8. Posted June 10, 2013 at 4:55 pm | Permalink

    Excellent post. (I take it (1) would be qualified for “the” causal effect in the population randomly sampled.) But my main point, rather than adding examples, is to draw attention to the deep implications of these. Notably, they immediately scotch the oft repeated charge that sampling distributions and associated frequentist methods are valuable only in some long run: it is precisely the counterfactual reasoning that enables them to be relevant to the process, or data generating source, at hand. A causal question, say, entails counterfactual claims about “what it would be like were…” and the sampling distribution lets us infer “what it would be like were…” using the sample. Therein lies the key that opens the door to ampliative inference (whichever school takes advantage of them)—or so I have been claiming.

  9. Posted June 11, 2013 at 12:48 am | Permalink

    I recently came across the “Pushed” confidence intervals of Lorden (as described in a PhD thesis by T. V. Asparouhov). These are randomized confidence intervals on the Bernoulli parameter. You take the number of successes observed in the experiment, add a uniform random variate in [-0.5,+0.5] and then use that value to construct the confidence interval. Asparouhov claims these intervals are more efficient than non-randomized ones.

  10. Posted June 11, 2013 at 12:53 am | Permalink

    Speaking of randomized algorithms, a big theoretical question in computer science is whether randomization plays any ESSENTIAL role in allowing some things to be computed in polynomial time. There is a whole field of “de-randomization”, which seeks to convert randomized algorithms into deterministic ones. There are some formal results where randomness helps, but there are many open questions!

  11. Posted June 12, 2013 at 8:16 pm | Permalink

    A simple example: jittering.

  12. apdawid
    Posted June 17, 2013 at 4:17 pm | Permalink

    I’d appreciate some clarification of your assertion: “This [the effectiveness of randomisation for causal inference] is easily proved using either the directed graph approach to causation or the counterfactual approach”. However, approaches such as these can do no more than provide a system for inferring conclusions from assumptions, either explicit or implicit. Exactly what assumptions are required to infer this specific conclusion?

    • Posted June 17, 2013 at 4:20 pm | Permalink

      I’m not sure what you are asking Phil.
      Randomization turns a non-identifiable parameter
      into a identifiable parameter.
      But I said this in the post so perhaps I am not understanding your question.

  13. Z
    Posted June 18, 2013 at 1:19 pm | Permalink

    Random Forests!
    And Meinshausen’s stability subsampling method for model selection.

  14. Christian Hennig
    Posted June 19, 2013 at 7:53 am | Permalink

    In
    “C. Hennig and T. F. Liao How to find an appropriate clustering for mixed type variables with application to socioeconomic stratification. Journal of the Royal Statistical Science, Series C (Applied Statistics) 62, 309-369 (2013), with discussion”
    we observed that maximising certain cluster validation indexes (average silhouette width, Calinski/Harabasz; the phenomenon may apply to others) as recommeded in the literature led to an estimate of the number of clusters for which the clustering is not significantly better than random data simulated from a null model for “no clustering but some other realistic structure”. However, other supposedly non-optimal numbers of clusters led to significantly better clusterings.
    Comparing the validation index values of clusterings to what happens under such null models is generally helpful for cluster validation.
    Generally it may be worthwhile when interested in some kind of structure to compare with a null model that incorporates other kinds of structure that one could legitimately expect in such data, instead testing against an unrealistically simplistic null model such as iid normal.
    Obviously, though, adding randomness here is only required because I am (“we all are?”) too stupid to compute such things from sufficiently flexible null models theoretically. (But this applies to some other examples listed here, too.)

  15. Mike Meyer
    Posted July 8, 2013 at 5:22 pm | Permalink

    Randomized response — that old trick for getting people in aggregate to reveal their preferences for sensitive questions.