Tag Archives: Bayesian Inference

The Normalizing Constant Paradox

Recently, there was a discussion on stack exchange about an example in my book. The example is a paradox about estimating normalizing constants. The analysis of the problem in my book is wrong; more precisely, the analysis is my book is meant to show that just blindly applying Bayes’ rule does not always yield a correct posterior distribution.

This point was correctly noted by a commenter at stack exchange who uses the name “Zen.” I don’t know who Zen is, but he or she correctly identified the problem with the analysis in my book.

However, it is still an open question how to do the analysis properly. Another commenter, David Rohde, identified the usual proposal which I’ll review below. But as I’ll explain, I don’t think the usual answer is satisfactory.

The purpose of this post is to explain the paradox and then I want to ask the question: does anyone know how to correctly solve the problem?

The example, by the way, is due to my friend Ed George.

1. Problem Description

The posterior for a parameter {\theta} given data {Y_1,\ldots, Y_n} is

\displaystyle  p(\theta|Y_1,\ldots, Y_n) =\frac{L(\theta)\pi(\theta)}{c}

where {L(\theta)} is the likelihood function, {\pi(\theta)} is the prior and {c= \int L(\theta)\pi(\theta)} is the normalizing constant. Notice that the function {L(\theta)\pi(\theta)} is known.

In complicated models, especially where {\theta} is a high-dimensional vector, it is not possible to do the integral {c= \int L(\theta)\pi(\theta)}. Fortunately, we may not need to know the normalizing constant. However, there are occasions where we do need to know it. So how can we compute {c} when we can’t do the integral?

In many cases we can use simulation methods (such as MCMC) to draw a sample {\theta_1,\ldots, \theta_n} from the posterior. The question is: how can we use the sample {\theta_1,\ldots, \theta_n} from the posterior to estimate {c}?

More generally, suppose that

\displaystyle  f(\theta) = \frac{g(\theta)}{c}

where {g(\theta)} is known but we cannot compute the integral {c = \int g(\theta) d \theta}. Given a sample {\theta_1,\ldots, \theta_n \sim f}, how do we estimate {c}?

2. Frequentist Estimator

We can use the sample to compute a density estimator {\hat f(\theta)} of {f(\theta)}. Note that {c = g(\theta)/f(\theta)} for all {\theta}. This suggests the estimator

\displaystyle  \hat c = \frac{g(\theta_0)}{\hat f(\theta_0)}

where {\theta_0} is an arbitrary value of {\theta}.

This is only one possible estimator. In fact, there is much research on the problem of finding good estimators of {c} from the sample. As far as I know, all of them are frequentist.

As David Rohde notes on stack exchange, there is a certain irony to the fact the Bayesians use frequentist methods to estimate the normalizing constant of their posterior distributions.

3. A Bogus Bayesian Analysis

Let’s restate the problem. We have a sample {\theta_1,\ldots, \theta_n} from {f(\theta)=g(\theta)/c}. The function {g(\theta)} is known but we don’t know the constant {c = \int g(\theta) d\theta} and it is not feasible to do the integral.

In my book, I consider the following Bayesian analysis. The analysis is wrong, as I’ll explain in a minute.

We have an unknown quantity {c} and some data {\theta_1,\ldots, \theta_n}. We should be able to do Bayesian inference for {c}. We start by placing a prior {h(c)} on {c}. The posterior is obtained by multiplying the prior and the likelihood:

\displaystyle  h(c|\theta_1,\ldots, \theta_n) = h(c) \prod_{i=1}^n \frac{g(\theta_i)}{c} \propto h(c) c^{-n}

where we dropped the terms {g(\theta_i)} since they are known.

The “posterior” {h(c|\theta_1,\ldots, \theta_n) \propto h(c) c^{-n}} is useless. It does not depend on the data. And it may not even be integrable.

The point of the example was to point out that blindly applying Bayes rule is not always wise. As I mentioned earlier, Zen correctly notes that my application of Bayes rule is not valid. The reason is that, I acted as if we had a family of densities {f(\theta|c)} indexed by {c}. But we don’t: {f(\theta)=g(\theta)/c} is a valid density only for one value of {c}, namely, {c = \int g(\theta)d\theta}. (To get a valid posterior from Bayes rule, we need a family {f(x|\psi)} which is a valid distribution for {x}, for each value of {\psi}.)

4. A Correct Bayesian Analysis?

The usual Bayesian approach that I have seen is to pretend that the function {g} is unknown. Then we place a prior on {g} (such as a Gaussian process prior) and proceed with a Bayesian analysis. However, this seems a unsatisfactory. It seems to me that we should be able to get a valid Bayesian estimator for {c} with pretending not to know {g}.

Christian Robert discussed the problem on his blog. If I understand what Christian has written, he claims that this cannot be considered a statistical problem and that we can’t even put a prior on {c} because it is a constant. I don’t find this point of view convincing. Isn’t the whole point of Bayesian inference that we can put distributions on fixed but unknown constants? Christian says that this is a numerical problem not a statistical problem. But we have data sampled from a distribution. To me, that makes it a statistical problem.

5. The Answer Is …

So what is a valid Bayes estimator of {c}? Pretending I don’t know {g} or simply declaring it to be a non-statistical problem seem like giving up.

I want to emphasize that this is not meant in any way as a critique of Bayes. I really think there should be a good Bayesian estimator here but I don’t know what it is.

Anyone have any good ideas?