Friday, 6 February 2009

Data Analysis and Model Selection

A sort of aside from main PhD work, I thought I'd better write up my notes from an interesting seminar I attended yesterday on Bayes Theorem which managed to clear my head about a few things as well as launching me into 1000 ideas on research plans. Lecture was given by Prof. Andrew Liddle.

The purpose of data analysis is to work out whether the values of parameters are compatible with data. By parameters, I mean some measurement(s) that has been created by theory or past experiment.

During model selection, one may ask what parameters are indicated by the data.

In combining these two analyses, one can create a feedback loop that simultaneously tests how well a theory fits a set of observational data AND see how well observational data obeys a theory. To do this effectively we need to demand certain criteria from both data and models:

Useful models:
  • Fit the present data acceptably.
  • Have the ability to make predictions for future data (this is computationally expensive).
Useful data:
  • Will be predictable by the models we aim to test
  • Have modellable intrinsic randomness (e.g. cosmic variance, sample variance) & experimental error (instrumental accuracy, noise)
Bayesian Inference

Bayesian Inference is a system of logical deduction which assigns probabilities to all quantities of interest. The marked difference with this and regular (frequentist) statistics is that you now take into account how reliable /unreliable a quantity is. Frequentist statistics only really allows for an "absolute" truth and is not as suited to making predictions.

where A and B can be anything. So lets make P(A) the probability that you are wearing a hat versus P(B), the probability you're going to a casino. The above equation then reads as: the probability that you are wearing a hat, given the fact you are going to a casino, is equal to the probability you are going to a casino now that you're wearing a hat multiplied by the probability you are wearing a hat, divided by the probability you are going to a casino. The key thing is here, if you're unlikely to be wearing a hat, you'll unlikely be going to a casino, UNLESS it's more than likely you wear a hat whenever you go to a casino! Random bits aside, Bayes theorem is a clever doodle to test whether something is related to something else. It may even be a good thing to find if something isn't related to something else either! Back on topic though, for data analysis the concept stays but the variables change thus:


Prior P(θ) has no data dependence - more on this follows.
Likelihood P(D|θ) involves making a new calculation to evaluate the data against the model.
Posterior P(θ|D) then is the ability of the model to reflect the data.
P(D) is commonly ignored: it is simply a number that normalizes the probabilities to unity.

Now, priors. They can be a source of disagreement in calculation since how much you trust parameters (from previous results/theory) depends on how you feel about them. But the upshot is, you have to lay it down on paper, illustrating any point you have to make on them will be clear in the calculations you make. This is also where a little bravery comes into play. You have to force your assumptions into play, exposing them. This is where physical intuition comes into play. The trick is, don't seek a single "right" prior. Rather, test the robustness of your conclusions under a reasonable variation of the priors. Bayesian statistics ensures that, eventually, good data will overturn an incorrect choice of prior. Bravery aside, hardcore bayesianists (is that a word?!) simply comment "if you don't know enough to set a prior, why did you bother getting the data".

Bayesian Parameter Estimation

So with the above knowledge in tow, how do you use the thing?

1. Choose a model (a choice of a set of parameters to be varied to fit a dataset):
  • Set parameters to be varied
  • Set prior ranges for these parameters
2. Choose datasets
  • Having the most powerful/complete/accurate dataset is all well and good, but adding more data will only improve your result.
3. Compute a likelihood function... that is to say, the probability something will occur given another situation. The reason it's a function rather than a number is that we are now playing with a whole bunch of parameters (the model) and not just one.

4. Obtain the posterior parameter distribution... basically plug in and play!

In this way, a huge dataset can then be summarized into a handful of parameters. Taking a cosmological example, From the WMAP satellite we recieve a series of time-ordered images covering the whole sky; these are then composited into a map of ~10 million pixels. Make an assumption about this map (such as the likelihood of getting this map given gaussianity of the cosmic microwave background [CMB]) and we can obtain the power spectrum of the CMB, which has reduced those 10 million pixels into ~1000 parameters for each measurement of an angular power. Then take a model and compare it to the likelihood of obtaining the observed power spectrum and we get a handful of parameters describing, say, Ωb, ΩDM and the Hubble parameter.

Briefly...

I want to only touch on Markov Chain Monte-Carlo (MCMC) simulations. To explore probabilities of multiple parameters (probability-space) is computationally expensive. If you, say take 10 variations of a single parameter that will cost 10 computations. Two parameters will cost 100 computations. If you, say, compute 8 parameters, it will cost 100,000,000 evaluations. Supercomputers typically push a million computations.. but that is pushing it. And so MCMC comes into play.

Monte Carlo -> any calculation that involves an element of randomness e.g. making a hop of a random size (normally drawn from a gaussian).
Markov chain -> where the next hop depends on where it ends up -> only a position of greater likelihood in probability space is taken.

So an MCMC run will gradually converge to a point in probability-space, where the most likely combination of parameter values (i.e. model) is found. To prevent false minima form being found, you can launch several Markov Chains: most will end up at the true maximum.

Computationally, MCMC scales linearly with additional parameters, as opposed to the multiplicative grid effect. Of mention, is the Metropolis-Hastings algorithm, which resembles Markov-Chains but sometimes allow steps to positions of lower likelihood.. allowing exploration of the shape of the maximum.

Publication bias

Something else I got from this seminar was the mention of publication bias, which is almost common sense but if you can characterize it by value - excellent! The simple statement is that things that are found in data are more likely to be published than things that are not found. For example, if a paper finds (cosmology hit again) non-gaussianity in the CMB it is more likely to be published than anything that doesn't find non-gaussianity (if judging by non-gaussianity alone). A rule of thumb is generally that if a (new) effect is found to ~95% (2σ) confidence, don't believe it. To prove something new you need a certainty of around 5σ (something like 99.99999980%).

No comments:

Post a Comment