Saturday, 28 February 2009

Trouble with the neighbours

With the DES cluster workshop meeting fast approaching I have uncovered a wealth of hiccups in processing the simulated catalogues with C4. Program-wise, almost everything works to a degree, thenfails spectacularly. I've managed to process the lowest redshift galaxy catalogue up to the point of the nearest neighbour fitting. The code for this was in fact written by Alexander Gray for Chris Miller (who wrote the original C4 algorithm). Alex briefly describes the algorithm, on his webpage, explaining:




The Proximity Project itself is based on comparing the best clustering algorithms currently known and comparing them to clustering algorithms from the last 3 decades. However, whilst they say the analysis is done (2004), 'the tedious write up is not'. Now, the algorithm I'm using seems to be fairly tuned in to what I'm doing with C4. I'm sure Chris has already realised this and has supplied me with the source code. However, it will be a long, long time before I can figure out what's going on with this algorithm without a deeper understanding of it. Straight out compiling the source code brings up a few errors so I need to work on those which necessitates investigation.

The algorithm problems don't stop there though. After a successful run of the first redshift catalogue, every other catalogue disasterously dumps most of their galaxies once it's gone through around 25% of the galaxies. Having run other catalogues at higher redshifts but with more OR less galaxies it can't be a problem with the formats of the catalogues and is definitely not a probelm with dealing with an excess number of galaxies. there is something going on that I can't figure out and it isn't in the source data (having checked it by hand) and it isn't in the pre-processed C4 galaxy information (as the same algorithm is used for the first catalogue which did run) nor is it in the way that the galaxy information is read in. What the problem looks like is the catalogues loop through for some indefinite period, happily categorizing galaxies and assigning them to the appropriate C4 gridpoint until it gives up and dumps all the remaining galaxies in one bin before carrying on as normal. Now, the problem with this is that that particular bin is then used to set up another grid of pointers with an extra dimension to correlate to each particular galaxy. Now 'the grid' has to be cuboidal in nature, so by having one really heavy bin we suddenly introduce a HUGE dimension on an already large data set. This is the source of the segmentation fault. The problem I'm having is why it suddenly decides to dump all of the galaxies in one bin.

Actually, why is partially solved. Having the input, printed to screen at various intervals it's possible to spot that at some point, C4 decides to read the same line in, over and over and over. And because this is the same galaxy, it slots into one gridpoint again and again and again, leading to one monumentally large bin and thereby causing the pointer array to fail. So what's making it read in values differently anyway? Or, more to the point, why after a random point in the file, does C4 stop reading in new values and simply put in the same value as before? I aim to answer this particular question before the workshop so I am at least at one roadblock behind rather than two (*touchwood*).

Thursday, 19 February 2009

Weeds

So I'm back working on the C4 algorithm.. it's now getting downright icky as I'm finding variables with helpful names such as 'a', 'b', 'c', etc... so tracing these back to the inputs has been taking a lot longer than I originally envisaged. Even better, some of the helpful names like 'newRA' get declared and then lead absolutely nowhere. Duplicate variables abound with varying degrees of latency. I will be trying to make at least two runs of C4 after finishing going through the program fully. I'm going to leave tidying up the code for after these runs. It'd be useful to see what parameters C4 can live without... slim it down then bulk it up. Anyway, the outputs are still an issue so to clarify, I aim to:
  1. Produce what I think is the best cluster catalog, in the sense that higher-ranked clusters correspond to more massive halos, with the highest possible rates of completeness and purity. I shaln't be concerned with precise coordinates of cluster centers... I'll use the halo lookup table.
  2. The catalogue camparisons will be running a uniform algorithm to assign galaxies to clusters, given a list of cluster positions. However, I believe all of the algorithms we discussed at the OSU meeting tracked individual galaxies, so will have different membership assignment algorithms. This effectively means that we will have two slightly different versions of most cluster catalogs in the comparison project.

In addition, assignment of the cluster rank can, be arbitrary. The comparison project will be assigning ranks according to the richnesses obtained with the coordinators' membership-assignment algorithm, but an individual cluster catalog might be more accurately ranked by some other cluster property.

Monday, 9 February 2009

Housekeeping...

I need to keep a track of some of my DES luminous red galaxy (LRG) programs so dumping a bunch of filenames here for later reference:

idl/DES/
CatSimLRG.pro
Main program that does everything to everything. Outputs are: colour plot comparisons, galaxy-redshift histograms, comoving number densities. Organized by model, photometry, version (of DES catalogue simultion). I-band cut option chosen for the Cannon plot to investigate cannonhole. Outputs to ~/Documents/PhD/LRGplots/new/

quickLRGvar.pro
Program that will look at the individual colour space distribution for a selected simulation for a selected model and compare it to a selected permutation of the SDSS sample (photometric, spectroscopic and DES-modified photometric). Nominally done in CatSimLRG but quickLRGvar output is also organized by redshift slice. Useful for varying parameters beyond those in the above program... includes a g-band cut option as well, to explore faint photometry. Outputs to ~/Documents/PhD/LRGplots/Cannonz/

plotDEScomp.pro
Will plot comoving number densities for a given model distribution that has already been processed by CatSimLRG. Serves to correct plot outputs and also gives the option of excluding the DES-modified SDSS cut (in a really stupid way), since I think that serves more as a pseudo-cut than any realistic expectation from data, simulated or otherwise. Outputs to same folder as CatSimLRG, superseding plots contained therein. Plots excluding the DES-modified cut have the suffix 'woDES'.

All outputs are in both postscript and jpeg formats. Jpegs are web-ready (mogrify'd) and postscripts are paper-quality.

I'm actually surprised it's only these three programs, I use. I have about 30 in that directory. There is the Cannonhole.pro program to separate out the cannonhole objects from the rest of the cannon sample.. but that's an artifact I presented last November, so it's mostly just a relic.

Friday, 6 February 2009

Data Analysis and Model Selection

A sort of aside from main PhD work, I thought I'd better write up my notes from an interesting seminar I attended yesterday on Bayes Theorem which managed to clear my head about a few things as well as launching me into 1000 ideas on research plans. Lecture was given by Prof. Andrew Liddle.

The purpose of data analysis is to work out whether the values of parameters are compatible with data. By parameters, I mean some measurement(s) that has been created by theory or past experiment.

During model selection, one may ask what parameters are indicated by the data.

In combining these two analyses, one can create a feedback loop that simultaneously tests how well a theory fits a set of observational data AND see how well observational data obeys a theory. To do this effectively we need to demand certain criteria from both data and models:

Useful models:
  • Fit the present data acceptably.
  • Have the ability to make predictions for future data (this is computationally expensive).
Useful data:
  • Will be predictable by the models we aim to test
  • Have modellable intrinsic randomness (e.g. cosmic variance, sample variance) & experimental error (instrumental accuracy, noise)
Bayesian Inference

Bayesian Inference is a system of logical deduction which assigns probabilities to all quantities of interest. The marked difference with this and regular (frequentist) statistics is that you now take into account how reliable /unreliable a quantity is. Frequentist statistics only really allows for an "absolute" truth and is not as suited to making predictions.

where A and B can be anything. So lets make P(A) the probability that you are wearing a hat versus P(B), the probability you're going to a casino. The above equation then reads as: the probability that you are wearing a hat, given the fact you are going to a casino, is equal to the probability you are going to a casino now that you're wearing a hat multiplied by the probability you are wearing a hat, divided by the probability you are going to a casino. The key thing is here, if you're unlikely to be wearing a hat, you'll unlikely be going to a casino, UNLESS it's more than likely you wear a hat whenever you go to a casino! Random bits aside, Bayes theorem is a clever doodle to test whether something is related to something else. It may even be a good thing to find if something isn't related to something else either! Back on topic though, for data analysis the concept stays but the variables change thus:


Prior P(θ) has no data dependence - more on this follows.
Likelihood P(D|θ) involves making a new calculation to evaluate the data against the model.
Posterior P(θ|D) then is the ability of the model to reflect the data.
P(D) is commonly ignored: it is simply a number that normalizes the probabilities to unity.

Now, priors. They can be a source of disagreement in calculation since how much you trust parameters (from previous results/theory) depends on how you feel about them. But the upshot is, you have to lay it down on paper, illustrating any point you have to make on them will be clear in the calculations you make. This is also where a little bravery comes into play. You have to force your assumptions into play, exposing them. This is where physical intuition comes into play. The trick is, don't seek a single "right" prior. Rather, test the robustness of your conclusions under a reasonable variation of the priors. Bayesian statistics ensures that, eventually, good data will overturn an incorrect choice of prior. Bravery aside, hardcore bayesianists (is that a word?!) simply comment "if you don't know enough to set a prior, why did you bother getting the data".

Bayesian Parameter Estimation

So with the above knowledge in tow, how do you use the thing?

1. Choose a model (a choice of a set of parameters to be varied to fit a dataset):
  • Set parameters to be varied
  • Set prior ranges for these parameters
2. Choose datasets
  • Having the most powerful/complete/accurate dataset is all well and good, but adding more data will only improve your result.
3. Compute a likelihood function... that is to say, the probability something will occur given another situation. The reason it's a function rather than a number is that we are now playing with a whole bunch of parameters (the model) and not just one.

4. Obtain the posterior parameter distribution... basically plug in and play!

In this way, a huge dataset can then be summarized into a handful of parameters. Taking a cosmological example, From the WMAP satellite we recieve a series of time-ordered images covering the whole sky; these are then composited into a map of ~10 million pixels. Make an assumption about this map (such as the likelihood of getting this map given gaussianity of the cosmic microwave background [CMB]) and we can obtain the power spectrum of the CMB, which has reduced those 10 million pixels into ~1000 parameters for each measurement of an angular power. Then take a model and compare it to the likelihood of obtaining the observed power spectrum and we get a handful of parameters describing, say, Ωb, ΩDM and the Hubble parameter.

Briefly...

I want to only touch on Markov Chain Monte-Carlo (MCMC) simulations. To explore probabilities of multiple parameters (probability-space) is computationally expensive. If you, say take 10 variations of a single parameter that will cost 10 computations. Two parameters will cost 100 computations. If you, say, compute 8 parameters, it will cost 100,000,000 evaluations. Supercomputers typically push a million computations.. but that is pushing it. And so MCMC comes into play.

Monte Carlo -> any calculation that involves an element of randomness e.g. making a hop of a random size (normally drawn from a gaussian).
Markov chain -> where the next hop depends on where it ends up -> only a position of greater likelihood in probability space is taken.

So an MCMC run will gradually converge to a point in probability-space, where the most likely combination of parameter values (i.e. model) is found. To prevent false minima form being found, you can launch several Markov Chains: most will end up at the true maximum.

Computationally, MCMC scales linearly with additional parameters, as opposed to the multiplicative grid effect. Of mention, is the Metropolis-Hastings algorithm, which resembles Markov-Chains but sometimes allow steps to positions of lower likelihood.. allowing exploration of the shape of the maximum.

Publication bias

Something else I got from this seminar was the mention of publication bias, which is almost common sense but if you can characterize it by value - excellent! The simple statement is that things that are found in data are more likely to be published than things that are not found. For example, if a paper finds (cosmology hit again) non-gaussianity in the CMB it is more likely to be published than anything that doesn't find non-gaussianity (if judging by non-gaussianity alone). A rule of thumb is generally that if a (new) effect is found to ~95% (2σ) confidence, don't believe it. To prove something new you need a certainty of around 5σ (something like 99.99999980%).

Wednesday, 4 February 2009

Ow my head.,

Ok, the undergrad lab I've taken on teaching this term is way more work than previous TAing I've done. So now must reorganise a few things to accomodate this course. Which is a shame really because all the cosmology projects I'm on have seemed to be firing on all 4 cylinders. Minor hiccups here and there though. The Stackbuild.pro routine need the pixel variables fixed. Of the 16 machines running it, one has fallen over and half of the others seem to be dawdling. I think there's a loop problem to investigate.

C4 is almost cracked. I've managed to plough through the main program and most of the smaller subroutines. That said, those only make up half of the program and the other half is taken up by 3 huge subroutines. I'm neatening the code up and adding comments for future reference but since those 3 subroutines are critical, I'll have to go through them with a fine-tooth comb. TestCount appears to replicate Count, so analyses may go ahead simultaneously on both. Will have to download a fraction of Stripe 82 from SDSS and have a test run sometime soon. I'm aiming for Monday. If I can set up the algorithms for inputting the catalogue data as well on Monday next week could be pretty exciting PhD-wise.