Saturday, 28 February 2009

Trouble with the neighbours

With the DES cluster workshop meeting fast approaching I have uncovered a wealth of hiccups in processing the simulated catalogues with C4. Program-wise, almost everything works to a degree, thenfails spectacularly. I've managed to process the lowest redshift galaxy catalogue up to the point of the nearest neighbour fitting. The code for this was in fact written by Alexander Gray for Chris Miller (who wrote the original C4 algorithm). Alex briefly describes the algorithm, on his webpage, explaining:




The Proximity Project itself is based on comparing the best clustering algorithms currently known and comparing them to clustering algorithms from the last 3 decades. However, whilst they say the analysis is done (2004), 'the tedious write up is not'. Now, the algorithm I'm using seems to be fairly tuned in to what I'm doing with C4. I'm sure Chris has already realised this and has supplied me with the source code. However, it will be a long, long time before I can figure out what's going on with this algorithm without a deeper understanding of it. Straight out compiling the source code brings up a few errors so I need to work on those which necessitates investigation.

The algorithm problems don't stop there though. After a successful run of the first redshift catalogue, every other catalogue disasterously dumps most of their galaxies once it's gone through around 25% of the galaxies. Having run other catalogues at higher redshifts but with more OR less galaxies it can't be a problem with the formats of the catalogues and is definitely not a probelm with dealing with an excess number of galaxies. there is something going on that I can't figure out and it isn't in the source data (having checked it by hand) and it isn't in the pre-processed C4 galaxy information (as the same algorithm is used for the first catalogue which did run) nor is it in the way that the galaxy information is read in. What the problem looks like is the catalogues loop through for some indefinite period, happily categorizing galaxies and assigning them to the appropriate C4 gridpoint until it gives up and dumps all the remaining galaxies in one bin before carrying on as normal. Now, the problem with this is that that particular bin is then used to set up another grid of pointers with an extra dimension to correlate to each particular galaxy. Now 'the grid' has to be cuboidal in nature, so by having one really heavy bin we suddenly introduce a HUGE dimension on an already large data set. This is the source of the segmentation fault. The problem I'm having is why it suddenly decides to dump all of the galaxies in one bin.

Actually, why is partially solved. Having the input, printed to screen at various intervals it's possible to spot that at some point, C4 decides to read the same line in, over and over and over. And because this is the same galaxy, it slots into one gridpoint again and again and again, leading to one monumentally large bin and thereby causing the pointer array to fail. So what's making it read in values differently anyway? Or, more to the point, why after a random point in the file, does C4 stop reading in new values and simply put in the same value as before? I aim to answer this particular question before the workshop so I am at least at one roadblock behind rather than two (*touchwood*).

1 comment:

  1. Problem one: solved! -ish. The input data is faulty and contains a varying number of errors in u-band magnitude. These errors particularly affect my cluster finder because C4 is somewhat dependent on these colours to determine clustering. So the preprocessing step is now to simply omit writing these galaxies without u-band magnitudes.. well.. it'll do for an answer until the catalogues get fixed. But preprocessing is stramlined to about 15 minutes now so it should be no problem :)

    ReplyDelete