Tuesday, 19 May 2009

C4 overview and work

So this blog has been a long time in the making. Distractions abound in terms of developing and redeveloping the C4 cluster finding algorithm (Miller et al 2005 [1]) so that it becomes a useful piece to the cluster finding strategy for DES. I've currently set it off to explore some part of parameter space so I can find the "sweet spot" to get it working as close to 100% as possible. So here's the rundown on what C4 does at the moment and then I'll summarize where to go from here.

A Rough Guide to C4

Step 1 - Data collection.

C4 in it's current incarnation is very rigid. I'm trying to soften the hard coding involved but this is going to take a lot of time and, for now, results are more important than coding implementation. The data I'm using is part of the Dark Energy Survey's (DES) simulations group producing a galaxy catalogue based on the Hubble Volume simulations and implementing their own (to be published) AddGals routine. This is the basis for my PhD and refining/testing C4 to this simulation will prove useful once I have an architecture set up to deal with similar data (i.e. from DES itself).

Now the data has to be a good approximation of the universe in order for cluster finding to mean anything. The problem here is that since we don't know what the far reaches of the universe look like yet, it's a bit difficult to say whether the further limits of the simulation are correct or not. This will naturally lead to deficiencies in the data where my cluster finding algorithm (or cluster finding algorithms in general), if too tuned to the simulation will fail when looking at real data. Now, my previous work on testing the simulation for luminous red galaxy distributions shows that the current incarnation of these simulations is comparable to observation, and is thereby a reasonable picture of the universe on more local scales. The simulations stretch from local universe to z~1.3, which is not an unheard of limit for astronomy, but doing a survey at that depth will provide the new science.

Step 2 - Feeding C4

So after formatting the data into a plain ascii file, C4 reads in Right Ascension, Declination, Redshift and then the five waveband magnitudes and their associated errors. With astrometry the way it is there should be no need to apply errors to RA and Dec, so their errors are ignored.

However, the redshifts (radial distance measurements) from DES will be photometric, and thus, less accurate than spectroscopic (non-astronomers: the reason for this is simply because it is far faster and far cheaper to obtain a photometric redshift (photo-z) than a spectroscopic one). C4's current incarnation does little to remedy this (we simply take a top hat in redshift dimensions to find neighbours) but should be taken into consideration. A nice idea would be to somehow input the redshift-probability distributions output by the photo-z estimators. However doing such would be fairly taxing and a milder alternative would be to fit gaussians to the peaks of any redshift distribution above some probability threshold. I've yet to explore this option but it's on the table of things to do in the coming year.

Optionally, we can feed in seeing, reddening, errors on magnitudes, edge flags and other flags. I've currently set up C4 to specifically keep simulation information such as galaxy ID and halo ID for comparison later. False values are used for the remaining numbers such that they don't affect the simulated sample whilst I'm tuning it up.

Step 3 - Binning

So we are reading in a lot of data and trying to find clustering. To do this effectively, I have to bin up the data into meaningful segments in which to evaluate clustering. These segments should be larger than any cluster I wish to find, but should be smaller than the actual survey itself. The reason for this is simple; when I evaluate clustering for a single object (galaxy) I do not want to see how clustered it is with an object on the other side of the sky - because the answer is: it won't be. A galaxy cluster tends not to extend beyond one square degree once we go beyond a redshift of ~0.1 so these are the grid dimensions of RA and dec. redshift is split into 30 bins from 0 < z < 1.4. Finally (although redundant in these simulations), the galaxies are also separated by r-band reddening and seeing, since these will highly affect observations and thus clustering estimates.

Step 4 - Measuring clustering

Now that the galaxies sit in their boxes how do we figure out which ones are in clusters and which ones are not? This is basically the heart of C4. C4 creates a box around a given galaxy in RA, dec and redshift and then applies colour apertures created from the errors on the magnitudes. These colour apertures have gaussian edges such that when a neighbouring galaxy is "counted", a probability is assigned to it (i.e. the likelihood that the colour is in fact the same [within error] given that it is within a reasonable distance in 3D real space AND 4D colour space of our given galaxy. These probabilities sum for each colour eventually giving a likelihood of being a neighbour. These likelihoods are all summed for galaxies which fall within a spatial+colour aperture to give an approximate "neighbour count".

So far so good, but we only know how many neighbours this particular galaxy has. The possibility is that this is an extremely common galaxy anyway so comparing to any other galaxy in the universe would be a bit of a wild goose chase. Also, we may find that our particular galaxy is rare or has a peculiar clustering behaviour, meaning comparisons to its peers may also give false results. So I do the obvious thing... I take our galaxy and clone it. Then I place it at a random position in the sky and count up its neighbours as I described above. I repeat this 30 times and then take the median number of neighbours. I now have an effective number of neighbours for the galaxy as well as the probable number of neighbours taken from any position in the sky. With some poissonian-to-gaussian jiggery pokery (I'll have to look at this properly in a future blog perhaps) I come up with p-value for the likelihood of the galaxy being in a clustered environment. I do this for every galaxy in every box. If a galaxy lies close to the edge of a box and the aperture overlaps it then I start counting neighbours in the next box as well.

Step 5 - Extracting the most clustered galaxies.

A false discovery rate algorithm (Miller et al 2001 [2]) is applied to the output galaxy catalogue and reduces the sample, selecting galaxies that are clustered (as determined by their p-value) and then applying a preselected error rate, i.e. how impure you will allow your sample to be. I've selected 4 false discovery rate (FDR) values (1%, 2.5%, 5% and 10%) to explore this parameter space. Miller et al (2005) applied a an FDR of 10% to the SDSS cluster sample. And retained a 90% completeness AND a near 90% purity above 1014.5 Solar Masses. Since I'm not getting this from the DES sample I'll need to figure out some harsher cuts to the sample to get this level of accuracy whilst retaining completeness.

Step 6 - kNN clustering

Our catalog of clustered galaxies is then put through a kNN clustering algorithm which calculates the distances (in Euclidean 3D space) to the 6th nearest neighbour. The algorithm being used is part of the FAST library distribution created at Georgia Tech by Alexander Gray [3] an further developed by his students. Now, the 3D metric space is a fairly decent parameter to cluster things by, but if I can figure out a way to add colour space into the process I can only guess at the improvement I'd make in terms of clustering... I will discuss this further later.

Step 7- Creating a cluster catalogue.

So we have a catalogue of clustered galaxies and a catalogue of their nearest neighbours. I take the 6th nearest neighbour distance (completely arbitrary choice) and rank the galaxies by distance to their 6th neighbour (smallest to largest). I then iterate through the list of galaxies, assigning the first galaxy in the list to the cluster centre. Then I look for it's neighbours out to a 3D background density limit of my choosing - this is also a fairly arbitrary parameter but it will have significant impact on any result I produce. Any neighbours that lie in environments denser than this limit are assigned to this cluster and then removed from my ranked list. Then the top-most galaxy in the ranked list is my second cluster centre and so on. I remove clusters with fewer than 3 galaxies and the remainder is output as my C4 cluster catalogue with number of associated members and radii for each cluster. Further analysis is performed to gauge the effectiveness of the algorithm which I will discuss in another blog again.

Current Work

I am currently working to get C4 working effectively, as I've now since passed the point at which C4 simply worked. I refer to the above sections as to what I'm doing/have done.

Step 1

This is vital to put data into C4 properly. C4 cannot preserve variables unless it is told to preserve variables. It also has trouble reading variables if they aren't delivered to it in the correct manner. I've adjusted C4 to cope with the simulated data and preserve values for galaxy ID and halo ID for comparison with the raw simulation. These runs are currently being analyzed and I expect results shortly (I'm promising too many future blogs now aren't I?).

Also, the simulations are huge, dealing with some 25 million individual galaxies. Evaluating each one is taxing and chunking it up into manageable portions has been a difficult task. I've opted to rearrange the supplied input such that I probe the entire redshift range over a smaller area in RA and dec and overlap segments. Since I'm already not expecting clusters to exceed 1 degree in diameter then a one degree overlap should suffice for any cluster smaller than 2 degrees in diameter. Which C4 should be able to handle well. There is some amount of wasted computational time but without the luxury of infinite memory, this approach has been sought. The default settings for these wedges is 36 square degrees (including overlaps) or 25 square degrees near the edges. If the area can't be fit by these then a smaller segment is produced on which C4 will run again.

For the time being I am working on one single segment and trying to explore the various parameters on that simulation segment before continuing on to process all the simulation segments in the same way.

Step 2

C4 is inflexible and this needs to be changed. Of the many things involved in changing C4 there are 2 key issues I want to address
(a) Since C4 takes in an input minimum and maximum RA and dec, it needs these to prevent the counting algorithm from going outside "the grid". However it is not employed in producing the grid itself and these values have to be hardwired into the simulation proper. It should be possibl for these arrays to be dynamically allocated at runtime rather than at compile time, allowing the grid to be created to cover whatever area you fancy. This is a bit beyond my current C++ skills and I'm ploughing through "teach yourself" books to get to this stage.
(b) The number of colours is rigidly set to 4, hence C4 (Colour 4). This shouldn't be so and a lack of one waveband shouldn't mean than the data be abandoned altogether. This can simply be overcome by intelligently overloading the main galaxy class, and then setting up flags during Step 1 to signal how the input is to be handled.

Step 3

Binning in redshift is taken in approximately 100 Mpc steps. A cluster scales anywhere up to around 5 Mpc, this is a fair divide. However, photometric redshifts, or more accurately their associated errors span of order 50Mpc+ at best once we begin to reach the further reaches of space around 0.5z. Whilst I'm sure it's wise to scale bins by Mpc for accurate positions of galaxies, with photo-z error I'd hazard a guess that this may need a complete rethink when dealing with photo-zs.

Step 4

Once again, redshift measurements are the weak spot in this process. In looking for spatial cluster members the current technique is to simply slam a box around your target galaxy and hope for the best. This isn't such a brilliant idea and needs a rethink. I'll discuss this in "Further Work"

Step 5

So far testing of the FDR has given me some fairly poor results. Although I suspect this may be more to do with my background thresholding than my FDR setting. Will have to go through Miller et al [2] and get a better grip on what I can fiddle about with to improve things.

Step 6

If I can figure out a reasonable metric to scale colour with distance, this will undoubtedly be a far more powerful clustering tool. As it is, I all but abandon using colours from the moment I have a p-value assigned to a galaxy. But is the current implementation already the correct way to go about things? This is also currently being explored.

Step 7

C4 currently produces too many clusters. Not all of which relate to a halo. The current idea is that the distribution is affected by 1) the FDR and 2) the background threshold estimate. So far, I've been testing background densities ranging from 5x to 2000x the density of the clustered galaxies. Now whilst this falls fairly happily into ideal cluster size ranges for 10% FDR this is still a bit too low for the lower FDR limits. I will thereby calculate background thresholds scaled to the FDR and then run the cluster finder on them to see if this helps anything. The idea is to scale the percentage over to an actual background estimate or more ideally, take the original number of galaxies (clustered and non-clustered) and scale the background to that. This is my most immediate task.

Further work

The RA/dec dynamics issue raised in Step 2 part a will be dealt with as soon as I learn how!

Current Work on Step 2 part b leads naturally to a fairly interesting idea. Combining 3 up to any number of wavebands (2 to, say, 8 colours) for extended wavelength coverage. If this can be done then source matched catalogs from differing surveys covering the same area will result. Cluster finding on joint wavelengths is pretty much an unexplored field and with more developed IR data just on the horizon (Herschel [4] and VISTA [5]), Joint cluster finding may give insight into not only cluster distributions, but possibly on cluster formation and crucially cluster formation histories. So I will be hopefully employing more than one survey to build a cluster catalogue - assigning probabilities to source matching within C4 may come out of this which will again lead to RA and dec uncertainties should the astrometry or resolutions be an issue.

My redshift measurement issue I raised on Step 4 may have an interesting (if slightly convoluted solution). Redshifts delivered are on the basis of galaxy spectra models and the likelihood of a set of magnitudes/colours appearing at some point in redshift distance. This generally delivers a redshift likelihood distribution which has the possibility of being exploited. If I set some arbitrary threshold such as 20% and search for peaks in the probability distribution above that I think I can engineer some very basic gaussian fitting to the peaks and thus preserve the galaxies' likelihoods of actually being there in redshift space. This has the upshot of retaining the problem galaxy photo-z measurements and rejecting any photo-z's too poor to make any reasonable clustering estimate. If there are indeed strange new galaxies to be seen, I think this will be a problem for photo-z estimators rather than C4, which will deal with properly identified galaxy redshifts.

References

[1] The C4 Clustering Algorithm: Clusters of Galaxies in the Sloan Digital Sky Survey (Miller et al, 2005)
http://arxiv.org/abs/astro-ph/0503713v1

[2] Controlling the False Discovery Rate in Astrophysical Data Analysis (Miller et al, 2001)
http://arxiv.org/abs/astro-ph/0107034v1

[3] 'N-Body' Problems in Statistical Learning (Gray and Moore, 2000)
http://www.cs.cmu.edu/~agray/nips-final.pdf

[4] Herschel ESA page
http://sci.esa.int/science-e/www/area/index.cfm?fareaid=16

[5] VISTA homepage
http://www.vista.ac.uk/