So this blog has been a long time in the making. Distractions abound in terms of developing and redeveloping the C4 cluster finding algorithm (Miller et al 2005 [1]) so that it becomes a useful piece to the cluster finding strategy for DES. I've currently set it off to explore some part of parameter space so I can find the "sweet spot" to get it working as close to 100% as possible. So here's the rundown on what C4 does at the moment and then I'll summarize where to go from here.
A Rough Guide to C4
Step 1 - Data collection.
C4 in it's current incarnation is very rigid. I'm trying to soften the hard coding involved but this is going to take a lot of time and, for now, results are more important than coding implementation. The data I'm using is part of the Dark Energy Survey's (DES) simulations group producing a galaxy catalogue based on the Hubble Volume simulations and implementing their own (to be published) AddGals routine. This is the basis for my PhD and refining/testing C4 to this simulation will prove useful once I have an architecture set up to deal with similar data (i.e. from DES itself).
Now the data has to be a good approximation of the universe in order for cluster finding to mean anything. The problem here is that since we don't know what the far reaches of the universe look like yet, it's a bit difficult to say whether the further limits of the simulation are correct or not. This will naturally lead to deficiencies in the data where my cluster finding algorithm (or cluster finding algorithms in general), if too tuned to the simulation will fail when looking at real data. Now, my previous work on testing the simulation for luminous red galaxy distributions shows that the current incarnation of these simulations is comparable to observation, and is thereby a reasonable picture of the universe on more local scales. The simulations stretch from local universe to z~1.3, which is not an unheard of limit for astronomy, but doing a survey at that depth will provide the new science.
Step 2 - Feeding C4
So after formatting the data into a plain ascii file, C4 reads in Right Ascension, Declination, Redshift and then the five waveband magnitudes and their associated errors. With astrometry the way it is there should be no need to apply errors to RA and Dec, so their errors are ignored.
However, the redshifts (radial distance measurements) from DES will be photometric, and thus, less accurate than spectroscopic (non-astronomers: the reason for this is simply because it is far faster and far cheaper to obtain a photometric redshift (photo-z) than a spectroscopic one). C4's current incarnation does little to remedy this (we simply take a top hat in redshift dimensions to find neighbours) but should be taken into consideration. A nice idea would be to somehow input the redshift-probability distributions output by the photo-z estimators. However doing such would be fairly taxing and a milder alternative would be to fit gaussians to the peaks of any redshift distribution above some probability threshold. I've yet to explore this option but it's on the table of things to do in the coming year.
Optionally, we can feed in seeing, reddening, errors on magnitudes, edge flags and other flags. I've currently set up C4 to specifically keep simulation information such as galaxy ID and halo ID for comparison later. False values are used for the remaining numbers such that they don't affect the simulated sample whilst I'm tuning it up.
Step 3 - Binning
So we are reading in a lot of data and trying to find clustering. To do this effectively, I have to bin up the data into meaningful segments in which to evaluate clustering. These segments should be larger than any cluster I wish to find, but should be smaller than the actual survey itself. The reason for this is simple; when I evaluate clustering for a single object (galaxy) I do not want to see how clustered it is with an object on the other side of the sky - because the answer is: it won't be. A galaxy cluster tends not to extend beyond one square degree once we go beyond a redshift of ~0.1 so these are the grid dimensions of RA and dec. redshift is split into 30 bins from 0 < z < 1.4. Finally (although redundant in these simulations), the galaxies are also separated by r-band reddening and seeing, since these will highly affect observations and thus clustering estimates.
Step 4 - Measuring clustering
Now that the galaxies sit in their boxes how do we figure out which ones are in clusters and which ones are not? This is basically the heart of C4. C4 creates a box around a given galaxy in RA, dec and redshift and then applies colour apertures created from the errors on the magnitudes. These colour apertures have gaussian edges such that when a neighbouring galaxy is "counted", a probability is assigned to it (i.e. the likelihood that the colour is in fact the same [within error] given that it is within a reasonable distance in 3D real space AND 4D colour space of our given galaxy. These probabilities sum for each colour eventually giving a likelihood of being a neighbour. These likelihoods are all summed for galaxies which fall within a spatial+colour aperture to give an approximate "neighbour count".
So far so good, but we only know how many neighbours this particular galaxy has. The possibility is that this is an extremely common galaxy anyway so comparing to any other galaxy in the universe would be a bit of a wild goose chase. Also, we may find that our particular galaxy is rare or has a peculiar clustering behaviour, meaning comparisons to its peers may also give false results. So I do the obvious thing... I take our galaxy and clone it. Then I place it at a random position in the sky and count up its neighbours as I described above. I repeat this 30 times and then take the median number of neighbours. I now have an effective number of neighbours for the galaxy as well as the probable number of neighbours taken from any position in the sky. With some poissonian-to-gaussian jiggery pokery (I'll have to look at this properly in a future blog perhaps) I come up with p-value for the likelihood of the galaxy being in a clustered environment. I do this for every galaxy in every box. If a galaxy lies close to the edge of a box and the aperture overlaps it then I start counting neighbours in the next box as well.
Step 5 - Extracting the most clustered galaxies.
A false discovery rate algorithm (Miller et al 2001 [2]) is applied to the output galaxy catalogue and reduces the sample, selecting galaxies that are clustered (as determined by their p-value) and then applying a preselected error rate, i.e. how impure you will allow your sample to be. I've selected 4 false discovery rate (FDR) values (1%, 2.5%, 5% and 10%) to explore this parameter space. Miller et al (2005) applied a an FDR of 10% to the SDSS cluster sample. And retained a 90% completeness AND a near 90% purity above 1014.5 Solar Masses. Since I'm not getting this from the DES sample I'll need to figure out some harsher cuts to the sample to get this level of accuracy whilst retaining completeness.
Step 6 - kNN clustering
Our catalog of clustered galaxies is then put through a kNN clustering algorithm which calculates the distances (in Euclidean 3D space) to the 6th nearest neighbour. The algorithm being used is part of the FAST library distribution created at Georgia Tech by Alexander Gray [3] an further developed by his students. Now, the 3D metric space is a fairly decent parameter to cluster things by, but if I can figure out a way to add colour space into the process I can only guess at the improvement I'd make in terms of clustering... I will discuss this further later.
Step 7- Creating a cluster catalogue.
So we have a catalogue of clustered galaxies and a catalogue of their nearest neighbours. I take the 6th nearest neighbour distance (completely arbitrary choice) and rank the galaxies by distance to their 6th neighbour (smallest to largest). I then iterate through the list of galaxies, assigning the first galaxy in the list to the cluster centre. Then I look for it's neighbours out to a 3D background density limit of my choosing - this is also a fairly arbitrary parameter but it will have significant impact on any result I produce. Any neighbours that lie in environments denser than this limit are assigned to this cluster and then removed from my ranked list. Then the top-most galaxy in the ranked list is my second cluster centre and so on. I remove clusters with fewer than 3 galaxies and the remainder is output as my C4 cluster catalogue with number of associated members and radii for each cluster. Further analysis is performed to gauge the effectiveness of the algorithm which I will discuss in another blog again.
Current Work
I am currently working to get C4 working effectively, as I've now since passed the point at which C4 simply worked. I refer to the above sections as to what I'm doing/have done.
Step 1
This is vital to put data into C4 properly. C4 cannot preserve variables unless it is told to preserve variables. It also has trouble reading variables if they aren't delivered to it in the correct manner. I've adjusted C4 to cope with the simulated data and preserve values for galaxy ID and halo ID for comparison with the raw simulation. These runs are currently being analyzed and I expect results shortly (I'm promising too many future blogs now aren't I?).
Also, the simulations are huge, dealing with some 25 million individual galaxies. Evaluating each one is taxing and chunking it up into manageable portions has been a difficult task. I've opted to rearrange the supplied input such that I probe the entire redshift range over a smaller area in RA and dec and overlap segments. Since I'm already not expecting clusters to exceed 1 degree in diameter then a one degree overlap should suffice for any cluster smaller than 2 degrees in diameter. Which C4 should be able to handle well. There is some amount of wasted computational time but without the luxury of infinite memory, this approach has been sought. The default settings for these wedges is 36 square degrees (including overlaps) or 25 square degrees near the edges. If the area can't be fit by these then a smaller segment is produced on which C4 will run again.
For the time being I am working on one single segment and trying to explore the various parameters on that simulation segment before continuing on to process all the simulation segments in the same way.
Step 2
C4 is inflexible and this needs to be changed. Of the many things involved in changing C4 there are 2 key issues I want to address
(a) Since C4 takes in an input minimum and maximum RA and dec, it needs these to prevent the counting algorithm from going outside "the grid". However it is not employed in producing the grid itself and these values have to be hardwired into the simulation proper. It should be possibl for these arrays to be dynamically allocated at runtime rather than at compile time, allowing the grid to be created to cover whatever area you fancy. This is a bit beyond my current C++ skills and I'm ploughing through "teach yourself" books to get to this stage.
(b) The number of colours is rigidly set to 4, hence C4 (Colour 4). This shouldn't be so and a lack of one waveband shouldn't mean than the data be abandoned altogether. This can simply be overcome by intelligently overloading the main galaxy class, and then setting up flags during Step 1 to signal how the input is to be handled.
Step 3
Binning in redshift is taken in approximately 100 Mpc steps. A cluster scales anywhere up to around 5 Mpc, this is a fair divide. However, photometric redshifts, or more accurately their associated errors span of order 50Mpc+ at best once we begin to reach the further reaches of space around 0.5z. Whilst I'm sure it's wise to scale bins by Mpc for accurate positions of galaxies, with photo-z error I'd hazard a guess that this may need a complete rethink when dealing with photo-zs.
Step 4
Once again, redshift measurements are the weak spot in this process. In looking for spatial cluster members the current technique is to simply slam a box around your target galaxy and hope for the best. This isn't such a brilliant idea and needs a rethink. I'll discuss this in "Further Work"
Step 5
So far testing of the FDR has given me some fairly poor results. Although I suspect this may be more to do with my background thresholding than my FDR setting. Will have to go through Miller et al [2] and get a better grip on what I can fiddle about with to improve things.
Step 6
If I can figure out a reasonable metric to scale colour with distance, this will undoubtedly be a far more powerful clustering tool. As it is, I all but abandon using colours from the moment I have a p-value assigned to a galaxy. But is the current implementation already the correct way to go about things? This is also currently being explored.
Step 7
C4 currently produces too many clusters. Not all of which relate to a halo. The current idea is that the distribution is affected by 1) the FDR and 2) the background threshold estimate. So far, I've been testing background densities ranging from 5x to 2000x the density of the clustered galaxies. Now whilst this falls fairly happily into ideal cluster size ranges for 10% FDR this is still a bit too low for the lower FDR limits. I will thereby calculate background thresholds scaled to the FDR and then run the cluster finder on them to see if this helps anything. The idea is to scale the percentage over to an actual background estimate or more ideally, take the original number of galaxies (clustered and non-clustered) and scale the background to that. This is my most immediate task.
Further work
The RA/dec dynamics issue raised in Step 2 part a will be dealt with as soon as I learn how!
Current Work on Step 2 part b leads naturally to a fairly interesting idea. Combining 3 up to any number of wavebands (2 to, say, 8 colours) for extended wavelength coverage. If this can be done then source matched catalogs from differing surveys covering the same area will result. Cluster finding on joint wavelengths is pretty much an unexplored field and with more developed IR data just on the horizon (Herschel [4] and VISTA [5]), Joint cluster finding may give insight into not only cluster distributions, but possibly on cluster formation and crucially cluster formation histories. So I will be hopefully employing more than one survey to build a cluster catalogue - assigning probabilities to source matching within C4 may come out of this which will again lead to RA and dec uncertainties should the astrometry or resolutions be an issue.
My redshift measurement issue I raised on Step 4 may have an interesting (if slightly convoluted solution). Redshifts delivered are on the basis of galaxy spectra models and the likelihood of a set of magnitudes/colours appearing at some point in redshift distance. This generally delivers a redshift likelihood distribution which has the possibility of being exploited. If I set some arbitrary threshold such as 20% and search for peaks in the probability distribution above that I think I can engineer some very basic gaussian fitting to the peaks and thus preserve the galaxies' likelihoods of actually being there in redshift space. This has the upshot of retaining the problem galaxy photo-z measurements and rejecting any photo-z's too poor to make any reasonable clustering estimate. If there are indeed strange new galaxies to be seen, I think this will be a problem for photo-z estimators rather than C4, which will deal with properly identified galaxy redshifts.
References
[1] The C4 Clustering Algorithm: Clusters of Galaxies in the Sloan Digital Sky Survey (Miller et al, 2005)
http://arxiv.org/abs/astro-ph/0503713v1
[2] Controlling the False Discovery Rate in Astrophysical Data Analysis (Miller et al, 2001)
http://arxiv.org/abs/astro-ph/0107034v1
[3] 'N-Body' Problems in Statistical Learning (Gray and Moore, 2000)
http://www.cs.cmu.edu/~agray/nips-final.pdf
[4] Herschel ESA page
http://sci.esa.int/science-e/www/area/index.cfm?fareaid=16
[5] VISTA homepage
http://www.vista.ac.uk/
Tuesday, 19 May 2009
Monday, 2 March 2009
Tune in, turn on, drop out.
The first hurdle explained in my previous blog was only really semi-solved. In order to get C4 to run, I simply ignored things which didn't have all the information I needed from them. The thing is, I'm not sure I need all the information from them.
This is a redshift dependent problem: source detection in catalogues is run on individual wavebands and colour information is simply crossmatched across the various catalogues. Whilst this is an effective way to detect sources, high redshift objects tend to drop out of shorter wavebands (e.g. u-band, g-band in SDSS) thus the galaxy information lacks these magnitudes. However, what should happen is the flux at that point should be measured so that a meaningful colour measurment can be made. A good reason for doing the former is so that a confirmed object that doesn't appear in certain wavebands will certainly be of some degree of interest. So, high redshift objects will undoubtedly lack this colour information.
So I'm going to make a minimum threshold of 2 colours per object (3 wavebands). I think the trick is to fool C4 into thinking this is a decent way to work and tune the clustering algorithm (I'm resigned to the fact I'm going to have to get into this deep and have a really messy time with it) to consciously decide on clustering the C4 galaxies with limited information. I think I can do this. And if I can do this, I can turn the method round and incorporate more colours. If I can incorporate more colours, I can combine multiple surveys and thus, I have better constraints. And if I do that... well... no-one's ever done that with clusters before.
This is a redshift dependent problem: source detection in catalogues is run on individual wavebands and colour information is simply crossmatched across the various catalogues. Whilst this is an effective way to detect sources, high redshift objects tend to drop out of shorter wavebands (e.g. u-band, g-band in SDSS) thus the galaxy information lacks these magnitudes. However, what should happen is the flux at that point should be measured so that a meaningful colour measurment can be made. A good reason for doing the former is so that a confirmed object that doesn't appear in certain wavebands will certainly be of some degree of interest. So, high redshift objects will undoubtedly lack this colour information.
So I'm going to make a minimum threshold of 2 colours per object (3 wavebands). I think the trick is to fool C4 into thinking this is a decent way to work and tune the clustering algorithm (I'm resigned to the fact I'm going to have to get into this deep and have a really messy time with it) to consciously decide on clustering the C4 galaxies with limited information. I think I can do this. And if I can do this, I can turn the method round and incorporate more colours. If I can incorporate more colours, I can combine multiple surveys and thus, I have better constraints. And if I do that... well... no-one's ever done that with clusters before.
Saturday, 28 February 2009
Trouble with the neighbours
With the DES cluster workshop meeting fast approaching I have uncovered a wealth of hiccups in processing the simulated catalogues with C4. Program-wise, almost everything works to a degree, thenfails spectacularly. I've managed to process the lowest redshift galaxy catalogue up to the point of the nearest neighbour fitting. The code for this was in fact written by Alexander Gray for Chris Miller (who wrote the original C4 algorithm). Alex briefly describes the algorithm, on his webpage, explaining:
The Proximity Project itself is based on comparing the best clustering algorithms currently known and comparing them to clustering algorithms from the last 3 decades. However, whilst they say the analysis is done (2004), 'the tedious write up is not'. Now, the algorithm I'm using seems to be fairly tuned in to what I'm doing with C4. I'm sure Chris has already realised this and has supplied me with the source code. However, it will be a long, long time before I can figure out what's going on with this algorithm without a deeper understanding of it. Straight out compiling the source code brings up a few errors so I need to work on those which necessitates investigation.
The algorithm problems don't stop there though. After a successful run of the first redshift catalogue, every other catalogue disasterously dumps most of their galaxies once it's gone through around 25% of the galaxies. Having run other catalogues at higher redshifts but with more OR less galaxies it can't be a problem with the formats of the catalogues and is definitely not a probelm with dealing with an excess number of galaxies. there is something going on that I can't figure out and it isn't in the source data (having checked it by hand) and it isn't in the pre-processed C4 galaxy information (as the same algorithm is used for the first catalogue which did run) nor is it in the way that the galaxy information is read in. What the problem looks like is the catalogues loop through for some indefinite period, happily categorizing galaxies and assigning them to the appropriate C4 gridpoint until it gives up and dumps all the remaining galaxies in one bin before carrying on as normal. Now, the problem with this is that that particular bin is then used to set up another grid of pointers with an extra dimension to correlate to each particular galaxy. Now 'the grid' has to be cuboidal in nature, so by having one really heavy bin we suddenly introduce a HUGE dimension on an already large data set. This is the source of the segmentation fault. The problem I'm having is why it suddenly decides to dump all of the galaxies in one bin.
Actually, why is partially solved. Having the input, printed to screen at various intervals it's possible to spot that at some point, C4 decides to read the same line in, over and over and over. And because this is the same galaxy, it slots into one gridpoint again and again and again, leading to one monumentally large bin and thereby causing the pointer array to fail. So what's making it read in values differently anyway? Or, more to the point, why after a random point in the file, does C4 stop reading in new values and simply put in the same value as before? I aim to answer this particular question before the workshop so I am at least at one roadblock behind rather than two (*touchwood*).
The all-nearest-neighbors problem asks, for each of a set of query points, what is its nearest neighbor among a set of reference points. In the more general bichromatic version, the query set and reference set may be different. Aside from being a classical challenge problem of computational geometry, this is often the real problem of interest in practice, not the single-query problem.Our simple dual-recursive depth-first-search algorithm turns out to be faster than the best previous algorithms in practice, including Vaidya's algorithm and the well-separated pair decomposition. It is exact, solves the more general bichromatic problem, works for general k, and as with all of our algorithms on this page, it works with virtually any space-partitioning tree. (First described in the NIPS 2000 paper. Full experimental results and detailed description will appear in the paper associated with the Proximity Project hopefully sometime in 2005.)
The Proximity Project itself is based on comparing the best clustering algorithms currently known and comparing them to clustering algorithms from the last 3 decades. However, whilst they say the analysis is done (2004), 'the tedious write up is not'. Now, the algorithm I'm using seems to be fairly tuned in to what I'm doing with C4. I'm sure Chris has already realised this and has supplied me with the source code. However, it will be a long, long time before I can figure out what's going on with this algorithm without a deeper understanding of it. Straight out compiling the source code brings up a few errors so I need to work on those which necessitates investigation.
The algorithm problems don't stop there though. After a successful run of the first redshift catalogue, every other catalogue disasterously dumps most of their galaxies once it's gone through around 25% of the galaxies. Having run other catalogues at higher redshifts but with more OR less galaxies it can't be a problem with the formats of the catalogues and is definitely not a probelm with dealing with an excess number of galaxies. there is something going on that I can't figure out and it isn't in the source data (having checked it by hand) and it isn't in the pre-processed C4 galaxy information (as the same algorithm is used for the first catalogue which did run) nor is it in the way that the galaxy information is read in. What the problem looks like is the catalogues loop through for some indefinite period, happily categorizing galaxies and assigning them to the appropriate C4 gridpoint until it gives up and dumps all the remaining galaxies in one bin before carrying on as normal. Now, the problem with this is that that particular bin is then used to set up another grid of pointers with an extra dimension to correlate to each particular galaxy. Now 'the grid' has to be cuboidal in nature, so by having one really heavy bin we suddenly introduce a HUGE dimension on an already large data set. This is the source of the segmentation fault. The problem I'm having is why it suddenly decides to dump all of the galaxies in one bin.
Actually, why is partially solved. Having the input, printed to screen at various intervals it's possible to spot that at some point, C4 decides to read the same line in, over and over and over. And because this is the same galaxy, it slots into one gridpoint again and again and again, leading to one monumentally large bin and thereby causing the pointer array to fail. So what's making it read in values differently anyway? Or, more to the point, why after a random point in the file, does C4 stop reading in new values and simply put in the same value as before? I aim to answer this particular question before the workshop so I am at least at one roadblock behind rather than two (*touchwood*).
Thursday, 19 February 2009
Weeds
So I'm back working on the C4 algorithm.. it's now getting downright icky as I'm finding variables with helpful names such as 'a', 'b', 'c', etc... so tracing these back to the inputs has been taking a lot longer than I originally envisaged. Even better, some of the helpful names like 'newRA' get declared and then lead absolutely nowhere. Duplicate variables abound with varying degrees of latency. I will be trying to make at least two runs of C4 after finishing going through the program fully. I'm going to leave tidying up the code for after these runs. It'd be useful to see what parameters C4 can live without... slim it down then bulk it up. Anyway, the outputs are still an issue so to clarify, I aim to:
In addition, assignment of the cluster rank can, be arbitrary. The comparison project will be assigning ranks according to the richnesses obtained with the coordinators' membership-assignment algorithm, but an individual cluster catalog might be more accurately ranked by some other cluster property.
- Produce what I think is the best cluster catalog, in the sense that higher-ranked clusters correspond to more massive halos, with the highest possible rates of completeness and purity. I shaln't be concerned with precise coordinates of cluster centers... I'll use the halo lookup table.
- The catalogue camparisons will be running a uniform algorithm to assign galaxies to clusters, given a list of cluster positions. However, I believe all of the algorithms we discussed at the OSU meeting tracked individual galaxies, so will have different membership assignment algorithms. This effectively means that we will have two slightly different versions of most cluster catalogs in the comparison project.
In addition, assignment of the cluster rank can, be arbitrary. The comparison project will be assigning ranks according to the richnesses obtained with the coordinators' membership-assignment algorithm, but an individual cluster catalog might be more accurately ranked by some other cluster property.
Monday, 9 February 2009
Housekeeping...
I need to keep a track of some of my DES luminous red galaxy (LRG) programs so dumping a bunch of filenames here for later reference:
idl/DES/
CatSimLRG.pro
Main program that does everything to everything. Outputs are: colour plot comparisons, galaxy-redshift histograms, comoving number densities. Organized by model, photometry, version (of DES catalogue simultion). I-band cut option chosen for the Cannon plot to investigate cannonhole. Outputs to ~/Documents/PhD/LRGplots/new/
quickLRGvar.pro
Program that will look at the individual colour space distribution for a selected simulation for a selected model and compare it to a selected permutation of the SDSS sample (photometric, spectroscopic and DES-modified photometric). Nominally done in CatSimLRG but quickLRGvar output is also organized by redshift slice. Useful for varying parameters beyond those in the above program... includes a g-band cut option as well, to explore faint photometry. Outputs to ~/Documents/PhD/LRGplots/Cannonz/
plotDEScomp.pro
Will plot comoving number densities for a given model distribution that has already been processed by CatSimLRG. Serves to correct plot outputs and also gives the option of excluding the DES-modified SDSS cut (in a really stupid way), since I think that serves more as a pseudo-cut than any realistic expectation from data, simulated or otherwise. Outputs to same folder as CatSimLRG, superseding plots contained therein. Plots excluding the DES-modified cut have the suffix 'woDES'.
All outputs are in both postscript and jpeg formats. Jpegs are web-ready (mogrify'd) and postscripts are paper-quality.
I'm actually surprised it's only these three programs, I use. I have about 30 in that directory. There is the Cannonhole.pro program to separate out the cannonhole objects from the rest of the cannon sample.. but that's an artifact I presented last November, so it's mostly just a relic.
idl/DES/
CatSimLRG.pro
Main program that does everything to everything. Outputs are: colour plot comparisons, galaxy-redshift histograms, comoving number densities. Organized by model, photometry, version (of DES catalogue simultion). I-band cut option chosen for the Cannon plot to investigate cannonhole. Outputs to ~/Documents/PhD/LRGplots/new/
quickLRGvar.pro
Program that will look at the individual colour space distribution for a selected simulation for a selected model and compare it to a selected permutation of the SDSS sample (photometric, spectroscopic and DES-modified photometric). Nominally done in CatSimLRG but quickLRGvar output is also organized by redshift slice. Useful for varying parameters beyond those in the above program... includes a g-band cut option as well, to explore faint photometry. Outputs to ~/Documents/PhD/LRGplots/Cannonz/
plotDEScomp.pro
Will plot comoving number densities for a given model distribution that has already been processed by CatSimLRG. Serves to correct plot outputs and also gives the option of excluding the DES-modified SDSS cut (in a really stupid way), since I think that serves more as a pseudo-cut than any realistic expectation from data, simulated or otherwise. Outputs to same folder as CatSimLRG, superseding plots contained therein. Plots excluding the DES-modified cut have the suffix 'woDES'.
All outputs are in both postscript and jpeg formats. Jpegs are web-ready (mogrify'd) and postscripts are paper-quality.
I'm actually surprised it's only these three programs, I use. I have about 30 in that directory. There is the Cannonhole.pro program to separate out the cannonhole objects from the rest of the cannon sample.. but that's an artifact I presented last November, so it's mostly just a relic.
Friday, 6 February 2009
Data Analysis and Model Selection
A sort of aside from main PhD work, I thought I'd better write up my notes from an interesting seminar I attended yesterday on Bayes Theorem which managed to clear my head about a few things as well as launching me into 1000 ideas on research plans. Lecture was given by Prof. Andrew Liddle.
The purpose of data analysis is to work out whether the values of parameters are compatible with data. By parameters, I mean some measurement(s) that has been created by theory or past experiment.
During model selection, one may ask what parameters are indicated by the data.
In combining these two analyses, one can create a feedback loop that simultaneously tests how well a theory fits a set of observational data AND see how well observational data obeys a theory. To do this effectively we need to demand certain criteria from both data and models:
Useful models:
Bayesian Inference is a system of logical deduction which assigns probabilities to all quantities of interest. The marked difference with this and regular (frequentist) statistics is that you now take into account how reliable /unreliable a quantity is. Frequentist statistics only really allows for an "absolute" truth and is not as suited to making predictions.
where A and B can be anything. So lets make P(A) the probability that you are wearing a hat versus P(B), the probability you're going to a casino. The above equation then reads as: the probability that you are wearing a hat, given the fact you are going to a casino, is equal to the probability you are going to a casino now that you're wearing a hat multiplied by the probability you are wearing a hat, divided by the probability you are going to a casino. The key thing is here, if you're unlikely to be wearing a hat, you'll unlikely be going to a casino, UNLESS it's more than likely you wear a hat whenever you go to a casino! Random bits aside, Bayes theorem is a clever doodle to test whether something is related to something else. It may even be a good thing to find if something isn't related to something else either! Back on topic though, for data analysis the concept stays but the variables change thus:

Prior P(θ) has no data dependence - more on this follows.
Likelihood P(D|θ) involves making a new calculation to evaluate the data against the model.
Posterior P(θ|D) then is the ability of the model to reflect the data.
P(D) is commonly ignored: it is simply a number that normalizes the probabilities to unity.
Now, priors. They can be a source of disagreement in calculation since how much you trust parameters (from previous results/theory) depends on how you feel about them. But the upshot is, you have to lay it down on paper, illustrating any point you have to make on them will be clear in the calculations you make. This is also where a little bravery comes into play. You have to force your assumptions into play, exposing them. This is where physical intuition comes into play. The trick is, don't seek a single "right" prior. Rather, test the robustness of your conclusions under a reasonable variation of the priors. Bayesian statistics ensures that, eventually, good data will overturn an incorrect choice of prior. Bravery aside, hardcore bayesianists (is that a word?!) simply comment "if you don't know enough to set a prior, why did you bother getting the data".
Bayesian Parameter Estimation
So with the above knowledge in tow, how do you use the thing?
1. Choose a model (a choice of a set of parameters to be varied to fit a dataset):
4. Obtain the posterior parameter distribution... basically plug in and play!
In this way, a huge dataset can then be summarized into a handful of parameters. Taking a cosmological example, From the WMAP satellite we recieve a series of time-ordered images covering the whole sky; these are then composited into a map of ~10 million pixels. Make an assumption about this map (such as the likelihood of getting this map given gaussianity of the cosmic microwave background [CMB]) and we can obtain the power spectrum of the CMB, which has reduced those 10 million pixels into ~1000 parameters for each measurement of an angular power. Then take a model and compare it to the likelihood of obtaining the observed power spectrum and we get a handful of parameters describing, say, Ωb, ΩDM and the Hubble parameter.
Briefly...
I want to only touch on Markov Chain Monte-Carlo (MCMC) simulations. To explore probabilities of multiple parameters (probability-space) is computationally expensive. If you, say take 10 variations of a single parameter that will cost 10 computations. Two parameters will cost 100 computations. If you, say, compute 8 parameters, it will cost 100,000,000 evaluations. Supercomputers typically push a million computations.. but that is pushing it. And so MCMC comes into play.
Monte Carlo -> any calculation that involves an element of randomness e.g. making a hop of a random size (normally drawn from a gaussian).
Markov chain -> where the next hop depends on where it ends up -> only a position of greater likelihood in probability space is taken.
So an MCMC run will gradually converge to a point in probability-space, where the most likely combination of parameter values (i.e. model) is found. To prevent false minima form being found, you can launch several Markov Chains: most will end up at the true maximum.
Computationally, MCMC scales linearly with additional parameters, as opposed to the multiplicative grid effect. Of mention, is the Metropolis-Hastings algorithm, which resembles Markov-Chains but sometimes allow steps to positions of lower likelihood.. allowing exploration of the shape of the maximum.
Publication bias
Something else I got from this seminar was the mention of publication bias, which is almost common sense but if you can characterize it by value - excellent! The simple statement is that things that are found in data are more likely to be published than things that are not found. For example, if a paper finds (cosmology hit again) non-gaussianity in the CMB it is more likely to be published than anything that doesn't find non-gaussianity (if judging by non-gaussianity alone). A rule of thumb is generally that if a (new) effect is found to ~95% (2σ) confidence, don't believe it. To prove something new you need a certainty of around 5σ (something like 99.99999980%).
The purpose of data analysis is to work out whether the values of parameters are compatible with data. By parameters, I mean some measurement(s) that has been created by theory or past experiment.
During model selection, one may ask what parameters are indicated by the data.
In combining these two analyses, one can create a feedback loop that simultaneously tests how well a theory fits a set of observational data AND see how well observational data obeys a theory. To do this effectively we need to demand certain criteria from both data and models:
Useful models:
- Fit the present data acceptably.
- Have the ability to make predictions for future data (this is computationally expensive).
- Will be predictable by the models we aim to test
- Have modellable intrinsic randomness (e.g. cosmic variance, sample variance) & experimental error (instrumental accuracy, noise)
Bayesian Inference is a system of logical deduction which assigns probabilities to all quantities of interest. The marked difference with this and regular (frequentist) statistics is that you now take into account how reliable /unreliable a quantity is. Frequentist statistics only really allows for an "absolute" truth and is not as suited to making predictions.
where A and B can be anything. So lets make P(A) the probability that you are wearing a hat versus P(B), the probability you're going to a casino. The above equation then reads as: the probability that you are wearing a hat, given the fact you are going to a casino, is equal to the probability you are going to a casino now that you're wearing a hat multiplied by the probability you are wearing a hat, divided by the probability you are going to a casino. The key thing is here, if you're unlikely to be wearing a hat, you'll unlikely be going to a casino, UNLESS it's more than likely you wear a hat whenever you go to a casino! Random bits aside, Bayes theorem is a clever doodle to test whether something is related to something else. It may even be a good thing to find if something isn't related to something else either! Back on topic though, for data analysis the concept stays but the variables change thus:
Prior P(θ) has no data dependence - more on this follows.
Likelihood P(D|θ) involves making a new calculation to evaluate the data against the model.
Posterior P(θ|D) then is the ability of the model to reflect the data.
P(D) is commonly ignored: it is simply a number that normalizes the probabilities to unity.
Now, priors. They can be a source of disagreement in calculation since how much you trust parameters (from previous results/theory) depends on how you feel about them. But the upshot is, you have to lay it down on paper, illustrating any point you have to make on them will be clear in the calculations you make. This is also where a little bravery comes into play. You have to force your assumptions into play, exposing them. This is where physical intuition comes into play. The trick is, don't seek a single "right" prior. Rather, test the robustness of your conclusions under a reasonable variation of the priors. Bayesian statistics ensures that, eventually, good data will overturn an incorrect choice of prior. Bravery aside, hardcore bayesianists (is that a word?!) simply comment "if you don't know enough to set a prior, why did you bother getting the data".
Bayesian Parameter Estimation
So with the above knowledge in tow, how do you use the thing?
1. Choose a model (a choice of a set of parameters to be varied to fit a dataset):
- Set parameters to be varied
- Set prior ranges for these parameters
- Having the most powerful/complete/accurate dataset is all well and good, but adding more data will only improve your result.
4. Obtain the posterior parameter distribution... basically plug in and play!
In this way, a huge dataset can then be summarized into a handful of parameters. Taking a cosmological example, From the WMAP satellite we recieve a series of time-ordered images covering the whole sky; these are then composited into a map of ~10 million pixels. Make an assumption about this map (such as the likelihood of getting this map given gaussianity of the cosmic microwave background [CMB]) and we can obtain the power spectrum of the CMB, which has reduced those 10 million pixels into ~1000 parameters for each measurement of an angular power. Then take a model and compare it to the likelihood of obtaining the observed power spectrum and we get a handful of parameters describing, say, Ωb, ΩDM and the Hubble parameter.
Briefly...
I want to only touch on Markov Chain Monte-Carlo (MCMC) simulations. To explore probabilities of multiple parameters (probability-space) is computationally expensive. If you, say take 10 variations of a single parameter that will cost 10 computations. Two parameters will cost 100 computations. If you, say, compute 8 parameters, it will cost 100,000,000 evaluations. Supercomputers typically push a million computations.. but that is pushing it. And so MCMC comes into play.
Monte Carlo -> any calculation that involves an element of randomness e.g. making a hop of a random size (normally drawn from a gaussian).
Markov chain -> where the next hop depends on where it ends up -> only a position of greater likelihood in probability space is taken.
So an MCMC run will gradually converge to a point in probability-space, where the most likely combination of parameter values (i.e. model) is found. To prevent false minima form being found, you can launch several Markov Chains: most will end up at the true maximum.
Computationally, MCMC scales linearly with additional parameters, as opposed to the multiplicative grid effect. Of mention, is the Metropolis-Hastings algorithm, which resembles Markov-Chains but sometimes allow steps to positions of lower likelihood.. allowing exploration of the shape of the maximum.
Publication bias
Something else I got from this seminar was the mention of publication bias, which is almost common sense but if you can characterize it by value - excellent! The simple statement is that things that are found in data are more likely to be published than things that are not found. For example, if a paper finds (cosmology hit again) non-gaussianity in the CMB it is more likely to be published than anything that doesn't find non-gaussianity (if judging by non-gaussianity alone). A rule of thumb is generally that if a (new) effect is found to ~95% (2σ) confidence, don't believe it. To prove something new you need a certainty of around 5σ (something like 99.99999980%).
Wednesday, 4 February 2009
Ow my head.,
Ok, the undergrad lab I've taken on teaching this term is way more work than previous TAing I've done. So now must reorganise a few things to accomodate this course. Which is a shame really because all the cosmology projects I'm on have seemed to be firing on all 4 cylinders. Minor hiccups here and there though. The Stackbuild.pro routine need the pixel variables fixed. Of the 16 machines running it, one has fallen over and half of the others seem to be dawdling. I think there's a loop problem to investigate.
C4 is almost cracked. I've managed to plough through the main program and most of the smaller subroutines. That said, those only make up half of the program and the other half is taken up by 3 huge subroutines. I'm neatening the code up and adding comments for future reference but since those 3 subroutines are critical, I'll have to go through them with a fine-tooth comb. TestCount appears to replicate Count, so analyses may go ahead simultaneously on both. Will have to download a fraction of Stripe 82 from SDSS and have a test run sometime soon. I'm aiming for Monday. If I can set up the algorithms for inputting the catalogue data as well on Monday next week could be pretty exciting PhD-wise.
C4 is almost cracked. I've managed to plough through the main program and most of the smaller subroutines. That said, those only make up half of the program and the other half is taken up by 3 huge subroutines. I'm neatening the code up and adding comments for future reference but since those 3 subroutines are critical, I'll have to go through them with a fine-tooth comb. TestCount appears to replicate Count, so analyses may go ahead simultaneously on both. Will have to download a fraction of Stripe 82 from SDSS and have a test run sometime soon. I'm aiming for Monday. If I can set up the algorithms for inputting the catalogue data as well on Monday next week could be pretty exciting PhD-wise.
Friday, 30 January 2009
GOODS progress
Implemented a new code that should hopefully be able to deal with the seemingly fragile network running here. Have modularized the initialization section of my rastermap v7 code. There are now two code blocks that run independently and I'll need to document them properly at some point in the future, but I need to keep track of files that come out of them first.
AORBUILD.pro
Outputs:
"meanmap.fits" - hashed up map of the summed images normalized by weight.
"meanmap.idl" - stores data used to construct the initial mean map:
.idl" - name of the AOR is used to build idl save files that store:
STACKBUILD.pro
Outputs:
"meta/STKx####y&&&&.idl" is a stack of pixels in the map plane where #### and &&&& represent the x and y coordinates:
AORBUILD.pro
Outputs:
"meanmap.fits" - hashed up map of the summed images normalized by weight.
"meanmap.idl" - stores data used to construct the initial mean map:
- "file[]" - list of filenames of files that contain fair images to work with.
- "mean" - the map written out in the meanmap.fits file
- "hd" - the header for the map
- "wt" - the summed weight of all the images in the map plane
- "usedfiles" - number of files in "file"
- "filesperaor[]" - number of (useful) images contained within the AOR
- "medsky[]" - median image produced by simply combining the images without interpolating; serves as the sky background employed in the initial sky subtraction. Good for spotting detector failures.
- "aornames[]" - what it says on the tin
- "stack[]" which holds stack.wt and stack.map which are the individual weight and image files interpolated onto the map plane.
- "wtaor" which is simply the summation of the individual weights of the image files in the map plane for each particular AOR.
STACKBUILD.pro
Outputs:
"meta/STKx####y&&&&.idl" is a stack of pixels in the map plane where #### and &&&& represent the x and y coordinates:
- "pxfil[]" - array containing full path of the file containing originating pixel.
- "pxsrc[]" - the key part of the filename used to describe the above file, allowing a full AOR/file/image/location search of the file in question.
- "pxval []" - the actual value of the pixel.
Sunday, 25 January 2009
Lost in Translation
So having gotten back to work after the New Year, a tip to India and a touch of lurgy, things seem to have been happening! The googlesky problem I was faced with seems to have magically been solved using Google AppEngine. My python code, instead of instantiating a web page and calling methods within, is now delivered in javascript, which then calls python from the javascript handler. I think avoiding javascript is no longer a choice now and I'm going to have to face learning yet another programming language for this specific project else I'm going to be fairly lost in translation.
In the meantime, I've been ploughing through the C4 algorithm and I think I'm happy with what the component parts do in terms of input and output. I still need to figure out what's actually going on inside the subroutines though, especially the count and testcount functions. They have a lot more going on than I initially thought. And the net output of the program is still gibberish to me. I need to see if I can come up with some sensible output. Or maybe find some standardized naming of the actual variables. I'll look up the NVO to see if they have anything useful I can call on later. Still haven't geared it up for dynamic array setting... the problem is when I have to deal with an observing footprint, specifically the DES observing footprint, I'm going to get a lot of junk from the testcount function. I'll have to dig a bit more and see if this is something that's previously been solved or whether I'm going to have to start from scratch. If I do have to start from scratch, I think I'll try and use the SQL syntax used in the NVO. This might mean extra work having a C++ parser for an SQL query... a risk of being lost in translation all over again. I think perhaps a lookup table might be the fastest/easiest way to do this.
In the meantime, I've been ploughing through the C4 algorithm and I think I'm happy with what the component parts do in terms of input and output. I still need to figure out what's actually going on inside the subroutines though, especially the count and testcount functions. They have a lot more going on than I initially thought. And the net output of the program is still gibberish to me. I need to see if I can come up with some sensible output. Or maybe find some standardized naming of the actual variables. I'll look up the NVO to see if they have anything useful I can call on later. Still haven't geared it up for dynamic array setting... the problem is when I have to deal with an observing footprint, specifically the DES observing footprint, I'm going to get a lot of junk from the testcount function. I'll have to dig a bit more and see if this is something that's previously been solved or whether I'm going to have to start from scratch. If I do have to start from scratch, I think I'll try and use the SQL syntax used in the NVO. This might mean extra work having a C++ parser for an SQL query... a risk of being lost in translation all over again. I think perhaps a lookup table might be the fastest/easiest way to do this.
Thursday, 22 January 2009
It begins...
Seeing as the department webserver is down for now, and I have no useful way of tracking my academic thoughts at present, I may as well kick this off. The naming of this blog comes from what seems to be more of a hobby than my actual PhD research but here goes: I'm trying to make super high quality maps from deep infrared data taken from the Spitzer space telescope as part of the GOODS programme. These will hopefully be the highest quality maps produced from this set of data. And as well as that, I met some excellent folks at Santa Fe last year and we've begun setting up scientific tools and interface for the web version of GoogleSky: sort of a GoogleMap for astronomers. Tons of potential for growth in the future of these!
My PhD work, in stark contrast, is directed towards finding galaxy clusters in the Dark Energy Survey. Hmm.. this might take some explaining. Dark Energy is this "stuff" present in the universe that
(a) we can't see because it's made up of something weird,
(b) we can't find because we can't see it,
(c) but know it's all over the place and blowing the universe apart.
Intriguing, no? So in the infinite wisdom of a bunch of cosmology geeks, the Dark Energy Survey (or DES) was set up to solve problems (a) and (b). Or at least get a better answer than shrugging our shoulders and mumbling something about scalar fields, maybe. I'll indulge with how I fit in further on down the line... I do stuff faaaarr less exciting!
So, for building maps of the sky, searching the heavens, and hunting for clusters buried in the quintessence of the universe; cosmic cartography begins thus.
My PhD work, in stark contrast, is directed towards finding galaxy clusters in the Dark Energy Survey. Hmm.. this might take some explaining. Dark Energy is this "stuff" present in the universe that
(a) we can't see because it's made up of something weird,
(b) we can't find because we can't see it,
(c) but know it's all over the place and blowing the universe apart.
Intriguing, no? So in the infinite wisdom of a bunch of cosmology geeks, the Dark Energy Survey (or DES) was set up to solve problems (a) and (b). Or at least get a better answer than shrugging our shoulders and mumbling something about scalar fields, maybe. I'll indulge with how I fit in further on down the line... I do stuff faaaarr less exciting!
So, for building maps of the sky, searching the heavens, and hunting for clusters buried in the quintessence of the universe; cosmic cartography begins thus.
Subscribe to:
Posts (Atom)
