The all-nearest-neighbors problem asks, for each of a set of query points, what is its nearest neighbor among a set of reference points. In the more general bichromatic version, the query set and reference set may be different. Aside from being a classical challenge problem of computational geometry, this is often the real problem of interest in practice, not the single-query problem.Our simple dual-recursive depth-first-search algorithm turns out to be faster than the best previous algorithms in practice, including Vaidya's algorithm and the well-separated pair decomposition. It is exact, solves the more general bichromatic problem, works for general k, and as with all of our algorithms on this page, it works with virtually any space-partitioning tree. (First described in the NIPS 2000 paper. Full experimental results and detailed description will appear in the paper associated with the Proximity Project hopefully sometime in 2005.)
The Proximity Project itself is based on comparing the best clustering algorithms currently known and comparing them to clustering algorithms from the last 3 decades. However, whilst they say the analysis is done (2004), 'the tedious write up is not'. Now, the algorithm I'm using seems to be fairly tuned in to what I'm doing with C4. I'm sure Chris has already realised this and has supplied me with the source code. However, it will be a long, long time before I can figure out what's going on with this algorithm without a deeper understanding of it. Straight out compiling the source code brings up a few errors so I need to work on those which necessitates investigation.
The algorithm problems don't stop there though. After a successful run of the first redshift catalogue, every other catalogue disasterously dumps most of their galaxies once it's gone through around 25% of the galaxies. Having run other catalogues at higher redshifts but with more OR less galaxies it can't be a problem with the formats of the catalogues and is definitely not a probelm with dealing with an excess number of galaxies. there is something going on that I can't figure out and it isn't in the source data (having checked it by hand) and it isn't in the pre-processed C4 galaxy information (as the same algorithm is used for the first catalogue which did run) nor is it in the way that the galaxy information is read in. What the problem looks like is the catalogues loop through for some indefinite period, happily categorizing galaxies and assigning them to the appropriate C4 gridpoint until it gives up and dumps all the remaining galaxies in one bin before carrying on as normal. Now, the problem with this is that that particular bin is then used to set up another grid of pointers with an extra dimension to correlate to each particular galaxy. Now 'the grid' has to be cuboidal in nature, so by having one really heavy bin we suddenly introduce a HUGE dimension on an already large data set. This is the source of the segmentation fault. The problem I'm having is why it suddenly decides to dump all of the galaxies in one bin.
Actually, why is partially solved. Having the input, printed to screen at various intervals it's possible to spot that at some point, C4 decides to read the same line in, over and over and over. And because this is the same galaxy, it slots into one gridpoint again and again and again, leading to one monumentally large bin and thereby causing the pointer array to fail. So what's making it read in values differently anyway? Or, more to the point, why after a random point in the file, does C4 stop reading in new values and simply put in the same value as before? I aim to answer this particular question before the workshop so I am at least at one roadblock behind rather than two (*touchwood*).


