Skip to content
low-sky edited this page Aug 23, 2012 · 19 revisions

About

This page describes the requirements and specification for the core dendrogram generation code.

Input

  • The code should accept data in FITS format

  • Input data can be 1-, 2-, or 3-dimensional (in all cases, data can be reshaped to 3-d internally). The value of 1-d is that it can be useful for simple experimentation.

    • EWR: There is some utility in allowing extensibility to N dimensions where N should be at least 6 for phase space applications.
    • BWM: For reference, my latest python code can create dendrograms of N-dimensional data , although only 2/3D data has been extensively tested. I would encourage a dimensionality-agnostic design if possible; there is a loss of simplicity and efficiency, but the end product is much more useful and can be more readily adapted to novel problems. One can use something like C++ template specialization to keep such code as efficient as N≤3-dimensional code, albeit more complicated.
  • If multiple HDUs are present, the code should raise an error asking to specify the HDU to use with an -h option

  • The code should handle common data types: float, double, int, long, ???

  • Blanked data should be a fixed, specified value (NaN?) and regions for analysis should be premasked.

Options/Parameters

  • Option to include the input dataset in the output file to have a self-contained data + dendrogram file. Since it takes more disk space, it could be off by default (CNB votes that it be on by default, since disk space is cheap and self contained files are more convenient - Tom: I agree, I actually did something similar for my radiative transfer code).

  • Minimum number of pixels a leaf can contain to be considered an independent entity (currently minpix)

  • Minimum peak flux for leaves (currently minpeak)

  • The peak of each leaf must be at least delta above its first merger

  • The minimum distance to other peaks for a peak to be considered independent (aka friends). The issue is that the naming of friends/specfriends doesn't make sense for a 3-d ppp cube. Maybe we could allow for such an option, but allow it to be either a single value (which applies to all dimensions), or the values to use in each direction. For a 3-d cube, -f 10 would use a value of 10 in each direction, while -f 10 5 2 would use 10, 5 and 2 in each direction. Then it doesn't matter whether the third dimension is spectral or not. Also, can we think of a better name than friends? e.g. separation? (-s). neighborhood? support?

  • An optional kernel file (currently -k) that manually specifies what pixels should be used for the dendrogram leaves. This is a CSV file with two columns: the 1D index of the pixel (in the IDL convention), and the intensity at that pixel. The intensity is a consistency check to make sure whoever generated the kernel list used the proper index convention.

  • A flag that controls the connectivity of pixels in the data cube. How many neighbors does a pixel in three dimensions have -- 6, 18 or 26? For N dimensions there are N choices of how connected pixels are though we tend to only consider the extremal cases: (1) neighbors are offset by ±1 in only one dimension or (2) neighbors are offset by 0, ±1 in all dimensions.

Output

The output file should be in FITS file, and should contain:

  • The input data (optional - see above) - HDU 0

  • The input options, and path/checksum of input file - added as header comments in HDU 0 (EWR: I could also see merit in these being in the header for HDU 1 since they are more tied to the output. HDU 0 is more accessible).

  • A version number for the output format, added as a header comment. Will be useful in case we ever modify this specification.

  • An 'index map' giving for each pixel/voxel the ID of the structure (leaf, branch) that the pixel/voxel is part of. HDU 1 (or 0, if input data not included). Convention for structure IDs should be established.

  • The relationship between the structures (leaf, branch). I propose an N row x 3 column array. Each row describes a single structure. The first two columns give the IDs of the two substrutures within this object (entries are -1 for leaves). The last column gives the intensity of the contour level at which these structures merge (for leaves, gives the maximum pixel intensity). HDU 2

    • BWM: I would suggest using 0 instead of -1 to represent empty IDs, because it avoid issues with signed vs. unsigned values. (If you use a signed integer type for IDs, with -1 representing 'no ID', it may happen that someone mistakenly casts that value to an unsigned int, in which case it will give a large positive ID that is seemingly valid but non-existent. This also makes import/export more straightforward and lets one write if (id0) rather than if (id0 != -1). The catch is you have to be sure IDs are defined to always be ≥ 1.)
    • EWR: Using a FITS bintable as this extension may be preferable since the different columns have different meanings.
  • The 1D index (in IDL convention) of the kernel positions in the input data. That is, the individual local maxima that seed each leaf. HDU 3 (EWR: This could be encoded in HDU2 as another column, -1 for non-leaves).

Alternate Specification

  • Instead of HDU2 (which requires binary trees), we could include a description of the tree as a Newick string (which allows arbitrary tree topologies). The question is where to put it, as these strings will be long (hundreds to thousands of characters). Do FITS header keywords impose length limits?

  • The proposed convention doesn't support more-than-binary merging. While practically, all mergers will be binary in real-values data, we should anticipate pruning a tree of insignificant branching. Thus, it would suffice to have a row contain ID of object, ID of parent, contour level at which object merges into parent.

Issues with current C++ Implementation

  • It does not implement the delta filtering. Is the python code able to do delta pruning on the fly? If so, we should incorporate that approach.

  • It only handles floating point fits files at the moment. With C++ templates, we should be able to generalize this to any data type.

  • HDU 2 currently doesn't include the last column describing the intensity.

  • It doesn't support N>3 dimensions (this may or may not be an issue).

Terminology

BWM: We should decide on terminology to be used, and apply it consistently in the code, API, and documentation.

  • What to call each data value: "intensity", "flux", or something else?
  • pixel vs. voxel? (pixel is simpler; voxel is more accurate in 3-D space only)
  • Probably best to avoid assuming data is observational; i.e. don't refer to "spectral axis"

Open questions

  • Do we restrict ourselves to binary trees, or allow any number of structures to merge at any point? The latter might make more sense, because physically we might want to say 'these four cores are part of this one clump, and these three clumps for a cloud'. Enforcing binarity might lead to trees that are more complicated and deep than they need to be.

    • (CNB) There are two major obstacles to non-binary trees. First is that much of the existing analysis code is built around the assumption of binarity. Second is that we have not been able to figure out a good way of generating non-binary trees. For observational data, the dendrogram is virtually always binary, since its extremely unlikely that 3 structures merge together at a single pixel. We haven't been able to figure out a well-defined, non-arbitrary way to merge branches from one of these trees to accomodate multi-way splits.

    • Tom: just for info, the Python code handles up to 6-way merges in 3-d. The reason I had to implement it was that if delta was very small then it actually did happen in a real dataset (the L1448 cube) that three or more structures were joined by the same pixel. We could allow for both modes though, with the default being binary? We don't have to implement the non-binary one now, but we could always add it as a lower priority feature? (This is probably worth discussing on skype at some point, since I'm not sure we all have the same idea of how a dendrogram is defined)

Wish list

  • In future, it would be nice to also provide a mechanism to be able to call this from Python without having the I/O overhead.

    • BWM: I strongly suggest that from the get-go we deliberately separate (1) dendrogram creation algorithm/specification/code, (2) FITS input/output, and (3) the command-line program that end users will use. If you put (1) and (2) into a library and create a simple accompanying command-line program that uses the library, things will be a lot more flexible. In other words, make sure that FITS input/output is always optional. Doing so lets users avoid I/O bottlenecks if their output doesn't need to be saved (e.g. during development / for unit tests / for benchmarks) and makes it fairly easy to create fast bindings to languages like Python.
  • It would also be nice to have different options for what happens at the edges of files. For instance, dendrograms should connect opposite sides of cubes for simulations with periodic boundary conditions. All-sky maps also have some periodicity.

Requirements

  • Since we need to read/write FITS files, it makes sense to rely on cfitsio/CCfits, but it might be sensible to bundle these and link to them directly rather than installing them as libraries. This will make it a lot easier for users. The licenses of those packages is compatible with this.