Skip to content
ChrisBeaumont edited this page Sep 11, 2012 · 19 revisions

BWM: here is an overview of the core parts of the current python astrodendro API, as well as my comments on what I think should be included/discard/changed for the new repository:

Dendrogram

The Dendrogram class represents a dendrogram or soon-to-be-dendrogram in memory.

Methods

  • __init__(data=None, minimum_flux=-np.inf, minimum_npix=0, minimum_delta=0, verbose=True, compute=True):
    Construct a new Dendrogram with the given parameters and data. Compute the dendrogram if compute=True

    • BWM: Include, but I propose changing parameters to __init__(data=None, min_intensity=-np.inf, min_npix=0, min_delta=0, verbose=True, compute=True)

    • TR: as you discuss below, it's ambiguous whether Dendrogram(filename) is computing or loading a dendrogam. One clean way out of it is to just make the computation a 'factory' function, so that one would do:

          >>> d = compute_dendrogram('mycube.fits', minimum_flux=1., ...)
          >>> type(d)
          dendro.Dendrogram
      

      that is, the Dendrogam class itself is agnostic about the computation, which might actually be cleaner. Then Dendrogram could either take an argument at initialization to load a dendogram:

      >>> d = Dendrogram('my_dendrogram.fits')
      

      and if the file is not a valid dendrogram, we raise an error saying 'to compute a dendrogram, use: compute_dendrogam(...)'. Anyway, this is just a thought, but it would allow us to disambiguate the API, and separate generation from representation. I find it more elegant than having a compute=True/False argument. Another way to do this is to use a class method:

      >>> d = Dendrogram.compute('mycube.fits', ...)
      

      and just to clarify, this is not the same as a method where one would do:

      >>> d = Dendrogram()
      >>> d.compute('mycube.fits')
      
    • CB: I'm in favor of requiring a specific call to compute, and a constructor that only sets paramaters. One could also use a utility function or class/static method to read files from disk. Something like d = Dendrogram.read_hd5(...) or d = load_dendrogram_hd5(...). The latter approach is used by Numpy and Pandas.

    • TR: from my point of view, the parameters are specific to the computation, so I would set those in the call to compute. I like the suggestion of class methods for reading in. I guess this means that we would have the following:

          # Intialize empty dendrogram
          d = Dendrogram()
      
          # Read dendrogram from file
          d = Dendrogram.from_hdf5(...)
      
          # Compute dendrogram from data
          d = Dendrogram.compute('data.fits', delta=3., ...)
      

      so that __init__ basically does not take any options. Both reading and computing are done explicitly via class methods. The latter two can also be functions instead of class methods, though I have a slight preference for class methods.

  • compute(minimum_flux, minimum_npix, minimum_delta, verbose=False):
    [Re-]compute the dendrogram with the given parameters

    • BWM: I'm not sure if this needs to be separate from the above, at least in the public API.
  • item_at(coords): Returns the Branch or Leaf at the given coordinates, if any.

    • TR: I presume this would be pixel coordinates (i.e. no WCS awareness) - right?
      • BWM: Yes, exactly. Pixel coordinates only. coords is an integer tuple.
    • CB: This should accept lists/arrays of values, and return the appropriate collection of objects.
      • BWM: Nice idea.

Properties

  • trunk: a list of the trunk nodes (the nodes with no parent). A collection of Branch and/or Leaf objects.
    • BWM: May be better to rename this to 'items' or 'children' for consistency with the Branch API.
  • all_items: returns a flat list of all Leaf and Branch objects in the dendrogram (all nodes)
  • min_flux: get the value of the minimum_flux when the dendrogram was computed.
  • min_delta: get the value of the minimum_delta when the dendrogram was computed.
  • min_npix: get the value of the minimum_npix when the dendrogram was computed.
    • TR: it might be sensible to use the same names for the keyword arguments as for the attributes
      • BWM: agreed. Having typed these out many types, I'm strongly in favor of the shorter form.
  • items_dict: internal dictionary containing references to all nodes in the dendrogram, indexed by idx. Should probably have a _ in front of it and be made private.

Other stuff

These are things I have working in the current Python implementation that can easily be included if we want.

  • get_leaves: Return a flattened list of all leaves in the dendrogram
    • TR: sounds reasonable
  • to_newick: Create a string containing the Newick representation of the dendrogram
  • to_hdf5(filename) / from_hdf5(filename): HDF5 import/export.
    • BWM: I don't really like that to create a dendrogram from data in memory, one goes d = Dendrogram(my_data) but to load from and HDF5 file, one goes d = Dendrogram(); d.from_hdf5(my_filename); - a consistent API would be nice if we include any file import/export.
    • TR: one option is to simply not allow any filenames to be passed at initialization. So we'd have a 'compute' method, and 'from_hdf5'/'to_hdf5', and 'from_fits/to_fits'?
    • CB: See my comments above for another option for reading dendrograms from disk

Node

Each node of the dendrogram is either a Branch or a Leaf. A Leaf is a node with no children, and represents a set of associated pixels. A branch consists of two or more children which may be branches or leaves, as well as pixels associated with the branch itself.

Each Branch/Leaf has the following API:

  • coords: array of coordinate tuples that correspond with the flux values in f

    • TR: I think this should be allowed to be either a list of tuples, or a 2-dimensional Numpy array with dimensions NxM, where N is the number of flux values, and M is the number of dimensions.
    • CB: The output should definitely be something that can index the original data array (e.g. a boolean mask or tuple of numpy arrays)
    • BWM: both coords and f practically have to be Python lists during computation, due to the way they are dynamically resized. I think Tom's original version used numpy arrays, and changing it to use python lists was one of the most significant optimizations I made. However, we can certainly provide methods to return numpy arrays after the dendrogram has been fully computed.
    • CB: Fair enough -- we just definitely need some equivalent of fluxes = original_data[branch.indexable_coords] (f also provides this, but being able to index another dataset spatially registered to the input is crucial). Perhaps, like you say, coords and f should be converted at the end of compute.
  • f: list of flux values.

    • TR: this could be a list or an array (CB: +1 for arrays)
  • fmin, fmax: the minimum and maximum flux values associated with this node.

  • parent: The parent Branch object, or None

  • idx: a unique, arbitrary ID number to represent this node.

  • height: (read only) the difference in flux value between fmax and the parent's merge level (or, if no parent exists, height gives (fmax-fmin)

  • add_point(coord, f) Add a pixel to the current leaf. Coord is the coordinate tuple and f is the flux value of the pixel.

  • merge(leaf): Add all of the pixels associated with leaf to this leaf/branch.

    • CB: How much care needs to be taken when using add_point and merge? I'm guessing these are used for computing the dendrogram - could the user corrupt the tree by calling them inappropriately after compute? If so, these methods should be hidden in some way. However, it would be very cool if these methods could safely be used to tweak the dendrogram after it's constructed.
      • BWM: At the moment, they are actually meant for internal use only, so should probably be prefixed with an underscore and kept private. The reason is that the dendrogram has an internal index_map structure that must be kept up-to-date, and these methods do not update it. In fact, the dendrogram Leaf and Branch objects don't even have a reference to the Dendrogram object so they cannot modify properties global to the whole dendrogram. In future, we could add a force_merge(node1, node2) or force_pixel(coord, node) method at the Dendrogram level.

Each Branch has the following additional properties:

  • items: a list of the child nodes of this branch. (typically two Branch or Leaf objects)
    • CB: the label items might be too ambiguous (it's used above to refer to all the leaves/branches in the dendrogram). Maybe children?
  • merge_level: the flux value that triggered the merger of the child items (required for plotting).

The constructors are unique to each type:

  • Leaf(coord, f, idx=None): Create a new leaf, associated with the pixel at coordinates coord (a tuple) which has the value f. idx is an integer ID number for the new leaf.
  • Branch(items, coord, f, idx=None): Create a new branch, which connects all of the nodes passed as items. Add the pixel with flux f, coordinates coord, to the new branch.
    • CB: Again, can we rename items to something like children?
      • BWM: agreed, 'children' would be better

Optional API: These are methods that exist in the current Python version, but which are probably not required for computation: (CB: I'm in favor with including many of these)

  • add_footprint(image, level): image is an array; use the coordinates associated with the current node to set every corresponding pixel in image to the value level.
    • CB: Maybe rename to something like fill_footprint or fill_image?
      • BWM: agreed.
  • npix, npix_self: These read-only properties provide the number of pixels in the node itself or in the node and its children
  • f_sum, f_sum_self: These read-only properties provide the sum of the flux values associated with the node itself or with the node and its children
    • TR: I think this is too ambiguous to include - I talked to someone the other day who was using it for a leaf thinking it was the sum of f - merge_level (i.e. the 'background-subtracted' flux). At least initially, I think we should leave this out.
    • CB: What about signatures like npix(self=False) instead of two methods? This signature should be used for npix, f_sum, coords, and peak (and maybe fmin and fmax, though what is the difference between fmax and peak)? Also, I'd like an intensities(self=False) method that returns a numpy array of intensities (in the original data) for this structure.
      • BWM: That is nicer. I like it. However, self is a Python keyword, so we should do something like children=True or descend=False or self_only=False perhaps?
      • BWM: FYI, the difference between fmax and peak is that the former returns only the peak flux value, whereas the latter also returns the coordinates at which the peak flux value is found.
  • level: read-only property that gives 0 for items in the trunk, 1 for their immediate children, etc.
  • ancestor: read-only property giving the ancestor (the only connected node that has no parent) of this leaf/branch recursively. Always returns an object (may return self if the object has no parent). Results are partially memoized/cached to reduce recursion depth.
    • BWM: I found that an efficient, memoized implementation of this property was necessary for my analysis code to run quickly. I think it would be worth including.
    • CB: Let's also add an ancestors method which returns all nodes visited from tracing this node to the trunk.
  • newick: read only property - returns a string with a Newick representation of the node
    • TR: I wonder whether to_newick() would make more sense for consistency with the Dendrogram method?
  • peak: read-only property. Return tuple (coordinates, flux) for the pixel with maximum flux value
    • TR: in the item, or all its children? Or is this a Leaf-only attribute?
    • CB: See above. Maybe a (self=True) option to distinguish between the two cases. A name like argmax would be more consistent with the Numpy API.
    • BWM: It was not a leaf-only value; for a Branch it would return the peak flux value associated with the branch itself, not looking at pixels contained by its children. I think the best would be to do what Chris is suggesting and have a peak(self=False) method that always returns (node, coordinate, flux) (if self is True, then the node return value is simply the current node itself.)
  • get_peak_recursive: Return a tuple with the (node, coordinate, and flux) value of the pixel with highest flux value found within this leaf.
    • CB: Seems too specialized
    • BWM: Hmmm... in my work this came up often enough that node,coord,flux = some_node.peak(self=False) was preferable but it is true that this could be replaced by coord,flux = some_node.peak(self=False); node=dendrogram.item_at(coord);
  • descendants: Read only. Returns a flattened list of all child leaves and branches. Non-recursive. Useful.

Unit Tests

Existing Unit tests

test_compute.Test2DimensionalData

  • test_dendrogramWithNan: Uses an 8x4 array of values (including many 'NaN' values) to compute a dendrogram, and checks basic properties of the resulting dendrogram, such as # of leaves and branches in the trunk. See implementation.
  • test_dendrogramWithConstBackground: Test a highly artificial 14x14 array containing a lot of equal flux values. Only checks the total # of nodes in the resulting dendrogram. See implementation.
    • This test does exhibit a potential issue, which is commented upon in the code.

test_compute.Test3DimensionalData

  • test_dendrogramComputation: Computes a dendrogram from a 107x107x602 pixel data cube where all the flux values greater than 1.4 represent real data from L1448 13co, but all other (lesser) values are randomly generated. Computes a dendrogram using minimum_npix=8, minimum_delta=0.3, minimum_flux=1.4. Checks that the number of leaves in the resulting dendrogram matches a result known to be correct. Iterates through every pixel in the data cube and double-checks that the item_at(coord) method returns the correct node object for each coordinate. See implementation.

test_compute.TestNDimensionalData

  • test_4dim: Programmatically construct a 5x5x5x5 data cube such that we can understand what the resulting dendrogram should be, then compute the dendrogram and check that the Leaf peaks are in the right places and the nodes are connected as expected. See implementation.

test_hdf5.TestHDF5

  • test_write: Compute a dendrogam and write it to an HDF5 file. Only fails if an exception occurs.
  • test_read: Construct a dendrogram from an 8x4x3 data cube. Write it to an HDF5 file. Load the HDF5 into memory as a new dendrogram. Iterate through every node in the original dendrogram and ensure the corresponding node in the dendrogram read from the HDF5 file has the same type (Branch vs. Leaf), pixel coordinates, pixel flux values, peak flux value/position, and (if applicable) merge level.

test_recursion.TestRecursionLimit

Test that we can efficiently compute deep dendrogram trees without hitting the recursion limit. Note that computation of dendrograms deliberately avoids recursion, but plotting and HDF5 import/export currently are recursive, so will not work with deep dendrogram trees.

  • test_compute: Artificially lower the python recursion limit, then compute a dendrogram with 1,000 levels.
  • test_computing_level: Ensure that the level property of each node can be used without hitting recursion limits.
    • BWM: Note to self: there is currently an old line in the code (raising the recursion limit) that needs to be removed.

benchmark

Uses the python timeit module to time the following (using same data as Test3DimensionalData):

  • Computation of a 3-D dendrogram
  • HDF5 import/export
  • plotting of a dendrogram

Other Stuff Braden Has

(not to be included in main repository)

  • Plotting (Matplotlib-based, backend-agnostic):
    • DendrogramPlot (base class - abstracts highlighting of nodes)
    • RecursiveSortPlot
    • SpatialCoordPlot
    • SpatialMeanCoordPlot
  • PyGTK Viewer