-
Notifications
You must be signed in to change notification settings - Fork 39
API Proposal
BWM: here is an overview of the core parts of the current python astrodendro API, as well as my comments on what I think should be included/discard/changed for the new repository:
The Dendrogram
class represents a dendrogram or soon-to-be-dendrogram in memory.
-
__init__(data=None, minimum_flux=-np.inf, minimum_npix=0, minimum_delta=0, verbose=True, compute=True)
:
Construct a newDendrogram
with the given parameters and data. Compute the dendrogram ifcompute=True
-
BWM: Include, but I propose changing parameters to
__init__(data=None, min_intensity=-np.inf, min_npix=0, min_delta=0, verbose=True, compute=True)
-
TR: as you discuss below, it's ambiguous whether Dendrogram(filename) is computing or loading a dendrogam. One clean way out of it is to just make the computation a 'factory' function, so that one would do:
>>> d = compute_dendrogram('mycube.fits', minimum_flux=1., ...) >>> type(d) dendro.Dendrogram
that is, the Dendrogam class itself is agnostic about the computation, which might actually be cleaner. Then Dendrogram could either take an argument at initialization to load a dendogram:
>>> d = Dendrogram('my_dendrogram.fits')
and if the file is not a valid dendrogram, we raise an error saying 'to compute a dendrogram, use: compute_dendrogam(...)'. Anyway, this is just a thought, but it would allow us to disambiguate the API, and separate generation from representation. I find it more elegant than having a
compute=True/False
argument. Another way to do this is to use a class method:>>> d = Dendrogram.compute('mycube.fits', ...)
and just to clarify, this is not the same as a method where one would do:
>>> d = Dendrogram() >>> d.compute('mycube.fits')
-
CB: I'm in favor of requiring a specific call to
compute
, and a constructor that only sets paramaters. One could also use a utility function or class/static method to read files from disk. Something liked = Dendrogram.read_hd5(...)
ord = load_dendrogram_hd5(...)
. The latter approach is used by Numpy and Pandas. -
TR: from my point of view, the parameters are specific to the computation, so I would set those in the call to compute. I like the suggestion of class methods for reading in. I guess this means that we would have the following:
# Intialize empty dendrogram d = Dendrogram() # Read dendrogram from file d = Dendrogram.from_hdf5(...) # Compute dendrogram from data d = Dendrogram.compute('data.fits', delta=3., ...)
so that
__init__
basically does not take any options. Both reading and computing are done explicitly via class methods. The latter two can also be functions instead of class methods, though I have a slight preference for class methods.
-
-
compute(minimum_flux, minimum_npix, minimum_delta, verbose=False)
:
[Re-]compute the dendrogram with the given parameters- BWM: I'm not sure if this needs to be separate from the above, at least in the public API.
-
item_at(coords)
: Returns theBranch
orLeaf
at the given coordinates, if any.- TR: I presume this would be pixel coordinates (i.e. no WCS awareness) - right?
- BWM: Yes, exactly. Pixel coordinates only.
coords
is an integer tuple.
- BWM: Yes, exactly. Pixel coordinates only.
- CB: This should accept lists/arrays of values, and return the appropriate collection of objects.
- BWM: Nice idea.
- TR: I presume this would be pixel coordinates (i.e. no WCS awareness) - right?
-
trunk
: a list of the trunk nodes (the nodes with no parent). A collection ofBranch
and/orLeaf
objects.- BWM: May be better to rename this to 'items' or 'children' for consistency with the
Branch
API.
- BWM: May be better to rename this to 'items' or 'children' for consistency with the
-
all_items
: returns a flat list of allLeaf
andBranch
objects in the dendrogram (all nodes) -
min_flux
: get the value of theminimum_flux
when the dendrogram was computed. -
min_delta
: get the value of theminimum_delta
when the dendrogram was computed. -
min_npix
: get the value of theminimum_npix
when the dendrogram was computed.- TR: it might be sensible to use the same names for the keyword arguments as for the attributes
- BWM: agreed. Having typed these out many types, I'm strongly in favor of the shorter form.
- TR: it might be sensible to use the same names for the keyword arguments as for the attributes
-
items_dict
: internal dictionary containing references to all nodes in the dendrogram, indexed byidx
. Should probably have a_
in front of it and be made private.
These are things I have working in the current Python implementation that can easily be included if we want.
-
get_leaves
: Return a flattened list of all leaves in the dendrogram- TR: sounds reasonable
-
to_newick
: Create a string containing the Newick representation of the dendrogram -
to_hdf5(filename)
/from_hdf5(filename)
: HDF5 import/export.- BWM: I don't really like that to create a dendrogram from data in memory, one goes
d = Dendrogram(my_data)
but to load from and HDF5 file, one goesd = Dendrogram(); d.from_hdf5(my_filename);
- a consistent API would be nice if we include any file import/export. - TR: one option is to simply not allow any filenames to be passed at initialization. So we'd have a 'compute' method, and 'from_hdf5'/'to_hdf5', and 'from_fits/to_fits'?
- CB: See my comments above for another option for reading dendrograms from disk
- BWM: I don't really like that to create a dendrogram from data in memory, one goes
Each node of the dendrogram is either a Branch
or a Leaf
. A Leaf
is a node with no children, and represents a set of associated pixels. A branch consists of two or more children which may be branches or leaves, as well as pixels associated with the branch itself.
Each Branch
/Leaf
has the following API:
-
coords
: array of coordinate tuples that correspond with the flux values inf
- TR: I think this should be allowed to be either a list of tuples, or a 2-dimensional Numpy array with dimensions NxM, where N is the number of flux values, and M is the number of dimensions.
- CB: The output should definitely be something that can index the original data array (e.g. a boolean mask or tuple of numpy arrays)
- BWM: both
coords
andf
practically have to be Python lists during computation, due to the way they are dynamically resized. I think Tom's original version used numpy arrays, and changing it to use python lists was one of the most significant optimizations I made. However, we can certainly provide methods to return numpy arrays after the dendrogram has been fully computed. - CB: Fair enough -- we just definitely need some equivalent of
fluxes = original_data[branch.indexable_coords]
(f
also provides this, but being able to index another dataset spatially registered to the input is crucial). Perhaps, like you say, coords and f should be converted at the end ofcompute
.
-
f
: list of flux values.- TR: this could be a list or an array (CB: +1 for arrays)
-
fmin
,fmax
: the minimum and maximum flux values associated with this node. -
parent
: The parentBranch
object, orNone
-
idx
: a unique, arbitrary ID number to represent this node. -
height
: (read only) the difference in flux value betweenfmax
and the parent's merge level (or, if no parent exists,height
gives (fmax
-fmin
) -
add_point(coord, f)
Add a pixel to the current leaf. Coord is the coordinate tuple andf
is the flux value of the pixel. -
merge(leaf)
: Add all of the pixels associated withleaf
to this leaf/branch.- CB: How much care needs to be taken when using
add_point
andmerge
? I'm guessing these are used for computing the dendrogram - could the user corrupt the tree by calling them inappropriately aftercompute
? If so, these methods should be hidden in some way. However, it would be very cool if these methods could safely be used to tweak the dendrogram after it's constructed.- BWM: At the moment, they are actually meant for internal use only, so should probably be prefixed with an underscore and kept private. The reason is that the dendrogram has an internal
index_map
structure that must be kept up-to-date, and these methods do not update it. In fact, the dendrogramLeaf
andBranch
objects don't even have a reference to theDendrogram
object so they cannot modify properties global to the whole dendrogram. In future, we could add aforce_merge(node1, node2)
orforce_pixel(coord, node)
method at theDendrogram
level.
- BWM: At the moment, they are actually meant for internal use only, so should probably be prefixed with an underscore and kept private. The reason is that the dendrogram has an internal
- CB: How much care needs to be taken when using
Each Branch
has the following additional properties:
-
items
: a list of the child nodes of this branch. (typically twoBranch
orLeaf
objects)- CB: the label
items
might be too ambiguous (it's used above to refer to all the leaves/branches in the dendrogram). Maybechildren
?
- CB: the label
-
merge_level
: the flux value that triggered the merger of the child items (required for plotting).
The constructors are unique to each type:
-
Leaf(coord, f, idx=None)
: Create a new leaf, associated with the pixel at coordinatescoord
(a tuple) which has the valuef
.idx
is an integer ID number for the new leaf. -
Branch(items, coord, f, idx=None)
: Create a new branch, which connects all of the nodes passed asitems
. Add the pixel with fluxf
, coordinatescoord
, to the new branch.- CB: Again, can we rename
items
to something likechildren
?- BWM: agreed, 'children' would be better
- CB: Again, can we rename
Optional API: These are methods that exist in the current Python version, but which are probably not required for computation: (CB: I'm in favor with including many of these)
-
add_footprint(image, level)
:image
is an array; use the coordinates associated with the current node to set every corresponding pixel inimage
to the valuelevel
.- CB: Maybe rename to something like
fill_footprint
orfill_image
?- BWM: agreed.
- CB: Maybe rename to something like
-
npix
,npix_self
: These read-only properties provide the number of pixels in the node itself or in the node and its children -
f_sum
,f_sum_self
: These read-only properties provide the sum of the flux values associated with the node itself or with the node and its children- TR: I think this is too ambiguous to include - I talked to someone the other day who was using it for a leaf thinking it was the sum of
f - merge_level
(i.e. the 'background-subtracted' flux). At least initially, I think we should leave this out. - CB: What about signatures like
npix(self=False)
instead of two methods? This signature should be used fornpix
,f_sum
,coords
, andpeak
(and maybefmin
andfmax
, though what is the difference betweenfmax
andpeak
)? Also, I'd like anintensities(self=False)
method that returns a numpy array of intensities (in the original data) for this structure.- BWM: That is nicer. I like it. However,
self
is a Python keyword, so we should do something likechildren=True
ordescend=False
orself_only=False
perhaps? - BWM: FYI, the difference between
fmax
andpeak
is that the former returns only the peak flux value, whereas the latter also returns the coordinates at which the peak flux value is found.
- BWM: That is nicer. I like it. However,
- TR: I think this is too ambiguous to include - I talked to someone the other day who was using it for a leaf thinking it was the sum of
-
level
: read-only property that gives 0 for items in the trunk, 1 for their immediate children, etc. -
ancestor
: read-only property giving the ancestor (the only connected node that has no parent) of this leaf/branch recursively. Always returns an object (may return self if the object has no parent). Results are partially memoized/cached to reduce recursion depth.- BWM: I found that an efficient, memoized implementation of this property was necessary for my analysis code to run quickly. I think it would be worth including.
- CB: Let's also add an
ancestors
method which returns all nodes visited from tracing this node to the trunk.
-
newick
: read only property - returns a string with a Newick representation of the node- TR: I wonder whether
to_newick()
would make more sense for consistency with the Dendrogram method?
- TR: I wonder whether
-
peak
: read-only property. Return tuple (coordinates, flux) for the pixel with maximum flux value- TR: in the item, or all its children? Or is this a Leaf-only attribute?
- CB: See above. Maybe a
(self=True) option
to distinguish between the two cases. A name likeargmax
would be more consistent with the Numpy API. - BWM: It was not a leaf-only value; for a
Branch
it would return the peak flux value associated with the branch itself, not looking at pixels contained by its children. I think the best would be to do what Chris is suggesting and have apeak(self=False)
method that always returns(node, coordinate, flux)
(ifself
isTrue
, then thenode
return value is simply the current node itself.)
-
get_peak_recursive
: Return a tuple with the (node, coordinate, and flux) value of the pixel with highest flux value found within this leaf.- CB: Seems too specialized
- BWM: Hmmm... in my work this came up often enough that
node,coord,flux = some_node.peak(self=False)
was preferable but it is true that this could be replaced bycoord,flux = some_node.peak(self=False); node=dendrogram.item_at(coord);
-
descendants
: Read only. Returns a flattened list of all child leaves and branches. Non-recursive. Useful.
- test_dendrogramWithNan: Uses an 8x4 array of values (including many 'NaN' values) to compute a dendrogram, and checks basic properties of the resulting dendrogram, such as # of leaves and branches in the trunk. See implementation.
-
test_dendrogramWithConstBackground: Test a highly artificial 14x14 array containing a lot of equal flux values. Only checks the total # of nodes in the resulting dendrogram. See implementation.
- This test does exhibit a potential issue, which is commented upon in the code.
-
test_dendrogramComputation: Computes a dendrogram from a 107x107x602 pixel data cube where all the flux values greater than 1.4 represent real data from L1448 13co, but all other (lesser) values are randomly generated. Computes a dendrogram using minimum_npix=8, minimum_delta=0.3, minimum_flux=1.4. Checks that the number of leaves in the resulting dendrogram matches a result known to be correct. Iterates through every pixel in the data cube and double-checks that the
item_at(coord)
method returns the correct node object for each coordinate. See implementation.
-
test_4dim: Programmatically construct a 5x5x5x5 data cube such that we can understand what the resulting dendrogram should be, then compute the dendrogram and check that the
Leaf
peaks are in the right places and the nodes are connected as expected. See implementation.
- test_write: Compute a dendrogam and write it to an HDF5 file. Only fails if an exception occurs.
- test_read: Construct a dendrogram from an 8x4x3 data cube. Write it to an HDF5 file. Load the HDF5 into memory as a new dendrogram. Iterate through every node in the original dendrogram and ensure the corresponding node in the dendrogram read from the HDF5 file has the same type (Branch vs. Leaf), pixel coordinates, pixel flux values, peak flux value/position, and (if applicable) merge level.
Test that we can efficiently compute deep dendrogram trees without hitting the recursion limit. Note that computation of dendrograms deliberately avoids recursion, but plotting and HDF5 import/export currently are recursive, so will not work with deep dendrogram trees.
- test_compute: Artificially lower the python recursion limit, then compute a dendrogram with 1,000 levels.
-
test_computing_level: Ensure that the
level
property of each node can be used without hitting recursion limits.- BWM: Note to self: there is currently an old line in the code (raising the recursion limit) that needs to be removed.
Uses the python timeit
module to time the following (using same data as Test3DimensionalData
):
- Computation of a 3-D dendrogram
- HDF5 import/export
- plotting of a dendrogram
(not to be included in main repository)
- Plotting (Matplotlib-based, backend-agnostic):
-
DendrogramPlot
(base class - abstracts highlighting of nodes) RecursiveSortPlot
SpatialCoordPlot
SpatialMeanCoordPlot
-
- PyGTK Viewer