Skip to content

Assorted bugs and possibly undefined behavior in closest #167

Open
@endrebak

Description

@endrebak

I am using property testing in hypothesis to ensure that poranges and bioframe return the exact same results.

This has led me to discover many trifling but annoying bugs.

  1. When no closest interval is found it throws:
df = bioframe.from_any([['chr1', 100, 110]], name_col='chrom')
bf.closest(df, df.copy(), ignore_overlaps=True)
~/anaconda3/lib/python3.8/site-packages/bioframe/core/arrops.py in closest_intervals(starts1, ends1, starts2, ends2, k, tie_arr, ignore_overlaps, ignore_upstream, ignore_downstream, direction)
    734     interval1_run_starts = interval1_run_borders[:-1]
    735     interval1_run_ends = interval1_run_borders[1:]
--> 736     closest_ids = closest_ids[
    737         arange_multi(
    738             interval1_run_starts,

IndexError: index 0 is out of bounds for axis 0 with size 0

Suggested solution (this is how you handle the case where df2 has no overlapping chromosomes with df1):

  chrom  start  end chrom_  start_  end_  distance
0  chr1    100  110   <NA>    <NA>  <NA>      <NA>
  1. bf.closest does not handle empty dataframes:
df2 = pd.DataFrame({c: pd.Series([], dtype=t) for c, t in df.dtypes.items()})
bf.closest(df2, df)
~/anaconda3/lib/python3.8/site-packages/bioframe/ops.py in _closest_intidxs(df1, df2, k, ignore_overlaps, ignore_upstream, ignore_downstream, direction_col, tie_breaking_col, cols1, cols2)
   1020
   1021     if len(closest_intidxs) == 0:
-> 1022         return np.ndarray(shape=(0, 2), dtype=np.int)
   1023     closest_intidxs = np.vstack(closest_intidxs)
   1024

~/anaconda3/lib/python3.8/site-packages/numpy/__init__.py in __getattr__(attr)
    282             return Tester
    283
--> 284         raise AttributeError("module {!r} has no attribute "
    285                              "{!r}".format(__name__, attr))
    286

AttributeError: module 'numpy' has no attribute 'int'

Suggested solution: return an empty dataframe with the columns from df2 added.


This isn't critical, but it would be nice if you could fix this eventually. Hypothesis ends the testing at the first error found so these bugs prevent me from doing proper testing.

I made the title general because I might update the issue with more bugs as I find them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions