Skip to content

Adding pd.query() style selections raises index error #52

@axiomcura

Description

@axiomcura

Description

This issue is centered around implementing pd.query()-based selection for the run_pipeline() function. I've encountered this challenge while working on a notebook, and I've documented the problem in a corresponding PR for visibility and collaboration.

To reporduce the error, I've created a small dataset and a standalone notebook to replicate the issue. You can find the test data here, which I used to reproduce the error in the referenced notebook. Additionally, the reproduce_error.ipynb notebook provides the code to recreate the issue.

The error that I receive is this:

KeyError                                  Traceback (most recent call last)
/home/erikserrano/Development/Mitocheck-MAP-analysis/notebooks/mitocheck-map-analysis.ipynb Cell 7 line 3
     33 # execute pipeline on negative control with trianing dataset with cp features
     34 logging.info(f"Running pipeline on CP features using {phenotype} phenotype")
---> 35 cp_negative_training_result = run_pipeline(meta=negative_training_cp_meta,
     36                                         feats=negative_training_cp_feats,
     37                                         pos_sameby=pos_sameby,
     38                                         pos_diffby=pos_diffby,
     39                                         neg_sameby=neg_sameby,
     40                                         neg_diffby=neg_diffby,
     41                                         batch_size=batch_size,
     42                                         null_size=null_size)
     43 map_results_neg_cp.append(cp_negative_training_result)                                       
     45 # execute pipeline on negative control with trianing dataset with dp features

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/map.py:115, in run_pipeline(meta, feats, pos_sameby, pos_diffby, neg_sameby, neg_diffby, null_size, batch_size, seed)
    105 def run_pipeline(meta,
    106                  feats,
    107                  pos_sameby,
   (...)
    112                  batch_size=20000,
    113                  seed=0) -> pd.DataFrame:
    114     columns = flatten_str_list(pos_sameby, pos_diffby, neg_sameby, neg_diffby)
--> 115     validate_pipeline_input(meta, feats, columns)
    117     # Critical!, otherwise the indexing wont work
    118     meta = meta.reset_index(drop=True).copy()

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/map.py:99, in validate_pipeline_input(meta, feats, columns)
     98 def validate_pipeline_input(meta, feats, columns):
---> 99     if meta[columns].isna().any(axis=None):
    100         raise ValueError('metadata columns should not have null values.')
    101     if len(meta) != len(feats):

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/frame.py:3899, in DataFrame.__getitem__(self, key)
   3897     if is_iterator(key):
   3898         key = list(key)
-> 3899     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   3901 # take() does not accept boolean indexers
   3902 if getattr(indexer, "dtype", None) == bool:

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/indexes/base.py:6114, in Index._get_indexer_strict(self, key, axis_name)
   6111 else:
   6112     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6114 self._raise_if_missing(keyarr, indexer, axis_name)
   6116 keyarr = self.take(indexer)
   6117 if isinstance(key, Index):
   6118     # GH 42790 - Preserve name from an Index

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/indexes/base.py:6178, in Index._raise_if_missing(self, key, indexer, axis_name)
   6175     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   6177 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 6178 raise KeyError(f"{not_found} not in index")

KeyError: "['Metadata_is_control == 0'] not in index"

The root cause is traced to the validate_pipeline_input() function, which struggles with recognizing pd.query() style calls. I attmpted to bypass this issue by commenting out the validation leads to a subsequent problem, as shown below.

ValueError                                Traceback (most recent call last)
/home/erikserrano/Development/Mitocheck-MAP-analysis/notebooks/mitocheck-map-analysis.ipynb Cell 7 line 3
     33 # execute pipeline on negative control with trianing dataset with cp features
     34 logging.info(f"Running pipeline on CP features using {phenotype} phenotype")
---> 35 cp_negative_training_result = run_pipeline(meta=negative_training_cp_meta,
     36                                         feats=negative_training_cp_feats,
     37                                         pos_sameby=pos_sameby,
     38                                         pos_diffby=pos_diffby,
     39                                         neg_sameby=neg_sameby,
     40                                         neg_diffby=neg_diffby,
     41                                         batch_size=batch_size,
     42                                         null_size=null_size)
     43 map_results_neg_cp.append(cp_negative_training_result)                                       
     45 # execute pipeline on negative control with trianing dataset with dp features

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/map.py:120, in run_pipeline(meta, feats, pos_sameby, pos_diffby, neg_sameby, neg_diffby, null_size, batch_size, seed)
    118 meta = meta.reset_index(drop=True).copy()
    119 logger.info('Indexing metadata...')
--> 120 matcher = create_matcher(meta, pos_sameby, pos_diffby, neg_sameby,
    121                          neg_diffby)
    123 logger.info('Finding positive pairs...')
    124 pos_pairs = matcher.get_all_pairs(sameby=pos_sameby, diffby=pos_diffby)

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/map.py:61, in create_matcher(obs, pos_sameby, pos_diffby, neg_sameby, neg_diffby, multilabel_col)
     59 if multilabel_col:
     60     return MatcherMultilabel(obs, columns, multilabel_col, seed=0)
---> 61 return Matcher(obs, columns, seed=0)

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/matching.py:77, in Matcher.__init__(self, dframe, columns, seed, max_size)
     73         elems = rng.choice(elems, max_size)
     74     return elems
     76 mappers = [
---> 77     reverse_index(dframe[col]).apply(clip_list) for col in dframe
     78 ]
     80 # Create a column order based on the number of potential row matches
     81 # Useful to solve queries with more than one sameby
     82 n_pairs = {}

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/matching.py:22, in reverse_index(col)
     20 def reverse_index(col: pd.Series) -> pd.Series:
     21     '''Build a reverse_index for a given column in the DataFrame'''
---> 22     return pd.Series(col.groupby(col).indices, name=col.name)

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/frame.py:8869, in DataFrame.groupby(self, by, axis, level, as_index, sort, group_keys, observed, dropna)
   8866 if level is None and by is None:
   8867     raise TypeError("You have to supply one of 'by' and 'level'")
-> 8869 return DataFrameGroupBy(
   8870     obj=self,
   8871     keys=by,
   8872     axis=axis,
   8873     level=level,
   8874     as_index=as_index,
   8875     sort=sort,
   8876     group_keys=group_keys,
   8877     observed=observed,
   8878     dropna=dropna,
   8879 )

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/groupby/groupby.py:1278, in GroupBy.__init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, observed, dropna)
   1275 self.dropna = dropna
   1277 if grouper is None:
-> 1278     grouper, exclusions, obj = get_grouper(
   1279         obj,
   1280         keys,
   1281         axis=axis,
   1282         level=level,
   1283         sort=sort,
   1284         observed=False if observed is lib.no_default else observed,
   1285         dropna=self.dropna,
   1286     )
   1288 if observed is lib.no_default:
   1289     if any(ping._passed_categorical for ping in grouper.groupings):

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/groupby/grouper.py:1020, in get_grouper(obj, key, axis, level, sort, observed, validate, dropna)
   1015         in_axis = False
   1017     # create the Grouping
   1018     # allow us to passing the actual Grouping as the gpr
   1019     ping = (
-> 1020         Grouping(
   1021             group_axis,
   1022             gpr,
   1023             obj=obj,
   1024             level=level,
   1025             sort=sort,
   1026             observed=observed,
   1027             in_axis=in_axis,
   1028             dropna=dropna,
   1029         )
   1030         if not isinstance(gpr, Grouping)
   1031         else gpr
   1032     )
   1034     groupings.append(ping)
   1036 if len(groupings) == 0 and len(obj):

File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/groupby/grouper.py:601, in Grouping.__init__(self, index, grouper, obj, level, sort, observed, in_axis, dropna, uniques)
    599 if getattr(grouping_vector, "ndim", 1) != 1:
    600     t = str(type(grouping_vector))
--> 601     raise ValueError(f"Grouper for '{t}' not 1-dimensional")
    603 grouping_vector = index.map(grouping_vector)
    605 if not (
    606     hasattr(grouping_vector, "__len__")
    607     and len(grouping_vector) == len(index)
    608 ):

ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional

It appears that bypassing the validation steps results in a failure to construct the Grouper class.

using the repo

If you would like to explore and test the issue, please feel free to use the dedicated repository I've set up. Here are the steps to get started:

git clone https://github.com/WayScience/Mitocheck-MAP-analysis.git && cd Mitocheck-MAP-analysis
conda env create -f map_env.yaml
conda activate map

These commands will clone the repository, set up the required conda environment using the provided map_env.yaml file, and activate the environment, respectively.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions