-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Description
This issue is centered around implementing pd.query()-based selection for the run_pipeline() function. I've encountered this challenge while working on a notebook, and I've documented the problem in a corresponding PR for visibility and collaboration.
To reporduce the error, I've created a small dataset and a standalone notebook to replicate the issue. You can find the test data here, which I used to reproduce the error in the referenced notebook. Additionally, the reproduce_error.ipynb notebook provides the code to recreate the issue.
The error that I receive is this:
KeyError Traceback (most recent call last)
/home/erikserrano/Development/Mitocheck-MAP-analysis/notebooks/mitocheck-map-analysis.ipynb Cell 7 line 3
33 # execute pipeline on negative control with trianing dataset with cp features
34 logging.info(f"Running pipeline on CP features using {phenotype} phenotype")
---> 35 cp_negative_training_result = run_pipeline(meta=negative_training_cp_meta,
36 feats=negative_training_cp_feats,
37 pos_sameby=pos_sameby,
38 pos_diffby=pos_diffby,
39 neg_sameby=neg_sameby,
40 neg_diffby=neg_diffby,
41 batch_size=batch_size,
42 null_size=null_size)
43 map_results_neg_cp.append(cp_negative_training_result)
45 # execute pipeline on negative control with trianing dataset with dp features
File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/map.py:115, in run_pipeline(meta, feats, pos_sameby, pos_diffby, neg_sameby, neg_diffby, null_size, batch_size, seed)
105 def run_pipeline(meta,
106 feats,
107 pos_sameby,
(...)
112 batch_size=20000,
113 seed=0) -> pd.DataFrame:
114 columns = flatten_str_list(pos_sameby, pos_diffby, neg_sameby, neg_diffby)
--> 115 validate_pipeline_input(meta, feats, columns)
117 # Critical!, otherwise the indexing wont work
118 meta = meta.reset_index(drop=True).copy()
File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/map.py:99, in validate_pipeline_input(meta, feats, columns)
98 def validate_pipeline_input(meta, feats, columns):
---> 99 if meta[columns].isna().any(axis=None):
100 raise ValueError('metadata columns should not have null values.')
101 if len(meta) != len(feats):
File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/frame.py:3899, in DataFrame.__getitem__(self, key)
3897 if is_iterator(key):
3898 key = list(key)
-> 3899 indexer = self.columns._get_indexer_strict(key, "columns")[1]
3901 # take() does not accept boolean indexers
3902 if getattr(indexer, "dtype", None) == bool:
File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/indexes/base.py:6114, in Index._get_indexer_strict(self, key, axis_name)
6111 else:
6112 keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 6114 self._raise_if_missing(keyarr, indexer, axis_name)
6116 keyarr = self.take(indexer)
6117 if isinstance(key, Index):
6118 # GH 42790 - Preserve name from an Index
File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/indexes/base.py:6178, in Index._raise_if_missing(self, key, indexer, axis_name)
6175 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
6177 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 6178 raise KeyError(f"{not_found} not in index")
KeyError: "['Metadata_is_control == 0'] not in index"
The root cause is traced to the validate_pipeline_input() function, which struggles with recognizing pd.query() style calls. I attmpted to bypass this issue by commenting out the validation leads to a subsequent problem, as shown below.
ValueError Traceback (most recent call last)
/home/erikserrano/Development/Mitocheck-MAP-analysis/notebooks/mitocheck-map-analysis.ipynb Cell 7 line 3
33 # execute pipeline on negative control with trianing dataset with cp features
34 logging.info(f"Running pipeline on CP features using {phenotype} phenotype")
---> 35 cp_negative_training_result = run_pipeline(meta=negative_training_cp_meta,
36 feats=negative_training_cp_feats,
37 pos_sameby=pos_sameby,
38 pos_diffby=pos_diffby,
39 neg_sameby=neg_sameby,
40 neg_diffby=neg_diffby,
41 batch_size=batch_size,
42 null_size=null_size)
43 map_results_neg_cp.append(cp_negative_training_result)
45 # execute pipeline on negative control with trianing dataset with dp features
File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/map.py:120, in run_pipeline(meta, feats, pos_sameby, pos_diffby, neg_sameby, neg_diffby, null_size, batch_size, seed)
118 meta = meta.reset_index(drop=True).copy()
119 logger.info('Indexing metadata...')
--> 120 matcher = create_matcher(meta, pos_sameby, pos_diffby, neg_sameby,
121 neg_diffby)
123 logger.info('Finding positive pairs...')
124 pos_pairs = matcher.get_all_pairs(sameby=pos_sameby, diffby=pos_diffby)
File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/map.py:61, in create_matcher(obs, pos_sameby, pos_diffby, neg_sameby, neg_diffby, multilabel_col)
59 if multilabel_col:
60 return MatcherMultilabel(obs, columns, multilabel_col, seed=0)
---> 61 return Matcher(obs, columns, seed=0)
File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/matching.py:77, in Matcher.__init__(self, dframe, columns, seed, max_size)
73 elems = rng.choice(elems, max_size)
74 return elems
76 mappers = [
---> 77 reverse_index(dframe[col]).apply(clip_list) for col in dframe
78 ]
80 # Create a column order based on the number of potential row matches
81 # Useful to solve queries with more than one sameby
82 n_pairs = {}
File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/copairs/matching.py:22, in reverse_index(col)
20 def reverse_index(col: pd.Series) -> pd.Series:
21 '''Build a reverse_index for a given column in the DataFrame'''
---> 22 return pd.Series(col.groupby(col).indices, name=col.name)
File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/frame.py:8869, in DataFrame.groupby(self, by, axis, level, as_index, sort, group_keys, observed, dropna)
8866 if level is None and by is None:
8867 raise TypeError("You have to supply one of 'by' and 'level'")
-> 8869 return DataFrameGroupBy(
8870 obj=self,
8871 keys=by,
8872 axis=axis,
8873 level=level,
8874 as_index=as_index,
8875 sort=sort,
8876 group_keys=group_keys,
8877 observed=observed,
8878 dropna=dropna,
8879 )
File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/groupby/groupby.py:1278, in GroupBy.__init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, observed, dropna)
1275 self.dropna = dropna
1277 if grouper is None:
-> 1278 grouper, exclusions, obj = get_grouper(
1279 obj,
1280 keys,
1281 axis=axis,
1282 level=level,
1283 sort=sort,
1284 observed=False if observed is lib.no_default else observed,
1285 dropna=self.dropna,
1286 )
1288 if observed is lib.no_default:
1289 if any(ping._passed_categorical for ping in grouper.groupings):
File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/groupby/grouper.py:1020, in get_grouper(obj, key, axis, level, sort, observed, validate, dropna)
1015 in_axis = False
1017 # create the Grouping
1018 # allow us to passing the actual Grouping as the gpr
1019 ping = (
-> 1020 Grouping(
1021 group_axis,
1022 gpr,
1023 obj=obj,
1024 level=level,
1025 sort=sort,
1026 observed=observed,
1027 in_axis=in_axis,
1028 dropna=dropna,
1029 )
1030 if not isinstance(gpr, Grouping)
1031 else gpr
1032 )
1034 groupings.append(ping)
1036 if len(groupings) == 0 and len(obj):
File ~/Programs/miniconda3/envs/map/lib/python3.12/site-packages/pandas/core/groupby/grouper.py:601, in Grouping.__init__(self, index, grouper, obj, level, sort, observed, in_axis, dropna, uniques)
599 if getattr(grouping_vector, "ndim", 1) != 1:
600 t = str(type(grouping_vector))
--> 601 raise ValueError(f"Grouper for '{t}' not 1-dimensional")
603 grouping_vector = index.map(grouping_vector)
605 if not (
606 hasattr(grouping_vector, "__len__")
607 and len(grouping_vector) == len(index)
608 ):
ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional
It appears that bypassing the validation steps results in a failure to construct the Grouper class.
using the repo
If you would like to explore and test the issue, please feel free to use the dedicated repository I've set up. Here are the steps to get started:
git clone https://github.com/WayScience/Mitocheck-MAP-analysis.git && cd Mitocheck-MAP-analysisconda env create -f map_env.yamlconda activate mapThese commands will clone the repository, set up the required conda environment using the provided map_env.yaml file, and activate the environment, respectively.