Add hierarchical FDR correction for grouped hypotheses

## Problem

When testing compound × dose combinations, the current BH FDR correction treats all tests as independent. For example, with 1000 compounds × 5 doses = 5000 tests, BH correction is very stringent.

But doses of the same compound are **not independent** - they test the same underlying biological hypothesis ("does compound X have a phenotype?"). This leads to over-correction.

## Proposed Solution

Implement **hierarchical FDR** as an option:

```
Stage 1: Test at group level (e.g., compound)
  - Use minimum p-value within group as the group p-value
  - Apply BH at level q
  - A compound passes if ANY dose is significant

Stage 2: Test within groups (only for groups that passed Stage 1)
  - For each significant group, apply BH at level q to its members
  - Report member-level results
```

### Why min p-value instead of Simes?

For dose-response data, low doses are *expected* to be inactive. Simes' method penalizes compounds for having inactive low doses, which is biologically normal. Min p-value is more appropriate: **a compound passes Stage 1 if ANY dose shows activity**.

### Benefits
- Provides dose-level (or other sub-group) inference
- Much less harsh correction than treating all tests as independent
- Users specify grouping structure via metadata columns

### Example
- 1000 compounds × 5 doses = 5000 raw tests
- Stage 1: 1000 compound tests → 50 pass (any dose active)
- Stage 2: 50 × 5 = 250 dose tests, corrected in groups of 5
- Result: dose-level significance with appropriate correction

## API Design

Add parameter to `mean_average_precision()`:

```python
def mean_average_precision(
    ap_scores: pd.DataFrame,
    sameby: List[str],           # e.g., ['compound', 'dose']
    hierarchical_by: Optional[List[str]] = None,  # NEW: e.g., ['compound']
    ...
)
```

When `hierarchical_by` is specified:
1. `sameby` defines the granularity of mAP calculation (e.g., per compound×dose)
2. `hierarchical_by` defines the grouping for Stage 1 correction (e.g., per compound)
3. Stage 2 correction happens within each group

## Benchmark (LINCS data)

On LINCS data (4 plates, 58 compounds × 6 doses):
- Flat BH: 26 significant doses
- Hierarchical with min-p: 49 significant doses (**88% power gain**)

## Additional Context

Related bug to fix: `silent_thread_map` in `map.py` doesn't handle `leave` kwarg, causing TypeError when `progress_bar=False`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add hierarchical FDR correction for grouped hypotheses #115

Problem

Proposed Solution

Why min p-value instead of Simes?

Benefits

Example

API Design

Benchmark (LINCS data)

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add hierarchical FDR correction for grouped hypotheses #115

Description

Problem

Proposed Solution

Why min p-value instead of Simes?

Benefits

Example

API Design

Benchmark (LINCS data)

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions