You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Updates docs for geneformer training, inference, and cellxclassification (#823)
### Description
Updates the documentation for Geneformer 10m and 106M models, and their
respective training curves and MLM loss benchmark scores.
All data is also tracked inside this google sheet
https://docs.google.com/spreadsheets/d/1OB28ArwR_-huNyfi4M2I_Q8jKEpvcINNqLhd-LGqKBY/edit?gid=521924651#gid=521924651
### Type of changes
<!-- Mark the relevant option with an [x] -->
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Refactor
- [X] Documentation update
- [ ] Other (please describe):
### CI Pipeline Configuration
Configure CI behavior by applying the relevant labels:
-
[SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci)
- Skip all continuous integration tests
-
[INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests)
- Execute notebook validation tests in pytest
-
[INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests)
- Execute tests labelled as slow in pytest for extensive testing
> [!NOTE]
> By default, the notebooks validation tests are skipped unless
explicitly enabled.
#### Authorizing CI Runs
We use
[copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation)
to manage authorization of CI
runs on NVIDIA's compute resources.
* If a pull request is opened by a trusted user and contains only
trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source
repository (e.g. pull-request/123)
* If a pull request is opened by an untrusted user or contains untrusted
changes, an NVIDIA org member must leave an
`/ok to test` comment on the pull request to trigger CI. This will need
to be done for each new commit.
### Usage
<!--- How does a user interact with the changed code -->
```python
TODO: Add code snippet
```
### Pre-submit Checklist
<!--- Ensure all items are completed before submitting -->
- [ ] I have tested these changes locally
- [X] I have updated the documentation accordingly
- [ ] I have added/updated tests as needed
- [ ] All existing tests pass successfully
Signed-off-by: Jonathan Mitchell <[email protected]>
Copy file name to clipboardExpand all lines: docs/docs/models/geneformer.md
+16-34Lines changed: 16 additions & 34 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,4 @@
1
1
# Geneformer
2
-
!!! note "Current checkpoints trained in BioNeMo1"
3
-
4
-
This document references performance numbers and runtime engines that are from the bionemo v1 variant of the model.
5
-
These numbers will be updated in a coming release to reflect the new bionemo v2 codebase. The model architecture and
6
-
training information will be the same, as checkpoints are converted from bionemo v1 format to v2 format. Benchmarks below
7
-
are annotated with which version of bionemo generated them. Accuracy should be the same within a small epsilon
8
-
since we have tests in place showing model equivalency between the two versions.
9
2
10
3
## Model Overview
11
4
@@ -155,32 +148,21 @@ NVIDIA believes Trustworthy AI is a shared responsibility and we have establishe
155
148
156
149
## Training diagnostics
157
150
158
-
### geneformer-10M-240530
159
-
160
-
This checkpoint was trained for approximately 11 epochs through the CELLxGENE split. Training was performed on 8 servers with 8 A100 GPUs each for a total of 115430 steps of per-gpu micro batch size 32 and global batch size of 2048. Training took a total of 1 day, 20 hours and 19 minutes of wallclock time. As can be seen in the following image, training and validation curves both decreased fairly smoothly throughout the course of training. In fact validation (blue) and training (orange) loss were both still decreasing at the end of 11 epochs through the dataset. The model could likely be trained for more epochs without overfitting.
161
-

162
-
163
-
!!! note "Training curves from BioNeMo1"
164
-
165
-
Note that these curves were generated on BioNeMo1. We see the same general training curves in our initial testing of
166
-
BioNeMo2, however. In the following figure the blue line is the previous training run of the 10M model and the
167
-
red curve is an equivalent training run on BioNeMo2. As we release new checkpoints they will be trained on BioNeMo2.
Training was performed on 8 servers with 8 A100 GPUs each for a total of 81485 steps using the CELLxGENE split with a per-gpu micro batch size 32 and global batch size of 2048. Training took a total of 4 days, 8 hours of wallclock time. As can be seen in the following images, training and validation curves both decreased fairly smoothly throughout the course of training.
173
154
174
-
This checkpoint was trained for approximately 11 epochs through the CELLxGENE split. Training was performed on 16 servers with 8 A100 GPUs each for a total of 115430 steps of per-gpu micro batch size 16 and global batch size of 2048. Training took a total of 3 days, 18 hours and 55 minutes of wallclock time. As can be seen in the following image, training and validation curves both decreased fairly smoothly throughout the course of training. In fact validation (blue) and training (orange) loss were both still decreasing at the end of 11 epochs through the dataset. The model could likely be trained for more epochs without overfitting.
175
-

155
+

156
+

157
+
176
158
177
-
Additionally, validation loss decreased both faster and continued to decrease at the same improved rate throughout training in the 106M parameter model (red) as compared to the 10M parameter model (blue). It would be interesting to test even larger models to see if we continue to observe improved performance in larger models.
178
-

This checkpoint was trained for approximately 35,650 steps using the CELLxGENE split. Training was performed on 16 servers with 8 A100 GPUs each for a total of 35,650 steps using the CELLxGENE split with a per-gpu micro batch size 16 and global batch size of 2,048. Training took a total of 8 hours of wallclock time. As can be seen in the following image, training and validation curves both decreased fairly smoothly throughout the course of training.
181
163
182
-
As stated in the previous section, the figures are from our BioNeMo1 code base where these checkpoints were originally
183
-
trained. As we release new checkpoints they will be trained on BioNeMo2.
164
+

165
+

184
166
185
167
## Benchmarking
186
168
@@ -192,9 +174,9 @@ The following describes the bert MLM token loss. Like in the original BERT paper
192
174
193
175
| Model Description | Token Loss (lower is better) |
!!! bug "Baseline Geneformer was recently updated on huggingface making loss comparisons challenging."
200
182
@@ -222,8 +204,8 @@ Elmentaite et al. (2020), Developmental Cell. This dataset contains approximatel
222
204
223
205
For more details see the example notebook titled Geneformer-celltype-classification-example.ipynb
224
206
225
-

226
-

207
+

208
+

0 commit comments