Skip to content

Commit cf6c89e

Browse files
committed
Add debugging tips to markdown files
1 parent 6cdc4cc commit cf6c89e

File tree

2 files changed

+26
-0
lines changed

2 files changed

+26
-0
lines changed

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,10 @@ Note: passing CLI overrides like `--reset_trainer_state` is only necessary if yo
206206

207207
Additional tools for evaluating OLMo models are available at the [OLMo Eval](https://github.com/allenai/ai2-olmo-eval) repo.
208208

209+
## Debugging
210+
211+
See [Debugging](https://github.com/allenai/OLMo/blob/main/docs/NOTES.md#debugging).
212+
209213
## Citing
210214

211215
```bibtex

docs/NOTES.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,3 +102,25 @@ outputs = model.generate(input_tensor, max_steps=3, beam_size=3)
102102
best_generation = outputs.token_ids[0][0].tolist()
103103
print(tokenizer.decode(best_generation))
104104
```
105+
106+
## Debugging
107+
108+
### Finding the cause of hangs
109+
110+
Hangs in distributed training can be due to several different causes, including
111+
bad user code, AMD/Nvidia memory-allocation issues, or issues in hardware setup.
112+
These issues can be difficult to root-cause and even harder to fix.
113+
114+
One approach we use to find the cause of a hang in distributed is to first identify which processes/nodes are hanging. The [scripts/pyspy_all_processes.sh](https://github.com/allenai/OLMo/blob/main/scripts/pyspy_all_processes.sh) script retrieves the python state of relevant python processes using `pyspy`. A process/node with different state may be experiencing a hang.
115+
116+
If a hang is suspected to be in GPU code, then you can run `gcore <pid>` on a hanging process to get core dumps. Then you can run `gdb <corefile>` and check where the code is hanging from a C++ perspective. Code being stuck on a GPU memory allocation (malloc) may be indicative of a hardware/setup issue rather than a problem in training code.
117+
118+
### Comparing two models that should be identical
119+
120+
There are some scenarios when one might want to investigate why two models/setups that should be identical but yield different results. A naive solution is to run both setups side-by-side and compare results manually (and this might not be possible if you have just 1 GPU).
121+
122+
An alternative for comparing OLMo models is to run the training of both models with the `--module_outputs_save_steps=[<list of steps]` config option. This causes OLMo to save a portion of the inputs & outputs of each OLMo submodule into a `traces/` folder at the model step's save location. Then [script/compare_module_outputs.py](https://github.com/allenai/OLMo/blob/main/scripts/compare_module_outputs.py) can be used to compare these portions of inputs & outputs, thus hopefully isolating the issue to a subset of model modules. See [script/compare_module_outputs.py](https://github.com/allenai/OLMo/blob/main/scripts/compare_module_outputs.py) for more details on its usage.
123+
124+
When comparing different hardware or dependency setups, it is possible that model
125+
state gets corrupted before the first forward pass of training. One can check this
126+
by running training with `--force_save_unsharded --dry_run --load_path=<original_model_path>` to save a checkpoint after the original model has loaded but before training has started. Then [scripts/compare_model_state.py](https://github.com/allenai/OLMo/blob/main/scripts/compare_model_state.py) can be used to see if parameters are different between the 2 models.

0 commit comments

Comments
 (0)