Conversation
⛈️ Required checks status: Has failure 🔴
Groups summary🔴 pytorch_lightning: Tests workflowThese checks are required after the changes to 🟡 pytorch_lightning: Azure GPU
These checks are required after the changes to 🟡 pytorch_lightning: Benchmarks
These checks are required after the changes to 🔴 pytorch_lightning: Docs
These checks are required after the changes to 🟢 mypy
These checks are required after the changes to 🟡 install
These checks are required after the changes to Thank you for your contribution! 💜
|
Lothiraldan
left a comment
There was a problem hiding this comment.
The following example fails with this branch but pass with the latest version of lightnintg.
Lightning 2.4.0, experiment: https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/64e6b0df893b435c93f54f1bc48a8958
Output:
CometLogger will be initialized in online mode
COMET INFO: Experiment is live on comet.com https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/64e6b0df893b435c93f54f1bc48a8958
COMET INFO: Couldn't find a Git repository in '/tmp' nor in any parent directory. Set `COMET_GIT_DIRECTORY` if your Git Repository is elsewhere.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------
COMET INFO: Experiment is live on comet.com https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/64e6b0df893b435c93f54f1bc48a8958
| Name | Type | Params | Mode
----------------------------------------
0 | l1 | Linear | 7.9 K | train
----------------------------------------
7.9 K Trainable params
0 Non-trainable params
7.9 K Total params
0.031 Total estimated model params size (MB)
1 Modules in train mode
0 Modules in eval mode
Sanity Checking: | | 0/? [00:00<?, ?it/s]/home/lothiraldan/.virtualenvs/tempenv-60a6200361ab/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=3` in the `DataLoader` to improve performance.
Sanity Checking DataLoader 0: 100%|██████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 38.05it/s]/home/lothiraldan/.virtualenvs/tempenv-60a6200361ab/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
/home/lothiraldan/.virtualenvs/tempenv-60a6200361ab/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=3` in the `DataLoader` to improve performance.
Epoch 2: 100%|███████████████████████████████████████████████████████████████████| 469/469 [00:31<00:00, 14.73it/s, v_num=8958]`Trainer.fit` stopped: `max_epochs=3` reached.
Epoch 2: 100%|███████████████████████████████████████████████████████████████████| 469/469 [00:31<00:00, 14.73it/s, v_num=8958]
COMET INFO: ---------------------------------------------------------------------------------------
COMET INFO: Comet.ml ExistingExperiment Summary
COMET INFO: ---------------------------------------------------------------------------------------
COMET INFO: Data:
COMET INFO: display_summary_level : 1
COMET INFO: name : upset_soil_1490
COMET INFO: url : https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/64e6b0df893b435c93f54f1bc48a8958
COMET INFO: Metrics [count] (min, max):
COMET INFO: train_loss [28] : (0.4863688051700592, 1.2028049230575562)
COMET INFO: val_loss [3] : (0.9357529878616333, 0.9526914358139038)
COMET INFO: Others:
COMET INFO: Created from : pytorch-lightning
COMET INFO: Parameters:
COMET INFO: layer_size : 784
COMET INFO: Uploads:
COMET INFO: model graph : 1
COMET INFO:
COMET INFO: Please wait for metadata to finish uploading (timeout is 3600 seconds)
COMET INFO: Uploading 1651 metrics, params and output messages
True
COMET INFO: ---------------------------------------------------------------------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------------------------------------------------------------------
COMET INFO: Data:
COMET INFO: display_summary_level : 1
COMET INFO: name : upset_soil_1490
COMET INFO: url : https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/64e6b0df893b435c93f54f1bc48a8958
COMET INFO: Others:
COMET INFO: Created from : pytorch-lightning
COMET INFO: Parameters:
COMET INFO: batch_size : 64
COMET INFO: Uploads:
COMET INFO: environment details : 1
COMET INFO: filename : 1
COMET INFO: installed packages : 1
COMET INFO: source_code : 2 (17.51 KB)
COMET INFO:
This branch, experiment: https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/26baa02c5c7244b4a5dc48a72e84392e
Output:
COMET INFO: Experiment is live on comet.com https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/26baa02c5c7244b4a5dc48a72e84392e
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
COMET INFO: Couldn't find a Git repository in '/tmp' nor in any parent directory. Set `COMET_GIT_DIRECTORY` if your Git Repository is elsewhere.
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------
| Name | Type | Params | Mode
----------------------------------------
0 | l1 | Linear | 7.9 K | train
----------------------------------------
7.9 K Trainable params
0 Non-trainable params
7.9 K Total params
0.031 Total estimated model params size (MB)
1 Modules in train mode
0 Modules in eval mode
W0906 18:27:21.134000 140399680829248 torch/multiprocessing/spawn.py:146] Terminating process 4052339 via signal SIGTERM
Traceback (most recent call last):
File "/tmp/Comet_and_Pytorch_Lightning.py", line 86, in <module>
main()
File "/tmp/Comet_and_Pytorch_Lightning.py", line 76, in main
trainer.fit(model, train_loader, eval_loader)
File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/trainer/trainer.py", line 538, in fit
call._call_and_handle_interrupt(
File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/strategies/launchers/multiprocessing.py", line 144, in launch
while not process_context.join():
^^^^^^^^^^^^^^^^^^^^^^
File "/home/lothiraldan/.virtualenvs/tempenv-5fbd1040246d4/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 189, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/lothiraldan/.virtualenvs/tempenv-5fbd1040246d4/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 76, in _wrap
fn(i, *args)
File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/strategies/launchers/multiprocessing.py", line 173, in _wrapping_function
results = function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/trainer/trainer.py", line 964, in _run
_log_hyperparams(self)
File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/loggers/utilities.py", line 93, in _log_hyperparams
logger.log_hyperparams(hparams_initial)
File "/home/lothiraldan/.virtualenvs/tempenv-5fbd1040246d4/lib/python3.12/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/lothiraldan/project/cometml/pytorch-lightning/src/lightning/pytorch/loggers/comet.py", line 282, in log_hyperparams
self.experiment.__internal_api__log_parameters__(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute '__internal_api__log_parameters__'
COMET INFO: ---------------------------------------------------------------------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------------------------------------------------------------------
COMET INFO: Data:
COMET INFO: display_summary_level : 1
COMET INFO: name : sleepy_monastery_3541
COMET INFO: url : https://www.comet.com/lothiraldan/comet-example-pytorch-lightning/26baa02c5c7244b4a5dc48a72e84392e
COMET INFO: Parameters:
COMET INFO: batch_size : 64
COMET INFO: Uploads:
COMET INFO: environment details : 1
COMET INFO: filename : 1
COMET INFO: installed packages : 1
COMET INFO: source_code : 2 (14.93 KB)
COMET INFO:
Please investigate what is happening
update tutorials to `3f8a254d` Co-authored-by: Borda <Borda@users.noreply.github.com>
|
Did some testing with following Trainer() params. CPU
GPU
MULTI-NODE (two VM nodes, each has one CUDA-device)
With or without current PR - everything works the same. |
|
@japdubengsub very nice job on testing, Sasha! |
update tutorials to `d5273534` Co-authored-by: Borda <Borda@users.noreply.github.com>
…ning-AI#20267) * build(deps): bump Lightning-AI/utilities from 0.11.6 to 0.11.7 Bumps [Lightning-AI/utilities](https://github.com/lightning-ai/utilities) from 0.11.6 to 0.11.7. - [Release notes](https://github.com/lightning-ai/utilities/releases) - [Changelog](https://github.com/Lightning-AI/utilities/blob/main/CHANGELOG.md) - [Commits](Lightning-AI/utilities@v0.11.6...v0.11.7) --- updated-dependencies: - dependency-name: Lightning-AI/utilities dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * Apply suggestions from code review --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
…ing-AI#20266) Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 6 to 7. - [Release notes](https://github.com/peter-evans/create-pull-request/releases) - [Commits](peter-evans/create-pull-request@v6...v7) --- updated-dependencies: - dependency-name: peter-evans/create-pull-request dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
When loading a pytorch-lightning model from MLFlow, I get `TypeError: Type parameter +_R_co without a default follows type parameter with a default`. This happens whenever doing `import pytorch_lightning as pl` which is done by packages like MLFlow. Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
* Fix TBPTT example * Make example self-contained * Update imports * Add test
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com> Co-authored-by: Jirka B <j.borovec+github@gmail.com>
* test: flaky terminated with signal SIGABRT * str
* Update twine to 6.0.1 for Python 3.13 * Pin pkginfo * Go with twine 6.0.1
…ning-AI#20569) Bumps [Lightning-AI/utilities](https://github.com/lightning-ai/utilities) from 0.11.9 to 0.12.0. - [Release notes](https://github.com/lightning-ai/utilities/releases) - [Changelog](https://github.com/Lightning-AI/utilities/blob/main/CHANGELOG.md) - [Commits](Lightning-AI/utilities@v0.11.9...v0.12.0) --- updated-dependencies: - dependency-name: Lightning-AI/utilities dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Haifeng Jin <haifeng-jin@users.noreply.github.com>
…#20574) Co-authored-by: Haifeng Jin <haifeng-jin@users.noreply.github.com>
…latency significantly. (Lightning-AI#20594) * Move save_hparams_to_yaml to log_hparams instead of auto save with metric * Fix params to be optional * Adjust test * Fix test_csv, test_no_name --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
… to ensure `tensorboard` logs can sync to `wandb` (Lightning-AI#20610)
) * Add checkpoint artifact path prefix to MLflow logger Add a new `checkpoint_artifact_path_prefix` parameter to the MLflow logger. * Modify `src/lightning/pytorch/loggers/mlflow.py` to include the new parameter in the `MLFlowLogger` class constructor and use it in the `after_save_checkpoint` method. * Update the documentation in `docs/source-pytorch/visualize/loggers.rst` to include the new `checkpoint_artifact_path_prefix` parameter. * Add a new test in `tests/tests_pytorch/loggers/test_mlflow.py` to verify the functionality of the `checkpoint_artifact_path_prefix` parameter and ensure it is used in the artifact path. * Add CHANGELOG * Fix MLflow logger test for `checkpoint_path_prefix` * Update stale documentation --------- Co-authored-by: Luca Antiga <luca.antiga@gmail.com>
…ning-AI#20631) * build(deps): bump Lightning-AI/utilities from 0.12.0 to 0.14.0 Bumps [Lightning-AI/utilities](https://github.com/lightning-ai/utilities) from 0.12.0 to 0.14.0. - [Release notes](https://github.com/lightning-ai/utilities/releases) - [Changelog](https://github.com/Lightning-AI/utilities/blob/main/CHANGELOG.md) - [Commits](Lightning-AI/utilities@v0.12.0...v0.14.0) --- updated-dependencies: - dependency-name: Lightning-AI/utilities dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Apply suggestions from code review --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jirka Borovec <6035284+Borda@users.noreply.github.com>
Update test_results.py
…if `sync_tensorboard=True` (Lightning-AI#20611)
* Allow a custom parser class when using LightningCLI * Update changelog
* ci: resolve standalone testing * faster * merge * printenv * here * list * prune * process * printf * stdout * ./ * -e * .coverage * all * rev * notes * notes * notes
* bump: testing with future torch 2.6 * bump `typing-extensions` * TORCHINDUCTOR_CACHE_DIR * bitsandbytes * Apply suggestions from code review * _TORCH_LESS_EQUAL_2_6 --------- Co-authored-by: Luca Antiga <luca.antiga@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Luca Antiga <luca@lightning.ai>
…ger-update # Conflicts: # src/lightning/pytorch/CHANGELOG.md
In this pull request, the CometML logger was updated to support the recent Comet SDK.
It has been unified with the comet_ml.start() method to ensure ease of use. The unit tests have also been updated.
📚 Documentation preview 📚: https://pytorch-lightning--2.org.readthedocs.build/en/2/