Skip to content

Commit

Permalink
Merge branch 'develop' into ml-evs/generalize-integration-tests
Browse files Browse the repository at this point in the history
  • Loading branch information
ml-evs committed Oct 24, 2024
2 parents 22b269a + d075299 commit df29570
Show file tree
Hide file tree
Showing 62 changed files with 4,453 additions and 252 deletions.
10 changes: 8 additions & 2 deletions .github/workflows/testing.yml
Original file line number Diff line number Diff line change
Expand Up @@ -98,10 +98,16 @@ jobs:
pip install .[tests]
- name: Unit tests
run: pytest --cov=jobflow_remote --cov-report=xml --cov-config pyproject.toml --ignore tests/integration
run: COVERAGE_FILE=.coverage.1 pytest --cov=jobflow_remote --cov-report= --cov-config pyproject.toml --ignore tests/integration

- name: Integration tests
run: pytest --cov=jobflow_remote --cov-append --cov-report=xml --cov-config pyproject.toml tests/integration
run: COVERAGE_FILE=.coverage.2 pytest --cov=jobflow_remote --cov-report= --cov-config pyproject.toml tests/integration

# combining the reports with --cov-append did not seem to work
- name: Generate coverage report
run: |
coverage combine .coverage.1 .coverage.2
coverage xml
- name: Upload coverage reports to Codecov
uses: codecov/codecov-action@v4
Expand Down
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ repos:
- tokenize-rt==4.1.0
- types-paramiko
- pydantic~=2.0
- types-python-dateutil
- repo: https://github.com/codespell-project/codespell
rev: v2.3.0
hooks:
Expand Down
1 change: 1 addition & 0 deletions doc/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@
"sphinx_copybutton",
"sphinxcontrib.autodoc_pydantic",
"sphinxcontrib.mermaid",
"sphinxcontrib.typer",
]

# Add any paths that contain templates here, relative to this directory.
Expand Down
57 changes: 54 additions & 3 deletions doc/source/user/advancedoptions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,11 @@ An minimal configuration for a *batch* worker would thus be:
work_dir: /home/guido/software/python/test_jfr/
host: hpc_host
max_jobs: 5
resources:
partition: debug
ntasks: 128
nodes: 1
time: "24:00:00"
batch:
jobs_handle_dir: /remote/path/jfr_handle_dir
work_dir: /remote/path/jfr_batch_jobs
Expand All @@ -133,8 +138,54 @@ end of one Job and the availability of a new one the *batch* job in the queue wi

.. warning::

The ``batch`` section of a worker's configuration also has a ``max_jobs`` option.
It allows for the definition of the maximum number of jobflow Jobs that will be executed in a single
process submitted to the queue (e.g. a SLURM job). This should not be confused with
The ``batch`` section of a worker's configuration has a ``max_jobs_per_batch`` option.
It allows for the definition of the maximum number of jobflow Jobs that will be executed
in a single *batch* process. This should not be confused with
the ``max_jobs`` value mentioned above, that defines the number of submitted *batch*
processes (e.g. the maximum number of SLURM Jobs simultaneously in the queue).

.. _advancedoptions paralbatch:

Parallel batch
--------------

Another potential use case is the need of executing multiple Jobs in parallel, inside the
same process submitted to the queue manager. For example requesting multiple nodes for a
job of the worker (e.g. a SLURM job) and running a different Job on each of the nodes.

It is possible to achieve this by enabling the execution of multiple Jobs in parallel
specifying a value of ``parallel_jobs`` larger than 1 in the ``batch`` section.
An example of a configuration for a parallel *batch* worker is:

.. code-block:: yaml
worker_name:
scheduler_type: slurm
work_dir: /home/guido/software/python/test_jfr/
host: hpc_host
max_jobs: 5
resources:
partition: debug
ntasks: 512
nodes: 4
time: "24:00:00"
batch:
jobs_handle_dir: /remote/path/jfr_handle_dir
work_dir: /remote/path/jfr_batch_jobs
parallel_jobs: 4
Consider that, depending on how the cluster is configured and how the job is implemented,
it will probably be needed to specify the number of processors used by each of the Jobs.
For example, for a Job running a code based on MPI parallelization in SLURM, it may be
needed to run the code with the command:

srun --nodes 1 -n 128 --exclusive EXECUTABLE

Additional options may need to be set. It would be advisable to verify the requirements
to execute multiple processes in parallel outside jobflow-remote.

.. note::

There is currently no way of obtaining a list of nodes/cores assigned to each Job
from jobflow-remote. If this might be needed to run in the parallel batch mode,
consider opening an issue on `Github <https://github.com/Matgenix/jobflow-remote/issues>`_.
26 changes: 26 additions & 0 deletions doc/source/user/cli.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
.. _cli:

===
CLI
===

Jobflow-remote allows to manage Jobs and Flows through the ``jf`` command line
interface (CLI). The most useful commands are already discussed in the
specific sections. A list of all the commands available can be obtained
running::

jf --tree

or for the commands available for a subsection with, for example::

jf job --tree

All the commands have an associated help that can be shown with the
``--help`` flag. Below are reported the help for all the commands
available in ``jf``.

.. typer:: jobflow_remote.cli:app
:preferred: html
:width: 65
:show-nested:
:make-sections:
2 changes: 2 additions & 0 deletions doc/source/user/errors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -333,6 +333,8 @@ rerun a Job. In particular
inconsistencies.


.. _errors runner:

Runner errors and Locked jobs
=============================

Expand Down
2 changes: 2 additions & 0 deletions doc/source/user/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,10 @@ details are found in :ref:`reference`.
tuning
errors
states
runner
advancedoptions
backup
cli

.. toctree::
:hidden:
Expand Down
58 changes: 55 additions & 3 deletions doc/source/user/install.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
.. _install:

**********************
Setup and installation
**********************
*******************************
Setup, installation and upgrade
*******************************

Introduction
============
Expand All @@ -25,6 +25,8 @@ All of these should have a python environment with at least jobflow-remote insta
However, only **USER** and **RUNNER** need to have access to the database. If not overlapping
with the other **RUNNER** only needs ``jobflow-remote`` and its dependencies to be installed.

.. _setup options:

Setup options
=============

Expand Down Expand Up @@ -97,6 +99,50 @@ or, for the development version::

pip install git+https://github.com/Matgenix/jobflow-remote.git

.. _upgrade :

Upgrade
=======

If you upgraded ``jobflow-remote`` to a new version and plan to use it with an
already existing project, it is possible that there will be incompatibilities
between the existing database or project configuration and those used in the upgraded
version. In order to smooth the upgrade procedure a tool to upgrade the configuration
has been implemented. This is exposed through the ``jf`` command line tool::

jf admin upgrade

This performs the following steps:

* Compare the version of the installed ``jobflow-remote`` with the one stored
in the database (set when executing a ``jf admin reset``) and use this as a
reference to determine which upgrades will be applied.
* Check the version of ``jobflow`` installed and compare with the version stored
in the database. Optionally compare the versions of all the other packages
installed (use the ``--check-env`` option).
* Provide a list of upgrades that will be performed.
* Ask the user for confirmation
* Sequentially apply the required upgrades.
* Update the version information in the database.

This will resolve potential incompatibilities and make the configuration compatible
with the current version of ``jobflow-remote``.

.. warning::
It is advisable to perform a backup of the content of the queue database
before performing the upgrade. See the :ref:`backup` section for more details.

.. note::
The version will be upgraded in steps, so that if multiple versions have
been skipped before the current upgrade, the code will proceed by upgrading
between subsequent versions, one at the time.

.. note::
A difference in the packages does not necessarily imply issues for the upgrade.
It may help checking if anything problematic or an unexpected difference may
be present.


Environments
============

Expand Down Expand Up @@ -216,4 +262,10 @@ As a last step you should reset the database with the command::
This will also delete the content of the database. If are reusing an existing database
and do not want to erase your data skip this step.

.. note::

This will also set the information about the ``jobflow-remote`` version and
python environment in the database. This will be used during the :ref:`upgrade`
procedure.

You are now ready to start running workflows with jobflow-remote!
3 changes: 2 additions & 1 deletion doc/source/user/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ equivalent for the other kinds of setup.

.. image:: ../_static/img/daemon_schema.svg
:width: 50%
:alt: All-in-one configuration
:alt: Runner daemon
:align: center

Once the daemon is started, the runner loops over the different actions that it can
Expand All @@ -68,6 +68,7 @@ perform and updates the state of Jobs in the database performing some actions on
- resolving all the references of the Job from the database (including everything in additional stores)
- using those data to generate a JSON representation of the Job without external references
- uploading a JSON file with this information on the runner

Once this is done, the state of the Job is ``UPLOADED``.
* The runner generates a submission script suitable for the type of chosen worker.
Uploads it and submits the job. The Job is now ``SUBMITTED``.
Expand Down
Loading

0 comments on commit df29570

Please sign in to comment.