Running the code in HPCs #10

ezpzbz · 2022-02-21T11:03:17Z

ezpzbz
Feb 21, 2022

Hi,
Thanks for making the code public. It seems very promising and efficient.
I'm wondering if you have a short-term plan to make it scheduler friendly. More specifically, I would be interested in training the potential in HPCs with Slurm/SGE scheduler. Either way we use wrappers such as srun to run the calculations.
I gave it a try by directly calling mpirun but jobs are crashing. I still have not investigated the failure in depth but it seems that training is being carried out but the scaling.data file is still not available so the code crashes.
Any hint or suggestion would be highly appreciated.
Best regards,
Pezhman

OndrejMarsalek · 2022-02-23T15:22:45Z

OndrejMarsalek
Feb 23, 2022
Maintainer

Thank you for your interest in our code, Pezhman.

Development currently happens in a private repository with updates to the public one soon, and an eventual transition to development in the public one as well.

We are currently working on improving the infrastructure to launch training directly, as is done currently, and using AiiDA. The idea is that the "direct" option would be suitable for smaller runs that can fit on a single machine, while AiiDA will offer more generality and better use of HPC resources. I can't give you a specific timeline, but it is something that we are working on at the moment.

As for using srun within an existing allocation, that should already be possible with the current code with a bit of customization. If you are interested in that, @cschran should be able to tell you what worked for him.

1 reply

ezpzbz Feb 23, 2022
Author

Hi Ondra,
Thanks for the detailed response.
It is very promising that you are linking your code with AiiDA. I would be very interested to give it a go once it is public.

Yeah, I certainly would appreciate hearing from @cschran about workarounds for using srun.
Best regards,
Pezhman

cschran · 2022-02-24T09:39:12Z

cschran
Feb 24, 2022
Maintainer

It depends a bit on how srun is configured in your case, but I assume for now the simplest case, where you can launch multiple srun processes and pinning/mapping onto nodes will be handled dynamically.

In that case, you only really need to modify

aml/aml/utilities.py

Line 123 in 9fc12fd

    
           cmd_mpi = f'mpirun --np {n_core_task:d} --bind-to cpulist:ordered --map-by hwthread --cpu-set {cpu_set:s}'

to something like

srun --ntasks={n_core_task:d}

If you're dealing with a more complicated situation, where you need to work out the pinning/mapping, I can also share more details from a modification done for archer2.

3 replies

ezpzbz Feb 24, 2022
Author

Thanks a lot @cschran.
Indeed, Archer2 is the HPC I am aiming to run the code. I highly appreciate if you may share those modifications with me.
Best regards,
Pezhman

cschran Feb 24, 2022
Maintainer

Below is a patch you can apply with git apply. The job needs to do module load genmaskcpu to get the genmaskcpu command.
This works only for a single node for now. I hope that's sufficient for your purposes.
Best, Christoph

diff --git a/aml/mlp.py b/aml/mlp.py
index 7f31579..a521de9 100644
--- a/aml/mlp.py
+++ b/aml/mlp.py
@@ -76,7 +76,7 @@ class MLPProcess(MLP):
 
         return model
 
-    def __init__(self, elements, n, dir_run='.', n_tasks=1, remove_output=False):
+    def __init__(self, elements, n, dir_run='.', n_tasks=1, n_nodes=1, remove_output=False):
         """Process-based machine learning potential.
 
         Arguments:
@@ -84,6 +84,7 @@ class MLPProcess(MLP):
             n:
             dir_run:
             n_tasks:
+            n_nodes:
             remove_output:
         """
 
@@ -101,6 +102,8 @@ class MLPProcess(MLP):
         if n % n_tasks != 0:
             raise ValueError('Number of simultaneous tasks does not divide the number of committee members.')
         self.n_batch = n // n_tasks
+        n_nodes = int(n_nodes)
+        self.n_nodes = n_nodes
 
         self.remove_output = remove_output
 
@@ -315,6 +318,7 @@ class N2P2(MLPProcess):
         exclude_triples=None,
         dir_run='.',
         n_tasks=1,
+        n_nodes=1,
         n_core_task=1,
         node_size=None,
         remove_output=False
@@ -340,7 +344,7 @@ class N2P2(MLPProcess):
             remove_output: whether to remove training and prediction run directories
         """
 
-        super().__init__(elements, n, dir_run, n_tasks, remove_output)
+        super().__init__(elements, n, dir_run, n_tasks, n_nodes, remove_output)
 
         # save list of pairs and triples to exclude from ACSF generation
         self.exclude_pairs = exclude_pairs
@@ -493,6 +497,8 @@ class N2P2(MLPProcess):
         details = False
         cmd_mpi = prepare_command_mpi(
             i_task,
+            n_tasks=self.n_tasks,
+            n_nodes=self.n_nodes,
             n_core_task=self.n_core_task,
             node_size=self.node_size,
             mode=self.mode,
diff --git a/aml/utilities.py b/aml/utilities.py
index dca1278..b9557b9 100644
--- a/aml/utilities.py
+++ b/aml/utilities.py
@@ -75,6 +75,8 @@ def write_rankfile_openmpi(fn, job_index: int, job_size: int, node_size: int):
 
 def prepare_command_mpi(
     i_task: int,
+    n_tasks:int,
+    n_nodes:int,
     n_core_task: int = 1,
     node_size: int = 1,
     mode: str = 'serial',
@@ -120,7 +122,8 @@ def prepare_command_mpi(
                 msg = f'{node_size:d} cores on one node not enough for task {i_task:d} on {n_core_task:d} cores.'
                 raise ValueError(msg)
         cpu_set = f'{i_task*n_core_task}-{(i_task+1)*n_core_task-1}'
-        cmd_mpi = f'mpirun --np {n_core_task:d} --bind-to cpulist:ordered --map-by hwthread --cpu-set {cpu_set:s}'
+        #cmd_mpi = f'mpirun --np {n_core_task:d} --bind-to cpulist:ordered --map-by hwthread --cpu-set {cpu_set:s}'
+        cmd_mpi = 'maskcpu=$(genmaskcpu ' + f'{n_core_task:d} {i_task+1:d} {n_tasks:d} 1); srun --cpu-bind=mask_cpu:' +'${maskcpu} ' +f'--nodes=1 --ntasks={n_core_task:d} --tasks-per-node={n_core_task:d} --cpus-per-task=1 --oversubscribe --mem=30500M'
 
     elif mode == 'OpenMPI-multi':

ezpzbz Feb 24, 2022
Author

Thanks very much Christoph.
I'll give it a go soon. Hopefully one node would be sufficient for now.
Best regards,
Pezhman

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Running the code in HPCs #10

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Running the code in HPCs #10

Uh oh!

ezpzbz Feb 21, 2022

Replies: 2 comments · 4 replies

Uh oh!

OndrejMarsalek Feb 23, 2022 Maintainer

Uh oh!

ezpzbz Feb 23, 2022 Author

Uh oh!

Uh oh!

cschran Feb 24, 2022 Maintainer

Uh oh!

ezpzbz Feb 24, 2022 Author

Uh oh!

cschran Feb 24, 2022 Maintainer

Uh oh!

ezpzbz Feb 24, 2022 Author

ezpzbz
Feb 21, 2022

Replies: 2 comments 4 replies

OndrejMarsalek
Feb 23, 2022
Maintainer

ezpzbz Feb 23, 2022
Author

cschran
Feb 24, 2022
Maintainer

ezpzbz Feb 24, 2022
Author

cschran Feb 24, 2022
Maintainer

ezpzbz Feb 24, 2022
Author