Hamilton orchestrating slurm #586

skrawcz · 2023-12-12T18:23:37Z

skrawcz
Dec 12, 2023
Maintainer

This is a brainstorming discussion on what the shape of the API should be to support some decorators/functionality that would submit to slurm. The catalyst for this issue is @Roy-Kid !

For example, given a sketch of an API:

@work_at('./test_submit') 
@submit('sleep 1000', 
           type='slurm', 
           is_block=True,
           config={
            '-A': 'snic2022-5-658',
            '-n': 4,
            '-J': 'JOB_NAME',
            '-t': '07-00:00:00',
    })
def foo(foo_input:str)->str:
    return foo_input

@work_at('./second_submit') 
@submit('sleep 1000', 
           type='slurm', 
           is_block=True,
           config={
            '-A': 'snic2022-5-658',
            '-n': 4,
            '-J': 'JOB_NAME',
            '-t': '07-00:00:00',
    })
def bar(foo: str) -> str:
    return foo

what do we want it to do and how do we want users to think?

skrawcz · 2023-12-12T18:24:02Z

skrawcz
Dec 12, 2023
Maintainer Author

Question: should the function body be executed before or after the decorators?

9 replies

Roy-Kid Dec 12, 2023

Yes, yield is confusing. I try to write down a process I am using in the computational chemistry task:

@work_at('./calc_energy') 
@submit('source calc_energy.sh', 
           type='slurm', 
           is_block=True,
           config={
            '-A': 'snic2022-5-658',
            '-n': 4,
            '-J': 'JOB_NAME',
            '-t': '07-00:00:00',
    })
def calc_energy(num_molecules: int) -> float:
    # before:
    # here we can code or render a script
    #     w.r.t to input
    with open('calc_energy.sh', 'r') as f:
        f.write(f"mpprun lmp -in run.in -var nmol {num_molecules}")
    yield
    # after:
    # extract result from `energy.txt`
    result = fn(`energy.txt`)
    return result

You are right, the code is ugly and even throw a StopIterationError (although I can accept this because I always use generators). Maybe we can find a way to determine which function is before and after. I have no better idea right now.

skrawcz Dec 12, 2023
Maintainer Author

Oh interesting -- a yield could work -- that would make it explicit what is before and what's after.

Also how much value is there in the decorator -- versus putting those values in a dictionary/TypedDict/object to return as part of the yield?

@work_at('./calc_energy') 
@slurm
def calc_energy(num_molecules: int) -> Generator[dict, dict, float]:
    # before:
    # here we can code or render a script
    #     w.r.t to input
    with open('calc_energy.sh', 'r') as f:
        f.write(f"mpprun lmp -in run.in -var nmol {num_molecules}")
    submit_dict = dict(cmd='source calc_energy.sh', 
           type='slurm', 
           is_block=True,
           config={
            '-A': 'snic2022-5-658',
            '-n': 4,
            '-J': 'JOB_NAME',
            '-t': '07-00:00:00',
        } 
    )
    sub_result = yield  submit_dict # note we could send the std out/error and status code back in here.
    # after:
    # extract result from `energy.txt`
    result = fn(`energy.txt`)
    return result

Otherwise question on the work_at -- would that map 1-1 with a function? or? If so, then we could (a) make that a default requirement -- and (b) make that something that could be passed to @slurm too.

Roy-Kid Dec 12, 2023

a yield could work -- that would make it explicit what is before and what's after

Brilliant! Completely avoid top-heavy, I like it!

How much value is there in the decorator -- versus putting those values in a dictionary/TypedDict/object to return as part of the yield?

My immature idea is, that either all the arguments are put in the yield, or put type(if we use @submit for all kinds of submitor) and is_block in the decrator. If all the args are in the function body, that means we can configure all the things as input, I can't think of any downsides. If we extract type / is_block from yield to decorator, the semantic is clearer: only the resource called is dynamic, and the executor and execution method are static

would that map 1-1 with a function? or? If so, then we could (a) make that a default requirement -- and (b) make that something that could be passed to @slurm too.

Yes, map 1-1 with a function and act as a contextmanager. I dont think we could make that a default requirement because we dont know which directory user want to submit the task or more generally, execute a command.

I need work_at and submit to be seperated. Because there are lots of tools, for example, antechamber, it can not determine output and intermediate directory. If I can not use work_at to cd another directory temporarily, tons of files will be generated in the root directory; or I need to use os.chdir or something to do so manually(actually work_at is a syntactic sugar for os.chdir, change directory befor node start and change it back after node end).

skrawcz Dec 12, 2023
Maintainer Author

Yes, map 1-1 with a function and act as a contextmanager. I dont think we could make that a default requirement because we dont know which directory user want to submit the task or more generally, execute a command.

Right this is where we could by convention make it use that by default to match, unless provided otherwise. This could help force people to use a similar structure/approach.

I need work_at and submit to be seperated. Because there are lots of tools, for example, antechamber, it can not determine output and intermediate directory. If I can not use work_at to cd another directory temporarily, tons of files will be generated in the root directory; or I need to use os.chdir or something to do so manually(actually work_at is a syntactic sugar for os.chdir, change directory befor node start and change it back after node end).

My assumption was that for each submit() you'd want to specify the working directory. Is that assumption incorrect? That is, could there be cases you want submit() but without changing the current working directory?

Roy-Kid Dec 15, 2023

So sorry I replied late since I got sick those days. You are right, sometimes or smaller projects may not need to change the directory, especially when there is only one submission. Adhering to the principle that an API only does one thing, I think submit() and work_at() should be separated . Especially I will use the work_at() function in other tasks

skrawcz · 2023-12-12T18:25:16Z

skrawcz
Dec 12, 2023
Maintainer Author

Question: what customization would we want to enable with respect to decorator parameters? It could be possible to do the following.

@work_at(source("bar_dir"))  # <---
@submit('sleep 1000', 
           type='slurm', 
           is_block=True,
           config=source("bar_config")) # <---
def bar(foo: str) -> str:
    return foo

so that we could do the following:

result = dr.execute([...], inputs={"bar_config": dict(...), "bar_dir": "some/path/to/folder"})

1 reply

Roy-Kid Dec 12, 2023

Sounds good! I'm not familiar with Hamilton right now, does this mean we need to implement a new driver or GraphAdapter? If not, and still user can replace source("bar_dir") with a path in string, then I think it is perfect!

skrawcz · 2023-12-12T18:27:35Z

skrawcz
Dec 12, 2023
Maintainer Author

What would be some example procedural code that we'd want the Hamilton code to map to? -- just to ensure it's clear what Hamilton is replacing/helping with.

TODO: provide example

3 replies

Roy-Kid Dec 12, 2023

Here is an example behand submit API, but decouple with Hamilton. I hope those function can be handle with Hamilton and make it integrate into a DAG:

class Monitor:
    # This code maybe abstract to a `watcher` class, which can monitor external status. For example, some programs are not handle by python, and then we can monitor if a file is generated or a new line appended to a file.
    def __init__(self):

        self.job_pool = {}
        self.logger = logging.getLogger(self.__class__.__name__)

    def add_job(self, job_id, job_name):
        self.job_pool[job_id] = job_name

    def remove_job(self, job_id):
        self.job_pool.pop(job_id)

    def status(self, job_id):
        out = subprocess.check_output(f'squeue -j {job_id}', shell=True)
        out = out.decode('utf-8')
        lines = out.split('\n')
        line = list(map(lambda x: x.startswirh(str(job_id)), lines))
        assert len(line) > 0, 'multiple jobs found, please check!'
        # 29417953 tetralith tg_mil1_  x_jicli  R      23:51      2 n[1790,1795]
        job_id, partition, name, user, status, time, nodes, node_info = line[0].split(' ')
        self.logger.info(f'Job {job_id} status: {status}')
        return status
        

class SlurmSubmitor(Script):
    # This class for submit a task to queue system. 
    # All tasks submitted to one system should be monitoring by a watcher.
    # By polling, we can see the status of all tasks
    monitor = Monitor()

    def __init__(self, config:dict, name:str='submit', ext:str='sh', is_block:bool=False):
        super().__init__(name, ext)
        self._config = config
        self.logger = logging.getLogger(self.__class__.__name__)

    def submit(self, path: Path | str = Path.cwd()):
        # write down
        config_lines = []
        config_lines.append('#!/bin/bash')
        for k, v in self._config.items():
            config_lines.append(f'#SBATCH {k} {v}')
        config_lines.append('\n')

        self._content_list = config_lines + self._content_list

        path = Path(path)
        self.save(path)

        # Here we need to orchestrate with external code,
        # Maybe need a robust way to take care of submission 
        proc = subprocess.call(f'sbatch {self.full_name}', shell=True, cwd=path, capture_output=True)
        stdout = proc.stdout.decode('utf-8')
        stderr = proc.stderr.decode('utf-8')
        if stderr:
            self.logger.error(f'Submit {self.full_name} to {path} failed!')
            self.logger.error(stderr)
        else:
            # example: Submitted batch job 123456
            job_id = stdout.split(' ')[-1]
            self.logger.info(f'{self.full_name}({job_id}) submitted')
        
        if self.is_block:
            self.monitor.add_job(job_id, self.full_name)
            while self.monitor(job_id) == 'R':
                time.sleep(60)

def work_at(workdir:Path|str=Path.cwd()):
    # This method is important because the working directory cannot be specified and can only be executed in the current directory.
    def _cd(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            olddir = os.cwd()
            os.chdir(workdir)
            results = func(*args, **kwargs)
            os.chdir(olddir)
            return results
        return wrapper
    return _cd

def submit(cmd:str, type:str, is_block:bool, config:dict, name:str='submit', ext:str='sh'):
    def _submit(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            result = func(*args, **kwargs)
            if type == 'slurm':
                submitor = SlurmSubmitor(config, name, ext, is_block)
                submitor.append(cmd)
                submitor.submit(os.cwd())
            else:
                raise NotImplementedError(f'submit type {type} not implemented!')
            return result
        return wrapper
    return _submit

Ref code:

skrawcz Dec 12, 2023
Maintainer Author

@Roy-Kid yep thanks -- does this map to what you're thinking in terms of how things would be executed?

Create driver -- that creates the DAG.
Specify what to execute -- the graph is walked and then.
foo() is executed
Result of (3) is used to run a submit command in the specified working directory.
We wait for (4) to finish.
bar() is then executed -- what is passed in to bar from foo?
Result of (6) is used to run a submit command in the specified working directory.
we wait for (7) to finish.
the driver returns a result.

Roy-Kid Dec 12, 2023

Yes, exactly! The point of discussion narrowed down to whether submit should be executed before or after the function. If we can introduce a yield keyword... That would be awesome!

elijahbenizzy · 2024-01-05T17:44:35Z

elijahbenizzy
Jan 5, 2024
Maintainer

OK, @skrawcz and I thought this over. I think this is a nice solution. What we do:

Use task based execution
Define an execution manager to choose the executor for tasks
This delegates to the function
This runs in a new thread (and thus doesn't block), only if decorated
Otherwise it runs locally

Then, the function you have with the @slurm decorator does all the meat of what you have in your execution class -- submit, poll, and parse the results/return it.

# module
def some_var() -> str:
    ...

@slurm(...)
def slurm_job_1(some_var: str) -> dict:
    return _make_config(...)

@slurm(...) 
@tag(slurm_execute=True)
def slurm_job_2(some_var: str) -> dict:
    return _make_config(...)

def results(slurm_job_1: dict, slurm_job_2: dict) -> dict:
    return ...


# 1. break into four tasks
# 2. run through each task
#    2.a if not decorated with slurm -- execute as normal
#    3.a if decorated with slurm -- pass to SlurmExecutor
# 3. SlurmExecutor should wait and pass some indicator of the results, interpreted as node output


# execution manager, this is custom but let's push this back to hamilton
class RemoteDelegatingExecutionManager(DefaultExecutionManager):
    def __init__(local_executor, remote_executor, indicator_tag="should_execute_remotely"):
        super(...)
        self.indicator_tag = indicator_tag

    def get_executor_for_task(self, task: DefaultExecutionManager) -> TaskExecutor:
        is_single_node_task = len(task.nodes) == 1
        if not is_single_node_task:
            raise ValueError("Only single node tasks supported")
        node, = task.nodes
        if indicator_tag in node.tags:
            return self.remote_executor
        return self.local_executor


def slurm(params):
    # Decorator that:
    # 0. Tags with the indicator above, with fn = tag(is_slurm_fn=True)(fn)
    # 1. takes a function, 
    # 2. uses it to evaluate a config
    # 3. launches a task
    # 4. Polls that task on a loop
    # 5. Returns the result or erorrs out (depending on how you want to do failure management)
    # Call out to your class for these?


dr = (
    driver
        .Builder()
        .enable_dynamic_execution(...)
        .with_execution_manager(
            RemoteDelegatingExecutionManager(
                SynchronousLocalTaskExecutor(),
                MultiThreadingExecutor(),
                indicator_tag="is_slurm_fn")
        )
)

dr.execute(["results"])

3 replies

skrawcz Jan 5, 2024
Maintainer Author

Here's a simple proof of concept that works:

# cmdline.py
import functools
import subprocess

from hamilton.execution.executors import DefaultExecutionManager, TaskExecutor
from hamilton.execution.grouping import TaskImplementation


class CMDLineExecutionManager(DefaultExecutionManager):

    def get_executor_for_task(self, task: TaskImplementation) -> TaskExecutor:
        """Simple implementation that returns the local executor for single task executions,

        :param task: Task to get executor for
        :return: A local task if this is a "single-node" task, a remote task otherwise
        """
        is_single_node_task = len(task.nodes) == 1
        if not is_single_node_task:
            raise ValueError("Only single node tasks supported")
        node, = task.nodes
        if "cmdline" in node.tags:  # hard coded for now
            return self.remote_executor
        return self.local_executor



def cmdline_decorator(func):
    """Decorator to run the result of a function as a command line command."""
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        # Get the command from the function
        cmd = func(*args, **kwargs)

        # Run the command and capture the output
        result = subprocess.run(cmd, shell=True, capture_output=True, text=True)

        # Return the output
        return result.stdout

    return wrapper

# funcs.py --- hamilton functions
from cmdline import cmdline_decorator
from hamilton.function_modifiers import tag

@tag(cmdline="yes")
@cmdline_decorator
def echo_1(start: str) -> str:
    return f'echo "1: {start}"'

@tag(cmdline="yes")
@cmdline_decorator
def echo_2(echo_1: str) -> str:
    return f'echo "2: {echo_1}"'

@tag(cmdline="yes")
@cmdline_decorator
def echo_2b(echo_1: str) -> str:
    return f'echo "2b: {echo_1}"'

@tag(cmdline="yes")
@cmdline_decorator
def echo_3(echo_2: str, echo_2b: str) -> str:
    return f'echo "3: {echo_2 + ":::" + echo_2b}"'

# -- run.py
from hamilton.execution.executors import SynchronousLocalTaskExecutor, MultiThreadingExecutor

if __name__ == '__main__':
    from hamilton import driver
    from cmdline import CMDLineExecutionManager
    import funcs

    dr = (
        driver
        .Builder()
        .enable_dynamic_execution(allow_experimental_mode=True)
        .with_execution_manager(
            CMDLineExecutionManager(
                SynchronousLocalTaskExecutor(),
                MultiThreadingExecutor(5))
        )
        .with_modules(funcs)
        .build()
    )

    print(dr.list_available_variables())
    # for var in dr.list_available_variables():
    #     print(dr.execute([var.name], inputs={"start": "hello"}))
    print(dr.execute(["echo_3"], inputs={"start": "hello"}))

So the "decorator" just needs to know how to handle slurm.

And creating the DAG works nicely too:

The result with this design is then:

Write functions that return what should be run for the slurm decorator to use.
You then write a separate post processing function from the output of that slurm task.
you can interleave other python code as you like.

I haven't prototyped the decorator wrapping a generator like we discussed above, but I think that's also a design possibility with this design.

The work_at decorator could still work I think, but the order of the decorators would matter here.

skrawcz Jan 6, 2024
Maintainer Author

Here's the decorator supporting a generator function to sandwich pre and post processing for the command:

def cmdline_decorator(func):
    """Decorator to run the result of a function as a command line command."""
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        if inspect.isgeneratorfunction(func):
            # If the function is a generator, then we need to run it and capture the output
            # in order to return it
            gen = func(*args, **kwargs)
            cmd = next(gen)
            # Run the command and capture the output
            result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
            try:
                gen.send(result)
                raise ValueError("Generator cannot have multiple yields.")
            except StopIteration as e:
                return e.value
        else:
            # Get the command from the function
            cmd = func(*args, **kwargs)

            # Run the command and capture the output
            result = subprocess.run(cmd, shell=True, capture_output=True, text=True)

            # Return the output
            return result.stdout

    if inspect.isgeneratorfunction(func):
        # get the return type and set it as the return type of the wrapper
        wrapper.__annotations__["return"] = inspect.signature(func).return_annotation[2]
    return wrapper

Then a function could look like this:

@tag(cmdline="yes")
@cmdline_decorator
def echo_2b(echo_1: str) -> [str, CompletedProcess, str]:
    # preprocess
    print("preprocess")
    msg = f'echo "2b: {echo_1}"'
    completed_process = yield msg
    # postprocess
    print("postprocess")
    output = completed_process.stdout + "!!!"
    return output

Roy-Kid Jan 13, 2024

What a nice design, it is quite general! I can immediately use those codes in my project to invoke other scientific binary. I think the submission of slurm can be based on the cmd decorator and monitoring code to do so.

skrawcz · 2024-01-15T19:37:08Z

skrawcz
Jan 15, 2024
Maintainer Author

Things to figure out:

Submission mechanics. Polling vs fire & forget.

Example user flow - 1 with polling:
Assumption: -- jobs don't take > 24 hours. Point of Hamilton DAG is preparing and running everything.

Write out Hamilton code to mirror what should be orchestrated.
Execute Hamilton code. (python run.py)
If "job" function is encountered. Submit job, and poll for result.
This blocks until job is completed.
Complete rest of Hamilton graph execution.

Example user flow - 2 with fire & forget:
Assumption -- there is a really long running job. Point of Hamilton DAG is to prepare everything to run that job.

Write out Hamilton code to mirror all the preprocessing that's required.
There could be intermediate slurm jobs that are "fast" and we'd want to wait on.
Once the "final" slurm function is encountered. We would fire & forget, and then complete Hamilton execution -- provide the ID, or something for someone to follow up on.

0 replies

Hamilton orchestrating slurm #586

skrawcz Dec 12, 2023 Maintainer

Replies: 5 comments · 16 replies

skrawcz Dec 12, 2023 Maintainer Author

Roy-Kid Dec 12, 2023

skrawcz Dec 12, 2023 Maintainer Author

Roy-Kid Dec 12, 2023

skrawcz Dec 12, 2023 Maintainer Author

Roy-Kid Dec 15, 2023

skrawcz Dec 12, 2023 Maintainer Author

Roy-Kid Dec 12, 2023

skrawcz Dec 12, 2023 Maintainer Author

Roy-Kid Dec 12, 2023

skrawcz Dec 12, 2023 Maintainer Author

Roy-Kid Dec 12, 2023

elijahbenizzy Jan 5, 2024 Maintainer

skrawcz Jan 5, 2024 Maintainer Author

skrawcz Jan 6, 2024 Maintainer Author

Roy-Kid Jan 13, 2024

skrawcz Jan 15, 2024 Maintainer Author

skrawcz
Dec 12, 2023
Maintainer

Replies: 5 comments 16 replies

skrawcz
Dec 12, 2023
Maintainer Author

skrawcz Dec 12, 2023
Maintainer Author

skrawcz Dec 12, 2023
Maintainer Author

skrawcz
Dec 12, 2023
Maintainer Author

skrawcz
Dec 12, 2023
Maintainer Author

skrawcz Dec 12, 2023
Maintainer Author

elijahbenizzy
Jan 5, 2024
Maintainer

skrawcz Jan 5, 2024
Maintainer Author

skrawcz Jan 6, 2024
Maintainer Author

skrawcz
Jan 15, 2024
Maintainer Author