Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipelines stuck on remote #1328

Open
lolpa1n opened this issue Sep 16, 2024 · 6 comments
Open

pipelines stuck on remote #1328

lolpa1n opened this issue Sep 16, 2024 · 6 comments

Comments

@lolpa1n
Copy link

lolpa1n commented Sep 16, 2024

Hello,

I deployed clearml server on my machine and wanted to make pipelines:

my code:

from clearml import PipelineController

def read_data(csv_path):
    import pandas as pd
    import os

    df = pd.read_csv(csv_path)

    return df
    
def log_data(df):
    from clearml import Logger
    
    logger = Logger.current_logger()
    logger.report_table(title=f'Dataframe FULL', series='pandas DataFrame', iteration=1, table_plot=df)
    print("Logged data successfully")
    return True
    
if __name__ == '__main__':
    pipe = PipelineController(
        project='ASMBT', 
        name='test_pipe',
        version='1.0',
        add_pipeline_tags=True,
        # working_dir='.'
    )
    pipe.add_parameter(
        name='csv_path',
        description='path to csv file', 
        default='../data/data.csv'
    )
    pipe.add_function_step(
        name='read_data',
        function=read_data,
        function_kwargs=dict(csv_path='${pipeline.csv_path}'),
        function_return=['data_frame'],
        cache_executed_step=False,
        execution_queue='test_gpu'
    )
    pipe.add_function_step(
        name='log_data',
        function=log_data,
        function_kwargs=dict(df='${read_data.data_frame}'),
        cache_executed_step=False,
        execution_queue='test_gpu'
    )
    # pipe.start_locally(run_pipeline_steps_locally=True)
    pipe.start(queue='test_gpu')
    print('pipeline completed')

if I execute: pipe.start_locally(run_pipeline_steps_locally=True)
then everything works,
but if I change to pipe.start(queue='test_gpu'), after
clearml-agent daemon --detached --queue test_gpu --gpus 0:

then nothing happens and the green status - QUEUED
logs:

Launching the next 1 steps
Launching step [read_data]
2024-09-12 17:16:01
Launching step: read_data
Parameters:
{'kwargs/csv_path': '${pipeline.csv_path}'}
Configurations:
{}
Overrides:
{} 

Tell me pls, how to do this correctly, if I want, for example, to select a specific GPU, etc. for launch

@ainoam
Copy link
Collaborator

ainoam commented Sep 16, 2024

@lolpa1n As you'll see in this example, the queue specified in PipelineController.start is the one through which the controller itself will be executed.
The queue through which the pipeline steps will be executed is controlled through PipelineController.set_default_execution_queue

@lolpa1n
Copy link
Author

lolpa1n commented Sep 16, 2024

@lolpa1n As you'll see in this example, the queue specified in PipelineController.start is the one through which the controller itself will be executed. The queue through which the pipeline steps will be executed is controlled through PipelineController.set_default_execution_queue

i am add pipe.set_default_execution_queue(default_execution_queue='test_gpu')
but nothing happens -
image

maybe is this because I'm running on the same machine?

@suparshukov
Copy link

Hello,
I have a similar problem.
I run pipelines in remote execution mode. It gets into the queue and the agent starts the container. Sometimes the first stage works and the second stage doesn't start - Launching step.... And sometimes the first stage doesn't launch either - just the Launching step [stage name] message hangs without errors....
When run_locally() is executed, the pipeline works completely

@ainoam
Copy link
Collaborator

ainoam commented Oct 10, 2024

@lolpa1n Sounds like the issue is more that you are using the same agent (you can deploy multiple agents on the same machine) - It can't take care of the steps since its busy handling the pipeline controller which in turn is waiting for the steps to complete.

@suparshukov Not sure the same applies for your use case? I think you'll need to take a look at the logs inside the container that appears as if its not doing anything to better isolate the issue.

@kiranzo
Copy link

kiranzo commented Oct 15, 2024

Experimenting with steps from functions and draft=True option, getting the same result - the first step of my pipeline just hangs indefinitely. I have 2 clearml agents which each have each its own queue: one is --cpu-only and handles pipeline controller, and another one uses GPU and its queue is set as default.

Also, I'm using freshly built docker image that I didn't push into our artifactory. I set it on ClearML UI, and during the execution it said:
Error response from daemon: pull access denied for clearml_worker_etl_test, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
If it went ahead and tried to pull it from the artifactory, couldn't find it, and then went ahead with the execution anyway, where the heck does it execute the pipeline?

@jkhenning
Copy link
Member

Hi @kiranzo , can you include the complete log?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants