pipelines stuck on remote #1328

lolpa1n · 2024-09-16T12:22:25Z

Hello,

I deployed clearml server on my machine and wanted to make pipelines:

my code:

from clearml import PipelineController

def read_data(csv_path):
    import pandas as pd
    import os

    df = pd.read_csv(csv_path)

    return df
    
def log_data(df):
    from clearml import Logger
    
    logger = Logger.current_logger()
    logger.report_table(title=f'Dataframe FULL', series='pandas DataFrame', iteration=1, table_plot=df)
    print("Logged data successfully")
    return True
    
if __name__ == '__main__':
    pipe = PipelineController(
        project='ASMBT', 
        name='test_pipe',
        version='1.0',
        add_pipeline_tags=True,
        # working_dir='.'
    )
    pipe.add_parameter(
        name='csv_path',
        description='path to csv file', 
        default='../data/data.csv'
    )
    pipe.add_function_step(
        name='read_data',
        function=read_data,
        function_kwargs=dict(csv_path='${pipeline.csv_path}'),
        function_return=['data_frame'],
        cache_executed_step=False,
        execution_queue='test_gpu'
    )
    pipe.add_function_step(
        name='log_data',
        function=log_data,
        function_kwargs=dict(df='${read_data.data_frame}'),
        cache_executed_step=False,
        execution_queue='test_gpu'
    )
    # pipe.start_locally(run_pipeline_steps_locally=True)
    pipe.start(queue='test_gpu')
    print('pipeline completed')

if I execute: pipe.start_locally(run_pipeline_steps_locally=True)
then everything works,
but if I change to pipe.start(queue='test_gpu'), after
clearml-agent daemon --detached --queue test_gpu --gpus 0:

then nothing happens and the green status - QUEUED
logs:

Launching the next 1 steps
Launching step [read_data]
2024-09-12 17:16:01
Launching step: read_data
Parameters:
{'kwargs/csv_path': '${pipeline.csv_path}'}
Configurations:
{}
Overrides:
{}

Tell me pls, how to do this correctly, if I want, for example, to select a specific GPU, etc. for launch

The text was updated successfully, but these errors were encountered:

ainoam · 2024-09-16T12:53:43Z

@lolpa1n As you'll see in this example, the queue specified in PipelineController.start is the one through which the controller itself will be executed.
The queue through which the pipeline steps will be executed is controlled through PipelineController.set_default_execution_queue

lolpa1n · 2024-09-16T13:13:53Z

@lolpa1n As you'll see in this example, the queue specified in PipelineController.start is the one through which the controller itself will be executed. The queue through which the pipeline steps will be executed is controlled through PipelineController.set_default_execution_queue

i am add pipe.set_default_execution_queue(default_execution_queue='test_gpu')
but nothing happens -

maybe is this because I'm running on the same machine?

suparshukov · 2024-10-09T20:36:52Z

Hello,
I have a similar problem.
I run pipelines in remote execution mode. It gets into the queue and the agent starts the container. Sometimes the first stage works and the second stage doesn't start - Launching step.... And sometimes the first stage doesn't launch either - just the Launching step [stage name] message hangs without errors....
When run_locally() is executed, the pipeline works completely

ainoam · 2024-10-10T12:15:46Z

@lolpa1n Sounds like the issue is more that you are using the same agent (you can deploy multiple agents on the same machine) - It can't take care of the steps since its busy handling the pipeline controller which in turn is waiting for the steps to complete.

@suparshukov Not sure the same applies for your use case? I think you'll need to take a look at the logs inside the container that appears as if its not doing anything to better isolate the issue.

kiranzo · 2024-10-15T17:43:52Z

Experimenting with steps from functions and draft=True option, getting the same result - the first step of my pipeline just hangs indefinitely. I have 2 clearml agents which each have each its own queue: one is --cpu-only and handles pipeline controller, and another one uses GPU and its queue is set as default.

Also, I'm using freshly built docker image that I didn't push into our artifactory. I set it on ClearML UI, and during the execution it said:
Error response from daemon: pull access denied for clearml_worker_etl_test, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
If it went ahead and tried to pull it from the artifactory, couldn't find it, and then went ahead with the execution anyway, where the heck does it execute the pipeline?

jkhenning · 2024-11-07T16:30:38Z

Hi @kiranzo , can you include the complete log?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipelines stuck on remote #1328

pipelines stuck on remote #1328

lolpa1n commented Sep 16, 2024

ainoam commented Sep 16, 2024

lolpa1n commented Sep 16, 2024 •

edited

Loading

suparshukov commented Oct 9, 2024

ainoam commented Oct 10, 2024

kiranzo commented Oct 15, 2024 •

edited

Loading

jkhenning commented Nov 7, 2024

pipelines stuck on remote #1328

pipelines stuck on remote #1328

Comments

lolpa1n commented Sep 16, 2024

ainoam commented Sep 16, 2024

lolpa1n commented Sep 16, 2024 • edited Loading

suparshukov commented Oct 9, 2024

ainoam commented Oct 10, 2024

kiranzo commented Oct 15, 2024 • edited Loading

jkhenning commented Nov 7, 2024

lolpa1n commented Sep 16, 2024 •

edited

Loading

kiranzo commented Oct 15, 2024 •

edited

Loading