Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propery locate podman images #1891

Merged
merged 8 commits into from
Aug 28, 2023

Conversation

jmfernandez
Copy link
Contributor

Use docker/podman/udocker inspect to check container images are already in the local registry.

Additionally, removed superfluous udocker/podman pull, which was done even when --disable-pull parameter is passed.

This should fix issue #1884

@codecov
Copy link

codecov bot commented Aug 21, 2023

Codecov Report

Merging #1891 (60fd77c) into main (bd89c56) will decrease coverage by 0.04%.
The diff coverage is 57.57%.

@@            Coverage Diff             @@
##             main    #1891      +/-   ##
==========================================
- Coverage   83.99%   83.95%   -0.04%     
==========================================
  Files          46       46              
  Lines        8152     8152              
  Branches     2170     2168       -2     
==========================================
- Hits         6847     6844       -3     
- Misses        836      838       +2     
- Partials      469      470       +1     
Files Changed Coverage Δ
cwltool/job.py 81.67% <51.85%> (-1.50%) ⬇️
cwltool/docker.py 76.62% <83.33%> (+3.67%) ⬆️

... and 1 file with indirect coverage changes

cwltool/docker.py Outdated Show resolved Hide resolved
@jmfernandez
Copy link
Contributor Author

jmfernandez commented Aug 22, 2023

I have just seen at https://github.com/common-workflow-language/cwltool/actions/runs/5937318373/job/16099520096?pr=1891 that there are several tests which are not passed now, due the error messages are apparently different. But I don't know why, because I have tried to reproduce it from command line, and it returned what it was expected:

(.cwltool) jmfernandez@pavonis[47]:~/projects/cwltool_podman_images> cwltool --debug --user-space-docker-cmd=quaquioN --default-container=docker.io/debian:stable-slim tests/wf/wc-tool.cwl tests/wf/wc-job.json &
[1] 12218
INFO /home/jmfernandez/projects/cwltool_podman_images/.cwltool/bin/cwltool 3.1
INFO Resolved 'tests/wf/wc-tool.cwl' to 'file:///home/jmfernandez/projects/cwltool_podman_images/tests/wf/wc-tool.cwl'
DEBUG [job wc-tool.cwl] initializing from file:///home/jmfernandez/projects/cwltool_podman_images/tests/wf/wc-tool.cwl
DEBUG [job wc-tool.cwl] {
    "file1": {
        "class": "File",
        "location": "file:///home/jmfernandez/projects/cwltool_podman_images/tests/wf/whale.txt",
        "size": 1111,
        "basename": "whale.txt",
        "nameroot": "whale",
        "nameext": ".txt"
    }
}
DEBUG [job wc-tool.cwl] path mappings is {
    "file:///home/jmfernandez/projects/cwltool_podman_images/tests/wf/whale.txt": [
        "/home/jmfernandez/projects/cwltool_podman_images/tests/wf/whale.txt",
        "/var/lib/cwl/stg59928477-76a6-42f4-aedb-1cbd7a861c7f/whale.txt",
        "File",
        true
    ]
}
DEBUG [job wc-tool.cwl] command line bindings is [
    {
        "position": [
            -1000000,
            0
        ],
        "datum": "wc"
    },
    {
        "position": [
            -1000000,
            1
        ],
        "datum": "-l"
    }
]
INFO ['quaquioN', 'pull', 'docker.io/debian:stable-slim']
ERROR Workflow error:
tests/wf/wc-tool.cwl:8:4: Either Docker container docker.io/debian:stable-slim is not available
                          with user space docker implementation quaquioN  or quaquioN is missing or
                          broken.
Traceback (most recent call last):
  File "/home/jmfernandez/projects/cwltool_podman_images/cwltool/job.py", line 754, in run
    subprocess.check_call(cmd, stdout=sys.stderr)  # nosec
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/subprocess.py", line 408, in check_call
    retcode = call(*popenargs, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/subprocess.py", line 389, in call
    with Popen(*popenargs, **kwargs) as p:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.11/subprocess.py", line 1950, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'quaquioN'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jmfernandez/projects/cwltool_podman_images/cwltool/main.py", line 1298, in main
    (out, status) = real_executor(
                    ^^^^^^^^^^^^^^
  File "/home/jmfernandez/projects/cwltool_podman_images/cwltool/executors.py", line 61, in __call__
    return self.execute(process, job_order_object, runtime_context, logger)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jmfernandez/projects/cwltool_podman_images/cwltool/executors.py", line 144, in execute
    self.run_jobs(process, job_order_object, logger, runtime_context)
  File "/home/jmfernandez/projects/cwltool_podman_images/cwltool/executors.py", line 240, in run_jobs
    job.run(runtime_context)
  File "/home/jmfernandez/projects/cwltool_podman_images/cwltool/job.py", line 756, in run
    raise SourceLine(docker_req, None, WorkflowException, debug).makeError(
cwltool.errors.WorkflowException: tests/wf/wc-tool.cwl:8:4: Either Docker container docker.io/debian:stable-slim is not available with user space docker implementation quaquioN  or quaquioN is missing or broken.

[1]+  Exit 1                  cwltool --debug --user-space-docker-cmd=quaquioN --default-container=docker.io/debian:stable-slim tests/wf/wc-tool.cwl tests/wf/wc-job.json

@mr-c
Copy link
Member

mr-c commented Aug 22, 2023

@jmfernandez
See https://github.com/common-workflow-language/cwltool/actions/runs/5937318373/job/16099520096?pr=1891#step:8:96

The test is looking for the exact phrase or quaquioN is missing or broken (the old behavior) but now we get 'quaquioN' not found: [Errno 2] No such file or directory: 'quaquioN' (which is more useful, thanks!)

@jmfernandez
Copy link
Contributor Author

I still don't understand the reason of the "irreproducibility" of the test failure. Along today I got the test failure at least twice when all the tests were run, but not always. And never when only the failing test is run (through tox -- "-k test_bad_userspace_runtime").

In any case, I have updated the failing test case, but I'm afraid it is not the real workaround / solution.

This approach avoids the issue about podman using "canonical"
image tags just using the "inspect" subcommand.
This udocker/podman pull command was being done even when
`--disable-pull` parameter is passed to `cwltool`, and it
should not be needed, as all container image pulls should
have happened at this point.
Also, cosmetic changes to fix some flake8 complaints.
@jmfernandez
Copy link
Contributor Author

The failing test shows that there seems to be some kind of unpredictable slowness when udocker is used. As you can see at https://github.com/common-workflow-language/cwltool/actions/runs/5962544324/job/16174584481?pr=1891 a timeout is raised, but it doesn't happen at https://github.com/common-workflow-language/cwltool/actions/runs/5962544324/job/16174585196?pr=1891

When I was debugging the source of this condition I was tracing the source. Next line calls cwltool code:

rc = main(argsl=args, stdout=stdout, stderr=stderr)

which ends leading to next code snippet at _job_popen:

cwltool/cwltool/job.py

Lines 967 to 985 in bd89c56

if timelimit is not None and timelimit > 0:
def terminate(): # type: () -> None
try:
_logger.warning(
"[job %s] exceeded time limit of %d seconds and will be terminated",
name,
timelimit,
)
sproc.terminate()
except OSError:
pass
tm = Timer(timelimit, terminate)
tm.daemon = True
tm.start()
if monitor_function:
monitor_function(sproc)
rcode = sproc.wait()

I detected it was remaining stuck at line 984, whose monitoring body for this use case is at

cwltool/cwltool/job.py

Lines 507 to 537 in bd89c56

def process_monitor(self, sproc: "subprocess.Popen[str]") -> None:
"""Watch a process, logging its max memory usage."""
monitor = psutil.Process(sproc.pid)
# Value must be list rather than integer to utilise pass-by-reference in python
memory_usage: MutableSequence[Optional[int]] = [None]
def get_tree_mem_usage(memory_usage: MutableSequence[Optional[int]]) -> None:
children = monitor.children()
try:
rss = monitor.memory_info().rss
while len(children):
rss += sum(process.memory_info().rss for process in children)
children = list(itertools.chain(*(process.children() for process in children)))
if memory_usage[0] is None or rss > memory_usage[0]:
memory_usage[0] = rss
except psutil.NoSuchProcess:
mem_tm.cancel()
mem_tm = Timer(interval=1, function=get_tree_mem_usage, args=(memory_usage,))
mem_tm.daemon = True
mem_tm.start()
sproc.wait()
mem_tm.cancel()
if memory_usage[0] is not None:
_logger.info(
"[job %s] Max memory used: %iMiB",
self.name,
round(memory_usage[0] / (2**20)),
)
else:
_logger.debug("Could not collect memory usage, job ended before monitoring began.")

and then it is waiting just at line 528 until the timeout happens

sproc.wait()

I did a test where I removed the process termination, and it finished properly later than the timelimit. So, I raised the timelimit at:

timelimit: 15

to 20, and it started working!

@jmfernandez
Copy link
Contributor Author

Also, I realized that the max memory usage monitoring being done was happening only once, after the first second of the process running. And that both the timelimit and monitoring features were not implemented for the cases where a shelled popen was used.

@mr-c mr-c enabled auto-merge (rebase) August 28, 2023 14:27
Copy link
Member

@mr-c mr-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jmfernandez !

@mr-c mr-c disabled auto-merge August 28, 2023 14:27
@mr-c mr-c merged commit 509ffb9 into common-workflow-language:main Aug 28, 2023
40 of 42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants