Skip to content

Simple way to output a directory artifact with Pydantic I/O? #1482

@alicederyn

Description

@alicederyn

What's the intended vision for a Pydantic I/O script that needs to output a directory artifact (e.g. to be tarred)?

Hybrid

I ended up using an odd hybrid, here's a simplified example:

_DOWNLOAD_PATH = "/tmp/repo"

@script(outputs=Artifact(name="repo", path=_DOWNLOAD_PATH.as_posix(), archive=TarArchiveStrategy())
def download_repository(input: DownloadRepositoryInput) -> DownloadRepositoryOutput:
  extract_repo(input.repo, _DOWNLOAD_PATH)  # Download to the right place to end up in the artifact
  details = extract_repo_details(_DOWNLOAD_PATH)
  return DownloadRepositoryOutput(details=details)

Output in input

I believe there is supposed to be a way to mark an artifact as an output in the input class to get handed a Path, but I couldn't get that to work.

class DownloadRepositoryInput:
  ...
  repo: Annotated[Path, Artifact(name="repo", archive=TarArchiveStrategy(), output=True)]

I tried this, but I can't create the workflow template any more:

  File "workflow_template.py", line 90, in get_workflow_template
    repo_artifact = download_output.get_artifact("repo")
  File "hera/workflows/_mixins.py", line 1001, in get_artifact
    return self._get_artifact(name=name, subtype=self._subtype)
  File "hera/workflows/_mixins.py", line 974, in _get_artifact
    raise ValueError(f"Cannot get output artifacts when the template has no outputs: {template}")
ValueError: Cannot get output artifacts when the template has no outputs:

It's also surprising to make an output in an input object, and I'm not sure how this would work with decorator syntax, where there would not be any way to access the output artifact.

Path in output

Ideally I'd write something like this:

class DownloadRepositoryOutput(Output):
  repo: Annotated[Path, Artifact(name="repo", archive=TarArchiveStrategy())]
  ...

@script()
def download_repository(input: DownloadRepositoryInput) -> DownloadRepositoryOutput:
  repo_path = Path("./repo")
  extract_repo(input.repo, repo_path)
  details = extract_repo_details(repo_path)
  return DownloadRepositoryOutput(repo=repo_path, details=details)

However, the path needs to be in the yaml, and unsurprisingly it's not the right path:

    outputs:
      artifacts:
      - name: repo
        path: /tmp/hera-outputs/artifacts/repo
        archive:
          tar: {}

Also, the code fails at runtime:

  File "hera/workflows/_runner/util.py", line 259, in _runner
    output = _save_annotated_return_outputs(function(**function_kwargs), output_annotations)
  File "hera/workflows/_runner/script_annotations_util.py", line 234, in _save_annotated_return_outputs
    _write_to_path(path, value, _get_dumper_function(matching_output))
  File "hera/workflows/_runner/script_annotations_util.py", line 303, in _write_to_path
    dumped_output = dumper(output_value)
...
  File "hera/shared/serialization.py", line 47, in serialize
    return json.dumps(value, cls=PydanticEncoder)  # None serialized as `null`
...
TypeError: Object of type PosixPath is not JSON serializable

Would it be reasonable to get this solution working? It seems like the right approach to support decorator syntax. We'd need to ensure Argo finds the right directory -- a symlink might work? If not, we can do a recursive copy. (A move sounds sensible, except a user might expect to be able to provide the same Path multiple times, or even nested Paths.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions