Use of s3() within parallel_map() #2069

hasush · 2024-09-30T23:15:27Z

Consider this call:

with metaflow.S3() as s3i:
result = s3i.info_many(s3_path, return_missing=True)

Can this be put in a metaflow.multicore_utils.parallel_map ?

i.e. parallel_map(wrapper_for_s3_info_many, s3_paths)

When I try, I get this error:

2024-09-30 23:08:29.391 [261693/start/3201226 (pid 1400063)] metaflow.plugins.datatools.s3.s3.MetaflowS3URLException: Specify S3(run=self) when you use S3 inside a running flow. Otherwise you have to use S3 with full s3:// urls.
2024-09-30 23:08:29.391 [261693/start/3201226 (pid 1400063)] Internal error

However, s3_paths=["s3://path/to/something.jpg","s3://path/to/something_else.jpg", ...] and I know 100% that every path in s3_paths starts with "s3://"

Putting run=self in the S3 instantiation within the wrapper yields

2024-09-30 23:21:21.832 [261699/start/3201250 (pid 1405459)] S3 non-transient error (attempt #1): s3op failed:
2024-09-30 23:21:21.913 [261699/start/3201250 (pid 1405459)] Invalid url: /

savingoyal · 2024-10-06T14:59:41Z

@hasush the s3.xx_many calls are already parallelized behind the scenes so one shouldn't necessarily need parallel_map. regardless, the error that you highlighted looks like a bug that we will address.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of s3() within parallel_map() #2069

Use of s3() within parallel_map() #2069

hasush commented Sep 30, 2024 •

edited

Loading

savingoyal commented Oct 6, 2024

Use of s3() within parallel_map() #2069

Use of s3() within parallel_map() #2069

Comments

hasush commented Sep 30, 2024 • edited Loading

savingoyal commented Oct 6, 2024

hasush commented Sep 30, 2024 •

edited

Loading