Skip to content

How to do a simple cat with s3 #7

@rabernat

Description

@rabernat

Thanks for working on this Martin!

I'm trying to do a simple cat operation to look at performance. Here's what I tried

import rfsspec

bucket = "cmip6-pds"
# 53 MB object
key = "CMIP6/CMIP/AS-RCEC/TaiESM1/1pctCO2/r1i1p1f1/Amon/hfls/gn/v20200225/hfls/0.0.0"
url = f"s3://{bucket}/{key}"

rs3.cat(url)
# b'S3 ERRROR: <?xml version="1.0" encoding="UTF-8"?>\n<Error><Code>InvalidBucketName</Code><Message>The specified bucket is not valid.</Message><BucketName>s3:</BucketName><RequestId>QR3RC7B9RMRTCTNN</RequestId><HostId>QoqRLQ03ZkncmistNC7OIEY8McgnijkP1j25CyHHLOON1MmlD0Xp5NuXdWBk5O6scsy0P1yjJ8w=</HostId></Error>'

rs3.cat(f"{bucket}/{key}")
# b'S3 ERRROR: <?xml version="1.0" encoding="UTF-8"?>\n<Error><Code>PermanentRedirect</Code><Message>The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.</Message><Endpoint>cmip6-pds.s3-us-west-2.amazonaws.com</Endpoint><Bucket>cmip6-pds</Bucket><RequestId>3XWMHZ78EJ409A4J</RequestId><HostId>XIrupd9bzk4GqrN0TBsr44itH3BovkQ5WmKyljcyF6ka8fjBBvRRSUdxcm4nYMetst66bTKr8aU=</HostId></Error>'

I'm stuck! Can you help me figure out what to do?


For reference, here is what I am comparing with

import s3fs
import boto3

s3 = s3fs.S3FileSystem()
s3_client = boto3.client('s3')

%timeit _ = s3.cat(url)
# 1.1 s ± 17.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit _ = s3_client.get_object(Bucket=bucket, Key=key)
# 326 ms ± 5.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I've found that boto is 3-5x faster than s3fs for this basic operation, and this performance difference propagates itself through our whole stack. In trying to get to the bottom of this, I decided to try rfsspec for the first time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions