-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
Thanks for working on this Martin!
I'm trying to do a simple cat operation to look at performance. Here's what I tried
import rfsspec
bucket = "cmip6-pds"
# 53 MB object
key = "CMIP6/CMIP/AS-RCEC/TaiESM1/1pctCO2/r1i1p1f1/Amon/hfls/gn/v20200225/hfls/0.0.0"
url = f"s3://{bucket}/{key}"
rs3.cat(url)
# b'S3 ERRROR: <?xml version="1.0" encoding="UTF-8"?>\n<Error><Code>InvalidBucketName</Code><Message>The specified bucket is not valid.</Message><BucketName>s3:</BucketName><RequestId>QR3RC7B9RMRTCTNN</RequestId><HostId>QoqRLQ03ZkncmistNC7OIEY8McgnijkP1j25CyHHLOON1MmlD0Xp5NuXdWBk5O6scsy0P1yjJ8w=</HostId></Error>'
rs3.cat(f"{bucket}/{key}")
# b'S3 ERRROR: <?xml version="1.0" encoding="UTF-8"?>\n<Error><Code>PermanentRedirect</Code><Message>The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.</Message><Endpoint>cmip6-pds.s3-us-west-2.amazonaws.com</Endpoint><Bucket>cmip6-pds</Bucket><RequestId>3XWMHZ78EJ409A4J</RequestId><HostId>XIrupd9bzk4GqrN0TBsr44itH3BovkQ5WmKyljcyF6ka8fjBBvRRSUdxcm4nYMetst66bTKr8aU=</HostId></Error>'I'm stuck! Can you help me figure out what to do?
For reference, here is what I am comparing with
import s3fs
import boto3
s3 = s3fs.S3FileSystem()
s3_client = boto3.client('s3')
%timeit _ = s3.cat(url)
# 1.1 s ± 17.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit _ = s3_client.get_object(Bucket=bucket, Key=key)
# 326 ms ± 5.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)I've found that boto is 3-5x faster than s3fs for this basic operation, and this performance difference propagates itself through our whole stack. In trying to get to the bottom of this, I decided to try rfsspec for the first time.
Metadata
Metadata
Assignees
Labels
No labels