In this little guide I will describe how to set up a project specific bucket in google cloud storage.
For this you need to either have your own account or be invited to the pangeo project (like I was in this example).
Open the Google Cloud Console and navigate to the Storage
tab.
Now you can create a bucket. The form that opens up will ask you a few things. You have to choose a unique name for the bucket first. Then you can decide where to store the data. Since the pangeo cloud instances run on us-central1
I opted for the Region
option and the selected us-central1
under the Location
drop-down.
Leave all other settings as the defaults.
Go to IAM & Admin
and create a new service account. For now you only need to give it a name and a description. Leave the other things like they are. Copy the generated email, which will be something like [email protected]
.
I think there is a way to set permissions (the next step) directly here now, but I have not figured out how to do that.
Now navigate back to your bucket, click the Permissions
tab, add a member and past the email from above. Choose Storage Admin
(you might have to search for it) as the Role.
The final step is to create a key for the service account. So return to the service account you just created, navigate to the Keys
tab, click Add Key
and choose JSON
as option. This will download the key file to your computer.
You can upload the key via drag and drop to your jupyterlab interface and then use it in your notebooks.
I am not sure this is a secure best practice, this will only affect the data within this bucket, so I will not worry too much about it. If you know a better way to do this, please let me know.
import json
import gcsfs
with open('/home/jovyan/keys/<keyname>.json') as token_file:
token = json.load(token_file)
fs = gcsfs.GCSFileSystem(token=token)
As anonymous
fs = gcsfs.GCSFileSystem(anon=True)
AWS has its own variation for this.
import s3fs
fs = s3fs.S3FileSystem(key=, secret=)
This should give you an idea about how to read and write to the bucket using xarray. Zarr is preferred when working on the cloud.
import xarray as xr
import numpy as np
import tempfile
da = xr.DataArray(np.random.rand(4)).to_dataset(name='data')
# Write/Read Zarr
mapper = gcs.get_mapper('<bucketname>/something.zarr')
da.to_zarr(mapper, mode='w')
da_zarr = xr.open_zarr(mapper)
# Write/Read other data objects
f = NamedTemporaryFile()
da.to_netcdf(f.name) # this could be any object, this example uses netCDF
fs.put(f.name, '<bucketname>/something.nc') # same as fs.upload(...)
open_f = fs.open('<bucketname>/something.nc')
ds = xr.open_dataset(open_f)