Skip to content

Opening s3 file with rasterio 1.4 immediately loads the data to the machine #824

@phofl

Description

@phofl

Code Sample, a copy-pastable example if possible

A "Minimal, Complete and Verifiable Example" will make it much easier for maintainers to help you:
http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

We used to open the file connection to the file stored on s3 and then passing it to open_rasterio. A change in 1.4 now immediately loads all the data to the local machine and then puts it completely into the dask graph, making it explode in size

import s3fs
import rioxarray

s3 = s3fs.S3FileSystem()
counties = rioxarray.open_rasterio(
    s3.open("s3://nwm-250m-us-counties/Counties_on_250m_grid.tif"),
    chunks="auto",
)

This now hangs and downloads the whole 1GB to my local machine, before it completely, i.e. it doesn't keep things lazy.

See rasterio/rasterio#3232 for context

Problem description

This behavior is sub-optimal, it slows us down on the client and then create a huge dask graph. The new recommended way of dealing with this is as follows:

import s3fs
import rioxarray
import fsspec
import rasterio

s3 = s3fs.S3FileSystem()
counties = rioxarray.open_rasterio(
    "s3://nwm-250m-us-counties/Counties_on_250m_grid.tif",
    chunks="auto",
    opener=fsspec.open
)

Expected Output

Ideally, the opener would be set automatically whenever fsspec is installed and potentially raise a warning if rasterio >=1.4 and a remote /local file is passed that runs into this condition so that users now what to do

Environment Information

rioxarray (0.17.1.dev0) deps:
rasterio: 1.4.2
xarray: 2024.10.0
GDAL: 3.9.3
GEOS: 0.0.0
PROJ: 9.4.1
PROJ DATA: /Users/patrick/mambaforge/envs/rioxarray/share/proj
GDAL DATA: /Users/patrick/mambaforge/envs/rioxarray/lib/python3.12/site-packages/rasterio/gdal_data

Other python deps:
scipy: 1.14.1
pyproj: 3.7.0

System:
python: 3.12.7 | packaged by conda-forge | (main, Oct 4 2024, 15:57:01) [Clang 17.0.6 ]
executable: /Users/patrick/mambaforge/envs/rioxarray/bin/python
machine: macOS-14.3.1-arm64-arm-64bit

Installation method

  • conda, pypi, from source, etc...

Conda environment information (if you installed with conda):


Environment (conda list):
$ conda list | grep -E "rasterio|xarray|gdal"


Details about conda and system ( conda info ):
$ conda info

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions