Skip to content

Read a single file from an archive #271

Open
@PietrH

Description

@PietrH

Bart sent us a message with an example where he was able to read a single events.csv from a 10Gb archive very quickly.

Hi Pieter & Peter, I thought this might interest you. I just tried reading partial files a bit more and used a camera trap zenodo repository by Julian as a example. There I can read the events.csv from a 10Gb archive within a second. That is kind of cool for applications where you are only interested in a subset of data (say all tiger images for camera traps or only summer radar data)

system.time(a <- vroom::vroom(
  archive::archive_read(
    "https://zenodo.org/records/10671148/files/pilot1.zip?download=1",
    file = "pilot1/events.csv"
  )
))
#> Rows: 30506 Columns: 5
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (3): eventID, deploymentID, mediaID
#> dttm (2): eventStart, eventEnd
#>
#> :information_source: Use `spec()` to retrieve the full column specification for this data.
#> :information_source: Specify the column types or set `show_col_types = FALSE` to quiet this message.
#>    user  system elapsed
#>   0.412   0.046   0.984
tibble::glimpse(a)
#> Rows: 30,506
#> Columns: 5
#> $ eventID      <chr> "42d09be5-1b91-49e1-a154-864eb557c0a4", "42d09be5-1b91-49…
#> $ deploymentID <chr> "AWD_2_13082021_pilot bc34dfce-8ee3-4e97-870e-d53079b80ce…
#> $ eventStart   <dttm> 2022-08-20 04:27:38, 2022-08-20 04:27:38, 2022-08-20 04:…
#> $ eventEnd     <dttm> 2022-08-20 04:27:44, 2022-08-20 04:27:44, 2022-08-20 04:…
#> $ mediaID      <chr> "d919bdd2-35e0-4219-b74d-45f2201d5ba1", "77522548-6728-45…

However, two other files took longer:

Maybe I have been a bit premature as it seems to depend on the position in the file how quick the read is. Two other csv (media and observations) that are about equally sized take much longer while locally they are about as quick ....

system.time(a <- vroom::vroom(show_col_types = F,
  archive::archive_read(
    "https://zenodo.org/records/10671148/files/pilot2.zip?download=1",
    file = "pilot2/events.csv"
  )
))
#>    user  system elapsed 
#>   0.334   0.017   0.712
system.time(a <- vroom::vroom(show_col_types = F,
  archive::archive_read(
    "https://zenodo.org/records/10671148/files/pilot2.zip?download=1",
    file = "pilot2/observations.csv"
  )
))
#>    user  system elapsed 
#>   1.416   1.319  30.998
system.time(a <- vroom::vroom(show_col_types = F,
  archive::archive_read(
    "https://zenodo.org/records/10671148/files/pilot2.zip?download=1",
    file = "pilot2/media.csv"
  )
))
#>    user  system elapsed 
#>   1.458   1.406  32.188


system.time(a <- vroom::vroom(show_col_types = F,
  archive::archive_read(
    "~/Downloads/pilot2.zip",
    file = "pilot2/events.csv"
  )
))
#>    user  system elapsed 
#>   0.012   0.002   0.014
system.time(a <- vroom::vroom(show_col_types = F,
  archive::archive_read(
    "~/Downloads/pilot2.zip",
    file = "pilot2/observations.csv"
  )
))
#>    user  system elapsed 
#>   0.044   0.013   0.058
system.time(a <- vroom::vroom(show_col_types = F,
  archive::archive_read(
    "~/Downloads/pilot2.zip",
    file = "pilot2/media.csv"
  )
))
#>    user  system elapsed 
#>   0.036   0.017   0.052

In Python he had better luck:

import unzip_http
import pandas
rzf = unzip_http.RemoteZipFile("https://zenodo.org/records/10671148/files/pilot2.zip")
rzf.namelist()
binfp = rzf.open('pilot2/observations.csv')
print(binfp.readlines())

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions