[DNM][JOSS] JOSS paper #95

valeriupredoi · 2025-09-17T14:33:52Z

Description

Latest draft PDF: https://github.com/NCAS-CMS/pyfive/blob/joss_paper/paper.pdf

Summary

This is the PR that contains the JOSS paper we are writing; this is DNM (do not merge) since we don't want to have the paper in our code base, but it makes it easier to contribute to the paper via code review, and it shows the JOSS folks that we are writing the paper off an up to date main branch, that is tested and has coverage measured.

the branch is here https://github.com/NCAS-CMS/pyfive/tree/joss_paper (ie this PR-ed branch)
example explained (incl how to get references in) here https://joss.readthedocs.io/en/latest/example_paper.html
(optional?) script to generate metadata for paper https://gist.github.com/arfon/478b2ed49e11f984d6fb

How to work on it

checkout this joss_paper branch as per normal, make changes, commit and push
at push, the PDF compiler Github Action will run, that will produce the paper PDF and commit it to this branch, as well as create a paper.zip Artifact that can be downloaded (that only contains the paper pdf)
the latest draft PDF can be viewed here https://github.com/NCAS-CMS/pyfive/blob/joss_paper/paper.pdf
the latest draft paper.zip artifact is available on the Action page, eg https://github.com/NCAS-CMS/pyfive/actions/runs/17856845646
IMPORTANT since the bot pushes the latest draft here, you will always need to pull the latest branch locally ie do git pull origin joss_paper since you need to bring your HEAD to the latest state; if you don't, and get stuck at git conflicts, allow for merge with no rebase ie run git config pull.rebase false then pull again

Closes #94

Before you get started

☝ Create an issue to discuss what you are going to do

Checklist

This pull request has a descriptive title and labels
This pull request has a minimal description (most was discussed in the issue, but a two-liner description is still desirable)
~~[ ] Unit tests have been added (if codecov test fails)~~
~~[ ] Any changed dependencies have been added or removed correctly (if need be)~~
~~[ ] If you are working on the documentation, please ensure the current build passes~~
All tests pass

codecov · 2025-09-17T14:37:44Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.58%. Comparing base (85ba6aa) to head (dbc24f2).

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #95   +/-   ##
=======================================
  Coverage   72.58%   72.58%           
=======================================
  Files          11       11           
  Lines        2499     2499           
  Branches      379      379           
=======================================
  Hits         1814     1814           
  Misses        583      583           
  Partials      102      102

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

valeriupredoi · 2025-09-18T15:48:12Z

@bnlawrence had a very good point about running the JOSS converter on a draft, so I asked the JOSS folks at openjournals/joss#1456

valeriupredoi · 2025-09-18T19:36:49Z

@bnlawrence had a very good point about running the JOSS converter on a draft, so I asked the JOSS folks at openjournals/joss#1456

proves out there are multiple ways, described in the JOSS docs https://joss.readthedocs.io/en/latest/paper.html#checking-that-your-paper-compiles (duh!) - and the GHA one is perfect for us here, I'll implement it tomorrow 🥳

…joss_paper

…t draft.

…ss_paper

bmaranville · 2025-09-21T13:02:36Z

paper.md

+HDF5 is probably the most important data format in physical science, used across the piste.It is particularly important in environmental science, particularly given the fact that NetCDF4 is HDF5 under the hood. From satellite missions, to climate models and radar systems, the default binary format has been HDF5 for decades. While newer formats are starting to get mindshare, there are petabytes, if not exabytes of existing HDF5, and there are still many good use-cases for creating new data in HDF5. However, despite the history, there are few libraries for reading HDF5 file data that do not depend on the official HDF5 library maintained by the HDFGroup, and in particular, there are none that can be used with Python. 
+While  the HDF5 c-library is reliable and performant, and battle-tested over decades, there are some caveats to depending upon it: Firstly, it is not thread-safe, and secondly, the code is large and complex, and should anything happen to the financial stability of The HDF5group, it is not obvious the C-code could be maintained. From a long-term curation perspective this last constraint is a concern.
+
+The original implementation of pyfive (by JH and BM), which included all the low-level functionality to deal with the internals of an HDF5 file was developed with POSIX access in mind. The recent upgrades were developed with the use-case of performant remote access to curated data as the primary motivation, but with additional motivations of having a lightweight HDF5 reader capable of deploying in resource or operating-system constrained environments (such as mobile), and one that could be maintained long-term as a reference reader for curation purposes. The lightweight deployment consequences of a pure-python HDF5 reader need no further introduction, but as additional motivation we now expand on the issues around remote access and curation.


JH is the original implementer - this paragraph shouldn't include me (BM)

many thanks @bmaranville 🍺

…est to modify the history

…ss_paper

…joss_paper

Added couple of paragraphs about remote data access

davidhassell

Great - a first pass from me - mainly pedantry :)

davidhassell · 2025-10-02T12:47:14Z

paper.md

+
+# Summary
+
+Pyfive (<https://pyfive.readthedocs.io/en/latest/>) is an open-source thread-safe pure Python package for reading data stored in HDF5. While it is not a complete implementation of all the specifications and capabilities of HDF5, it includes all the core functionality necessary to read gridded datasets, whether stored contiguously or with chunks, and to carry out the necessary decompression for the standard options (INCLUDE OPTIONS).


Suggested change

Pyfive (<https://pyfive.readthedocs.io/en/latest/>) is an open-source thread-safe pure Python package for reading data stored in HDF5. While it is not a complete implementation of all the specifications and capabilities of HDF5, it includes all the core functionality necessary to read gridded datasets, whether stored contiguously or with chunks, and to carry out the necessary decompression for the standard options (INCLUDE OPTIONS).

Pyfive (<https://pyfive.readthedocs.io>) is an open-source thread-safe pure Python package for reading data stored in HDF5. While it is not a complete implementation of all the specifications and capabilities of HDF5, it includes all the core functionality necessary to read gridded datasets, whether stored contiguously or with chunks, and to carry out the necessary decompression for the standard options (INCLUDE OPTIONS).

davidhassell · 2025-10-02T12:47:30Z

paper.md

+
+# Statement of need
+
+HDF5 is probably the most important data format in physical science, used across the piste. It is particularly important in environmental science, particularly given the fact that NetCDF4 is HDF5 under the hood. 


Suggested change

HDF5 is probably the most important data format in physical science, used across the piste. It is particularly important in environmental science, particularly given the fact that NetCDF4 is HDF5 under the hood.

HDF5 is probably the most important data format in physical science, used across the piste. It is particularly important in environmental science, particularly given the fact that netCDF4 is HDF5 under the hood.

davidhassell · 2025-10-02T12:47:54Z

paper.md

+HDF5 is probably the most important data format in physical science, used across the piste. It is particularly important in environmental science, particularly given the fact that NetCDF4 is HDF5 under the hood. 
+From satellite missions, to climate models and radar systems, the default binary format has been HDF5 for decades. While newer formats are starting to get mindshare, there are petabytes, if not exabytes of existing HDF5, and there are still many good use-cases for creating new data in HDF5. 
+However, despite the history, there are few libraries for reading HDF5 file data that do not depend on the official HDF5 library maintained by the HDFGroup, and in particular, apart from pyfive, in Python there are none that cover the needs of environmental science. 
+While  the HDF5 c-library is reliable and performant, and battle-tested over decades, there are some caveats to depending upon it: Firstly, it is not thread-safe,  secondly, the code is large and complex, and should anything happen to the financial stability of The HDF5group, it is not obvious the C-code could be maintained. Finally, the code complexity also meant that it was not suitable for developing bespoke code for data recovery in the case of partially corrupt data. From a long-term curation perspective both of these last two constraints are a concern. 


Suggested change

While the HDF5 c-library is reliable and performant, and battle-tested over decades, there are some caveats to depending upon it: Firstly, it is not thread-safe, secondly, the code is large and complex, and should anything happen to the financial stability of The HDF5group, it is not obvious the C-code could be maintained. Finally, the code complexity also meant that it was not suitable for developing bespoke code for data recovery in the case of partially corrupt data. From a long-term curation perspective both of these last two constraints are a concern.

While the HDF5 c-library is reliable and performant, and battle-tested over decades, there are some caveats to depending upon it: Firstly, it is not thread-safe, secondly, the code is large and complex, and should anything happen to the financial stability of The HDF5 Group, it is not obvious the C-code could be maintained. Finally, the code complexity also meant that it was not suitable for developing bespoke code for data recovery in the case of partially corrupt data. From a long-term curation perspective both of these last two constraints are a concern.

davidhassell · 2025-10-02T12:48:09Z

paper.md

+
+HDF5 is probably the most important data format in physical science, used across the piste. It is particularly important in environmental science, particularly given the fact that NetCDF4 is HDF5 under the hood. 
+From satellite missions, to climate models and radar systems, the default binary format has been HDF5 for decades. While newer formats are starting to get mindshare, there are petabytes, if not exabytes of existing HDF5, and there are still many good use-cases for creating new data in HDF5. 
+However, despite the history, there are few libraries for reading HDF5 file data that do not depend on the official HDF5 library maintained by the HDFGroup, and in particular, apart from pyfive, in Python there are none that cover the needs of environmental science. 


Suggested change

However, despite the history, there are few libraries for reading HDF5 file data that do not depend on the official HDF5 library maintained by the HDFGroup, and in particular, apart from pyfive, in Python there are none that cover the needs of environmental science.

However, despite the history, there are few libraries for reading HDF5 file data that do not depend on the official HDF5 library maintained by the HDF Group <CAN/SHOULD WE NAME ONES WE KNOW OF?>, and in particular, apart from pyfive, in Python there are none that cover the needs of environmental science.

davidhassell · 2025-10-02T12:50:57Z

paper.md

+However, in practice, for many use cases, b-tree extraction with pyfive will be comparable in performance to obtaining a kerchunk index, and completely opaque to the user.
+
+The issues of the dependency on a complex code maintained by one private company in the context of maintaining data access (over decades, and potentially centuries), can only be mitigated by ensuring that the data format is well documented, that data writers use only the documented features, and that public code exists which can be relatively easily maintained. 
+The HDF5group have provided good documentation for the core features of HDF5 which include all those of interest to the weather and climate community who motivated this reboot of pyfive, and while there is a community of developers beyond the HDF5 group (including some at the publicly funded Unidata institution), recent events suggest that given most of those developers and their existing funding are US based, some spreading of risk would be desirable. 


Suggested change

The HDF5group have provided good documentation for the core features of HDF5 which include all those of interest to the weather and climate community who motivated this reboot of pyfive, and while there is a community of developers beyond the HDF5 group (including some at the publicly funded Unidata institution), recent events suggest that given most of those developers and their existing funding are US based, some spreading of risk would be desirable.

The HDF5 Group have provided good documentation for the core features of HDF5 which include all those of interest to the weather and climate community who motivated this reboot of pyfive, and while there is a community of developers beyond the HDF5 Group (including some at the publicly funded Unidata institution), recent events suggest that given most of those developers and their existing funding are US based, some spreading of risk would be desirable.

davidhassell · 2025-10-02T12:54:09Z

paper.md

+
+Pyfive (<https://pyfive.readthedocs.io/en/latest/>) is an open-source thread-safe pure Python package for reading data stored in HDF5. While it is not a complete implementation of all the specifications and capabilities of HDF5, it includes all the core functionality necessary to read gridded datasets, whether stored contiguously or with chunks, and to carry out the necessary decompression for the standard options (INCLUDE OPTIONS).
+It is designed to address the current challenges of the standard HDF5 library in accessing data remotely, where data is transferred over the Internet from a storage server to the computation platform. Furthermore, it aims to demonstrate the untapped capabilities of the HDF5 format in the context of remote data access, countering the criticism that it is unsuitable for cloud storage.
+All data access is fully lazy, the data is only read from storage when the numpy data arrays are manipulated. Originally developed some years ago, the package has recently been upgraded to support


Suggested change

All data access is fully lazy, the data is only read from storage when the numpy data arrays are manipulated. Originally developed some years ago, the package has recently been upgraded to support

All data access is fully lazy, the data is only read from storage when the data arrays are manipulated in memory with numpy (REF). Originally developed some years ago, the package has recently been upgraded to support

davidhassell · 2025-10-02T12:55:45Z

paper.md

+# Statement of need
+
+HDF5 is probably the most important data format in physical science, used across the piste. It is particularly important in environmental science, particularly given the fact that NetCDF4 is HDF5 under the hood. 
+From satellite missions, to climate models and radar systems, the default binary format has been HDF5 for decades. While newer formats are starting to get mindshare, there are petabytes, if not exabytes of existing HDF5, and there are still many good use-cases for creating new data in HDF5. 


Suggested change

From satellite missions, to climate models and radar systems, the default binary format has been HDF5 for decades. While newer formats are starting to get mindshare, there are petabytes, if not exabytes of existing HDF5, and there are still many good use-cases for creating new data in HDF5.

From satellite missions, to climate models and radar systems, the default binary format has been HDF5 for decades. While newer formats are starting to get mindshare, there are petabytes, if not exabytes of existing HDF5 datasets, and there are still many good use-cases for creating new data in HDF5.

davidhassell · 2025-10-02T13:06:25Z

paper.md

+This trend has accelerated with the increasing adoption of cloud platforms for storing environmental and climate data, which provide more scalable storage capabilities than those available to many research centers that produce the original datasets. 
+The combination of remote data access and cloud storage has opened a new paradigm for data access; however, the technological stack must be carefully analyzed and evaluated to fully assess and exploit the performance offered by these platforms. 
+
+In this context, HDF5 has faced challenges in providing users with the performance and capabilities required for accessing data remotely in the cloud, showing relatively slow performance when accessed from cloud storage in a remote data access setting. However, the specific aspects of the HDF5 library responsible for this performance limitation have not been thoroughly investigated. 


Is this entirely fair? non-parallel access aside, HDF5 can access well-written (in terms of chunks and internal metadata structures) HDF5 very nicely. I think pyfive also struggles with not-well-written HDF5

davidhassell · 2025-10-02T13:07:36Z

paper.md

+The combination of remote data access and cloud storage has opened a new paradigm for data access; however, the technological stack must be carefully analyzed and evaluated to fully assess and exploit the performance offered by these platforms. 
+
+In this context, HDF5 has faced challenges in providing users with the performance and capabilities required for accessing data remotely in the cloud, showing relatively slow performance when accessed from cloud storage in a remote data access setting. However, the specific aspects of the HDF5 library responsible for this performance limitation have not been thoroughly investigated. 
+Instead, the perceived inadequacy of HDF5 has often been superficially justified based on empirical observations of performance when accessing test files. 


... when accessing poorly written test files.

valeriupredoi · 2025-10-03T15:57:24Z

I have now finally added the examples in 602afaa but was just chatting to @bnlawrence about maybe adding stuff about the OthogonalIndexer, so placeholder here for it 🍺

add JOSS paper example to be edited in place

b3b291e

valeriupredoi requested review from bnlawrence, davidhassell and kmuehlbauer September 17, 2025 14:33

valeriupredoi added the documentation label Sep 17, 2025

valeriupredoi mentioned this pull request Sep 17, 2025

Paper submission to JOSS #94

Open

valeriupredoi marked this pull request as draft September 17, 2025 14:44

started editing manuscript

23115fb

valeriupredoi mentioned this pull request Sep 18, 2025

Run manuscript converter on paper.md to get a feel of how the paper will look like openjournals/joss#1456

Closed

This comment was marked as outdated.

Sign in to view

valeriupredoi and others added 16 commits September 19, 2025 12:09

Merge branch 'main' into joss_paper

dbc24f2

commit draft pdf paper to here

dc000f9

add first dummy pdf draft

4fc1f15

(auto) Paper PDF Draft

a67308d

add Zeki as author

8183708

Merge branch 'joss_paper' of https://github.com/NCAS-CMS/pyfive into …

e9276d2

…joss_paper

(auto) Paper PDF Draft

da8075b

example trigger

bb5b7d2

Merge branch 'joss_paper' of https://github.com/NCAS-CMS/pyfive into …

cb82ba5

…joss_paper

(auto) Paper PDF Draft

3bc9e52

remove bit I added for demo

bfead76

(auto) Paper PDF Draft

d20190e

Starting to add authors, and fix some of the lazy wording of the firs…

474bf86

…t draft.

(auto) Paper PDF Draft

ae6710b

it's not all about environmental science

e88b0a9

Merge remote-tracking branch 'refs/remotes/origin/joss_paper' into jo…

6980e5d

…ss_paper

(auto) Paper PDF Draft

0c4959a

bmaranville reviewed Sep 21, 2025

View reviewed changes

valeriupredoi and others added 15 commits September 22, 2025 13:40

Merge branch 'main' into joss_paper

56c597b

(auto) Paper PDF Draft

c698ba7

Predoi correct ORCID number

a9cc682

(auto) Paper PDF Draft

5bf3868

Added Wout De Nolf's details and use case, and addressed Brian's requ…

7abfa3c

…est to modify the history

Merge remote-tracking branch 'refs/remotes/origin/joss_paper' into jo…

821e288

…ss_paper

(auto) Paper PDF Draft

f07f89b

Merge branch 'main' into joss_paper

a405059

Merge branch 'joss_paper' of https://github.com/NCAS-CMS/pyfive into …

aebf949

…joss_paper

(auto) Paper PDF Draft

861a68d

fix affiliation and add orcid for Kai

59f33f3

(auto) Paper PDF Draft

69a4081

added couple of paragraphs about remote data access

9e08340

Merge pull request #106 from zequihg50/joss_paper

bee9b3d

Added couple of paragraphs about remote data access

(auto) Paper PDF Draft

6ab0fd8

davidhassell reviewed Oct 2, 2025

View reviewed changes

valeriupredoi and others added 4 commits October 3, 2025 15:48

Merge branch 'main' into joss_paper

cd1788d

(auto) Paper PDF Draft

9d185f8

add use examples

602afaa

(auto) Paper PDF Draft

383fb06

bnlawrence mentioned this pull request Oct 7, 2025

Version 1.0 milestones #1

Open

21 tasks


		# Summary

		Pyfive (<https://pyfive.readthedocs.io/en/latest/>) is an open-source thread-safe pure Python package for reading data stored in HDF5. While it is not a complete implementation of all the specifications and capabilities of HDF5, it includes all the core functionality necessary to read gridded datasets, whether stored contiguously or with chunks, and to carry out the necessary decompression for the standard options (INCLUDE OPTIONS).


		# Statement of need

		HDF5 is probably the most important data format in physical science, used across the piste. It is particularly important in environmental science, particularly given the fact that NetCDF4 is HDF5 under the hood.

	However, despite the history, there are few libraries for reading HDF5 file data that do not depend on the official HDF5 library maintained by the HDFGroup, and in particular, apart from pyfive, in Python there are none that cover the needs of environmental science.
	However, despite the history, there are few libraries for reading HDF5 file data that do not depend on the official HDF5 library maintained by the HDF Group <CAN/SHOULD WE NAME ONES WE KNOW OF?>, and in particular, apart from pyfive, in Python there are none that cover the needs of environmental science.

	All data access is fully lazy, the data is only read from storage when the numpy data arrays are manipulated. Originally developed some years ago, the package has recently been upgraded to support
	All data access is fully lazy, the data is only read from storage when the data arrays are manipulated in memory with numpy (REF). Originally developed some years ago, the package has recently been upgraded to support

	From satellite missions, to climate models and radar systems, the default binary format has been HDF5 for decades. While newer formats are starting to get mindshare, there are petabytes, if not exabytes of existing HDF5, and there are still many good use-cases for creating new data in HDF5.
	From satellite missions, to climate models and radar systems, the default binary format has been HDF5 for decades. While newer formats are starting to get mindshare, there are petabytes, if not exabytes of existing HDF5 datasets, and there are still many good use-cases for creating new data in HDF5.

[DNM][JOSS] JOSS paper #95

Are you sure you want to change the base?

[DNM][JOSS] JOSS paper #95

Conversation

valeriupredoi commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Before you get started

Checklist

Uh oh!

codecov bot commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

valeriupredoi commented Sep 18, 2025

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

valeriupredoi commented Sep 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidhassell left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

valeriupredoi commented Oct 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

valeriupredoi commented Sep 17, 2025 •

edited

Loading

codecov bot commented Sep 17, 2025 •

edited

Loading