-
Notifications
You must be signed in to change notification settings - Fork 24
[DNM][JOSS] JOSS paper #95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
valeriupredoi
wants to merge
38
commits into
main
Choose a base branch
from
joss_paper
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+235
−1
Draft
Changes from 19 commits
Commits
Show all changes
38 commits
Select commit
Hold shift + click to select a range
b3b291e
add JOSS paper example to be edited in place
valeriupredoi 23115fb
started editing manuscript
valeriupredoi dbc24f2
Merge branch 'main' into joss_paper
valeriupredoi dc000f9
commit draft pdf paper to here
valeriupredoi 4fc1f15
add first dummy pdf draft
valeriupredoi a67308d
(auto) Paper PDF Draft
valeriupredoi 8183708
add Zeki as author
valeriupredoi e9276d2
Merge branch 'joss_paper' of https://github.com/NCAS-CMS/pyfive into …
valeriupredoi da8075b
(auto) Paper PDF Draft
valeriupredoi bb5b7d2
example trigger
valeriupredoi cb82ba5
Merge branch 'joss_paper' of https://github.com/NCAS-CMS/pyfive into …
valeriupredoi 3bc9e52
(auto) Paper PDF Draft
valeriupredoi bfead76
remove bit I added for demo
valeriupredoi d20190e
(auto) Paper PDF Draft
valeriupredoi 474bf86
Starting to add authors, and fix some of the lazy wording of the firs…
ae6710b
(auto) Paper PDF Draft
bnlawrence e88b0a9
it's not all about environmental science
6980e5d
Merge remote-tracking branch 'refs/remotes/origin/joss_paper' into jo…
0c4959a
(auto) Paper PDF Draft
bnlawrence 56c597b
Merge branch 'main' into joss_paper
valeriupredoi c698ba7
(auto) Paper PDF Draft
valeriupredoi a9cc682
Predoi correct ORCID number
valeriupredoi 5bf3868
(auto) Paper PDF Draft
valeriupredoi 7abfa3c
Added Wout De Nolf's details and use case, and addressed Brian's requ…
821e288
Merge remote-tracking branch 'refs/remotes/origin/joss_paper' into jo…
f07f89b
(auto) Paper PDF Draft
bnlawrence a405059
Merge branch 'main' into joss_paper
valeriupredoi aebf949
Merge branch 'joss_paper' of https://github.com/NCAS-CMS/pyfive into …
valeriupredoi 861a68d
(auto) Paper PDF Draft
valeriupredoi 59f33f3
fix affiliation and add orcid for Kai
kmuehlbauer 69a4081
(auto) Paper PDF Draft
kmuehlbauer 9e08340
added couple of paragraphs about remote data access
zequihg50 bee9b3d
Merge pull request #106 from zequihg50/joss_paper
valeriupredoi 6ab0fd8
(auto) Paper PDF Draft
valeriupredoi cd1788d
Merge branch 'main' into joss_paper
valeriupredoi 9d185f8
(auto) Paper PDF Draft
valeriupredoi 602afaa
add use examples
valeriupredoi 383fb06
(auto) Paper PDF Draft
valeriupredoi File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,59 @@ | ||
| @article{Pearson:2017, | ||
| url = {http://adsabs.harvard.edu/abs/2017arXiv170304627P}, | ||
| Archiveprefix = {arXiv}, | ||
| Author = {{Pearson}, S. and {Price-Whelan}, A.~M. and {Johnston}, K.~V.}, | ||
| Eprint = {1703.04627}, | ||
| Journal = {ArXiv e-prints}, | ||
| Keywords = {Astrophysics - Astrophysics of Galaxies}, | ||
| Month = mar, | ||
| Title = {{Gaps in Globular Cluster Streams: Pal 5 and the Galactic Bar}}, | ||
| Year = 2017 | ||
| } | ||
|
|
||
| @book{Binney:2008, | ||
| url = {http://adsabs.harvard.edu/abs/2008gady.book.....B}, | ||
| Author = {{Binney}, J. and {Tremaine}, S.}, | ||
| Booktitle = {Galactic Dynamics: Second Edition, by James Binney and Scott Tremaine.~ISBN 978-0-691-13026-2 (HB).~Published by Princeton University Press, Princeton, NJ USA, 2008.}, | ||
| Publisher = {Princeton University Press}, | ||
| Title = {{Galactic Dynamics: Second Edition}}, | ||
| Year = 2008 | ||
| } | ||
|
|
||
| @article{gaia, | ||
| author = {{Gaia Collaboration}}, | ||
| title = "{The Gaia mission}", | ||
| journal = {Astronomy and Astrophysics}, | ||
| archivePrefix = "arXiv", | ||
| eprint = {1609.04153}, | ||
| primaryClass = "astro-ph.IM", | ||
| keywords = {space vehicles: instruments, Galaxy: structure, astrometry, parallaxes, proper motions, telescopes}, | ||
| year = 2016, | ||
| month = nov, | ||
| volume = 595, | ||
| doi = {10.1051/0004-6361/201629272}, | ||
| url = {http://adsabs.harvard.edu/abs/2016A%26A...595A...1G}, | ||
| } | ||
|
|
||
| @article{astropy, | ||
| author = {{Astropy Collaboration}}, | ||
| title = "{Astropy: A community Python package for astronomy}", | ||
| journal = {Astronomy and Astrophysics}, | ||
| archivePrefix = "arXiv", | ||
| eprint = {1307.6212}, | ||
| primaryClass = "astro-ph.IM", | ||
| keywords = {methods: data analysis, methods: miscellaneous, virtual observatory tools}, | ||
| year = 2013, | ||
| month = oct, | ||
| volume = 558, | ||
| doi = {10.1051/0004-6361/201322068}, | ||
| url = {http://adsabs.harvard.edu/abs/2013A%26A...558A..33A} | ||
| } | ||
|
|
||
| @misc{fidgit, | ||
| author = {A. M. Smith and K. Thaney and M. Hahnel}, | ||
| title = {Fidgit: An ungodly union of GitHub and Figshare}, | ||
| year = {2020}, | ||
| publisher = {GitHub}, | ||
| journal = {GitHub repository}, | ||
| url = {https://github.com/arfon/fidgit} | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,112 @@ | ||
| --- | ||
| title: 'Pyfive: A pure-python HDF5 reader' | ||
| tags: | ||
| - Python | ||
| - Atmospheric Science | ||
| - Physics | ||
| - Climate Model Data | ||
| - Engineering | ||
| authors: | ||
| - name: Bryan Lawrence | ||
| orcid: 0000-0001-9262-7860 | ||
| affiliation: 1 # (Multiple affiliations must be quoted) | ||
| - name: Ezequiel Cimadevilla | ||
| affiliation: 2 | ||
| - name: David Hassell | ||
| orcid: 0000-0002-5312-4950 | ||
| affiliation: 1 | ||
| - name: Jonathan Helmus | ||
| affiliation: 3 | ||
| - name: Brian Maranville | ||
| orcid: 0000-0002-6105-8789 | ||
| affiliation: 4 | ||
| - name: Kai Mühlbauer | ||
| affiliation: 5 | ||
| - name: Valeriu Predoi | ||
| orcid: 0000-0002-9729-657 | ||
| affiliation: 1 | ||
| affiliations: | ||
| - name: NCAS-CMS, Meteorology Department, University of Reading, Reading, UK | ||
| index: 1 | ||
| ror: 00hx57361 | ||
| - name: Institution Name, Spain | ||
| index: 2 | ||
| - name: TBD | ||
| index: 3 | ||
| - name: NIST Center for Neutron Research | ||
| index: 4 | ||
| - name: Institute for Geophysics, University of Bonn | ||
| index: 5 | ||
| date: 21 September 2025 | ||
| bibliography: paper.bib | ||
|
|
||
| --- | ||
|
|
||
| # Summary | ||
|
|
||
| Pyfive (<https://pyfive.readthedocs.io/en/latest/>) is an open-source thread-safe pure Python package for reading data stored in HDF5. While it is not a complete implementation of all the capabilities of HDF5, it includes all the core functionality necessary to read gridded datasets, whether stored contiguously or with chunks, and to carry out the necessary decompression for the standard options (INCLUDE OPTIONS). All data access is fully lazy, the data is only read from storage when the numpy data arrays are manipulated. Originally developed some years ago, the package has recently been upgraded to support | ||
| lazy access, and to add missing features necessary for handling all the environmental data known to the authors. It is now a realistic option for production data access in environmental science and more widely. The API is based on that of h5py (which is a python shimmy over the HDF5 c-library, and hence is not thread-safe), with some API extensions to help optimise remote access. | ||
|
|
||
| # Statement of need | ||
|
|
||
| HDF5 is probably the most important data format in physical science, used across the piste.It is particularly important in environmental science, particularly given the fact that NetCDF4 is HDF5 under the hood. From satellite missions, to climate models and radar systems, the default binary format has been HDF5 for decades. While newer formats are starting to get mindshare, there are petabytes, if not exabytes of existing HDF5, and there are still many good use-cases for creating new data in HDF5. However, despite the history, there are few libraries for reading HDF5 file data that do not depend on the official HDF5 library maintained by the HDFGroup, and in particular, there are none that can be used with Python. | ||
| While the HDF5 c-library is reliable and performant, and battle-tested over decades, there are some caveats to depending upon it: Firstly, it is not thread-safe, and secondly, the code is large and complex, and should anything happen to the financial stability of The HDF5group, it is not obvious the C-code could be maintained. From a long-term curation perspective this last constraint is a concern. | ||
|
|
||
| The original implementation of pyfive (by JH and BM), which included all the low-level functionality to deal with the internals of an HDF5 file was developed with POSIX access in mind. The recent upgrades were developed with the use-case of performant remote access to curated data as the primary motivation, but with additional motivations of having a lightweight HDF5 reader capable of deploying in resource or operating-system constrained environments (such as mobile), and one that could be maintained long-term as a reference reader for curation purposes. The lightweight deployment consequences of a pure-python HDF5 reader need no further introduction, but as additional motivation we now expand on the issues around remote access and curation. | ||
|
|
||
| Taking remote access first, one of the reasons for the rapid adoption of pure-python tools like xarray with zarr has been the ability for thread-safe parallelism using dask. Any python solution based on the HDF5 c-library could not meet this requirement, which led to the development of kerchunk mediated direct access to chunked HDF5 data (https://fsspec.github.io/kerchunk/). However, in practice using kerchunk requires the data provider to generate kerchunk indices to support remote users, and it leads to issues of synchronicity between indices and changing datasets. pyfive was developed in such a way to have all the benefits of using kerchunk, but without the need for provider support. Because pyfive can access and cache (in the client) the b-tree (index) on a variable-by-variable basis, most of the benefits of kerchunk are gained without any of the constraints. The one advantage left to kerchunk is that the kerchunk index is always a contiguous object accessible with one get transaction, this is not necessarily the case with the b-tree, unless the source data has been repacked to ensure contiguous metadata using a tool like h5repack. However, in practice, for many use cases, b-tree extraction with pyfive will be comparable in performance to obtaining a kerchunk index, and completely opaque to the user. | ||
|
|
||
| The issues of the dependency on a complex code maintained by one private company in the context of maintaining data access (over decades, and potentially centuries), can only be mitigated by ensuring that the data format is well documented, that data writers use only the documented features, and that public code exists which can be relatively easily maintained. The HDF5group have provided good documentation for the core features of HDF5 which include all those of interest to the weather and climate community who motivated this reboot of pyfive, and while there is a community of developers beyond the HDF5 group (including some at the publicly funded Unidata institution), recent events suggest that given most of those developers and their existing funding are US based, some spreading of risk would be desirable. To that end, a pure-python code, which is relatively small and maintained by an international constituency, alongside the existing c-code, provides some assurance that the community can maintain HDF5 access for the foreseeable future. | ||
|
|
||
| # Examples | ||
|
|
||
| A notable feature of the recent pyfive upgrade is that it was carried out with thread-safety and remote access using fsspec (filesystem-spec.readthedocs.io) in mind. We provide two examples of using pyfive to access remote data, one in S3, and one behind a modern http web server: | ||
|
|
||
| [email protected] When we have this is markdown, can you please put two python examples in here as above! | ||
|
|
||
| # Mathematics | ||
|
|
||
| Single dollars ($) are required for inline mathematics e.g. $f(x) = e^{\pi/x}$ | ||
|
|
||
| Double dollars make self-standing equations: | ||
|
|
||
| $$\Theta(x) = \left\{\begin{array}{l} | ||
| 0\textrm{ if } x < 0\cr | ||
| 1\textrm{ else} | ||
| \end{array}\right.$$ | ||
|
|
||
| You can also use plain \LaTeX for equations | ||
| \begin{equation}\label{eq:fourier} | ||
| \hat f(\omega) = \int_{-\infty}^{\infty} f(x) e^{i\omega x} dx | ||
| \end{equation} | ||
| and refer to \autoref{eq:fourier} from text. | ||
|
|
||
| # Citations | ||
|
|
||
| Citations to entries in paper.bib should be in | ||
| [rMarkdown](http://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html) | ||
| format. | ||
|
|
||
| If you want to cite a software repository URL (e.g. something on GitHub without a preferred | ||
| citation) then you can do it with the example BibTeX entry below for @fidgit. | ||
|
|
||
| For a quick reference, the following citation commands can be used: | ||
| - `@author:2001` -> "Author et al. (2001)" | ||
| - `[@author:2001]` -> "(Author et al., 2001)" | ||
| - `[@author1:2001; @author2:2001]` -> "(Author1 et al., 2001; Author2 et al., 2002)" | ||
|
|
||
| # Figures | ||
|
|
||
| Figures can be included like this: | ||
|  | ||
| and referenced from text using \autoref{fig:example}. | ||
|
|
||
| Figure sizes can be customized by adding an optional second parameter: | ||
| { width=20% } | ||
|
|
||
| # Acknowledgements | ||
|
|
||
| We acknowledge contributions from Brigitta Sipocz, Syrtis Major, and Semyeong | ||
| Oh, and support from Kathryn Johnston during the genesis of this project. | ||
|
|
||
| # References | ||
Binary file not shown.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JH is the original implementer - this paragraph shouldn't include me (BM)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
many thanks @bmaranville 🍺