-
Notifications
You must be signed in to change notification settings - Fork 5
Description
I've been doing a full read-through of the book to try and decide where a new section on "recommendations when using conda" might best fit. I'm wondering if some reorganisation of sections might help users and contributors. I've had a stab at an example draft outline below to start some discussion.
Please note that I'm not wedded to this proposal at all, but I thought I should mention to the team that as the current structure evolves, I'm starting to find it difficult to know where to look for specific things.
Overview - as is, but add table-of-contents providing details of what each section aims to do.
Introduction - new section pulling info from a few existing sections and providing context for what's to come. Very high-level concepts like "there are lot's of computation and storage resources in Australia", "for Big Data, data storage and compute must be considered together", "for Big Data it's best to utilise compute close to the data"... Include some of the list of platforms here (https://acdguide.github.io/BigData/platforms/platforms-intro.html), but save the "analysis environments" (e.g. OOD, ARE) for the Computation section below
Data storage - overview taken from https://acdguide.github.io/BigData/data_storage.html#about-large-scale-data
- Data standards - new subsection giving high level overview of CF conventions (and others?) and pointing to other resources
- Chunking - taken from https://acdguide.github.io/BigData/chunking.html
- Data formats - includes subsubsections on netcdf (https://acdguide.github.io/BigData/data_storage.html#netcdf, https://acdguide.github.io/BigData/format_metadata.html, https://acdguide.github.io/BigData/data_storage.html#netcdf-zarr), zarr (https://acdguide.github.io/BigData/data_storage.html#zarr), etc
- Analysis-ready data - taken from https://acdguide.github.io/BigData/data_storage.html#advice-on-writing-datasets-for-efficient-use, also subsubsection on Pangeo Forge from https://acdguide.github.io/BigData/data_storage.html#pangeo-forge-an-open-source-framework-for-extraction-transformation-and-loading-of-scientific-data
- Tools for streamlining data access - taken from https://acdguide.github.io/BigData/accessing_data.html#methods-of-accessing-data-and-metadata
Computation - generic overview taken from https://acdguide.github.io/BigData/computations.html#general-tools, https://acdguide.github.io/BigData/computations.html#command-line-tools and https://acdguide.github.io/BigData/computations.html#other-languages-matlab-r-etc
- Analysis environments - New section including tools like conda, Jupyter and platforms like ARE, OOD (taken from https://acdguide.github.io/BigData/platforms/platforms-nci-ood.html), EasyHub?...
- Chunking - "Some tools provide explicitly chunked data models that enable you to distribute your computations across multiple chunks in parallel". Different from the Data storage/Chunking section in that it provides a computational perspective. Show Scott's great examples using dask (https://acdguide.github.io/BigData/computations.html#common-tasks)
- Tools for efficient computation - taken from https://acdguide.github.io/BigData/tools/intro.html#available-tools-and-their-best-use
- Examples - New section for including example notebooks/scripts??
Resources - taken from https://acdguide.github.io/BigData/resources.html#resources
I think this includes all the existing sections/information (and adds a few). I'm not sure about the division into "Data storage" and "Computation" sections because they're so heavily related. I'm interested to hear what people think (@paigem, @hot007, @paolap). Please do feel free to tell me that this is not needed.
P.S. There are also a number of "hints" and "asides" scattered throughout the book. I wonder if putting these boxes would help with clarity (e.g. https://acdguide.github.io/BigData/accessing_data.html#working-with-authorised-catalogues)?