Merge branch 'main' into dependabot/github_actions/pre-commit-ci/lite…

…-action-1.1.0
astronomy-commons · Nov 14, 2024 · 0e829f1 · 0e829f1
2 parents 57d4806 + 76a8ab1
commit 0e829f1
Show file tree

Hide file tree

Showing 24 changed files with 349 additions and 151 deletions.
diff --git a/.github/ISSUE_TEMPLATE/4-release_tracker.md b/.github/ISSUE_TEMPLATE/4-release_tracker.md
@@ -0,0 +1,56 @@
+---
+name: Release Tracker
+about: Request release batch of LSDB, HATS, HATS-import packages.
+title: 'Release: '
+labels: ''
+assignees: 'delucchi-cmu'
+
+---
+
+
+## Packages to release
+
+Please check which packages you would like to be released:
+
+- [ ] hats
+- [ ] lsdb
+- [ ] hats-import
+
+## Additional context
+
+e.g. major/minor/patch, deadlines, blocking issues, breaking changes, folks to notify
+
+<!-- DON'T EDIT BELOW HERE -->
+
+
+
+<!-- ================= -->
+<!-- PROCESS CHECKLIST -->
+<!-- ================= -->
+
+## hats release
+
+- [ ] tag in github
+- [ ] confirm on [pypi](https://pypi.org/manage/project/hats/releases/)
+- [ ] request new conda-forge version (see [similar issue](https://github.com/conda-forge/hats-feedstock/issues/3)) and approve
+- [ ] confirm on [conda-forge](https://anaconda.org/conda-forge/hats)
+
+## lsdb release
+
+- [ ] tag in github
+- [ ] confirm on [pypi](https://pypi.org/manage/project/lsdb/releases/)
+- [ ] request new conda-forge version (see [similar issue](https://github.com/conda-forge/hats-feedstock/issues/3))
+- [ ] confirm tagged hats version and approve
+- [ ] confirm on [conda-forge](https://anaconda.org/conda-forge/lsdb)
+
+## hats-import release
+
+- [ ] tag in github
+- [ ] confirm on [pypi](https://pypi.org/manage/project/hats-import/releases/)
+- [ ] request new conda-forge version (see [similar issue](https://github.com/conda-forge/hats-feedstock/issues/3)) and approve
+- [ ] confirm tagged hats version and approve
+- [ ] confirm on [conda-forge](https://anaconda.org/conda-forge/hats-import)
+
+## Release announcement
+
+- [ ] send release summary to LSSTC slack #lincc-frameworks-lsdb
diff --git a/docs/getting-started.rst b/docs/getting-started.rst
@@ -50,9 +50,7 @@ LSDB can also be installed from source on `GitHub <https://github.com/astronomy-
 advanced installation instructions in the :doc:`contribution guide </developer/contributing>`.
 
 .. tip::
-    Installing optional dependencies
-
-    There are some extra dependencies that can make running LSDB in a jupyter
+    There are some extra dependencies that can make running LSDB in a Jupyter
     environment easier, or connecting to a variety of remote file systems.
 
     These can be installed with the ``full`` extra.
@@ -64,8 +62,8 @@ advanced installation instructions in the :doc:`contribution guide </developer/c
 Quickstart
 --------------------------
 
-LSDB is built on top of `Dask DataFrame <https://docs.dask.org/en/stable/dataframe.html>`_ which allows workflows
-to run in parallel on distributed environments, and scale to large, out of memory datasets. For this to work,
+LSDB is built on top of `Dask DataFrame <https://docs.dask.org/en/stable/dataframe.html>`_, which allows workflows
+to run in parallel on distributed environments and scale to large, out-of-memory datasets. For this to work,
 Catalogs are loaded **lazily**, meaning that only the metadata is loaded at first. This way, LSDB can plan
 how tasks will be executed in the future without actually doing any computation. See our :doc:`tutorials </tutorials>`
 for more information.
@@ -94,8 +92,8 @@ Catalog, and specify which columns we want to use from it.
 
 
 Here we can see the lazy representation of an LSDB catalog object, showing its metadata such as the column
-names and their types without loading any data (See the ellipsis in the table as placeholders where you would
-usually see values).
+names and their types without loading any data. The ellipses in the table act as placeholders where you would
+usually see values.
 
 .. important::
 

diff --git a/docs/tutorials.rst b/docs/tutorials.rst
@@ -14,6 +14,7 @@ An introduction to LSDB's core features and functionality
 
     Getting data into LSDB <tutorials/getting_data>
     Filtering large catalogs <tutorials/filtering_large_catalogs>
+    Joining catalogs <tutorials/join_catalogs>
     Exporting results <tutorials/exporting_results>
 
 Advanced Topics
@@ -27,10 +28,10 @@ A more in-depth look into how LSDB works
 
     Topic: Import catalogs <tutorials/import_catalogs>
     Topic: Manual catalog verification <tutorials/manual_verification>
-    Topic: Accessing Remote Data <tutorials/remote_data>
+    Topic: Accessing remote data <tutorials/remote_data>
     Topic: Margins <tutorials/margins>
-    Topic: Index Tables <tutorials/index_table>
-    Topic: Performance Testing <tutorials/performance>
+    Topic: Index tables <tutorials/pre_executed/index_table>
+    Topic: Performance testing <tutorials/performance>
 
 Example Science Use Cases
 ---------------------------

diff --git a/docs/tutorials/_static/crossmatching-performance.png b/docs/tutorials/_static/crossmatching-performance.png
diff --git a/docs/tutorials/exporting_results.ipynb b/docs/tutorials/exporting_results.ipynb
@@ -8,9 +8,9 @@
     "\n",
     "You can save the catalogs that result from running your workflow to disk, in parquet format, using the `to_hats` call. \n",
     "\n",
-    "You must provide a `base_catalog_path`, which is the output path for your catalog directory, and (optionally) a name for your catalog, `catalog_name`. The `catalog_name` is the catalog's internal name and therefore may differ from the catalog's base directory name. If the directory already exists and you want to overwrite its content set the `overwrite` flag to True. Do not forget to provide the necessary credentials, as `storage_options` to the UPath construction, when trying to export the catalog to protected remote storage.\n",
+    "You must provide a `base_catalog_path`, which is the output path for your catalog directory, and (optionally) a name for your catalog, `catalog_name`. The `catalog_name` is the catalog's internal name and therefore may differ from the catalog's base directory name. If the directory already exists and you want to overwrite its content, set the `overwrite` flag to True. Do not forget to provide the necessary credentials, as `storage_options` to the UPath construction, when trying to export the catalog to protected remote storage.\n",
     "\n",
-    "For example, to save a catalog that contains the results of crossmatching Gaia with ZTF to `\"./my_catalogs/gaia_x_ztf\"` one could run:\n",
+    "For example, to save a catalog that contains the results of crossmatching Gaia with ZTF to `\"./my_catalogs/gaia_x_ztf\"`, one could run:\n",
     "```python\n",
     "gaia_x_ztf_catalog.to_hats(base_catalog_path=\"./my_catalogs/gaia_x_ztf\", catalog_name=\"gaia_x_ztf\")\n",
     "```\n",
@@ -36,7 +36,7 @@
     "└── provenance_info.json\n",
     "```\n",
     "\n",
-    "The data is partitioned spatially and it is stored, in `parquet` format, according to the area of the respective partitions in the sky. Each parquet file represents a partition. The higher the `Norder` for a partition, the smaller it is in area. As a result, because partitions contain (approximately) the same number of points, those in a directory with a larger `Norder` hold data for denser regions of the sky. All this information is encoded in metadata files that exist at the root of our catalog."
+    "The data is partitioned spatially and is stored, in `parquet` format, according to the area of the respective partitions in the sky. Each parquet file represents a partition. The higher the `Norder` for a partition, the smaller it is in area. As a result, because partitions contain (approximately) the same number of points, those in a directory with a larger `Norder` hold data for denser regions of the sky. All this information is encoded in metadata files that exist at the root of our catalog."
    ]
   },
   {

diff --git a/docs/tutorials/filtering_large_catalogs.ipynb b/docs/tutorials/filtering_large_catalogs.ipynb
@@ -9,11 +9,11 @@
    "source": [
     "# Filtering large catalogs\n",
     "\n",
-    "Large astronomical surveys contain a massive volume of data. Billion object, multi-terabyte sized catalogs are challenging to store and manipulate because they demand state-of-the-art hardware. Processing them is expensive, both in terms of runtime and memory consumption, and performing it in a single machine has become impractical. LSDB is a solution that enables scalable algorithm execution. It handles loading, querying, filtering and crossmatching astronomical data (of HATS format) in a distributed environment. \n",
+    "Large astronomical surveys contain a massive volume of data. Billion-object, multi-terabyte-sized catalogs are challenging to store and manipulate because they demand state-of-the-art hardware. Processing them is expensive, both in terms of runtime and memory consumption, and doing so on a single machine has become impractical. LSDB is a solution that enables scalable algorithm execution. It handles loading, querying, filtering, and crossmatching astronomical data (of HATS format) in a distributed environment. \n",
     "\n",
     "In this tutorial, we will demonstrate how to:\n",
     "\n",
-    "1. Setup a Dask client for distributed processing\n",
+    "1. Set up a Dask client for distributed processing\n",
     "2. Load an object catalog with a set of desired columns\n",
     "3. Select data from regions of the sky\n",
     "4. Preview catalog data"
@@ -40,14 +40,14 @@
     "\n",
     "Dask is a framework that allows us to take advantage of distributed computing capabilities. \n",
     "\n",
-    "With Dask, the operations defined in a workflow (e.g. this notebook) are added to a task graph which optimizes their order of execution. The operations are not immediately computed, that's for us to decide. As soon as we trigger the computations, Dask distributes the workload across its multiple workers and tasks are run efficiently in parallel. Usually the later we kick off the computations the better.\n",
+    "With Dask, the operations defined in a workflow (e.g. this notebook) are added to a task graph which optimizes their order of execution. The operations are not immediately computed; that's for us to decide. As soon as we trigger the computations, Dask distributes the workload across its multiple workers, and tasks are run efficiently in parallel. Usually, the later we kick off the computations, the better.\n",
     "\n",
-    "Dask creates a client by default, if we do not instantiate one. If we do, we may, among others:\n",
+    "Dask creates a client by default, if we do not instantiate one. If we do, we may set options such as:\n",
     "\n",
     "- Specify the number of workers and the memory limit for each of them.\n",
-    "- Adjust the address for the dashboard to profile the operations while they run (by default it serves on port _8787_).\n",
+    "- Adjust the address for the dashboard that profiles the operations while they run (by default, it serves on port _8787_).\n",
     "\n",
-    "For additional information please refer to https://distributed.dask.org/en/latest/client.html."
+    "For additional information, please refer to https://distributed.dask.org/en/latest/client.html."
    ]
   },
   {
@@ -72,7 +72,7 @@
     "\n",
     "We will load a small 5 degree radius cone from the `ZTF DR14` object catalog. \n",
     "\n",
-    "Catalogs represent tabular data and are internally subdivided into partitions based on their positions in the sky. When processing a catalog, each worker is expected to be able to load a single partition at a time into memory for processing. Therefore, when loading a catalog, it's crucial to __specify the subset of columns__ we need for our science pipeline. Failure to specify these columns results in loading the entire partition table, which not only increases the usage of worker memory but also impacts runtime performance significantly."
+    "Catalogs represent tabular data and are internally subdivided into partitions based on their positions in the sky. When processing a catalog, each worker is expected to be able to load a single partition at a time into memory for processing. Therefore, when loading a catalog, it's crucial to __specify the subset of columns__ we need for our science pipeline. Failure to specify these columns results in loading the entire partition table, which not only increases the usage of worker memory, but also impacts runtime performance significantly."
    ]
   },
   {
@@ -102,7 +102,7 @@
    "id": "9061e278-b55b-433a-af34-45b31d5295cc",
    "metadata": {},
    "source": [
-    "The catalog has been loaded lazily, we can see its metadata but no actual data is there yet. We will be defining more operations in this notebook. Only when we call `compute()` on the resulting catalog are operations executed, i.e. data is loaded from disk into memory for processing."
+    "The catalog has been loaded lazily: we can see its metadata, but no actual data is there yet. We will be defining more operations in this notebook. Only when we call `compute()` on the resulting catalog are operations executed; i.e., data is loaded from disk into memory for processing."
    ]
   },
   {
@@ -134,7 +134,7 @@
    "id": "bdd68cda",
    "metadata": {},
    "source": [
-    "Throughout this notebook we will use the Catalog's `plot_pixels` method to display the HEALPix of each resulting catalog as filters are applied."
+    "Throughout this notebook, we will use the Catalog's `plot_pixels` method to display the HEALPix of each resulting catalog as filters are applied."
    ]
   },
   {
@@ -330,7 +330,7 @@
     "\n",
     "Computing an entire catalog requires loading all of its resulting data into memory, which is expensive and may lead to out-of-memory issues. \n",
     "\n",
-    "Often our goal is to have a peek at a slice of data to make sure the workflow output is reasonable (e.g. to assess if some new created columns are present and their values have been properly processed). `head()` is a pandas-like method which allows us to preview part of the data for this purpose. It iterates over the existing catalog partitions, in sequence, and finds up to `n` number of rows.\n",
+    "Often, our goal is to have a peek at a slice of data to make sure the workflow output is reasonable (e.g., to assess if some new created columns are present and their values have been properly processed). `head()` is a pandas-like method which allows us to preview part of the data for this purpose. It iterates over the existing catalog partitions, in sequence, and finds up to `n` number of rows.\n",
     "\n",
     "Notice that this method implicitly calls `compute()`."
    ]
@@ -350,7 +350,7 @@
    "id": "2233de79-1dc4-4737-a5a9-794e8f1c3d9b",
    "metadata": {},
    "source": [
-    "By default the first 5 rows of data will be shown but we can specify a higher number if we need."
+    "By default, the first 5 rows of data will be shown, but we can specify a higher number if we need."
    ]
   },
   {

diff --git a/docs/tutorials/getting_data.ipynb b/docs/tutorials/getting_data.ipynb
@@ -100,7 +100,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "When invoking `read_hats` only metadata information about that catalog (e.g. sky coverage, number of total rows and column schema) is loaded into memory! Notice that the ellipses in the previous catalog representation are just placeholders.\n",
+    "When invoking `read_hats`, only metadata information about that catalog (e.g. sky coverage, number of total rows, and column schema) is loaded into memory! Notice that the ellipses in the previous catalog representation are just placeholders.\n",
     "\n",
     "You will find that most use cases start with **LAZY** loading and planning operations, followed by more expensive **COMPUTE** operations. The data is only loaded into memory when we trigger the workflow computations, usually with a `compute` call.\n",
     "\n",

diff --git a/docs/tutorials/import_catalogs.ipynb b/docs/tutorials/import_catalogs.ipynb
@@ -11,8 +11,8 @@
     "\n",
     "This notebook presents two modes of importing catalogs to HATS format:\n",
     "\n",
-    "1. `lsdb.from_dataframe()` method: helpful to load smaller catalogs from a single dataframe. data should have fewer than 1-2 million rows and the pandas dataframe should be less than 1-2G in-memory. if your data is larger, the format is complicated, you need more flexibility, or you notice any performance issues when importing with this mode, use the next mode.\n",
-    "2. `hats-import` package: for large datasets (1G - 100s of TB). this is a purpose-built map-reduce pipeline for creating HATS catalogs from various datasets. in this notebook, we use a very basic dataset and basic import options. please see [the full package documentation](https://hats-import.readthedocs.io/) if you need to do anything more complicated."
+    "1. The `lsdb.from_dataframe()` method is useful for loading smaller catalogs from a single DataFrame. The data should have fewer than 1-2 million rows, and the pandas DataFrame should occupy less than 1-2 GB in memory. If your data is larger, has a complex format, requires greater flexibility, or if you encounter performance issues with this method, consider using the next mode.\n",
+    "2. The hats-import package is designed for large datasets (from 1 GB to hundreds of terabytes). This is a purpose-built map-reduce pipeline for creating HATS catalogs from various datasets. in this notebook, we use a very basic dataset and simple import options. Please see [the full package documentation](https://hats-import.readthedocs.io/) if you need to do anything more complicated."
    ]
   },
   {
@@ -38,7 +38,7 @@
    "id": "529d9e23",
    "metadata": {},
    "source": [
-    "We will be importing `small_sky` from a single CSV file.\n",
+    "We will be importing `small_sky` from a single CSV file. (If you did not install `lsdb` from source, you can find the file [here](https://github.com/astronomy-commons/lsdb/blob/main/tests/data/raw/small_sky/small_sky.csv) and modify the paths accordingly.)\n",
     "\n",
     "Let's define the input and output paths:"
    ]
@@ -190,7 +190,7 @@
     "collapsed": false
    },
    "source": [
-    "Let's read both catalogs, from disk, and check that the two methods produced the same output:"
+    "Let's read both catalogs (from disk) and check that the two methods produced the same output:"
    ]
   },
   {
@@ -263,14 +263,6 @@
    "source": [
     "tmp_dir.cleanup()"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "240241dc",
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {