Support on-the-fly HATS web-servers #480

hombit · 2025-03-28T14:39:26Z

Feature request

@fxpineau developed an on-the-fly HATS server for Vizier catalogs. Unfortunately, due to some fsspec limitations we currently cannot use it:

with UPath(VIZIER_HATS_PARQUET_FILE_URL).open('rb') as fh:
    pa.parquet.read_table(fh)

File ~/.virtualenvs/lsdb/lib/python3.12/site-packages/pyarrow/parquet/core.py:1793, in read_table(source, columns, use_threads, schema, use_pandas_metadata, read_dictionary, memory_map, buffer_size, partitioning, filesystem, filters, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, page_checksum_verification)
   1787     warnings.warn(
   1788         "Passing 'use_legacy_dataset' is deprecated as of pyarrow 15.0.0 "
   1789         "and will be removed in a future version.",
   1790         FutureWarning, stacklevel=2)
   1792 try:
-> 1793     dataset = ParquetDataset(
   1794         source,
   1795         schema=schema,
   1796         filesystem=filesystem,
   1797         partitioning=partitioning,
   1798         memory_map=memory_map,
   1799         read_dictionary=read_dictionary,
   1800         buffer_size=buffer_size,
   1801         filters=filters,
   1802         ignore_prefixes=ignore_prefixes,
   1803         pre_buffer=pre_buffer,
   1804         coerce_int96_timestamp_unit=coerce_int96_timestamp_unit,
   1805         decryption_properties=decryption_properties,
   1806         thrift_string_size_limit=thrift_string_size_limit,
   1807         thrift_container_size_limit=thrift_container_size_limit,
   1808         page_checksum_verification=page_checksum_verification,
   1809     )
   1810 except ImportError:
   1811     # fall back on ParquetFile for simple cases when pyarrow.dataset
   1812     # module is not available
   1813     if filters is not None:

File ~/.virtualenvs/lsdb/lib/python3.12/site-packages/pyarrow/parquet/core.py:1360, in ParquetDataset.__init__(self, path_or_paths, filesystem, schema, filters, read_dictionary, memory_map, buffer_size, partitioning, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, page_checksum_verification, use_legacy_dataset)
   1356 if single_file is not None:
   1357     fragment = parquet_format.make_fragment(single_file, filesystem)
   1359     self._dataset = ds.FileSystemDataset(
-> 1360         [fragment], schema=schema or fragment.physical_schema,
   1361         format=parquet_format,
   1362         filesystem=fragment.filesystem
   1363     )
   1364     return
   1366 # check partitioning to enable dictionary encoding

File ~/.virtualenvs/lsdb/lib/python3.12/site-packages/pyarrow/_dataset.pyx:1443, in pyarrow._dataset.Fragment.physical_schema.__get__()

File ~/.virtualenvs/lsdb/lib/python3.12/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File ~/.virtualenvs/lsdb/lib/python3.12/site-packages/pyarrow/error.pxi:89, in pyarrow.lib.check_status()

File ~/.virtualenvs/lsdb/lib/python3.12/site-packages/fsspec/implementations/http.py:732, in HTTPStreamFile.seek(self, loc, whence)
    730 if loc == self.loc and whence == 0:
    731     return
--> 732 raise ValueError("Cannot seek streaming HTTP file")

ValueError: Cannot seek streaming HTTP file

Before submitting
Please check the following:

I have described the purpose of the suggested change, specifying what I need the enhancement to accomplish, i.e. what problem it solves.
I have included any relevant links, screenshots, environment information, and data relevant to implementing the requested feature, as well as pseudocode for how I want to access the new functionality.
If I have ideas for how the new feature could be implemented, I have provided explanations and/or pseudocode and/or task lists for the steps.

The text was updated successfully, but these errors were encountered:

fxpineau · 2025-03-28T14:44:50Z

The URL of CDS on-the-fly HATS products is: https://vizcat.cds.unistra.fr/hats

fxpineau · 2025-03-28T14:51:25Z

(technically, it consists in a small cgi + apache rewrite rules transforming parquet file paths into GET queries on the qat2s.cgi used in production for VizieR large tables)

hombit added the enhancement New feature or request label Mar 28, 2025

delucchi-cmu added this to HATS / LSDB Mar 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support on-the-fly HATS web-servers #480

Support on-the-fly HATS web-servers #480

hombit commented Mar 28, 2025 •

edited

Loading

fxpineau commented Mar 28, 2025

fxpineau commented Mar 28, 2025

Support on-the-fly HATS web-servers #480

Support on-the-fly HATS web-servers #480

Comments

hombit commented Mar 28, 2025 • edited Loading

fxpineau commented Mar 28, 2025

fxpineau commented Mar 28, 2025

hombit commented Mar 28, 2025 •

edited

Loading