Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New trajectory class with parquet rountrip functionality #1206

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

esoteric-ephemera
Copy link
Collaborator

@esoteric-ephemera esoteric-ephemera commented Mar 12, 2025

In support of removing ionic step / calcs reversed info from MP's mongo task collection, this adds a Trajectory class which interfaces with pyarrow/parquet, pymatgen's Trajectory, and ASE's Trajectory. Verified that parquet rountrip works perfectly (model-dumped hashes of emmet Trajectory objects are identical before and after parquet conversion).

Since the site still needs energy convergence info, using parquet lets us partially retrieve the energy data from the trajectory

This is a middle-ground solution until the emmet-archival PR is ready

@codecov-commenter
Copy link

codecov-commenter commented Mar 12, 2025

Codecov Report

Attention: Patch coverage is 82.31293% with 26 lines in your changes missing coverage. Please review.

Project coverage is 90.05%. Comparing base (947ecc8) to head (83e1b8d).

Files with missing lines Patch % Lines
emmet-core/emmet/core/trajectory.py 82.31% 26 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1206      +/-   ##
==========================================
- Coverage   90.13%   90.05%   -0.08%     
==========================================
  Files         147      148       +1     
  Lines       14506    14653     +147     
==========================================
+ Hits        13075    13196     +121     
- Misses       1431     1457      +26     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@esoteric-ephemera
Copy link
Collaborator Author

esoteric-ephemera commented Mar 12, 2025

@tschaume @yang-ruoxi : should be ready to test with the trajectories endpoint, roundtrip is working fine. Only additions I might want to make are supporting site properties, but I don't think we have any in the ionic steps (e.g., selective dynamics, magmoms, and velocities tags)

And @tsmathis when you have time : any comments about the parquet serialization are appreciated - this is a very specific implementation of parquet serialization for an emmet object

@tsm

This comment was marked as resolved.

@tsm

This comment was marked as resolved.

@tsmathis

This comment was marked as resolved.

@tsmathis
Copy link
Collaborator

re: arrow + parquet writing, is the long term intention to keep writing individual trajectory objects to individual parquet files?

@esoteric-ephemera
Copy link
Collaborator Author

No this is a short-term solution: To build a performant index for the task collection, removing the ionic steps / calcs reversed helps a lot (reduces the task collection size by half). We still want to serve that info up, and need the total energy by ionic step for the convergence graph in the website task view

Serving up individual parquet files permits partial retrieval of energy data by task_id, and also lets users retrieve full trajectory info

pa_table = pa.table(pa_config)
if file_name:
with zopen(str(file_name), "wb") as f:
pa_pq.write_table(pa_table, f)
Copy link
Collaborator

@tsmathis tsmathis Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but the compression formats that are parquet compatible (‘SNAPPY’, ‘GZIP’, ‘BROTLI’, ‘LZ4’, ‘ZSTD’) don't mesh with zopen's formats.

I see in the test files there is a .gz extension for the test parquet file, I'm guessing this is the only format that would work with zopen?

I would opt towards dropping monty here and sticking with pyarrow.parquet's read/write behavior and support all the compression types that are parquet compatible. And have the default be the default compression format for write_table, i.e., snappy: pyarrow.parquet.write_table.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah monty doesn't support these - switched to default pyarrow compression, thanks for the suggestion!

@tsmathis
Copy link
Collaborator

No this is a short-term solution: To build a performant index for the task collection, removing the ionic steps / calcs reversed helps a lot (reduces the task collection size by half). We still want to serve that info up, and need the total energy by ionic step for the convergence graph in the website task view

Serving up individual parquet files permits partial retrieval of energy data by task_id, and also lets users retrieve full trajectory info

Okay, fine as is for now then if this will get us where we need to be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants