-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New trajectory class with parquet rountrip functionality #1206
base: main
Are you sure you want to change the base?
New trajectory class with parquet rountrip functionality #1206
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1206 +/- ##
==========================================
- Coverage 90.13% 90.05% -0.08%
==========================================
Files 147 148 +1
Lines 14506 14653 +147
==========================================
+ Hits 13075 13196 +121
- Misses 1431 1457 +26 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@tschaume @yang-ruoxi : should be ready to test with the trajectories endpoint, roundtrip is working fine. Only additions I might want to make are supporting site properties, but I don't think we have any in the ionic steps (e.g., selective dynamics, magmoms, and velocities tags) And @tsmathis when you have time : any comments about the parquet serialization are appreciated - this is a very specific implementation of parquet serialization for an emmet object |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
re: arrow + parquet writing, is the long term intention to keep writing individual trajectory objects to individual parquet files? |
No this is a short-term solution: To build a performant index for the task collection, removing the ionic steps / calcs reversed helps a lot (reduces the task collection size by half). We still want to serve that info up, and need the total energy by ionic step for the convergence graph in the website task view Serving up individual parquet files permits partial retrieval of energy data by task_id, and also lets users retrieve full trajectory info |
emmet-core/emmet/core/trajectory.py
Outdated
pa_table = pa.table(pa_config) | ||
if file_name: | ||
with zopen(str(file_name), "wb") as f: | ||
pa_pq.write_table(pa_table, f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but the compression formats that are parquet compatible (‘SNAPPY’, ‘GZIP’, ‘BROTLI’, ‘LZ4’, ‘ZSTD’) don't mesh with zopen
's formats.
I see in the test files there is a .gz
extension for the test parquet file, I'm guessing this is the only format that would work with zopen
?
I would opt towards dropping monty here and sticking with pyarrow.parquet's read/write behavior and support all the compression types that are parquet compatible. And have the default be the default compression format for write_table, i.e., snappy
: pyarrow.parquet.write_table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah monty
doesn't support these - switched to default pyarrow compression, thanks for the suggestion!
Okay, fine as is for now then if this will get us where we need to be. |
In support of removing ionic step / calcs reversed info from MP's mongo task collection, this adds a
Trajectory
class which interfaces with pyarrow/parquet, pymatgen'sTrajectory
, and ASE'sTrajectory
. Verified that parquet rountrip works perfectly (model-dumped hashes of emmetTrajectory
objects are identical before and after parquet conversion).Since the site still needs energy convergence info, using parquet lets us partially retrieve the energy data from the trajectory
This is a middle-ground solution until the emmet-archival PR is ready