-
Notifications
You must be signed in to change notification settings - Fork 74
New trajectory class with parquet rountrip functionality #1206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1206 +/- ##
==========================================
- Coverage 90.15% 90.06% -0.09%
==========================================
Files 147 148 +1
Lines 14509 14663 +154
==========================================
+ Hits 13080 13206 +126
- Misses 1429 1457 +28 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@tschaume @yang-ruoxi : should be ready to test with the trajectories endpoint, roundtrip is working fine. Only additions I might want to make are supporting site properties, but I don't think we have any in the ionic steps (e.g., selective dynamics, magmoms, and velocities tags) And @tsmathis when you have time : any comments about the parquet serialization are appreciated - this is a very specific implementation of parquet serialization for an emmet object |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
re: arrow + parquet writing, is the long term intention to keep writing individual trajectory objects to individual parquet files? |
No this is a short-term solution: To build a performant index for the task collection, removing the ionic steps / calcs reversed helps a lot (reduces the task collection size by half). We still want to serve that info up, and need the total energy by ionic step for the convergence graph in the website task view Serving up individual parquet files permits partial retrieval of energy data by task_id, and also lets users retrieve full trajectory info |
emmet-core/emmet/core/trajectory.py
Outdated
pa_table = pa.table(pa_config) | ||
if file_name: | ||
with zopen(str(file_name), "wb") as f: | ||
pa_pq.write_table(pa_table, f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but the compression formats that are parquet compatible (‘SNAPPY’, ‘GZIP’, ‘BROTLI’, ‘LZ4’, ‘ZSTD’) don't mesh with zopen
's formats.
I see in the test files there is a .gz
extension for the test parquet file, I'm guessing this is the only format that would work with zopen
?
I would opt towards dropping monty here and sticking with pyarrow.parquet's read/write behavior and support all the compression types that are parquet compatible. And have the default be the default compression format for write_table, i.e., snappy
: pyarrow.parquet.write_table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah monty
doesn't support these - switched to default pyarrow compression, thanks for the suggestion!
Okay, fine as is for now then if this will get us where we need to be. |
Thank you @esoteric-ephemera! Once we get around to trying to put the new Task API into production, I'll have to double-check whether it's more performant to put the data needed for the convergence graph back into the collection or retrieve them through parquet files (preferably the same @tsmathis is using for the builders). |
84b956c
to
d522ea4
Compare
…s bandgaps, add approx to tests
@tschaume, when the |
@esoteric-ephemera This comes down to this line in the workflow file. I can't remember why I decided to only install an editable version of emmet-core for testing the other emmet namespace packages. We could change that line to |
PS: actually I remember :) The next step in the action should install the current emmet-core over any version defined in the requirements file. |
OK makes sense, but for releasing upper version pins on dependencies, this can cause CI issues |
If the upper version pin is on dependencies other than |
In support of removing ionic step / calcs reversed info from MP's mongo task collection, this adds a
Trajectory
class which interfaces with pyarrow/parquet, pymatgen'sTrajectory
, and ASE'sTrajectory
. Verified that parquet rountrip works perfectly (model-dumped hashes of emmetTrajectory
objects are identical before and after parquet conversion).Since the site still needs energy convergence info, using parquet lets us partially retrieve the energy data from the trajectory
This is a middle-ground solution until the emmet-archival PR is ready