-
Notifications
You must be signed in to change notification settings - Fork 226
Profile data dumping #6723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profile data dumping #6723
Conversation
9597527
to
1c4b67b
Compare
ce20e4c
to
2dfe2ca
Compare
06bc55e
to
b795eda
Compare
02acb56
to
8e0cfdc
Compare
38d2940
to
8d12f11
Compare
458cabd
to
355b0e4
Compare
3f859fc
to
4188200
Compare
c9428b2
to
35638a3
Compare
37459ef
to
65b7ec9
Compare
28a5b61
to
0008f8d
Compare
fce0e8a
to
5d53642
Compare
for more information, see https://pre-commit.ci
…cmethod of `DumpPaths`.
for more information, see https://pre-commit.ci
41791ca
to
4193ebd
Compare
Data dumping for groups and profiles (#6723) This PR adds functionality to incrementally dump profile and group data Public APIThe data dumping feature can be used from the CLI via the The internal implementation of the feature is contained in the private Configuration of dumpingTo organize the extensive options, a data class for the config options is State of dumping folderThe dumping functionality tracks the state of the dumped folder wrt. to Execution of the dumpThe Finally, code that was previously contained in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the huge work. Let's merge this giant.
congratulation 🎊 hard to image how hard it is to bring > 7000 lines of changes to aiida-core, nice work.
Don't forget to open an issue if you still think it is a good idea 😉 |
So, while I finalize things here, I'd say this is ready for a first review. At least the user-facing CLI in
cmd_process.py
,cmd_group.py
, andcmd_profile.py
, as well as the user-facing Python API insrc/aiida/tools/dumping/facades.py
, that defines theProcessDumper
,GroupDumper
, andProfileDumper
facades (simple, user-facing entry points). For anybody reviewing, start from there.I'd say we release the feature as a kind of beta, testing version. Testing all the edge cases and internal implementation probably won't be possible until we have the release planned. I made almost all methods private, and the user anyway doesn't interact with the internal implementation at all, so we can still modify things there without worrying about backwards compatibility. Maybe someone with more experience can chime in what is a common approach here. One could also prepend the internal classes with a leading underscore, or move the files to a subdirectory like
_internal
, but I haven't seen that anywhere else in the code base, so maybe it's fine as it is.The main components of the implementation are the following:
dumping/
├── config.py
:DumpConfig
pydantic
model that holds the various configuration options (as well as some helper Enum classes)├── detect.py
:DumpChangeDetector
class to detect changes between dump operations (based on information from the JSON log file), andDumpNodeQuery
class to query nodes from AiiDA's DB with filters (e.g., time-based, code, computer, already dumped nodes, etc.)├── engine.py
: Top-level orchestrator of the dumping operation (setup and teardown operations, e.g., initialize classes such as theDumpLogger
, prepare output directory, perform dump, save log)├── facades.py
: User-facingDumper
classes withfrom_config
,dump
, and methods to verify the passed AiiDA entities├── logger.py
: MainlyDumpLogger
class that keeps track of dumped entities and their paths, the dump time, and the groups-to-nodes and nodes-to-groups mappings├── managers/
(classes responsible for actually executing the various dump operations)│ ├── deletion.py
:DeletionManager
class that takes care of deleting directories and log entries for entities deleted from AiiDA's DB between dump operations (as detected by theDumpChangeDetector
)│ ├── process.py
:ProcessDumpManager
to orchestrate dumping of processes, and helper classes (in the previously releasedverdi process dump
feature, all functionality was contained in theProcessDumper
class instead; this is now more modular). For each encountered process, there are various possible actions (skip
(node already dumped and not necessary to dump again),dump_primary
(first, "normal" node dump),dump_duplicate
(node already dumped elsewhere, but dump again, e.g., for duplicated group or node contained in two groups),symlink
(node dump directory already exists, so make symlink))│ └── profile.py
:ProfileDumpManager
class that orchestrates all necessary operations when dumping a profile (group and node deletions, group updates (relabel, node removal/addition), dumping of new nodes and groups)├── mapping.py
:GroupNodeMapping
class that holds the group-to-nodes and nodes-to-groups mapping, as well as functionality to get the mappings from AiiDA's DB and calculate the diff between two mappings (used by theDumpChangeDetector
and stored via theDumpLogger
)└── utils/
├── helpers.py
: Various helper classes (mainly dataclasses), e.g.,DumpTimes
to track last and current dump time, containers for group and node changes, store classes to hold entities to be dumped/deleted├── paths.py
:DumpPaths
class to track top-level dump path, sub-paths during the dumping, and that compiles variousstaticmethod
s for path modifications during the dumping@unkcpz mentioned using a tree data structure to represent the dumping directory/relationships, rather than a "flat" log as I have it now. This aligns well with the data organization via groups, nested workflows, it could also allow for quick diffs, so I think it's a good idea. However, if I completely modify the implementation now, the feature will never make it into v2.7, so I'd rather make an issue and spend some time investigating that approach later on.
@edan-bainglass mentioned adding
.dump
methods to the ORM classes. I also think this is interesting, and could possibly be added now already, at least for processes, groups, and profiles. One can do that by just instantiating the facade classes, pass any configuration options viakwargs
, and calldump
of the facade, similar to how the functionality is exposed via the CLI. In addition, we discussed removing the (quite lengthy) Python dictionary representations of the dump output directory structures I currently use for the integration tests of the top-level dumper classes, in favor of regression tests or some other representation or YAML.@mikibonacci, I changed the default process dump path such that it uses the node label, if one is available. Could you please check if it now works for your AiiDAlab QE use case?
Pinging also other people in the team for notification and dog-fooding, @mbercx, @agoscinski, @khsrali @superstar54.
Other notes:
--delete-missing
option?graph_traversal_rules
when updating directories after a node was deleted.graph_traversal_rules
and addget_nodes_dump
tosrc/aiida/tools/graph/graph_traversers.py
, as well asAiidaEntitySet
fromsrc/aiida/tools/graph/age_entities.py
, etc., to first obtain the nodes, and then run the dumping.