Skip to content

Profile data dumping #6723

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 86 commits into from
Jun 6, 2025
Merged

Conversation

GeigerJ2
Copy link
Contributor

@GeigerJ2 GeigerJ2 commented Jan 22, 2025

So, while I finalize things here, I'd say this is ready for a first review. At least the user-facing CLI in cmd_process.py, cmd_group.py, and cmd_profile.py, as well as the user-facing Python API in src/aiida/tools/dumping/facades.py, that defines the ProcessDumper, GroupDumper, and ProfileDumper facades (simple, user-facing entry points). For anybody reviewing, start from there.

I'd say we release the feature as a kind of beta, testing version. Testing all the edge cases and internal implementation probably won't be possible until we have the release planned. I made almost all methods private, and the user anyway doesn't interact with the internal implementation at all, so we can still modify things there without worrying about backwards compatibility. Maybe someone with more experience can chime in what is a common approach here. One could also prepend the internal classes with a leading underscore, or move the files to a subdirectory like _internal, but I haven't seen that anywhere else in the code base, so maybe it's fine as it is.

The main components of the implementation are the following:

dumping/
├── config.py: DumpConfig pydantic model that holds the various configuration options (as well as some helper Enum classes)
├── detect.py: DumpChangeDetector class to detect changes between dump operations (based on information from the JSON log file), and DumpNodeQuery class to query nodes from AiiDA's DB with filters (e.g., time-based, code, computer, already dumped nodes, etc.)
├── engine.py: Top-level orchestrator of the dumping operation (setup and teardown operations, e.g., initialize classes such as the DumpLogger, prepare output directory, perform dump, save log)
├── facades.py: User-facing Dumper classes with from_config, dump, and methods to verify the passed AiiDA entities
├── logger.py: Mainly DumpLogger class that keeps track of dumped entities and their paths, the dump time, and the groups-to-nodes and nodes-to-groups mappings
├── managers/ (classes responsible for actually executing the various dump operations)
│ ├── deletion.py: DeletionManager class that takes care of deleting directories and log entries for entities deleted from AiiDA's DB between dump operations (as detected by the DumpChangeDetector)
│ ├── process.py: ProcessDumpManager to orchestrate dumping of processes, and helper classes (in the previously released verdi process dump feature, all functionality was contained in the ProcessDumper class instead; this is now more modular). For each encountered process, there are various possible actions (skip (node already dumped and not necessary to dump again), dump_primary (first, "normal" node dump), dump_duplicate (node already dumped elsewhere, but dump again, e.g., for duplicated group or node contained in two groups), symlink (node dump directory already exists, so make symlink))
│ └── profile.py: ProfileDumpManager class that orchestrates all necessary operations when dumping a profile (group and node deletions, group updates (relabel, node removal/addition), dumping of new nodes and groups)
├── mapping.py: GroupNodeMapping class that holds the group-to-nodes and nodes-to-groups mapping, as well as functionality to get the mappings from AiiDA's DB and calculate the diff between two mappings (used by the DumpChangeDetector and stored via the DumpLogger)
└── utils/
├── helpers.py: Various helper classes (mainly dataclasses), e.g., DumpTimes to track last and current dump time, containers for group and node changes, store classes to hold entities to be dumped/deleted
├── paths.py: DumpPaths class to track top-level dump path, sub-paths during the dumping, and that compiles various staticmethods for path modifications during the dumping

@unkcpz mentioned using a tree data structure to represent the dumping directory/relationships, rather than a "flat" log as I have it now. This aligns well with the data organization via groups, nested workflows, it could also allow for quick diffs, so I think it's a good idea. However, if I completely modify the implementation now, the feature will never make it into v2.7, so I'd rather make an issue and spend some time investigating that approach later on.
@edan-bainglass mentioned adding .dump methods to the ORM classes. I also think this is interesting, and could possibly be added now already, at least for processes, groups, and profiles. One can do that by just instantiating the facade classes, pass any configuration options via kwargs, and call dump of the facade, similar to how the functionality is exposed via the CLI. In addition, we discussed removing the (quite lengthy) Python dictionary representations of the dump output directory structures I currently use for the integration tests of the top-level dumper classes, in favor of regression tests or some other representation or YAML.
@mikibonacci, I changed the default process dump path such that it uses the node label, if one is available. Could you please check if it now works for your AiiDAlab QE use case?

Pinging also other people in the team for notification and dog-fooding, @mbercx, @agoscinski, @khsrali @superstar54.

Other notes:

  • Add experimental warning in the same way as in pydantic PR
  • What happens if I delete a calculation that was called by another workchain, from AiiDA's DB, and I run with the --delete-missing option?
  • Possibly use graph traversal rules for recursion during node selection (especially process nodes)
  • Possibly also use graph_traversal_rules when updating directories after a node was deleted.
  • Possibly use graph_traversal_rules and add get_nodes_dump to src/aiida/tools/graph/graph_traversers.py, as well as AiidaEntitySet from src/aiida/tools/graph/age_entities.py, etc., to first obtain the nodes, and then run the dumping.

@GeigerJ2 GeigerJ2 force-pushed the feature/verdi-profile-mirror branch 3 times, most recently from 9597527 to 1c4b67b Compare January 23, 2025 16:07
@GeigerJ2 GeigerJ2 force-pushed the feature/verdi-profile-mirror branch from ce20e4c to 2dfe2ca Compare January 28, 2025 16:26
@GeigerJ2 GeigerJ2 force-pushed the feature/verdi-profile-mirror branch 5 times, most recently from 06bc55e to b795eda Compare February 17, 2025 16:30
@GeigerJ2 GeigerJ2 force-pushed the feature/verdi-profile-mirror branch 7 times, most recently from 02acb56 to 8e0cfdc Compare February 25, 2025 17:50
@GeigerJ2 GeigerJ2 force-pushed the feature/verdi-profile-mirror branch from 38d2940 to 8d12f11 Compare March 12, 2025 08:42
@GeigerJ2 GeigerJ2 force-pushed the feature/verdi-profile-mirror branch 11 times, most recently from 458cabd to 355b0e4 Compare April 2, 2025 16:33
@GeigerJ2 GeigerJ2 force-pushed the feature/verdi-profile-mirror branch from 3f859fc to 4188200 Compare June 4, 2025 11:52
@GeigerJ2 GeigerJ2 force-pushed the feature/verdi-profile-mirror branch from c9428b2 to 35638a3 Compare June 4, 2025 13:45
@GeigerJ2 GeigerJ2 force-pushed the feature/verdi-profile-mirror branch from 37459ef to 65b7ec9 Compare June 4, 2025 13:57
@GeigerJ2 GeigerJ2 force-pushed the feature/verdi-profile-mirror branch from 28a5b61 to 0008f8d Compare June 4, 2025 14:11
@GeigerJ2 GeigerJ2 force-pushed the feature/verdi-profile-mirror branch from fce0e8a to 5d53642 Compare June 4, 2025 15:49
@GeigerJ2 GeigerJ2 force-pushed the feature/verdi-profile-mirror branch from 41791ca to 4193ebd Compare June 5, 2025 11:39
@GeigerJ2 GeigerJ2 changed the title WIP: Profile data dumping Profile data dumping Jun 5, 2025
@GeigerJ2
Copy link
Contributor Author

GeigerJ2 commented Jun 5, 2025

@agoscinski

Data dumping for groups and profiles (#6723)

This PR adds functionality to incrementally dump profile and group data
into a human-readable output folder, and refactors the internal logic of
the recently released process dumping.

Public API

The data dumping feature can be used from the CLI via the
verdi {profile|group|process} dump commands. Furthermore,
the classes aiida.manage.configuration.Profile, aiida.orm.Group and
aiida.orm.ProcessNode are extended by a new public member function
dump that takes the same dumping options as the CLI entry points.

The internal implementation of the feature is contained in the private
module aiida.tools._dumping, which is currently excluded from
codecov. Further testing and modifications will be applied based on
user feedback in smaller, more manageable PRs.

Configuration of dumping

To organize the extensive options, a data class for the config options is
created using pydantic BaseModel in the aiida.tools._dumping.config
module. For each type of dumping (process/group/profile) different
options are available contained in three config classes
ProcessDumpConfig, ProfileDumpConfig, GroupDumpConfig all
inheriting from BaseDumpConfig. The *DumpConfig classes use the
mixin pattern to organize different options via the TimeFilterMixin,
EntityFilterMixin, ProcessHandlingMixin and GroupManagementMixin
since they are not available for each type of entity being dumped. The
new CLI entry points and the new member function dump all map their
inputs to the respective config class. By mapping both inputs to the
*DumpConfig classes the validation process is unified, reducing code
duplication.

State of dumping folder

The dumping functionality tracks the state of the dumped folder wrt. to
the AiiDA database. This requires a persistent storage of the current
state of the dumping folder as well as logic comparing the state of the
dumping folder and the database. To prevent expensive file reads of the
dumping folder, the state is stored in a json file after the dumping
process. The logic to evaluate and track the state and compare it with
the database is contained in the modules aiida.tools._dumping.tracking
and aiida.tools._dumping.mapping, and changes since the last dump
(new/deleted nodes/groups, relabeled groups, node membership changes,
etc.) are then picked up via the module aiida.tools._dumping.detector.
This enables incremental dumping of data in a way that the
human-readable output folder of the dumping feature tracks the state of
the AiiDA DB as it evolves.

Execution of the dump

The aiida.tools._dumping.engine module is responsible for the
top-level orchestration of the dumping process (including reading in the
json state file from the previous dump, or deleting it, if overwrite
mode is selected), as well as common setup and teardown operations. For
group and profile dumping, changes in AiiDA's DB since the last dump
that are not yet reflected in the dumping output folder, are then
carried out incrementally. This includes deleting output directories of
nodes and groups that were previously dumped but were since deleted from
AiiDA's DB, applying group relabeling carried out by the user, as well
as dumping new nodes and groups. This functionality is contained in the
aiida.tools._dumping.executors that provides the DeletionExecutor,
ProcessDumpExecutor, GroupDumpExecutor, and ProfileDumpExecutor,
classes and presents the meat of the feature implementation.

Finally, code that was previously contained in the ProcessDumper class
is now moved to the ProcessDumpExecutor, while a ProcessDumper
"facade" class is still provided via the aiida.tools._dumping.facades
module, and exposed as public API for backwards compatibility.

Copy link
Contributor

@agoscinski agoscinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the huge work. Let's merge this giant.

@GeigerJ2 GeigerJ2 merged commit 88df72a into aiidateam:main Jun 6, 2025
27 checks passed
@github-project-automation github-project-automation bot moved this from In progress to Done in aiida-core v2.7.0 Jun 6, 2025
@GeigerJ2 GeigerJ2 deleted the feature/verdi-profile-mirror branch June 6, 2025 07:25
@unkcpz
Copy link
Member

unkcpz commented Jun 6, 2025

congratulation 🎊 hard to image how hard it is to bring > 7000 lines of changes to aiida-core, nice work.

using a tree data structure to represent the dumping directory/relationships, rather than a "flat" log as I have it now.

Don't forget to open an issue if you still think it is a good idea 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants