Replies: 3 comments
-
@jfbooth Larger self-contained PODs with flags that allow the user to select the step(s) they want to run are preferable.The framework does not currently support intra- or inter-POD data dependencies because of the difficulties in maintenance and debugging that tend to occur. It is also a good idea (though not strictly required at this point) to include unit tests for the different option combinations with your POD code. That said, there may be a way to structure the framework to allow self-contained workflows with "chained" PODs that depend on results/output. If there is a demand for it, we can revisit the possibility of integrating it in the future. |
Beta Was this translation helpful? Give feedback.
-
Migrated this from issues, where it was posted under the title "Does the framework allow implementation of a POD that uses output from a different POD." |
Beta Was this translation helpful? Give feedback.
-
I believe we'll inevitably be drawn towards including this functionality as PODs become more complex; two self-evident use cases are
For the purposes of package development, I think this jump in complexity is the point at which the framework itself should shift to being implemented in terms of a third-party workflow engine, rather than the ad-hoc implementation of a data pipeline we've currently written ourselves. To meet the design needs of the project, such an engine would need to be 1) embeddable; 2) run entirely in user space (i.e., not be based on a client-server architecture) and 3) preferably be python-centric. From my notes, I've singled out luigi and its extensions sci-luigi and luigi analysis workflow as meeting these criteria, but we'll need to re-examine this when we're ready to implement this functionality. |
Beta Was this translation helpful? Give feedback.
-
My group has developed a POD that currently works as follows: Step 1: track extratropical cyclones in space and time. Step 2: use the cyclone tracks to grab other variable fields in the vicinity of the cyclones. Currently, the POD is set up with options that allow a user to choose to run: (1) Step 1 and Step 2, or (2) Step 1 only, or (3) Step 2 only - with the user providing a set of track data in a specific format. However, we are now thinking - for the sake of debugging it might be easier if the two steps are 2 separate PODs. As 2 separate PODS, what would be the work flow if someone wanted to run both of them? So the question can be generalized to: would you prefer bigger, self-contained PODs that include flags for options. Or smaller PODs some of which might depend on output from the others.
I would appreciate some feedback on this. (Also note, if this question is outside of scope of the "Issues" list, I apologize - email me and we can discuss offline - Jimmy [email protected]).
Beta Was this translation helpful? Give feedback.
All reactions