You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+12-12
Original file line number
Diff line number
Diff line change
@@ -20,7 +20,7 @@
20
20
21
21
# `pdpp`
22
22
23
-
`pdpp` is a command-line interface for facilitating the creation and maintainance of transparent and reproducible data workflows. `pdpp` adheres to principles espoused by Patrick Ball in his manifesto on ['Principled Data Processing'](https://www.youtube.com/watch?v=ZSunU9GQdcI). `pdpp` can be used to create 'tasks', populate task directories with the requisite subdirectories, link together tasks' inputs and outputs, and executing the pipeline using the `doit`[suite of automation tools](https://pydoit.org/).
23
+
`pdpp` is a command-line interface for facilitating the creation and maintainance of transparent and reproducible data workflows. `pdpp` adheres to principles espoused by Patrick Ball in his manifesto on ['Principled Data Processing'](https://www.youtube.com/watch?v=ZSunU9GQdcI). `pdpp` can be used to create 'tasks', populate task directories with the requisite subdirectories, link together tasks' inputs and outputs, and executing the pipeline using the `doit`[suite of automation tools](https://pydoit.org/).
24
24
25
25
`pdpp` is also capable of producing rich visualizaitons of the data processing workflows it creates:
26
26
@@ -34,12 +34,12 @@ Each task directory contains at minimum three subdirectories:
34
34
2.`output`, which contains all of the task's local data outputs (also referred to as 'targets')
35
35
3.`src`, which all of the task's source codeWhich, ideally, would be contained within a single script file.]
36
36
37
-
The `pdpp` package adds two additional constraints to Patrick Ball's original formulation of PDP:
37
+
The `pdpp` package adds two additional constraints to Patrick Ball's original formulation of PDP:
38
38
39
39
1. All local data files needed by the workflow but which are not generated by any of the workflow's tasks must be included in the `_import_` directory, which `pdpp` places at the same directory level as the overall workflow during project initialization.
40
40
2. All local data files produced by the workflow as project outputs must be routed into the `_export_` directory, which `pdpp` places at the same directory level as the overall workflow during project initialization.
41
41
42
-
These additional constraints disambiguate the input and output of the overall workflow, which permits `pdpp` workflows to be embedded within one another.
42
+
These additional constraints disambiguate the input and output of the overall workflow, which permits `pdpp` workflows to be embedded within one another.
43
43
44
44
45
45
## Installation Prerequisites
@@ -67,15 +67,15 @@ Doing so should produce a directory tree similar to this one:
67
67
68
68

69
69
70
-
For the purposes of this example, a `.csv` file containing some toy data has been added to the `_import_` directory.
70
+
For the purposes of this example, a `.csv` file containing some toy data has been added to the `_import_` directory.
71
71
72
72
At this point, we're ready to add our first task to the project. To do this, we'll use the `new` command:
73
73
74
74
```bash
75
75
pdpp new
76
76
```
77
77
78
-
Upon executing the command, `pdpp` will request a name for the new task. We'll call it 'task_1'. After supplying the name, `pdpp` will display an interactive menu which allows users to specify which other tasks in the project contain files that 'task_1' will depend upon.
78
+
Upon executing the command, `pdpp` will request a name for the new task. We'll call it 'task_1'. After supplying the name, `pdpp` will display an interactive menu which allows users to specify which other tasks in the project contain files that 'task_1' will depend upon.
79
79
80
80

81
81
@@ -96,7 +96,7 @@ new_rows = []
96
96
97
97
withopen('../input/example_data.csv', 'r') as f1:
98
98
r = csv.reader(f1)
99
-
for row in r:
99
+
for row in r:
100
100
new_row = [int(row[0]) +1, int(row[1]) +1]
101
101
new_rows.append(new_row)
102
102
@@ -112,7 +112,7 @@ After running `task_1.py`, a new file called `example_data_plus_one.csv` should
112
112
pdpp rig
113
113
```
114
114
115
-
Select `_export_` from the list of tasks available, then select `task_1` (and not `_import_`); finally, select `example_data_plus_one.csv` as the only dependency for `_export_`.
115
+
Select `_export_` from the list of tasks available, then select `task_1` (and not `_import_`); finally, select `example_data_plus_one.csv` as the only dependency for `_export_`.
116
116
117
117
Once `_export_` has been rigged, this example project is a complete (if exceedingly simple) example of a `pdpp` workflow. The workflow imports a simple `.csv` file, adds one to each number in the file, and exports the resulting modified `.csv` file. `pdpp` workflows can be visualized using the built-in visualization suite like so:
118
118
@@ -124,7 +124,7 @@ The above command will prompt users for two pieces of information: the output fo
124
124
125
125

126
126
127
-
In `pdpp` visualizations, the box-like nodes represent tasks, the nodes with the folded-corners repesent data files, and the nodes with two tabs on the left-hand side represent source code.
127
+
In `pdpp` visualizations, the box-like nodes represent tasks, the nodes with the folded-corners repesent data files, and the nodes with two tabs on the left-hand side represent source code.
128
128
129
129
One may execute the entire workflow by using one of the two following commands (both are functionally identical):
130
130
@@ -148,7 +148,7 @@ When a workflow is run, the `doit` automation suite -- atop which `pdpp` is buil
148
148
-- task_1
149
149
```
150
150
151
-
This is because `doit` checks the relative ages of each tasks' inputs and outputs at runtime; if any given task has any outputsOr 'targets,' in `doit` nomenclature.] that are older than one or more of the task's inputs,Or 'dependencies,' in `doit` nomenclature] that task must be re-run. If all of a task's inputs are older than its outputs, the task does not need to be run. This means that a `pdpp`/`doit` pipeline can be run as often as the user desires without running the risk of needlessly wasting time or computing power: tasks will only be re-run if changes to 'upstream' files necessitate it. You can read more about this impressive feature of the `doit` suite [here](https://pydoit.org/tasks.html).
151
+
This is because `doit` checks the relative ages of each tasks' inputs and outputs at runtime; if any given task has any outputsOr 'targets,' in `doit` nomenclature.] that are older than one or more of the task's inputs,Or 'dependencies,' in `doit` nomenclature] that task must be re-run. If all of a task's inputs are older than its outputs, the task does not need to be run. This means that a `pdpp`/`doit` pipeline can be run as often as the user desires without running the risk of needlessly wasting time or computing power: tasks will only be re-run if changes to 'upstream' files necessitate it. You can read more about this impressive feature of the `doit` suite [here](https://pydoit.org/tasks.html).
152
152
153
153
154
154
## Usage from the Command Line
@@ -170,7 +170,7 @@ Adds a new custom task to a `pdpp` project and launches an interactive rigging s
170
170
171
171
### `pdpp sub`
172
172
173
-
Adds a new sub-project task to a `pdpp` project and launches an interactive rigging session for it (see `pdpp rig` below for more information). Sub-project tasks are distinct `pdpp` projects nested inside the main project -- structurally, they function identically to all other `pdpp` projects. Their dependencies are defined as any local files contained inside their `_import_` directory (which functions as if it were an `input` directory for a task) and their targets are defined as any local files contained inside their `_export_` directory (which functions as if if were an `output` directory for a task).
173
+
Adds a new sub-project task to a `pdpp` project and launches an interactive rigging session for it (see `pdpp rig` below for more information). Sub-project tasks are distinct `pdpp` projects nested inside the main project -- structurally, they function identically to all other `pdpp` projects. Their dependencies are defined as any local files contained inside their `_import_` directory (which functions as if it were an `input` directory for a task) and their targets are defined as any local files contained inside their `_export_` directory (which functions as if if were an `output` directory for a task).
174
174
175
175
176
176
### `pdpp rig`
@@ -179,7 +179,7 @@ Launches an interactive rigging session for a selected task, which allows users
179
179
180
180
### `pdpp run` or `doit`
181
181
182
-
Runs the project. The `pdpp run` command provides basic functionality; users may pass arguments to the `doit` command that provides a great deal of control and specificity. More information about the `doit` command can be found [here](https://pydoit.org/cmd-run.html).
182
+
Runs the project. The `pdpp run` command provides basic functionality; users may pass arguments to the `doit` command that provides a great deal of control and specificity. More information about the `doit` command can be found [here](https://pydoit.org/cmd-run.html).
183
183
184
184
### `pdpp graph`
185
185
@@ -196,4 +196,4 @@ Incorporates an already-PDP compliant directory (containing `input`, `output`, a
196
196
197
197
### `pdpp enable`
198
198
199
-
Allows users to toggle tasks 'on' or 'off'; tasks that are 'off' will not be executed when `pdpp run` or `doit` is used.
199
+
Allows users to toggle tasks 'on' or 'off'; tasks that are 'off' will not be executed when `pdpp run` or `doit` is used.
0 commit comments