You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+9-16Lines changed: 9 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,19 +8,14 @@ Many of the data science functions have been moved to wvu [https://github.com/Wi
8
8
9
9
10
10
11
-
I would like to share what I have found to be a very effective personal Jupyter workflow for data science development.
11
+
<ahref="https://github.com/WinVector/wvpy">wvpy</a> is a very effective personal Jupyter workflow for data science development.
12
12
13
-
<center>
14
-
<imgstyle="display:block; margin-left:auto; margin-right:auto;"src="https://win-vector.com/wp-content/uploads/2022/08/DALL%C2%B7E-2022-08-20-14.51.57-An-Effective-Personal-Jupyter-Data-Science-Workflow.png"alt="DALL E 2022 08 20 14 51 57 An Effective Personal Jupyter Data Science Workflow"title="DALL·E 2022-08-20 14.51.57 - An Effective Personal Jupyter Data Science Workflow.png"border="0"width="300"height="300" />
15
-
<br>
16
-
DALL-E "An Effective Personal Jupyter Data Science Workflow"
17
-
</center>
18
13
19
14
<ahref="https://jupyter.org">Jupyter</a> (nee IPython) workbooks are JSON documents that allow a data scientist to mix: code, markdown, results, images, and graphs. They are a great contribution to scientific reproducibility, as they can contain a number of steps that can all be re-run in batch. They serve a similar role to literate programming, SWEAVE, and rmarkdown/knitr. The main design difference is Jupyter notebooks do not separate specification from presentation, which causes a number of friction points. They are not legible without a tool (such as JupyterLab, Jupyter Notebook, Visual Studio Code, PyCharm, or other IDEs), they are fairly incompatible with source control (as they may contain images as binary blobs, and many of the tools alter the notebook on opening), and they make <code>grep</code>ing/searching difficult.
20
15
21
-
The above issues are fortunately all <em>inessential difficulties</em>. Python is a very code-oriented work environment, so most tools expose a succinct programable interface. The tooling exposed by the Python packages <ahref="https://pypi.org/project/ipython/">IPython</a>, <ahref="https://pypi.org/project/nbformat/">nbformat</a>, and <ahref="https://pypi.org/project/nbconvert/">nbconvert</a> are very powerful and convenient. With only a little organizing code I was able to build a very powerful personal data science workflow that I have found works very well for clients.
16
+
The above issues are fortunately all <em>inessential difficulties</em>. Python is a very code-oriented work environment, so most tools expose a succinct programable interface. The tooling exposed by the Python packages <ahref="https://pypi.org/project/ipython/">IPython</a>, <ahref="https://pypi.org/project/nbformat/">nbformat</a>, and <ahref="https://pypi.org/project/nbconvert/">nbconvert</a> are very powerful and convenient. With only a little organizing code we were able to build a very powerful personal data science workflow that we have found works very well for clients.
22
17
23
-
I share this small amount of code in the package <ahref="https://pypi.org/project/wvpy/">wvpy</a>. This is easiest to demonstrate in action, both in <ahref="https://win-vector.com/2022/08/20/an-effective-personal-jupyter-data-science-workflow/">this article</a> and in a video demonstration <ahref="https://youtu.be/cQ-tCwD4moc">here</a>.
18
+
We share this small amount of code in the package <ahref="https://pypi.org/project/wvpy/">wvpy</a>. This is easiest to demonstrate in action, both in <ahref="https://win-vector.com/2022/08/20/an-effective-personal-jupyter-data-science-workflow/">this article</a> and in a video demonstration <ahref="https://youtu.be/cQ-tCwD4moc">here</a>.
24
19
25
20
The first feature is: converting Jupyter notebooks (which are JSON files ending with a <code>.ipynb</code> suffix) to and from simple Python code that is more compatible with source control (such as Git).
26
21
@@ -59,7 +54,7 @@ from "plot.ipynb" to "plot.py"
59
54
</pre>
60
55
</code>
61
56
62
-
The resulting Python file is shown <ahref="https://github.com/WinVector/wvpy/blob/main/examples/worksheets/plot.py">here</a>. The idea is: the entire file is pure Python, with the non-python blocks in multi-line strings. This file has all results and meta-data stripped out, and a small amount of whitespace regularization. This ".py" format is exactly the right format for source control, we get reliable and legible differences. In my personal practice I don't check ".ipynb" files in to source control, but only the matching ".py" files. This discipline makes <code>grep</code>ing and searching for items in the project as easy as finding items in code.
57
+
The resulting Python file is shown <ahref="https://github.com/WinVector/wvpy/blob/main/examples/worksheets/plot.py">here</a>. The idea is: the entire file is pure Python, with the non-python blocks in multi-line strings. This file has all results and meta-data stripped out, and a small amount of whitespace regularization. This ".py" format is exactly the right format for source control, we get reliable and legible differences. In my personal practice I don't always check ".ipynb" files in to source control, but only the matching ".py" files. This discipline makes <code>grep</code>ing and searching for items in the project as easy as finding items in code.
63
58
64
59
In the ".py" file "begin text", "end text", and "end code" markers show where the Jupyter cell boundaries are. This allows reliable conversion from the ".py" file back to a Jupyter notebook. PyCharm and others have a similar notebook representation strategy.
65
60
@@ -121,14 +116,12 @@ This gives a simplified output as below.
121
116
122
117
<imgstyle="display:block; margin-left:auto; margin-right:auto;"src="https://win-vector.com/wp-content/uploads/2022/08/Screen-Shot-2022-08-20-at-12.43.40-PM.png"alt="Screen Shot 2022 08 20 at 12 43 40 PM"title="Screen Shot 2022-08-20 at 12.43.40 PM.png"border="0"width="335"height="465" />
123
118
124
-
For already executed sheets one would use the standard Juypter supplied command <code>jupyter nbconvert --to html plot.ipynb</code>, the merit of the rendering here is parameterization of notebooks and stripping of input and prompt ids. The strategy here is to be lightweight stand-alone, and not a plug in such as the strategy pursued by <ahref="https://github.com/mwouts/jupytext">jupytext</a> or <ahref="https://www.fast.ai/2022/07/28/nbdev-v2/">nbdev</a>, or targeting fully camera ready reports via <ahref="https://www.fast.ai/2022/07/28/nbdev-v2/">Quarto</a>. I feel the <ahref="https://github.com/WinVector/wvpy">wvpy</a> approach maximizes productivity during development, with minimal plug-in and install burdens.
119
+
For already executed sheets one would use the standard Juypter supplied command <code>jupyter nbconvert --to html plot.ipynb</code>, the merit of the rendering here is parameterization of notebooks and stripping of input and prompt ids. The strategy here is to be lightweight stand-alone, and not a plug in such as the strategy pursued by <ahref="https://github.com/mwouts/jupytext">jupytext</a> or <ahref="https://www.fast.ai/2022/07/28/nbdev-v2/">nbdev</a>, or targeting fully camera ready reports via <ahref="https://www.fast.ai/2022/07/28/nbdev-v2/">Quarto</a>. We feel the <ahref="https://github.com/WinVector/wvpy">wvpy</a> approach maximizes productivity during development, with minimal plug-in and install burdens.
125
120
126
-
We also supply a <a href="https://github.com/WinVector/wvpy/blob/main/pkg/wvpy/jtools.py#L281">simple class for holding render tasks</a>, including inserting arbitrary initialization code for each run. This makes it very easy to render the same Jupyter workbook for different targets (say the same analysis for each city in a state) and even parallelize the rendering using standard Python tools such as <code>multiprocessing.Pool</code>. This parameterized running allows simple management of fairly large projects. If I need to run a great many variations of a notebook I use the <a href="https://github.com/WinVector/wvpy/blob/main/pkg/wvpy/jtools.py#L281">JTask container</a> and either a for loop or <code>multiprocessing.Pool</code> over the tasks in Python (remember, when we have Python we don't have to perform all steps at the GUI or even in a shell!). A small example of the method is found <a href="https://github.com/WinVector/wvpy/tree/main/examples/param_worksheet">here</a>, where a single Jupyter notebook <a href="https://github.com/WinVector/wvpy/blob/main/examples/param_worksheet/ParamExample.ipynb">ParamExample.ipynb</a> is used by <a href="https://github.com/WinVector/wvpy/blob/main/examples/param_worksheet/run_examples.py">run_examples.py</a> to produce the multiple per-date HTML, PDF, and PNG files found in the <a href="https://github.com/WinVector/wvpy/tree/main/examples/param_worksheet">directory</a>.
121
+
We also supply a <a href="https://github.com/WinVector/wvpy/blob/main/pkg/wvpy/jtools.py#L281">simple class for holding render tasks</a>, including inserting arbitrary initialization code for each run. This makes it very easy to render the same Jupyter workbook for different targets (say the same analysis for each city in a state) and even parallelize the rendering using standard Python tools such as <code>multiprocessing.Pool</code>. This parameterized running allows simple management of fairly large projects. If we need to run a great many variations of a notebook we use the <a href="https://github.com/WinVector/wvpy/blob/main/pkg/wvpy/jtools.py#L281">JTask container</a> and either a for loop or <code>multiprocessing.Pool</code> over the tasks in Python (remember, when we have Python we don't have to perform all steps at the GUI or even in a shell!). A small example of the method is found <a href="https://github.com/WinVector/wvpy/tree/main/examples/param_worksheet">here</a>, where a single Jupyter notebook <a href="https://github.com/WinVector/wvpy/blob/main/examples/param_worksheet/ParamExample.ipynb">ParamExample.ipynb</a> is used by <a href="https://github.com/WinVector/wvpy/blob/main/examples/param_worksheet/run_examples.py">run_examples.py</a> to produce the multiple per-date HTML, PDF, and PNG files found in the <a href="https://github.com/WinVector/wvpy/tree/main/examples/param_worksheet">directory</a>.
127
122
128
-
I have found the quickest development workflow is to work with the ".ipynb" Jupyter notebooks (usually in Visual Studio Code, and settng any values that were supposed to come from the <code>wvpy.render_workbook</code> by hand after checking they are not set in <code>globals()</code>). Then when the worksheet is working I convert it to ".py" using <code>wvpy.pysheet</code> and check that in to source control.
123
+
We have found the quickest development workflow is to work with the ".ipynb" Jupyter notebooks (usually in Visual Studio Code, and settng any values that were supposed to come from the <code>wvpy.render_workbook</code> by hand after checking they are not set in <code>globals()</code>). Then when the worksheet is working we convert it to ".py" using <code>wvpy.pysheet</code> and check that in to source control.
129
124
130
-
As a side-note I find Python is a developer first community, which is very refreshing. Capabilities (such as Jupyter, nbconvert, and nbformat) are released as code under generous open source licenses and documentation instead of being trapped in monolithic applications. This means one can take advantage of their capabilities using only a small amount of code. And under the mentioned assumption that Python is a developer first community, small amounts of code are considered easy integrations. wvpy is offered in the same spirit, it is available for use from PyPi <ahref="https://pypi.org/project/wvpy/">here</a> under a BSD 3-clause License and has it code available here for re-use or adaption <ahref="https://github.com/WinVector/wvpy">here</a> under the same license. It isn't a big project, but it has made working on client projects and teaching data science a bit easier for me.
125
+
As a side-note, I find Python is a developer first community, which is very refreshing. Capabilities (such as Jupyter, nbconvert, and nbformat) are released as code under generous open source licenses and documentation instead of being trapped in monolithic applications. This means one can take advantage of their capabilities using only a small amount of code. And under the mentioned assumption that Python is a developer first community, small amounts of code are considered easy integrations. wvpy is offered in the same spirit, it is available for use from PyPi <ahref="https://pypi.org/project/wvpy/">here</a> under a BSD 3-clause License and has it code available here for re-use or adaption <ahref="https://github.com/WinVector/wvpy">here</a> under the same license. It isn't a big project, but it has made working on client projects and teaching data science a bit easier for me.
131
126
132
-
In conclusion, that is my current personal Jupyter workflow. It improves compatibility with source control, ease of search, and automatic rendering of many worksheets in a parameterized manner. I feel this addresses the primary pain points of working with Jupyter worksheets.
133
-
134
-
I'll be offering private (and hopefully someday public) training on the work flow (including notebook parameterization to run many jobs from a single source, use of <code>multiprocessing.Pool</code> for speedup, and <code>IPython.display.display; IPython.display.Markdown</code> for custom results) going forward.
127
+
<ahref="https://win-vector.com">Win Vector LLC</a> will be offering private (and hopefully someday public) training on the work flow (including notebook parameterization to run many jobs from a single source, use of <code>multiprocessing.Pool</code> for speedup, and <code>IPython.display.display; IPython.display.Markdown</code> for custom results) going forward.
0 commit comments