Skip to content

Commit b541030

Browse files
HyukjinKwonFokko
andcommitted
[SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation
### What changes were proposed in this pull request? This PR proposes to: - add a notebook with a Binder integration which allows users to try PySpark in a live notebook. Please [try this here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb). - reuse this notebook as a quickstart guide in PySpark documentation. Note that Binder turns a Git repo into a collection of interactive notebooks. It works based on Docker image. Once somebody builds, other people can reuse the image against a specific commit. Therefore, if we run Binder with the images based on released tags in Spark, virtually all users can instantly launch the Jupyter notebooks. <br/> I made a simple demo to make it easier to review. Please see: - [Main page](https://hyukjin-spark.readthedocs.io/en/stable/). Note that the link ("Live Notebook") in the main page wouldn't work since this PR is not merged yet. - [Quickstart page](https://hyukjin-spark.readthedocs.io/en/stable/getting_started/quickstart.html) <br/> When reviewing the notebook file itself, please give my direct feedback which I will appreciate and address. Another way might be: - open [here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb). - edit / change / update the notebook. Please feel free to change as whatever you want. I can apply as are or slightly update more when I apply to this PR. - download it as a `.ipynb` file: ![Screen Shot 2020-08-20 at 10 12 19 PM](https://user-images.githubusercontent.com/6477701/90774311-3e38c800-e332-11ea-8476-699a653984db.png) - upload the `.ipynb` file here in a GitHub comment. Then, I will push a commit with that file with crediting correctly, of course. - alternatively, push a commit into this PR right away if that's easier for you (if you're a committer). References: - https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html - https://databricks.com/jp/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html - my own blog post .. :-) and https://koalas.readthedocs.io/en/latest/getting_started/10min.html ### Why are the changes needed? To improve PySpark's usability. The current quickstart for Python users are very friendly. ### Does this PR introduce _any_ user-facing change? Yes, it will add a documentation page, and expose a live notebook to PySpark users. ### How was this patch tested? Manually tested, and GitHub Actions builds will test. Closes #29491 from HyukjinKwon/SPARK-32204. Lead-authored-by: HyukjinKwon <[email protected]> Co-authored-by: Fokko Driesprong <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
1 parent 1354cf0 commit b541030

File tree

11 files changed

+1244
-6
lines changed

11 files changed

+1244
-6
lines changed

.github/workflows/build_and_test.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -226,7 +226,7 @@ jobs:
226226
run: |
227227
# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
228228
# See also https://github.com/sphinx-doc/sphinx/issues/7551.
229-
pip3 install flake8 'sphinx<3.1.0' numpy pydata_sphinx_theme
229+
pip3 install flake8 'sphinx<3.1.0' numpy pydata_sphinx_theme ipython nbsphinx
230230
- name: Install R 4.0
231231
run: |
232232
sudo sh -c "echo 'deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/' >> /etc/apt/sources.list"
@@ -245,10 +245,11 @@ jobs:
245245
ruby-version: 2.7
246246
- name: Install dependencies for documentation generation
247247
run: |
248+
# pandoc is required to generate PySpark APIs as well in nbsphinx.
248249
sudo apt-get install -y libcurl4-openssl-dev pandoc
249250
# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
250251
# See also https://github.com/sphinx-doc/sphinx/issues/7551.
251-
pip install 'sphinx<3.1.0' mkdocs numpy pydata_sphinx_theme
252+
pip install 'sphinx<3.1.0' mkdocs numpy pydata_sphinx_theme ipython nbsphinx
252253
gem install jekyll jekyll-redirect-from rouge
253254
sudo Rscript -e "install.packages(c('devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2'), repos='https://cloud.r-project.org/')"
254255
- name: Scala linter

binder/apt.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
openjdk-8-jre

binder/postBuild

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
#!/bin/bash
2+
3+
#
4+
# Licensed to the Apache Software Foundation (ASF) under one or more
5+
# contributor license agreements. See the NOTICE file distributed with
6+
# this work for additional information regarding copyright ownership.
7+
# The ASF licenses this file to You under the Apache License, Version 2.0
8+
# (the "License"); you may not use this file except in compliance with
9+
# the License. You may obtain a copy of the License at
10+
#
11+
# http://www.apache.org/licenses/LICENSE-2.0
12+
#
13+
# Unless required by applicable law or agreed to in writing, software
14+
# distributed under the License is distributed on an "AS IS" BASIS,
15+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16+
# See the License for the specific language governing permissions and
17+
# limitations under the License.
18+
#
19+
20+
# This file is used for Binder integration to install PySpark available in
21+
# Jupyter notebook.
22+
23+
VERSION=$(python -c "exec(open('python/pyspark/version.py').read()); print(__version__)")
24+
pip install "pyspark[sql,ml,mllib]<=$VERSION"

dev/create-release/spark-rm/Dockerfile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ ARG APT_INSTALL="apt-get install --no-install-recommends -y"
3636
# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
3737
# See also https://github.com/sphinx-doc/sphinx/issues/7551.
3838
# We should use the latest Sphinx version once this is fixed.
39-
ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.0.4 numpy==1.18.1 pydata_sphinx_theme==0.3.1"
39+
ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.0.4 numpy==1.18.1 pydata_sphinx_theme==0.3.1 ipython==7.16.1 nbsphinx==0.7.1"
4040
ARG GEM_PKGS="jekyll:4.0.0 jekyll-redirect-from:0.16.0 rouge:3.15.0"
4141

4242
# Install extra needed repos and refresh.
@@ -75,6 +75,7 @@ RUN apt-get clean && apt-get update && $APT_INSTALL gnupg ca-certificates && \
7575
pip3 install $PIP_PKGS && \
7676
# Install R packages and dependencies used when building.
7777
# R depends on pandoc*, libssl (which are installed above).
78+
# Note that PySpark doc generation also needs pandoc due to nbsphinx
7879
$APT_INSTALL r-base r-base-dev && \
7980
$APT_INSTALL texlive-latex-base texlive texlive-fonts-extra texinfo qpdf && \
8081
Rscript -e "install.packages(c('curl', 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2', 'e1071', 'survival'), repos='https://cloud.r-project.org/')" && \

dev/lint-python

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,22 @@ function sphinx_test {
196196
return
197197
fi
198198

199+
# TODO(SPARK-32666): Install nbsphinx in Jenkins machines
200+
PYTHON_HAS_NBSPHINX=$("$PYTHON_EXECUTABLE" -c 'import importlib.util; print(importlib.util.find_spec("nbsphinx") is not None)')
201+
if [[ "$PYTHON_HAS_NBSPHINX" == "False" ]]; then
202+
echo "$PYTHON_EXECUTABLE does not have nbsphinx installed. Skipping Sphinx build for now."
203+
echo
204+
return
205+
fi
206+
207+
# TODO(SPARK-32666): Install ipython in Jenkins machines
208+
PYTHON_HAS_IPYTHON=$("$PYTHON_EXECUTABLE" -c 'import importlib.util; print(importlib.util.find_spec("ipython") is not None)')
209+
if [[ "$PYTHON_HAS_IPYTHON" == "False" ]]; then
210+
echo "$PYTHON_EXECUTABLE does not have ipython installed. Skipping Sphinx build for now."
211+
echo
212+
return
213+
fi
214+
199215
echo "starting $SPHINX_BUILD tests..."
200216
pushd python/docs &> /dev/null
201217
make clean &> /dev/null

dev/requirements.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,5 @@ PyGithub==1.26.0
44
Unidecode==0.04.19
55
sphinx
66
pydata_sphinx_theme
7+
ipython
8+
nbsphinx

docs/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ See also https://github.com/sphinx-doc/sphinx/issues/7551.
6363
-->
6464

6565
```sh
66-
$ sudo pip install 'sphinx<3.1.0' mkdocs numpy pydata_sphinx_theme
66+
$ sudo pip install 'sphinx<3.1.0' mkdocs numpy pydata_sphinx_theme ipython nbsphinx
6767
```
6868

6969
## Generating the Documentation HTML

python/docs/source/conf.py

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,8 +45,20 @@
4545
'sphinx.ext.viewcode',
4646
'sphinx.ext.mathjax',
4747
'sphinx.ext.autosummary',
48+
'nbsphinx', # Converts Jupyter Notebook to reStructuredText files for Sphinx.
49+
# For ipython directive in reStructuredText files. It is generated by the notebook.
50+
'IPython.sphinxext.ipython_console_highlighting'
4851
]
4952

53+
# Links used globally in the RST files.
54+
# These are defined here to allow link substitutions dynamically.
55+
rst_epilog = """
56+
.. |binder| replace:: Live Notebook
57+
.. _binder: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb
58+
.. |examples| replace:: Examples
59+
.. _examples: https://github.com/apache/spark/tree/{0}/examples/src/main/python
60+
""".format(os.environ.get("RELEASE_TAG", "master"))
61+
5062
# Add any paths that contain templates here, relative to this directory.
5163
templates_path = ['_templates']
5264

@@ -84,7 +96,7 @@
8496

8597
# List of patterns, relative to source directory, that match files and
8698
# directories to ignore when looking for source files.
87-
exclude_patterns = ['_build']
99+
exclude_patterns = ['_build', '.DS_Store', '**.ipynb_checkpoints']
88100

89101
# The reST default role (used for this markup: `text`) to use for all
90102
# documents.

python/docs/source/getting_started/index.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,7 @@
2020
Getting Started
2121
===============
2222

23+
.. toctree::
24+
:maxdepth: 2
25+
26+
quickstart

0 commit comments

Comments
 (0)