Skip to content

Latest commit

 

History

History
105 lines (92 loc) · 5.55 KB

README.org

File metadata and controls

105 lines (92 loc) · 5.55 KB

PySpark Cookbook

A collection of useful copy-pasteable standalone PySpark code snippets with corresponding output explaining behavior of commonly used functions.

The source code of snippets is rendered as HTML and hosted at http://isabekov.github.io/pyspark-cookbook/.

Development Environment

Emacs with org-mode is used as a development environment. Compared to Jupyter notebooks, the source code is easier to keep in a version control system since it is just a plain text.

./screenshots/example.png

Development dependencies

SoftwareVersionComment
Emacs29.2main development environment
Python3.11.6works with pyspark >= 3.4.0, (see discussion)
python-pyspark3.4.0Python API for Spark (large-scale data processing library)
python-py4j0.10.9.7enables Python programs to dynamically access Java (dependency of PySpark)
python-pandas2.0.2Python data analysis library
python-pyarrow15.0.0bindings to Apache Arrow (dependency of PySpark)
python-tabulate0.9.0needed to convert dataframes into org-table format
Java Runtime Environment17.0.10newer version do not work with PySpark 3.4.0
PYNT (Emacs package)>1.0interactive kernel for Python in Emacs, read installation instructions at (see repository)
org-export64ac299command line tool needed for HTML export, requires Emacs (see repository)
GNU readline8.2.13library needed for correct invocation of Python in Emacs on MacOS

Install Python and Python Packages

Depending on the operating system, install Python and packages py4j, pyspark, pandas, pyarrow, tabulate using corresponding package manager and pip.

MacOS-specific Installation

Install GNU readline:

$ pip install gnureadline

Replace libedit~ with readline:

python -m override_readline

Details can be found here.

PYNT Installation

Install the codebook module with pip package manager:

$ pip install  git+https://github.com/ebanner/pynt

On ArchLinux, pip is not allowed to install by default, so pass an extra argument:

$ pip install --break-system-packages  git+https://github.com/ebanner/pynt

Open Emacs. Install pynt in Emacs through MELPA.

M-x package-install RET pynt

where RET is just the “Enter” key.

To fix the following error during evaluation of code blocks:

ModuleNotFoundError: No module named 'notebook.services'

Find the installation of PyNT:

$ grep -i kernelmanager /usr/lib/python3.11/site-packages/codebook/manager.py
from jupyter_server.services.kernels.kernelmanager import MappingKernelManager

which is defined in the source code and change that line in manager.py to

from jupyter_server.services.kernels.kernelmanager import MappingKernelManager

Java Runtime Installation

PySpark Cookbook’s recipes were tested in Emacs IDE using Java Runtime environment: 17.0.10.. Set it as default:

$ export JAVA_HOME=/usr/lib/jvm/java-17-openjdk
$ sudo ln -s /usr/lib/jvm/java-17-openjdk /usr/lib/jvm/default

Newer versions of Java are not compatible with PySpark v3.4.0.

Install org-export

$ git clone https://github.com/nhoffman/org-export.git
$ cd org-export
$ sudo install -D -m 755 org-export* /usr/local/bin

Export to HTML

To produce HTML page with PySpark code snippets, run:

$ make index.html

To render examples of converting PySpark tables displayed in pretty format to orgtbl format (see tabulate package describing the formats), run:

$ make test_ps2org.html

Execution of Code Blocks in org-mode

Navigate to any snippet outside “Functions”~ chapter (which is meant to provide only service functions for post-processing the output). Make sure that the cursor is inside a Python code block:

#+begin_src python :post pretty2orgtbl(data=*this*)
  ...
#+end_src

Press C-c C-c (i.e. Ctrl-c twice). Emacs will execute the source code block inside a Python session and display the output.