generated from byuibigdata/docker_guide_big
-
Notifications
You must be signed in to change notification settings - Fork 54
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit c8c4cb9
Showing
27 changed files
with
58,719 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,139 @@ | ||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
|
||
# C extensions | ||
*.so | ||
|
||
# Distribution / packaging | ||
.Python | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
wheels/ | ||
pip-wheel-metadata/ | ||
share/python-wheels/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
MANIFEST | ||
|
||
# PyInstaller | ||
# Usually these files are written by a python script from a template | ||
# before PyInstaller builds the exe, so as to inject date/other infos into it. | ||
*.manifest | ||
*.spec | ||
|
||
# Installer logs | ||
pip-log.txt | ||
pip-delete-this-directory.txt | ||
|
||
# Unit test / coverage reports | ||
htmlcov/ | ||
.tox/ | ||
.nox/ | ||
.coverage | ||
.coverage.* | ||
.cache | ||
nosetests.xml | ||
coverage.xml | ||
*.cover | ||
*.py,cover | ||
.hypothesis/ | ||
.pytest_cache/ | ||
|
||
# Translations | ||
*.mo | ||
*.pot | ||
|
||
# Django stuff: | ||
*.log | ||
local_settings.py | ||
db.sqlite3 | ||
db.sqlite3-journal | ||
|
||
# Flask stuff: | ||
instance/ | ||
.webassets-cache | ||
|
||
# Scrapy stuff: | ||
.scrapy | ||
|
||
# Sphinx documentation | ||
docs/_build/ | ||
|
||
# PyBuilder | ||
target/ | ||
|
||
# Jupyter Notebook | ||
.ipynb_checkpoints | ||
|
||
# IPython | ||
profile_default/ | ||
ipython_config.py | ||
|
||
# pyenv | ||
.python-version | ||
|
||
# pipenv | ||
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. | ||
# However, in case of collaboration, if having platform-specific dependencies or dependencies | ||
# having no cross-platform support, pipenv may install dependencies that don't work, or not | ||
# install all needed dependencies. | ||
#Pipfile.lock | ||
|
||
# PEP 582; used by e.g. github.com/David-OConnor/pyflow | ||
__pypackages__/ | ||
|
||
# Celery stuff | ||
celerybeat-schedule | ||
celerybeat.pid | ||
|
||
# SageMath parsed files | ||
*.sage.py | ||
|
||
# Environments | ||
.env | ||
.venv | ||
env/ | ||
venv/ | ||
ENV/ | ||
env.bak/ | ||
venv.bak/ | ||
|
||
# Spyder project settings | ||
.spyderproject | ||
.spyproject | ||
|
||
# Rope project settings | ||
.ropeproject | ||
|
||
# mkdocs documentation | ||
/site | ||
|
||
# mypy | ||
.mypy_cache/ | ||
.dmypy.json | ||
dmypy.json | ||
|
||
# Pyre type checker | ||
.pyre/ | ||
|
||
# ignore the database stuff created by the docker containers | ||
/data/postgresql | ||
/data/spark-warehouse | ||
/data/Open990 | ||
/data/irs990 | ||
/data/vermont | ||
|
||
# ignore draft folder in scripts | ||
/scripts/draft |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
Creative Commons Legal Code | ||
|
||
CC0 1.0 Universal | ||
|
||
CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE | ||
LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN | ||
ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS | ||
INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES | ||
REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS | ||
PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM | ||
THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED | ||
HEREUNDER. | ||
|
||
Statement of Purpose | ||
|
||
The laws of most jurisdictions throughout the world automatically confer | ||
exclusive Copyright and Related Rights (defined below) upon the creator | ||
and subsequent owner(s) (each and all, an "owner") of an original work of | ||
authorship and/or a database (each, a "Work"). | ||
|
||
Certain owners wish to permanently relinquish those rights to a Work for | ||
the purpose of contributing to a commons of creative, cultural and | ||
scientific works ("Commons") that the public can reliably and without fear | ||
of later claims of infringement build upon, modify, incorporate in other | ||
works, reuse and redistribute as freely as possible in any form whatsoever | ||
and for any purposes, including without limitation commercial purposes. | ||
These owners may contribute to the Commons to promote the ideal of a free | ||
culture and the further production of creative, cultural and scientific | ||
works, or to gain reputation or greater distribution for their Work in | ||
part through the use and efforts of others. | ||
|
||
For these and/or other purposes and motivations, and without any | ||
expectation of additional consideration or compensation, the person | ||
associating CC0 with a Work (the "Affirmer"), to the extent that he or she | ||
is an owner of Copyright and Related Rights in the Work, voluntarily | ||
elects to apply CC0 to the Work and publicly distribute the Work under its | ||
terms, with knowledge of his or her Copyright and Related Rights in the | ||
Work and the meaning and intended legal effect of CC0 on those rights. | ||
|
||
1. Copyright and Related Rights. A Work made available under CC0 may be | ||
protected by copyright and related or neighboring rights ("Copyright and | ||
Related Rights"). Copyright and Related Rights include, but are not | ||
limited to, the following: | ||
|
||
i. the right to reproduce, adapt, distribute, perform, display, | ||
communicate, and translate a Work; | ||
ii. moral rights retained by the original author(s) and/or performer(s); | ||
iii. publicity and privacy rights pertaining to a person's image or | ||
likeness depicted in a Work; | ||
iv. rights protecting against unfair competition in regards to a Work, | ||
subject to the limitations in paragraph 4(a), below; | ||
v. rights protecting the extraction, dissemination, use and reuse of data | ||
in a Work; | ||
vi. database rights (such as those arising under Directive 96/9/EC of the | ||
European Parliament and of the Council of 11 March 1996 on the legal | ||
protection of databases, and under any national implementation | ||
thereof, including any amended or successor version of such | ||
directive); and | ||
vii. other similar, equivalent or corresponding rights throughout the | ||
world based on applicable law or treaty, and any national | ||
implementations thereof. | ||
|
||
2. Waiver. To the greatest extent permitted by, but not in contravention | ||
of, applicable law, Affirmer hereby overtly, fully, permanently, | ||
irrevocably and unconditionally waives, abandons, and surrenders all of | ||
Affirmer's Copyright and Related Rights and associated claims and causes | ||
of action, whether now known or unknown (including existing as well as | ||
future claims and causes of action), in the Work (i) in all territories | ||
worldwide, (ii) for the maximum duration provided by applicable law or | ||
treaty (including future time extensions), (iii) in any current or future | ||
medium and for any number of copies, and (iv) for any purpose whatsoever, | ||
including without limitation commercial, advertising or promotional | ||
purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each | ||
member of the public at large and to the detriment of Affirmer's heirs and | ||
successors, fully intending that such Waiver shall not be subject to | ||
revocation, rescission, cancellation, termination, or any other legal or | ||
equitable action to disrupt the quiet enjoyment of the Work by the public | ||
as contemplated by Affirmer's express Statement of Purpose. | ||
|
||
3. Public License Fallback. Should any part of the Waiver for any reason | ||
be judged legally invalid or ineffective under applicable law, then the | ||
Waiver shall be preserved to the maximum extent permitted taking into | ||
account Affirmer's express Statement of Purpose. In addition, to the | ||
extent the Waiver is so judged Affirmer hereby grants to each affected | ||
person a royalty-free, non transferable, non sublicensable, non exclusive, | ||
irrevocable and unconditional license to exercise Affirmer's Copyright and | ||
Related Rights in the Work (i) in all territories worldwide, (ii) for the | ||
maximum duration provided by applicable law or treaty (including future | ||
time extensions), (iii) in any current or future medium and for any number | ||
of copies, and (iv) for any purpose whatsoever, including without | ||
limitation commercial, advertising or promotional purposes (the | ||
"License"). The License shall be deemed effective as of the date CC0 was | ||
applied by Affirmer to the Work. Should any part of the License for any | ||
reason be judged legally invalid or ineffective under applicable law, such | ||
partial invalidity or ineffectiveness shall not invalidate the remainder | ||
of the License, and in such case Affirmer hereby affirms that he or she | ||
will not (i) exercise any of his or her remaining Copyright and Related | ||
Rights in the Work or (ii) assert any associated claims and causes of | ||
action with respect to the Work, in either case contrary to Affirmer's | ||
express Statement of Purpose. | ||
|
||
4. Limitations and Disclaimers. | ||
|
||
a. No trademark or patent rights held by Affirmer are waived, abandoned, | ||
surrendered, licensed or otherwise affected by this document. | ||
b. Affirmer offers the Work as-is and makes no representations or | ||
warranties of any kind concerning the Work, express, implied, | ||
statutory or otherwise, including without limitation warranties of | ||
title, merchantability, fitness for a particular purpose, non | ||
infringement, or the absence of latent or other defects, accuracy, or | ||
the present or absence of errors, whether or not discoverable, all to | ||
the greatest extent permissible under applicable law. | ||
c. Affirmer disclaims responsibility for clearing rights of other persons | ||
that may apply to the Work or any use thereof, including without | ||
limitation any person's Copyright and Related Rights in the Work. | ||
Further, Affirmer disclaims responsibility for obtaining any necessary | ||
consents, permissions or other rights required for any use of the | ||
Work. | ||
d. Affirmer understands and acknowledges that Creative Commons is not a | ||
party to this document and has no duty or obligation with respect to | ||
this CC0 or use of the Work. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,129 @@ | ||
# Docker | ||
|
||
## Introduction | ||
|
||
> Docker is to containers as Google is to search | ||
'A container is a special type of process that is isolated from other processes. Containers are assigned resources that no other process can access, and they cannot access any resources not explicitly assigned to them.' The advantages of containers is that a company can create an isolated computer within a computer which provides security and consistency. [raygun.com](https://raygun.com/blog/what-is-docker/#:~:text=In%20conclusion%2C%20Docker%20is%20popular,create%20vast%20economies%20of%20scale.) | ||
|
||
### What are containers? | ||
|
||
'Containers sit on top of a physical server and its host OS—for example, Linux or Windows. Each container shares the host OS kernel and, usually, the binaries and libraries, too. Shared components are read-only. Containers are thus exceptionally “light”—they are only megabytes in size and take just seconds to start, versus gigabytes and minutes for a Virtual Machines.' | ||
|
||
'Containers also reduce management overhead. Because they share a common operating system, only a single operating system needs care and feeding for bug fixes, patches, and so on. In short, containers are lighter weight and more portable than VMs.' [blog.netap.com](https://blog.netapp.com/blogs/containers-vs-vms/) | ||
|
||
### Why Docker? | ||
|
||
'Docker enables developers to easily pack, ship, and run any application as a lightweight, portable, self-sufficient container, which can run virtually anywhere. Containers gives you instant application portability.' | ||
|
||
Containers do this by enabling developers to isolate code into a single container. This makes it easier to modify and update the program. It also lends itself, as Docker points out, for enterprises to break up big development projects among multiple smaller Agile teams to automate the delivery of new software in containers.[zdnet.com](https://www.zdnet.com/article/what-is-docker-and-why-is-it-so-darn-popular/) | ||
|
||
> Docker is to containers as GitHub is to Git | ||
Docker brings several new things to the table that the earlier technologies didn't. The first is it's made containers easier and safer to deploy and use than previous approaches. In addition, because Docker's partnering with the other container powers, including Canonical, Google, Red Hat, and Parallels, on its key open-source component libcontainer, it's brought much-needed standardization to containers. | ||
|
||
Since then Docker donated "its software container format and its runtime, as well as the associated specifications," to The Linux Foundation's Open Container Project. Specifically, "Docker has taken the entire contents of the libcontainer project, including nsinit, and all modifications needed to make it run independently of Docker, and donated it to this effort." [zdnet.com](https://www.zdnet.com/article/what-is-docker-and-why-is-it-so-darn-popular/) | ||
|
||
### Docker for data science? | ||
|
||
Using docker containers means you don't have to deal with "works on my machine" problems. Generally, the main advantage Docker provides is standardization. This means you can define the parameters of your container once, and run it wherever Docker is installed. This in turn provides a few major advantages: | ||
|
||
1. __Reproducibility:__ Everyone has the same OS, the same versions of tools etc. If it works on your machine, it works on everyone's machine. | ||
2. __Portability:__ This means that moving from local development to a super-computing cluster is easy. Also, if you're working on open source data science projects you can provide collaborators with an easy way to bypass setup hassle. | ||
3. __Docker Hub:__ You can take advantage of the community to find pre-built images [search here](https://hub.docker.com/search?q=data%20science&type=image) | ||
|
||
Another huge advantage – learning to use Docker will make you a better engineer, or turn you into a data scientist with super powers. Many systems rely on Docker, and it will help you turn your ML projects into applications and deploy models into production. [dagshub.com](https://dagshub.com/blog/setting-up-data-science-workspace-with-docker/) | ||
|
||
## Getting started using `docker run` | ||
|
||
1. [Install Docker Desktop](https://www.docker.com/get-started) (Windows users will need to [install WSL-2](windows_wsl2.md).) | ||
2. [Create a Dockerhub account](https://hub.docker.com/signup) | ||
3. [Pull the jupyter/all-spark-notebook](https://hub.docker.com/r/jupyter/all-spark-notebook) `docker pull jupyter/all-spark-notebook` | ||
4. Create a docker network `docker network create n451` | ||
5. Start your Docker all-spark-notebook container - map to a folder path on your computer to a docker volume. I have included my path (`/Users/hathawayj/git/BYUI451/docker_guide/data`) which you will need to change. The path to the right of `:` will stay the same. | ||
|
||
We will see how to [create a Docker compose yaml](https://docs.docker.com/compose/) a little later. The Docker compose yaml includes a PostgreSQL and Adminer container as well. You can read about creating those containers using `docker run` at [database.md](database.md). In trying to get all three containers to communicate, you will see the need for step 4 above. | ||
|
||
_Note that the command line versions require that the full local volume path is specified. We will be able to use relative file paths with the yaml._ | ||
|
||
__Command Line: Mac__'" | ||
|
||
```bash | ||
docker run --name spark -it \ | ||
-p 8888:8888 -p 4040:4040 -p 4041:4041 \ | ||
-v /Users/hathawayj/git/BYUI451/docker_guide/data:/home/jovyan/data \ | ||
-v /Users/hathawayj/git/BYUI451/docker_guide/scripts:/home/jovyan/scripts \ | ||
-v /Users/hathawayj/git/BYUI451/docker_guide/scratch:/home/jovyan/scratch \ | ||
--network n451 \ | ||
jupyter/all-spark-notebook | ||
``` | ||
|
||
__Command Line: Windows__ | ||
|
||
```bash | ||
docker run --name spark -it ^ | ||
-p 8888:8888 -p 4040:4040 -p 4041:4041 ^ | ||
-v C:/git/BYUI451/docker_guide/data:/home/jovyan/data ^ | ||
-v C:/git/BYUI451/docker_guide/scripts:/home/jovyan/scripts ^ | ||
-v C:/git/BYUI451/docker_guide/scratch:/home/jovyan/scratch ^ | ||
--network n451 ^ | ||
jupyter/all-spark-notebook | ||
``` | ||
|
||
__Docker Desktop__ | ||
|
||
<img src="docker_startup.png" width="400" /> | ||
|
||
6. Now open [http://localhost:8888/lab](http://localhost:8888/lab?token=) and paste `?token=` plus the token shown at the end of the url. | ||
|
||
You can find the token in the terminal or in the logs. | ||
|
||
| Terminal | Docker Desktop Logs | | ||
|----------|---------------------| | ||
|<img src="terminal_token.png" width="400" /> | <img src="docker_desktop_logs.png" width="400" /> | | ||
|
||
With `docker run` we can get a full Spark environment up and running on our computer in minutes. In this container, we can practice our Spark magic and even speed up some of the work we would do in pandas. Spend some time in Jupyter Lab getting used to Spark. Here are some great links to help you with pyspark. | ||
|
||
- [pyspark-examples](https://github.com/spark-examples/pyspark-examples) | ||
- [PySpark Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_SQL_Cheat_Sheet_Python.pdf) | ||
- [PySpark SQL Cheat Sheet](https://intellipaat.com/mediaFiles/2019/03/PySpark-SQL-cheat-sheet.jpg) | ||
|
||
You could try using the [master files](https://www.irs.gov/charities-non-profits/exempt-organizations-business-master-file-extract-eo-bmf) from the United States IRS 990 forms about non-profit companies. Find the four region `.csv` files and explore. | ||
## Getting started using `docker-compose` | ||
|
||
To use this section, I am assuming the following. | ||
|
||
- You have cloned your template repo to your local computer. | ||
- You have a terminal open at the file path of this cloned repo. | ||
- You have reviewed the [database.md](database.md) guide on the postgresql and Adminer containers. | ||
- You have examined the [docker-compose.yml](docker-compose.yml) file. | ||
|
||
We can create a docker compose `.yml` that automates a bit of the work we went through above. Once the `.yml` is created, we can simply tell `docker-compose` to build our docker containers. Here are the steps | ||
|
||
1. Clone this repository to your computer. | ||
2. Open your terminal and navigate to your git repo directory you just cloned. (Mac: `pwd`, Windows:`cd` to see your working directory) | ||
3. If your terminal is open in the git directory, you can run the `docker-compose`. The full command - `docker-compose -p c451 -f docker-compose.yml up`. | ||
|
||
One difference is that each docker container will now have new names. | ||
|
||
| docker-compose name | docker run name | | ||
| ----------------------- | ------------------- | | ||
| _db_451_ | db | | ||
| _spark_451_ | spark | | ||
| _adminer_451_ | adminer | | ||
|
||
With these new names a few commands and inputs will need to be updated. For example, to get into the new postgres container we would run `docker exec -it db_451 sh`. | ||
|
||
## Other readme.md files | ||
|
||
- [Postgres database Docker support](database.md) | ||
- [Docker CLI and psql](command_line_containers.md) | ||
- [Spark Guide using our Docker containers](https://github.com/BYUI451/spark_guide) | ||
|
||
## References | ||
|
||
- [raygun.com](https://raygun.com/blog/what-is-docker/#:~:text=In%20conclusion%2C%20Docker%20is%20popular,create%20vast%20economies%20of%20scale.) | ||
- [blog.netap.com](https://blog.netapp.com/blogs/containers-vs-vms/) | ||
- [zdnet.com](https://www.zdnet.com/article/what-is-docker-and-why-is-it-so-darn-popular/) | ||
- [dagshub.com](https://dagshub.com/blog/setting-up-data-science-workspace-with-docker/) | ||
- [backup and restore postgresql](https://docs.bitnami.com/installer/infrastructure/mapp/administration/backup-restore-postgresql/) and [docker and postgresql](https://markheath.net/post/exploring-postgresql-with-docker) |
Oops, something went wrong.