Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dockerize mara #7

Open
wants to merge 20 commits into
base: master
Choose a base branch
from
Open

Dockerize mara #7

wants to merge 20 commits into from

Conversation

gathineou
Copy link
Contributor

PR for discussing and brainstorming on the docker-related mara implementations

@jankatins
Copy link
Member

jankatins commented May 16, 2019

I had a few additions and problems :-( :

  • pip had problems installing (or better uninstalling old versions): --ignore-installed added
  • when using the .venv for both local and in-docker dev, the venv had wrong paths for the respective other way): moved the in-docker venv to .venv-docker and duplicated the code to create/update it into the entrypint.sh script. BEtter fix would be to make the env part configureable in the makefile
  • The ensure-pushed script would reproduceable fail the first time it would run in an environment (i.e. in-docker vs local directly). So I made the code run the script twice, once as a dry-run.
  • I needed a way to create the etl db, so I added a small command which basically copies the first part from the migration code in mara-db (flask app.cli.ensure-etl-db) -> this might be something for the data-integration package.
  • I wanted postgresql client version 11

I ended up with this for the dev docker container (e.g. linking in the source into the container via a volume):

# don't care about slim, we anyway have lots of reasons to poke into the container...
FROM python:3.7-stretch

RUN ["mkdir", "-p", "/mara"]
WORKDIR /mara
VOLUME /mara

RUN groupadd -r mara && useradd --no-log-init -r -g mara mara

# Install latest stable postgresql-client from the official repository
RUN \
    apt-get update && apt-get install -y --no-install-recommends gnupg dirmngr \
    # https://github.com/inversepath/usbarmory-debian-base_image/issues/9#issuecomment-466594168
    && mkdir ~/.gnupg && echo "disable-ipv6" >> ~/.gnupg/dirmngr.conf \
	&& apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys B97B0AFCAA1A47F044F244A07FCC7D46ACCC4CF8 \
	&& echo "deb http://apt.postgresql.org/pub/repos/apt/ stretch-pgdg main" > /etc/apt/sources.list.d/pgdg.list \
    # ugly fix to install postgresql-client without errors in slim-stretch image
    && mkdir -p /usr/share/man/man1 /usr/share/man/man7 \
    && apt-get update && apt-get install -y --no-install-recommends \
          git \
          dialog \
          coreutils \
          graphviz \
          python3-dev \
          python3-venv \
          rsync \
          nano \
          telnet \
          postgresql-client

# The entrypoint installs all packages on first start...

COPY ./docker/dev/entrypoint.sh /
RUN ["chmod", "+x", "/entrypoint.sh"]


EXPOSE 5000

ENV MARA_ENVIRONMENT docker-dev
ENV FLASK_APP "/mara/app/app.py"
ENV FLASK_DEBUG 1
# preactivate the environment, so you can straight do stuff like run pipelines with docker exec
ENV PATH /mara/.venv-docker/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

USER mara

ENTRYPOINT ["/entrypoint.sh"]
CMD ["flask-no-reload"]

entrypoint.sh:

#!/usr/bin/env bash
set -x
set -e

# Create the virtual env for the first time, which is only used for docker
# This prevents problems when a user uses the same source repo for both working
# via docker and local
venv_dir=".venv-docker"
if [ ! -d "${venv_dir}" ]; then
    mkdir -p .venv-docker
    (cd "${venv_dir}" && python3 -m venv --copies .)
    # add the project directory to path, so we might use that to edit source packages
    echo $(pwd) > "$(echo ${venv_dir}/lib/*/site-packages)/mara-path.pth"
    # install minimum set of required packages
    # wheel needs to be early to be able to build wheels
    # --ignore-installed: https://github.com/moby/moby/issues/12327
    # "EnvironmentError: [Errno 39] Directory not empty: '/mara/.venv-docker/lib/python3.7/site-packages/~ip/_internal'"
    "${venv_dir}/bin/python3" -m pip install --ignore-installed  --upgrade pip wheel requests setuptools pipdeptree
    # Workaround problems with un-vendored urllib3/requests in pip on ubuntu/debian
    # This forces .venv/bin/pip to use the vendored versions of urllib3 from the installed requests version
    # see https://stackoverflow.com/a/46970344/1380673
    rm -vf "${venv_dir}/share/python-wheels/{requests,chardet,urllib3}-*.whl"
fi

source "${venv_dir}/bin/activate"

if [ "$1" = "flask-reload" ]; then
    exec flask run --with-threads --host 0.0.0.0 --reload --eager-loading
elif [ "$1" = "flask-no-reload" ]; then
    exec flask run --with-threads --host 0.0.0.0 --no-reload --eager-loading
elif [ "$1" = "migrate" ]; then
    flask app.cli.ensure-etl-db
    flask mara_db.migrate
elif [ "$1" = "update-packages" ] ; then
    for package_dir in $(mkdir -p packages; cd packages; find . -maxdepth 1 -mindepth 1 -type d) ; do
        # I've no clue, but this has to run twice to work: the first time it will fail, but the second time it succeeds
        echo $(.scripts/mara-app/ensure-pushed.sh packages/${package_dir} > /dev/null 2>&1) > /dev/null 2>&1
        .scripts/mara-app/ensure-pushed.sh packages/${package_dir}
    done
    "${venv_dir}/bin/python3" -m pip install --ignore-installed --requirement=requirements.txt.freeze --src=./packages --upgrade --exists-action=w
    PYTHONWARNINGS="ignore" "${venv_dir}/bin/python3" -m pip install --requirement=requirements.txt --src=./packages --upgrade --exists-action=w
    # copy newer script versions
    rsync --archive --recursive --itemize-changes  --delete packages/mara-app/.scripts/ .scripts/mara-app/
    "${venv_dir}/bin/pipdeptree" --warn=fail
    # write freeze file
    # pkg-ressources is automatically added on ubuntu, but breaks the install.
    # https://stackoverflow.com/a/40167445/1380673
    "${venv_dir}/bin/python3" -m pip freeze | grep -v "pkg-resources" > requirements.txt.freeze
    flask app.cli.ensure-etl-db
    flask mara_db.migrate
elif [ "$1" = "install-packages" ] ; then
    for package_dir in $(mkdir -p packages; cd packages; find . -maxdepth 1 -mindepth 1 -type d) ; do
        # I've no clue, but this has to run twice to work: the first time it will fail, but the second time it succeeds
        echo $(.scripts/mara-app/ensure-pushed.sh packages/${package_dir} > /dev/null 2>&1) > /dev/null 2>&1
        .scripts/mara-app/ensure-pushed.sh packages/${package_dir}
    done
    "${venv_dir}/bin/python3" -m pip install --ignore-installed  --requirement=requirements.txt.freeze --src=./packages --upgrade --exists-action=w
    rsync --archive --recursive --itemize-changes  --delete packages/mara-app/.scripts/ .scripts/mara-app/
    flask app.cli.ensure-etl-db
    flask mara_db.migrate
else
    exec "$@"
fi

I've also build soemthing for our production environment (e.g. code copied into the docker image):

Dockerfile

# use a slim image to get a smaller size
FROM python:3.7-slim-stretch

RUN ["mkdir", "-p", "/mara"]
WORKDIR /mara

RUN groupadd -r mara && useradd --no-log-init -r -g mara mara

COPY requirements.txt.freeze /mara/

RUN \
    apt-get update && apt-get install -y --no-install-recommends gnupg dirmngr \
    # https://github.com/inversepath/usbarmory-debian-base_image/issues/9#issuecomment-466594168
    && mkdir ~/.gnupg && echo "disable-ipv6" >> ~/.gnupg/dirmngr.conf \
	&& apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys B97B0AFCAA1A47F044F244A07FCC7D46ACCC4CF8 \
	&& echo "deb http://apt.postgresql.org/pub/repos/apt/ stretch-pgdg main" > /etc/apt/sources.list.d/pgdg.list \
    # ugly fix to install postgresql-client without errors in slim-stretch image
    && mkdir -p /usr/share/man/man1 /usr/share/man/man7 \
    && apt-get update && apt-get install -y --no-install-recommends \
	  git \
      curl \
      dialog \
      coreutils \
      graphviz \
      python3-dev \
      python3-venv \
      rsync \
      postgresql-client \
      gcc \
	&& pip install --no-cache-dir  -r requirements.txt.freeze \
	&& apt-get purge -y --auto-remove git gnupg dirmngr gcc \
	&& rm -rf /var/lib/apt/lists/* ;

COPY ./docker/prod/entrypoint.sh /
RUN ["chmod", "+x", "/entrypoint.sh"]

COPY ./docker/prod/local_setup.py /mara/app/local_setup.py
COPY ./app/ /mara/app/

EXPOSE 5000

ENV MARA_ENVIRONMENT docker-prod
ENV FLASK_APP="/mara/app/app.py"

USER mara

ENTRYPOINT ["/entrypoint.sh"]
CMD ["flask-no-reload"]

entrypoint.sh

#!/usr/bin/env bash
set -x
set -e


if [ "$1" = "flask-reload" ]; then
    exec flask run --with-threads --host 0.0.0.0 --reload --eager-loading
elif [ "$1" = "flask-no-reload" ]; then
    exec flask run --with-threads --host 0.0.0.0 --no-reload --eager-loading
else
    exec "$@"
fi

local_setup.py

# ...

# On local/in-docker, this is filled with some values
defaults = {}

_sentinal = object()


def e(name, default=_sentinal):
    ret = os.environ.get(name.upper(), defaults.get(name))
    if ret is None:
        if default is not _sentinal:
            return default
        raise KeyError(f"{name} not found in default or environment")
    return ret

# Cached to not lookup the config with each time the DBs are looked up
__dwh_db = mara_db.dbs.PostgreSQLDB(user=e("dwh_user"),
                                        host=e("dwh_host"),
                                        port=int(e("dwh_port")),
                                        database=e("dwh_database"),
                                        password=e("dwh_password"))
__mara_db = mara_db.dbs.PostgreSQLDB(user=e("mara_user"),
                                         host=e("mara_host"),
                                         port=int(e("mara_port")),
                                         database=e("mara_database"),
                                         password=e("mara_password"))
@patch(mara_db.config.databases)
def databases():
    return {
        # the project requires two databases: 'mara' for the app itself, and 'dwh' for the etl
        'dwh': __dwh_db,
        'mara': __mara_db,
# ...
# Again cached...
__max_number_of_parallel_tasks = int(e("max_number_of_parallel_tasks", 11))
patch(data_integration.config.max_number_of_parallel_tasks)(lambda: __max_number_of_parallel_tasks)
# ...

I've put the Dockerfile/entrypoint.sh into `./docker/{dev|prod}/ and it's now called like this:

### Docker

For local dev, it's assumed that you want to use the locally installed 
postgresql instead of one in a docker 
container (speed on real disk is better than in a virtualized server. 
Wouldn't matter on Linux...). 

To be able to use a locally installed postgresql make postgresql listen 
to the interface the docker
container have access to:

` ``bash
λ docker-machine ip
192.168.99.100
λ ifconfig |grep 192.168.99
	inet 192.168.99.1 netmask 0xffffff00 broadcast 192.168.99.255
# on windows use ipconfig and a manual search
# -> This means postgresql has to listen at the 192.168.99.1 address
` ``

This needs two changes to the postgresql config files:
* in `postgresql.conf`, you need to add `listen_addresses = 'localhost,192.168.99.1'`and 
* in `pg_hba.conf`, you need to add `host    all             all             192.168.99.100/32        trust`

Afterwards system restart postgresql (this depends on the docker host (in virtualbox)
already being up and running. on mac, the postgresql server starts too early for the
vbox to be online, so I always have to restart postgresql).

To build the container:

` ``bash
# development version, where the source code is on a VOLUME and you run pipelines in the browser
docker build -t mara-app:dev -f ./docker/dev/Dockerfile .
# Production version, where the source code is copied into the container
docker build -t mara-app:prod -f ./docker/prod/Dockerfile .
` ``

Running a development container:

` ``bash
cp docker/.env.example development.env
# edit development.env -> see comment in the file
# If you use a local postgresql, you need to set the DWH_HOST and MARA_HOST to the ipaddress 
# from above (e.g. 192.168.99.1)
docker run -i -t --rm --name mara-app --mount type=bind,source=$(pwd),target=/mara -p 5000:5000 --net=bridge --env-file=development.env mara-app:dev
# docker exec also has the python venv in path:
docker exec -ti mara-app flask data_integration.ui.run --path utils
` ``

Running the production container:

` ``bash
cp docker/.env.example production.env
# edit production.env -> see comment in the file
docker run -i -t --rm --name mara-app -p 5000:5000 --net=bridge --env-file=production.env mara-app:prod
# run the utils pipeline in the same container
docker exec -ti mara-app flask data_integration.ui.run --path utils
` ``

@jankatins
Copy link
Member

Sizes:

λ docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
mara-app            dev                 a06cceefb076        4 hours ago         1.03GB
mara-app            prod                c2e0d861fd4c        15 hours ago        424MB

Still Meh :-(

@soobrosa
Copy link

@martin-loetzsch @jankatins who can resolve the conflict?

@jankatins
Copy link
Member

@gathineou is anyone still working on this?

@leo-schick
Copy link
Member

I copied the Dockerfiles of this branch and build a docker environment myself which is quite similar to this one. In case there is interest, I could share this in a separate PoC repository.

@soobrosa
Copy link

@leo-schick would be more than happy to try!

@leo-schick
Copy link
Member

leo-schick commented Apr 1, 2022

@soobrosa I published now my simplified current version under https://github.com/mara/docker. The repository is currently private but you should be able to see it as member of the mara organisation.

My docker images are based on this repo with some additional changes from my side. The suggestions from @jankatins are not considered (yet). PRs are welcome 🤟

p.s. did not do any changes to the postgres-cstore_ftd container/image since I do not use cstore_ftd.

@gathineou
Copy link
Contributor Author

@gathineou is anyone still working on this?

Hi @jankatins 👋🏽 excuse my late reply, was off for a while and overlooked this.
Unfortunately no one worked more on this for a while so there is no much progress made from our side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants