Skip to content

henry410213028/pycon-etl

 
 

Repository files navigation

PyConTW ETL

Python CI Docker Image CI

Using Airflow to implement our ETL pipelines

Dags

  1. ods/opening_crawler: Crawlers written by @Rain. Those openings can be used for recuitment board, which was implemented by @tai271828 and @stacy.
  2. ods/survey_cake: A manually triggered uploader which would upload questionnaire to bigquery. The uploader should be invoked after we recieved the surveycake questionnaire.

Prerequisites

  1. Install Python 3.8+
  2. Get Docker
  3. Install Git
  4. Get npm

Install

  1. docker pull puckel/docker-airflow:1.10.9
  2. Python dependencies:
    1. virtualenv venv
      • . venv/bin/activate
    2. pip install poetry
    3. poetry install
  3. Npm dependencies, for linter, formatter and commit linter (optional):
    1. brew install npm
    2. npm ci

Commit

  1. git add <files>
  2. npm run check: Apply all the linter and formatter
  3. npm run commit

Run

Local environment Python Script

  1. . venv/bin/activate
  2. . ./.env.sh
  3. cd contrib
  4. Check its command in contrib/README.md
  5. python xxx.py

Local environment Docker

Would need to setup Snowflake Connection manually, find @davidtnfsh if you don't have those secrets

⚠ WARNING: About .env
Please don't use the .env for local development, or it might screw up the production tables.

  1. Build docker image:
    • Build a production image (for production): docker build -t davidtnfsh/pycon_etl:prod --cache-from davidtnfsh/pycon_etl:prod -f Dockerfile . If you want to build dev/test image, you also need to build this docker image first because dev/test image is on top of this production image. See below.
    • Build dev/test image (for dev/test): docker build -t davidtnfsh/pycon_etl:test --cache-from davidtnfsh/pycon_etl:prod -f Dockerfile.test .
  2. Fill in some secrets:
    1. cp .env.template .env.staging for dev/test. cp .env.template .env.production instead if you are going to start a production instance.
    2. Follow the instruction in .env.<staging|production> and fill in your secrets. If you are just running the staging instance for development as a sandbox, and not going to access any specific thrid-party service, leave the .env.staging as-is should be fine.
  3. Start the Airflow server:
    • production: docker run --rm -p 80:8080 --name airflow -v $(pwd)/dags:/usr/local/airflow/dags -v $(pwd)/service-account.json:/usr/local/airflow/service-account.json --env-file=./.env.production davidtnfsh/pycon_etl:prod webserver
    • dev/test: docker run --rm -p 80:8080 --name airflow -v $(pwd)/dags:/usr/local/airflow/dags -v $(pwd)/service-account.json:/usr/local/airflow/service-account.json --env-file=./.env.staging davidtnfsh/pycon_etl:test webserver
    • Note the difference are just the env file name and the image cache.
  4. Enter the localhost or 127.0.0.1 in the address bar in your browser. Open the Airflow Page.
    • If Port 80 is already in use. You can select a different host port in the -p argument to docker run. And enter the localhost:<host port> in the address bar.

image

BigQuery (Optional)

  1. Setup the Authentication of GCP: https://googleapis.dev/python/google-api-core/latest/auth.html
    • After invoking gcloud auth application-default login, you'll get a credentials.json resides in $HOME/.config/gcloud/application_default_credentials.json. Invoke export GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json" if you have it.
    • service-account.json: Please contact @david30907d using email, telegram or discord. No worry about this json if you are just running the sandbox staging instance for development.
  2. Give Toy-Examples a try

Deployment

  1. Manually deploy to Google compute instance
  2. Fill out airflow.cfg with Google OAuth ID and credential (Ref: setting-up-google-authentication)

CI/CD

Please check .github/workflows for details

Tutorials

BigQuery Example:

from google.cloud import bigquery

client = bigquery.Client(project='pycontw-225217')

# Perform a query.
QUERY = '''
    SELECT scenario.day2checkin.attr.diet FROM `pycontw-225217.ods.ods_opass_attendee_timestamp`
'''
query_job = client.query(QUERY)  # API request
rows = query_job.result()  # Waits for query to finish

for row in rows:
    print(row.diet)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.9%
  • Dockerfile 1.4%
  • Shell 1.2%
  • Other 0.5%