Skip to content

pass-culture/data-gcp

Repository files navigation

Data GCP πŸš€

Python Version License Documentation

Data Engineering Platform for Pass Culture on Google Cloud Platform (GCP)

πŸ“š Overview

This repository contains the core components of our data platform:

  • Airflow DAGs for workflow orchestration
  • DBT models for data transformation
  • ML models for machine learning services
  • ETL jobs for data processing

πŸ“– Documentation

πŸ—οΈ Architecture

+-- orchestration
| +-- dags
|    +-- dependencies
|    +-- jobs
|    +-- data_gcp_dbt
+-- jobs
| +-- etl_jobs
|   +-- external
|     +-- ...
|   +-- internal
|     +-- ...
| +-- ml_jobs
|   +-- ...

πŸš€ Getting Started

Prerequisites

  • Google Cloud CLI
  • Access to our GCP service accounts
  • Make installed
    • Linux: sudo apt install make
    • macOS: brew install make
  • Install the prerequisites
    • Linux: make install_ubuntu_libs
    • Mac: make install_macos_libs

Installation

  1. Clone the repository

    git clone [email protected]:pass-culture/data-gcp.git
    cd data-gcp
  2. Install the project

    make install

    This installation includes all necessary requirements for the orchestration part in a single virtual environment and sets up pre-commit hooks for code quality.

Troubleshooting

macOS

If you have MySQL client related issues when installing dependencies, you might need to set the following environment variables. Add to your ~/.zshrc:

export MYSQLCLIENT_LDFLAGS="-L/opt/homebrew/opt/mysql-client/lib -lmysqlclient -rpath /usr/local/mysql/lib"
export MYSQLCLIENT_CFLAGS="-I/opt/homebrew/opt/mysql-client/include -I/opt/homebrew/opt/mysql-client/include/mysql"

πŸ› οΈ Development

Creating New Microservices

ML Microservice

MS_NAME=my_microservice make create_microservice_ml

ETL Microservice (Internal)

MS_NAME=my_microservice make create_microservice_etl_internal

ETL Microservice (External)

MS_NAME=my_microservice make create_microservice_etl_external

Install specific dependencies

uv sync --group <airflow|dbt|dev|docs>

Run pre-commit hooks

make ruff_fix / ruff_check / sqlfluff_fix / sqlfluff_check / sqlfmt_fix / sqlfmt_check

View uv lock file as a human readable file

uv allows to manage dependencies with a lock file. However the lock file is not really easy to read. You can generate a human readable file by uv.lock with:

python automations/export_requirements.py export-requirements

or with a prefix

python automations/export_requirements.py export-requirements --prefix "new_"

⚠️ Don't commit these files, they are only for helping you to understand the dependencies. ⚠️

Compute diff of requirements between two branches

python automations/export_requirements.py diff-requirements --branch1 {first_branch} --branch2 {second_branch}

or

python automations/export_requirements.py diff-requirements --branch1 {first_branch} --branch2 {second_branch} --write-to-file

to write the output to a file named package_versions.diff

Example :

python automations/export_requirements.py diff-requirements --branch1 master --branch2 refactor/remove-hardcoded-deps-in-pyproject.toml --write-to-file

This will generate a file package_versions.diff with the diff of the requirements between the two branches.

πŸ”„ CI/CD

Our CI/CD pipelines are managed through GitHub Actions. See the workflows documentation for details.

🀝 Contributing

  1. Create a new branch for your feature
  2. Make your changes
  3. Submit a pull request

πŸ“ License

This project is licensed under the Mozilla Public License Version 2.0 - see the LICENSE file for details.

About

Repo pour la team data sur GCP

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 43