Development Lifecycle

Trunk Based Development

The Giga DataOps Platform project follows the concept of Trunk-Based Development, wherein User Stories are worked on PRs. PRs then get merged to main once approved by another developer.

The main branch serves as the most up-to-date version of the code base.

Naming Conventions

Branch Names

Refer to Conventional Commits.

PR Title

[<Feature/Fix/Release/Hotfix>](<issue-id>) <Short desc>

PR Template

pull_request_template.md

Development Workflow

Branch off from main to ensure you get the latest code.
Name your branch according to the Naming Conventions.
Keep your commits self-contained and your PRs small and tailored to a specific feature as much as possible.
Push your commits, open a PR and fill in the PR template.
Request a review from 1 other developer.
Once approved, rebase/squash your commits into main. Rule of thumb:
- If the PR contains 1 or 2 commits, perform a Rebase.
- If the PR contains several commits that build toward a larger feature, perform a Squash.
- If the PR contains several commits that are relatively unrelated (e.g., an assortment of bug fixes), perform a Rebase.

Local Development

File Structure Walkthrough

azure/ - Contains all configuration for Azure DevOps pipelines.
dagster/ - Contains all custom Dagster code.
docs/ - This folder contains all Markdown files for creating Backstage TechDocs.
spark/ - Contains Docker build items for custom Hive Metastore image.
infra/ - Contains all Kubernetes & Helm configuration.
spark/ - Contains Docker build items for custom Spark image.
oauth2-proxy/ - Contains all Docker build items for custom OAuth2 Proxy image.

Pre-requisites

Required

Docker
Task
asdf
Poetry
Python 3.11

As-needed

Kubernetes
- If you are using Docker Desktop on Windows, you can use the bundled Kubernetes distribution.
Helm

Windows Subsystem for Linux (WSL)

Skip this step if you are on Linux or Mac.

Check your USERPROFILE directory for a file named .wslconfig. You can navigate to this directory by opening the file explorer and entering %USERPROFILE% in the address bar. If the file does not exist, create it.
Ensure the following contents are in the file:
```
[wsl2]
memory=16GB
swap=20GB
```
This is working with the assumption of a workstation that has 4 cores, 32GB RAM, and 1TB of storage. Adjust the values accordingly if you have different hardware specifications. Ideally, do not give WSL more than half of your available RAM.
Install WSL. You may be prompted to restart your device.
In a separate Powershell/Command Prompt (CMD) terminal, run:
```
wsl --set-default-version 2
```
Open the Microsoft Store, search for and install Ubuntu.
In the Powershell/CMD terminal, run:
```
wsl --set-default Ubuntu
```
In the start menu, Ubuntu should show up in the recently added programs. Open it.
You will be prompted for a new username and password. Enter any credentials and make sure to remember them. You may be prompted to restart again.
If you are not prompted to restart, close Ubuntu and open it again. You should now have a working WSL installation.

Important

From this point on, all commands should be run inside the Ubuntu terminal, unless otherwise specified.

Docker

Install Docker Desktop. You may be prompted to restart your device.
Open the Docker Desktop app and go to settings.
Ensure you have the following settings:

[!NOTE] WSL integration settings are only applicable if you are on Windows.
Wait for the Kubernetes installation to complete.
To test if everything is setup correctly, run this inside an Ubuntu terminal:
```
docker image ls -a
kubectl get all
```
If you get no errors, you're good to go!

Kubernetes

Kubernetes is installed as part of the Docker Desktop installation. You can optionally install the kubectx and kubens plugins to make it easier to switch between contexts/namespaces.

Install Krew:

Run the following:

(
 set -x; cd "$(mktemp -d)" &&
 OS="$(uname | tr '[:upper:]' '[:lower:]')" &&
 ARCH="$(uname -m | sed -e 's/x86_64/amd64/' -e 's/\(arm\)\(64\)\?.*/\1\2/' -e 's/aarch64$/arm64/')" &&
 KREW="krew-${OS}_${ARCH}" &&
 curl -fsSLO "https://github.com/kubernetes-sigs/krew/releases/latest/download/${KREW}.tar.gz" &&
 tar zxvf "${KREW}.tar.gz" &&
 ./"${KREW}" install krew
)

Add the Krew path to your system PATH by appending to your .bashrc/.zshrc (i.e. run the following):
```
echo 'export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH"' >> ~/.bashrc
```

Load your new shell config:

# bash
source ~/.bashrc

# zsh
source ~/.zshrc

Download the Krew plugin list
```
kubectl krew update
```

Install kubectx and kubens

kubectl krew install ctx
kubectl krew install ns

Test if installation is ok:
```
kubectl ctx
kubectl ns
```

asdf

Install asdf.
Test installation:
```
asdf
```

Python

Install Python build dependencies:

MacOS

brew install openssl readline sqlite3 xz zlib tcl-tk

Linux/WSL

sudo apt-get update
sudo apt-get install -y build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev curl libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev

Install Python

asdf plugin add python
asdf install python 3.11.7

Poetry

Install Poetry

asdf add plugin poetry
asdf install poetry 1.7.1

Add Poetry path to your shell config:

echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc

Reload shell config:
```
source ~/.bashrc
```
Test installation:
```
poetry --version
```

Set recommended settings:

poetry config virtualenvs.in-project true

Task

Install Task:

sh -c "$(curl --location https://taskfile.dev/install.sh)" -- -d -b ~/.local/bin

Test installation:
```
task --version
```

Cloning and Installation

git clone the repository to your workstation.
Run initial setup:
```
task setup
```

Environment Setup

Dagster, Spark, and Hive have their own respective .env files. The contents of these files can be provided upon request. There are also .env.example files which you can use as reference. Copy the contents of this file into a new file named .env in the same directory, then supply your own values.

Ensure that the Pre-requisites have already been set up and all the necessary command-line executables are in your PATH.

Setting up your own warehouse and lakehouse

Make sure to fill up WAREHOUSE_USERNAME and LAKEHOUSE_USERNAME with the same value, preferably your first name. This will create a new warehouse-local-YOUR_NAME and a lakehouse-local-YOUR_NAME respectively in our deployed ADLS dev environment.

We are doing this to avoid sharing warehouses among developers. With this, each developer essentially has two environments to work with, their own, and the shared dev environment.

warehouse-local - This is where the bronze, silver, and gold delta lake tables reside. This is also what trino will be using to get the needed data for data-ingestion to run properly (e.g. it gets the ingested schemas from here).
lakehouse-local - This is where files uploaded by data-ingestion or manually through ADLS will reside. This is also where intermediate assets generated by the ops in our dagster pipelines will reside.

Running the Application

# spin up Docker containers
task

# Follow Docker logs
task logs

# List all tasks (inspect Taskfile.yml to see the actual commands being run)
task -l

Once you have initialized Dagster, follow the steps to initialize some schemas needed for the upload flow in giga-ingestion to work properly.

Go to localhost:3001 to access the dagster UI
Go to jobs
Click on admin__create_lakehouse_local_job and materialize the asset. This will spawn a job in the runs tab which will create your own lakehouse folder and also copy over schemas from ADLS -> raw/schema
Afterwards go to Overview->Sensors and turn on the migrations__schema_sensor this will spawn some runs which will populate the schema tables needed by (7 at the time of writing)
Validate by running data-ingestion and trino together, and then starting the Upload File Flow. It should no longer error out in the first screen

Housekeeping

At the end of your development tasks, stop the containers to free resources:

task stop

Adding dependencies

Example: Adding dagster-azure

# cd to relevant folder
cd dagster

# Add the dependency using poetry
poetry add dagster-azure

## Re-run task
task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

development.md

development.md

Development Lifecycle

Trunk Based Development

Naming Conventions

Branch Names

PR Title

PR Template

Development Workflow

Local Development

File Structure Walkthrough

Pre-requisites

Required

As-needed

Windows Subsystem for Linux (WSL)

Docker

Kubernetes

asdf

Python

Poetry

Task

Cloning and Installation

Environment Setup

Setting up your own warehouse and lakehouse

Running the Application

Housekeeping

Adding dependencies

Files

development.md

Latest commit

History

development.md

File metadata and controls

Development Lifecycle

Trunk Based Development

Naming Conventions

Branch Names

PR Title

PR Template

Development Workflow

Local Development

File Structure Walkthrough

Pre-requisites

Required

As-needed

Windows Subsystem for Linux (WSL)

Docker

Kubernetes

asdf

Python

Poetry

Task

Cloning and Installation

Environment Setup

Setting up your own warehouse and lakehouse

Running the Application

Housekeeping

Adding dependencies