Allow to install custom python libraries #592

Maleware · 2025-03-07T16:24:31Z

Current Situation

If you want to use non-standard python libraries in an Airflow job, you'd need to build a custom image, pip install those and then use your custom image in your cluster.

Preferred Situation

You can configure a requirements.txt, which then will be installed in the Airflow deployment.

Example

E.g. you want to use pandas==2.2.2 in a DAG, currently you would need to setup a CI/CD way of building and deploying a custom Airflow image. The Dockerfile would look like:

FROM oci.stackable.tech/sdp/airflow:${AIRFLOW_VERSION}-stackable${STACKABLE_VERSION}

ARG PYTHON_VERSION=3.9

# Install custom  python libraries
RUN pip install \
    --no-cache-dir \
    --upgrade \
    pandas==2.2.2

Although this is fairly easy doable it implies maintenance and resources. I consider this being a fairly common use case and thus we should think about if we could cover it with e.g. ( no strong opinion neither on naming nor where it should be in the crd and how )

---
apiVersion: airflow.stackable.tech/v1alpha1
kind: AirflowCluster
metadata:
  name: airflow
spec:
  image:
    productVersion: 2.9.3
  clusterConfig:
    loadExamples: false
    exposeConfig: false
    credentialsSecret: simple-airflow-credentials
    requirements:
      configMap:
          name: custom_requirements

and a configMap

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: custom_requirements
data:
  requirements.txt: |
    pandas==2.2.2

I think a solution on operator level would remove the pain to construct and maintain a build pipeline to the cluster. It moves the maintenance effort into the Airflow Operator, but this already needs attention ( stackable versions, product versions ).

However, I can't evaluate how much effort we need to put in to archive this and what kind of risks this would imply.

The text was updated successfully, but these errors were encountered:

razvan · 2025-03-10T11:24:24Z

This approach has the major downside that it installs the DAG requirements in Airflow's own virtual environment which may lead to conflicts and break the Airflow stacklet.

Stackable should make it as easy as possible to use isolate DAG envs and encourage the use of Python*Operator as described in here

Venvs can be built off-site and and provisioned with PEX or venv-pack or conda-pack as described here.

Maleware added the type/feature-new label Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to install custom python libraries #592

Allow to install custom python libraries #592

Maleware commented Mar 7, 2025

razvan commented Mar 10, 2025

Allow to install custom python libraries #592

Allow to install custom python libraries #592

Comments

Maleware commented Mar 7, 2025

Current Situation

Preferred Situation

Example

razvan commented Mar 10, 2025