Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to install custom python libraries #592

Open
Maleware opened this issue Mar 7, 2025 · 1 comment
Open

Allow to install custom python libraries #592

Maleware opened this issue Mar 7, 2025 · 1 comment

Comments

@Maleware
Copy link
Member

Maleware commented Mar 7, 2025

Current Situation

If you want to use non-standard python libraries in an Airflow job, you'd need to build a custom image, pip install those and then use your custom image in your cluster.

Preferred Situation

You can configure a requirements.txt, which then will be installed in the Airflow deployment.

Example

E.g. you want to use pandas==2.2.2 in a DAG, currently you would need to setup a CI/CD way of building and deploying a custom Airflow image. The Dockerfile would look like:

FROM oci.stackable.tech/sdp/airflow:${AIRFLOW_VERSION}-stackable${STACKABLE_VERSION}

ARG PYTHON_VERSION=3.9

# Install custom  python libraries
RUN pip install \
    --no-cache-dir \
    --upgrade \
    pandas==2.2.2 

Although this is fairly easy doable it implies maintenance and resources. I consider this being a fairly common use case and thus we should think about if we could cover it with e.g. ( no strong opinion neither on naming nor where it should be in the crd and how )

---
apiVersion: airflow.stackable.tech/v1alpha1
kind: AirflowCluster
metadata:
  name: airflow
spec:
  image:
    productVersion: 2.9.3
  clusterConfig:
    loadExamples: false
    exposeConfig: false
    credentialsSecret: simple-airflow-credentials
    requirements:
      configMap:
          name: custom_requirements

and a configMap

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: custom_requirements
data:
  requirements.txt: |
    pandas==2.2.2 

I think a solution on operator level would remove the pain to construct and maintain a build pipeline to the cluster. It moves the maintenance effort into the Airflow Operator, but this already needs attention ( stackable versions, product versions ).

However, I can't evaluate how much effort we need to put in to archive this and what kind of risks this would imply.

@razvan
Copy link
Member

razvan commented Mar 10, 2025

This approach has the major downside that it installs the DAG requirements in Airflow's own virtual environment which may lead to conflicts and break the Airflow stacklet.

Stackable should make it as easy as possible to use isolate DAG envs and encourage the use of Python*Operator as described in here

Venvs can be built off-site and and provisioned with PEX or venv-pack or conda-pack as described here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants