This repository provides an example Apache Airflow pipeline for local development of machine learning projects. The pipeline demonstrates how to fetch news articles from an open data source and apply a zero-shot classification NLP model to classify them into predefined categories.
In this example, we leverage Apache Airflow to automate the following steps:
-
Data Loader: The pipeline retrieves news articles from an open data source and prepares them for further processing.
-
Text Classification: We use a pre-trained NLP model for zero-shot classification to assign relevant categories to the news articles. This model can classify text into various categories without prior training on specific datasets.
-
Results Aggregation: This task focuses on aggregating the classified news articles.
The first and second tasks are executed in separate Docker-containers. By deploying the first two tasks in separate containers, we ensure efficient resource utilization and maintain modularity in the pipeline's execution environment.
To use this pipeline for local development, follow the steps below:
-
Ensure that your Docker Engine has sufficient memory allocated, as running the pipeline may require more memory in certain cases.
-
Сhange path to your local repo in
dags/news_classifier.py
. Replace "<path_to_your_airflow-ml_repo>/data" with your path. -
Before the first Airflow run, prepare the environment by executing the following steps:
- If you are working on Linux, specify the AIRFLOW_UID by running the command:
echo -e "AIRFLOW_UID=$(id -u)" > .env
- Perform the database migration and create the initial user account by running the command:
docker compose up airflow-init
The created user account will have the login
airflow
and the passwordairflow
. -
Start Airflow and build custom images to run tasks in Docker-containers:
docker compose up --build
-
Access the Airflow web interface in your browser at http://localhost:8080.
-
Trigger the DAG
financial_news
to initiate the pipeline execution. -
When you are finished working and want to clean up your environment, run:
docker compose down --volumes --rmi all