You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Lets look at another example: we need to get some data from a file which is hosted online and insert it into our local database. We also need to look at removing duplicate rows while inserting.
24
+
Welcome to the third tutorial in our series! At this point, you've already written your first DAG and used some basic
25
+
operators. Now it's time to build a small but meaningful data pipeline -- one that retrieves data from an external
26
+
source, loads it into a database, and cleans it up along the way.
25
27
26
-
*Be advised:* The operator used in this tutorial is `deprecated <https://airflow.apache.org/docs/apache-airflow-providers-postgres/stable/_api/airflow/providers/postgres/operators/postgres/index.html>`_.
27
-
Its recommended successor, `SQLExecuteQueryOperator <https://airflow.apache.org/docs/apache-airflow-providers-common-sql/stable/_api/airflow/providers/common/sql/operators/sql/index.html#airflow.providers.common.sql.operators.sql.SQLExecuteQueryOperator>`_ works similarly.
28
-
You might find `this guide <https://airflow.apache.org/docs/apache-airflow-providers-postgres/stable/operators/postgres_operator_howto_guide.html#creating-a-postgres-database-table>`_ helpful.
28
+
This tutorial introduces the ``SQLExecuteQueryOperator``, a flexible and modern way to execute SQL in Airflow. We'll use
29
+
it to interact with a local Postgres database, which we'll configure in the Airflow UI.
30
+
31
+
By the end of this tutorial, you'll have a working pipeline that:
32
+
33
+
- Downloads a CSV file
34
+
- Loads the data into a staging table
35
+
- Cleans the data and upserts it into a target table
36
+
37
+
Along the way, you'll gain hands-on experience with Airflow's UI, connection system, SQL execution, and DAG authoring
38
+
patterns.
39
+
40
+
Want to go deeper as you go? Here are two helpful references:
41
+
42
+
- The `SQLExecuteQueryOperator <https://airflow.apache.org/docs/apache-airflow-providers-common-sql/stable/_api/airflow/providers/common/sql/operators/sql/index.html#airflow.providers.common.sql.operators.sql.SQLExecuteQueryOperator>`_ documentation
43
+
- The `Postgres provider <https://airflow.apache.org/docs/apache-airflow-providers-postgres/stable/index.html>`_ documentation
44
+
45
+
Let's get started!
29
46
30
47
Initial setup
31
48
-------------
32
49
33
-
We need to have Docker installed as we will be using the :doc:`/howto/docker-compose/index` procedure for this example.
34
-
The steps below should be sufficient, but see the quick-start documentation for full instructions.
50
+
.. caution::
51
+
You'll need Docker installed to run this tutorial. We'll be using Docker Compose to launch Airflow locally. If you
52
+
need help setting it up, check out the :doc:`Docker Compose quickstart guide </howto/docker-compose/index>`.
53
+
54
+
To run our pipeline, we need a working Airflow environment. Docker Compose makes this easy and safe -- no system-wide
55
+
installs required. Just open your terminal and run the following:
35
56
36
57
.. code-block:: bash
37
58
@@ -48,36 +69,58 @@ The steps below should be sufficient, but see the quick-start documentation for
48
69
# Start up all services
49
70
docker compose up
50
71
51
-
After all services have started up, the web UI will be available at: ``http://localhost:8080``. The default account has the username ``airflow`` and the password ``airflow``.
72
+
Once Airflow is up and running, visit the UI at ``http://localhost:8080``.
73
+
74
+
Log in with:
75
+
76
+
- **Username:** ``airflow``
77
+
- **Password:** ``airflow``
78
+
79
+
You'll land in the Airflow dashboard, where you can trigger DAGs, explore logs, and manage your environment.
52
80
53
-
We will also need to create a `connection <https://airflow.apache.org/docs/apache-airflow/stable/concepts/connections.html>`_ to the postgres db. To create one via the web UI, from the "Admin" menu, select "Connections", then click the Plus sign to "Add a new record" to the list of connections.
81
+
Create a Postgres Connection
82
+
----------------------------
54
83
55
-
Fill in the fields as shown below. Note the Connection Id value, which we'll pass as a parameter for the ``postgres_conn_id`` kwarg.
84
+
Before our pipeline can write to Postgres, we need to tell Airflow how to connect to it. In the UI, open the **Admin >
85
+
Connections** page and click the + button to add a new
:alt:Add Connection form in Airflow's web UI with Postgres details filled in.
69
100
70
-
We can use the `PostgresOperator <https://airflow.apache.org/docs/apache-airflow-providers-postgres/stable/operators/postgres_operator_howto_guide.html#creating-a-postgres-database-table>`_ to define tasks that create tables in our postgres db.
101
+
|
71
102
72
-
We'll create one table to facilitate data cleaning steps (``employees_temp``) and another table to store our cleaned data (``employees``).
103
+
Save the connection. This tells Airflow how to reach the Postgres database running in your Docker environment.
104
+
105
+
Next, we'll start building the pipeline that uses this connection.
106
+
107
+
Create tables for staging and final data
108
+
----------------------------------------
109
+
110
+
Let's begin with table creation. We'll create two tables:
111
+
112
+
- ``employees_temp``: a staging table used for raw data
113
+
- ``employees``: the cleaned and deduplicated destination
114
+
115
+
We'll use the ``SQLExecuteQueryOperator`` to run the SQL statements needed to create these tables.
73
116
74
117
.. code-block:: python
75
118
76
-
from airflow.providers.postgres.operators.postgresimportPostgresOperator
119
+
from airflow.providers.common.sql.operators.sqlimportSQLExecuteQueryOperator
77
120
78
-
create_employees_table =PostgresOperator(
121
+
create_employees_table =SQLExecuteQueryOperator(
79
122
task_id="create_employees_table",
80
-
postgres_conn_id="tutorial_pg_conn",
123
+
conn_id="tutorial_pg_conn",
81
124
sql="""
82
125
CREATE TABLE IF NOT EXISTS employees (
83
126
"Serial Number" NUMERIC PRIMARY KEY,
@@ -88,9 +131,9 @@ We'll create one table to facilitate data cleaning steps (``employees_temp``) an
@@ -102,25 +145,13 @@ We'll create one table to facilitate data cleaning steps (``employees_temp``) an
102
145
);""",
103
146
)
104
147
105
-
Optional: Using SQL From Files
106
-
------------------------------
148
+
You can optionally place these SQL statements in ``.sql`` files inside your ``dags/`` folder and pass the file path to
149
+
the ``sql=`` argument. This can be a great way to keep your DAG code clean.
107
150
108
-
If you want to abstract these sql statements out of your DAG, you can move the statements sql files somewhere within the ``dags/`` directory and pass the sql file_path (relative to ``dags/``) to the ``sql`` kwarg. For ``employees`` for example, create a ``sql`` directory in ``dags/``, put ``employees`` DDL in ``dags/sql/employees_schema.sql``, and modify the PostgresOperator() to:
151
+
Load data into the staging table
152
+
--------------------------------
109
153
110
-
.. code-block:: python
111
-
112
-
create_employees_table = PostgresOperator(
113
-
task_id="create_employees_table",
114
-
postgres_conn_id="tutorial_pg_conn",
115
-
sql="sql/employees_schema.sql",
116
-
)
117
-
118
-
and repeat for the ``employees_temp`` table.
119
-
120
-
Data Retrieval Task
121
-
-------------------
122
-
123
-
Here we retrieve data, save it to a file on our Airflow instance, and load the data from that file into an intermediate table where we can execute data cleaning steps.
154
+
Next, we'll download a CSV file, save it locally, and load it into ``employees_temp`` using the ``PostgresHook``.
124
155
125
156
.. code-block:: python
126
157
@@ -153,10 +184,14 @@ Here we retrieve data, save it to a file on our Airflow instance, and load the d
153
184
)
154
185
conn.commit()
155
186
156
-
Data Merge Task
157
-
---------------
187
+
This task gives you a taste of combining Airflow with native Python and SQL hooks -- a common pattern in real-world
188
+
pipelines.
158
189
159
-
Here we select completely unique records from the retrieved data, then we check to see if any employee ``Serial Numbers`` are already in the database (if they are, we update those records with the new data).
190
+
Merge and clean the data
191
+
------------------------
192
+
193
+
Now let's deduplicate the data and merge it into our final table. We'll write a task that runs a SQL `INSERT ... ON
194
+
CONFLICT DO UPDATE`.
160
195
161
196
.. code-block:: python
162
197
@@ -191,26 +226,10 @@ Here we select completely unique records from the retrieved data, then we check
191
226
192
227
193
228
194
-
Completing our DAG
195
-
------------------
196
-
197
-
We've developed our tasks, now we need to wrap them in a DAG, which enables us to define when and how tasks should run, and state any dependencies that tasks have on other tasks. The DAG below is configured to:
198
-
199
-
* run every day at midnight starting on Jan 1, 2021,
200
-
* only run once in the event that days are missed, and
201
-
* timeout after 60 minutes
202
-
203
-
And from the last line in the definition of the ``process_employees`` DAG, we see:
@@ -312,25 +331,40 @@ Putting all of the pieces together, we have our completed DAG.
312
331
313
332
dag = ProcessEmployees()
314
333
315
-
Save this code to a python file in the ``/dags`` folder (e.g. ``dags/process_employees.py``) and (after a `brief delay <https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#dag-dir-list-interval>`_), the ``process_employees`` DAG will be included in the list of available dags on the web UI.
334
+
Save this DAG as ``dags/process_employees.py``. After a short delay, it will show up in the UI.
335
+
336
+
Trigger and explore your DAG
337
+
----------------------------
338
+
339
+
Open the Airflow UI and find the ``process_employees`` DAG in the list. Toggle it "on" using the slider, then trigger a
340
+
run using the play button.
316
341
317
-
.. image:: ../img/tutorial-pipeline-1.png
342
+
You can watch each task as it runs in the **Grid** view, and explore logs for each step.
318
343
319
-
You can trigger the ``process_employees`` DAG by unpausing it (via the slider on the left end) and running it (via the Run button under **Actions**).
0 commit comments