Data Pipeline for Data Science, Part 4: Airflow & Data Pipelines

5 min readSep 16, 2020

Deploy Operators and DAGs to a AWS hosted Apache Airflow and execute your Data Pipelines with DAG and Data Lineage Visualisation.

Wanna hear occasional rants about Tensorflow, Keras, DeepLearning4J, Python and Java?

Join me on twitter @ twitter.com/hudsonmendes!

Taking Machine Learning models to production is a battle. And there I share my learnings (and my sorrows) there, so we can learn together!

Data Pipeline for Data Science Series

This is a large tutorial that we tried to keep conveniently small for the occasional reader, and is divided into the following parts:

Part 1: Problem/Solution Fit
Part 2: TMDb Data “Crawler”
Part 3: Infrastructure As Code
Part 4: Airflow & Data Pipelines
(soon available) Part 5: DAG, Film Review Sentiment Classifier Model
(soon available) Part 6: DAG, Data Warehouse Building
(soon available) Part 7: Scheduling and Landings

The Problem: Deploying DAGs with Airflow

This project has the following problem statement:

Data Analysts must be able to produce reports on-demand, as well as run several roll-ups and drill-down queries into what the Review Sentiment is for both IMDb films and IMDb actors/actresses, based on their TMDb Film Reviews; And the Sentiment Classifier must be our own.

In our Part 3 we have setup the entire Airflow & Redshift infrastructure.

Now all we have to do is use it to run our DAGs using Airflow and deploying our Data Warehouse into AWS Redshift, so that we can run our Analytics.

The full source code for this solution can be found at:

hudsonmendes/nanodataeng-capstone

Permalink Dismiss GitHub is home to over 50 million developers working together to host and review code, manage…

github.com

Deploying DAGs with Jupyter? No!

Our code to deploy DAGs lives in a Jupyter Notebook, but this choice was only done due to the explanatory nature of this project.

Very Important: your code to deploy DAGs must NOT be in a Jupyter Notebook. You should instead use a Continuous Integration system such as Jenkins.

Your Continuous Integration pipeline must cover the following:

Build code
Runs Lint + Static Code Analysis
Runs Unit & Integration Tests
Deploys DAGs into the server

Shared Components

The following components are used through out the DAG setup process.

AWS EC2 KeyPair

In the previous article (Part 3) we have created a very important file: the "udacity-data-eng-nano'

This file will now be used to SCP (copy) the DAG files into our Airflow Server.

SCP Upload Procedure

As the next step, we have a routine that copies files from the local machine into our Airflow DAGs folder.

Once files are dropped into that folder, they are automatically picked up by Airflow, as long as there is no compilation errors!

Airflow Folder Structure

In order to run our Data Pipeline, we need a few folders created:

Data: where the raw data copied from HTTP sources will be dropped
Model: where we placed the Tensorflow Saved Model
Working: Temporary folder for files being generated
Images: Where we place assets used in the file generated PDF report

Airflow Configuration

Our DAGs use a number of Variables and Connections (check the Airflow documentation to understand better these definitions).

Here we setup both, so that our DAGs can run:

Uploading Operators

DAGs are graphs formed by operators (vertices) edged to one another. In this step we shall upload our custom operators to the DAGs folder.

Finally, Uploading DAGs

The DAGS (or Directed Acyclical Graphs) are the main pieces of program in our Data Pipeline.

By uploading these, should they have no compilation errors, Airflow will pick them up automatically and present them in the Airflow web interface.

Launching DAGs in the Airflow Web Interface

Once Airflow has recognised the DAGs, it will display them in their web interface, such as in the image below:

In order to launch them, you must first enable them and then click the Launch button.

After doing that, you can check on the DAG execution illustration and the Lineage Illustration, such as you see below:

In Summary

Using python, we have:

Used scp to copy the DAG Operators and the DAGs
Used the Web Interface to Launch the Data Pipelines
Checked the DAG Run

Now our Data Pipeline run, we have our Data Warehouse waiting for us to run queries and achieve our objective of running drill downs and roll ups in our facts table.

Next Steps

In the next article Part 5: DAG, Film Review Sentiment Classifier Model we will understand what the Classification Model Trainer DAG does, and how we leveraged Data Science to produce our Data Warehouse.

Find the end-to-end solution source code at https://github.com/hudsonmendes/nanodataeng-capstone.

Wanna keep in Touch? LinkedIn!

My name is Hudson Mendes (@hudsonmendes), I’m a 38 years old coder, husband, father of 3, ex Startup Founder, ex-Peloton L7 Staff Tech Lead/Manager, nearly BSc in Computer Science by the University of London & Senior AI/ML Engineering Manager.

I’ve been on the Software Engineering road for 22+ years, and occasionally write about topics of interest to other Senior Engineering Managers and Staff Machine Learning Engineering with a bit of focus (but not exclusively) on Natural Language Processing.

Join me there, and I will keep you in the loop with my new learnings in AI/ML/LLMs and beyond!