Data Pipeline for Data Science, Part 3: Infrastructure As Code

⚡️Hudson Ⓜ️endes
9 min readSep 16, 2020

Deploying AWS EC2 Machines, AWS Redshift, Apache Airflow using a a repeatable interactive process coded into an explanatory Jupyter Notebook.

Wanna hear occasional rants about Tensorflow, Keras, DeepLearning4J, Python and Java?

Join me on twitter @ twitter.com/hudsonmendes!

Taking Machine Learning models to production is a battle. And there I share my learnings (and my sorrows) there, so we can learn together!

Data Pipeline for Data Science Series

This is a large tutorial that we tried to keep conveniently small for the occasional reader, and is divided into the following parts:

Part 1: Problem/Solution Fit
Part 2: TMDb Data “Crawler”
Part 3: Infrastructure As Code
Part 4: Airflow & Data Pipelines
(soon available) Part 5: DAG, Film Review Sentiment Classifier Model
(soon available) Part 6: DAG, Data Warehouse Building
(soon available) Part 7: Scheduling and Landings

The Problem: Setup Data Pipeline Infrastructure

This project has the following problem statement:

Data Analysts must be able to produce reports on-demand, as well as run several roll-ups and drill-down queries into what the Review Sentiment is for both IMDb films and IMDb actors/actresses, based on their TMDb Film Reviews; And the Sentiment Classifier must be our own.

In our Part 2 we have figured out a way to link IMDb Ids with TMDb Ids, so that we can use the TMDb Film Reviews using AWS Lambda.

In order to produce the Fact Tables (Data Warehouse) that we need as a final product of our Data Analytics, we must install a bunch of infrastructure.

Brace yourselves, this will be a slightly longer article.

The full source code for this solution can be found at:

Jupyter Notebooks: Infrastructure as Code

Although Jupyter Notebook sometimes gets used a bit too much, and it also may not be idea for some tasks, it has become increasingly common to find Infrastructure As Code in such format.

The benefits of having a notebook is that the explanation is "runnable". The same effect may be achieved with code comments. However the markdown and ease-of-use of notebooks make it a bit more welcoming for writers.

Infrastructure Components

As defined in our notebook:

Components of our infra-structure

One by one:

  • Data Lake: for the present project is nothing more than an AWS S3 Bucket that will hold our files.
  • Airflow: must be installed over an EC2 Machine that, in our case, can be also a Spot Instance (so we can save some money)
  • Classifier Model: must be trained and, in this case, it can be left in a local folder inside the EC2 Machine, after it's trained by the Airflow DAG.
  • Data Warehouse: will live inside a AWS Redshift Database

From these requirements, we can quickly infer that the following components are needed:

  1. AWS S3 Bucket
  2. AWS EC2 KeyPair
  3. AWS IAM Role, AWS IAM Policy, and AWS IAM Instance Profile for the AWS EC2 Machine running as a Spot Instance
  4. AWS EC2 Security Group for Airflow (with the necessary Ingress Rules)
  5. AWS EC2 Spot Instance Request that will launch our EC2 Machine
  6. Finally, the AWS EC2 Machine, where we will install our Airflow.
  7. The AWS EC2 ElasticIP used to access the AWS EC2 Machine.
  8. AWS Redshift Cluster
  9. AWS IAM Role and AWS IAM Policy for the Redshift Database, allowing access to our AWS S3 Bucket for our COPY commands.
  10. AWS Redshift Database

Let's now see how each one of those is created.

Shared Components

We will start by installing the shared components that are used across our code.

Installing Boto3

Boto3 is the python library that we use in order to mange our AWS account programmatically. We install the library by doing the following:

AWS EC2 KeyPair

We must create a KeyPair that we will link to our AWS EC2 Machine, but that will also be used in order to SSH/SCP programmatically into our machine to install dependencies and copy our DAG files.

Airflow: Infra-structure

To setup Airflow, we must first install some linux dependencies, so that we can later install Airflow, set it up, and copy our DAG files.

Let's see how it goes, step by step:

AWS IAM Role, AWS IAM Policy, and AWS IAM Instance Profile

See below how we create the:

  • AWS IAM Role,
  • AWS IAM Policy and
  • AWS IAM Instance Profile

AWS EC2 Security Group

In order to configure the firewall rules that allow inbound requests to Airflow, we must create a AWS EC2 Security Group.

AWS EC2 Spot Instance Request

With all other components ready, we can now request our Spot Instance that will create the AWS EC2 Machine that we need in order to run Airflow.

AWS EC2 Machine

The AWS EC2 Machine is created by the Spot Instance Request, but we need to wait until it's available, not directly by us.

Once available, we grab hold of the instance id.

AWS EC2 ElasticIP

We now allocate and attach an AWS EC2 Elastic IP to the EC2 machine, so that we can access it via SSH and via HTTP.

Airflow: Dependencies

Setting up a machine remotely requires us to SSH to run commands and SCP to copy files over. Fortunately it’s really easy to achieve by libraries called paramiko and scp.

Preparing SSH and SCP

First we install the dependencies:

And now we create a few methods that will help us run the commands:

Once these preparations are done, we can start installing the Linux Dependencies that must be available in order to run Airflow.

Airflow Packages & Local MySQL Database for Airflow

In this step we install Airflow pip packages, that will be later used to run Airflow.

Although We do not use MySQL for our Data Warehouse. Airflow needs a database in order to keep track of the DAGs, Runs, etc. So we here install a local instance of MySQL that will keep the Airflow metadata.

VERY IMPORTANT: In your production environment you should NOT have a local MySQL database. Instead, you should have your Database setup in AWS RDS or similar. Otherwise if you lose your Spot Instance, you will also lose the track of runs, data lineage, etc.

Initializing Airflow DB & Configuration (.ini file)

Airflow installs what it needs in order to run in terms of databases. That is done by using the command airflow initdb.

Configuration for Airflow must be done by changing a .ini file.

It'd be possible to change the configuration using sed (popular command line tool) but it's a bit harder to manage regular expressions that we'd want it to be. Alternatively, we use a package called crudini to do it.

Creating DAGs folder

We now create the folder where our DAGs code should live.

Airflow: DAG Dependencies & Launch

DAG (Directed Acyclic Graphs) are graph based representations of flows. When implemented using the Airflow Library, Airflow is capable of presenting them visually, as well as running them.

DAG as illustrated by Apache Airflow

We must now setup Airflow (Variables and Connections) as well as copy the DAG files, so that Airflow can recognise these files and present them in the system.

Installing DAG dependencies

For this setup, DAGs run all "in-process" (as the same machine as Airflow itself). Alternative configurations are possible and recommended, but for the scope of this project, we will not change this.

In order to run our DAGs, we must install the Dependencies used by the DAGs to run:

Starting Airflow

We will now start Airflow and leave its components running as daemons (-D) instead of interactively, so that it does not die with our SSH session or with a keyboard interrupt sequence.

The last infrastructure component that we require is Redshift, which we will show next.

AWS Redshift

Redshift is a columnar database based on Postgres that is ideal for common use cases of Data Warehouse queries, such as roll ups and drill downs, vastly used for Data Analysis.

Installing Dependencies

We will need to manage resources in EC2, IAM and Redshift. We therefore create their clients using Boto3.

AWS IAM Role for RedShift

We must now create the role that will be used by Redshift to run on our Cluster.

Important: this role must have Read access to our AWS S3 Bucket, from where it will run COPY commands.

AWS EC2 Security Group for Redshift

We now setup the Security Group that will contain the Ingress Rules that will make our Redshift Server accessible for all the internet.

Obviously, review the ingress rule to fit your security needs.

We also need an IP that we can use in order to access our Redshift.

AWS Redshift Cluster

We will now request the creation of our Redshift Cluster, using specific credentials, and the Elastic IP that We had previously allocated.

Data Warehouse Database

Using the ipython-sql Jupyter extension we can have entire notebook cells running as SQL.

We will connect now connect our notebook to the database, in order to run them:

Once connected, we will create our Dimension Tables:

We then create our Facts Tables:

And that's it, we now have our Database Structure ready to receive the Data that our Data Pipeline will continuously process and deliver into it.

In Summary

Using python, boto3 and paramiko, we have setup our entire infrastructure:

  1. AWS Networking
  2. AWS EC2 Machine created via Spot Instance Request
  3. AWS Redshift

We are now ready to install/update our DAGs and launch them, so our Data Pipeline starts running.

Next Steps

In the next article Part 4: Airflow & Data Pipelines we will copy our DAG code and launch our DAGs, so that our Data Pipeline creating the Data Warehouse that we need.

Source Code

Find the end-to-end solution source code at https://github.com/hudsonmendes/nanodataeng-capstone.

Wanna keep in Touch? LinkedIn!

My name is Hudson Mendes (@hudsonmendes), I’m a 38 years old coder, husband, father of 3, ex Startup Founder, ex-Peloton L7 Staff Tech Lead/Manager, nearly BSc in Computer Science by the University of London & Senior AI/ML Engineering Manager.

I’ve been on the Software Engineering road for 22+ years, and occasionally write about topics of interest to other Senior Engineering Managers and Staff Machine Learning Engineering with a bit of focus (but not exclusively) on Natural Language Processing.

Join me there, and I will keep you in the loop with my new learnings in AI/ML/LLMs and beyond!

--

--

⚡️Hudson Ⓜ️endes

⚡️Staff AI/ML Engineer & Senior Engineering Manager, #NLP, opinions are my own. https://linkedin.com/in/hudsonmendes