Data Pipeline for Data Science, Part 1: Problem/Solution Fit

⚡️Hudson Ⓜ️endes
8 min readSep 16, 2020

Learn how to create Data Pipeline for Data Science through a step-by-step series that covers the end-to-end delivery of a Data Engineering Solution employing Tensorflow, Amazon S3, Redshift, EC2 and Apache Airflow.

Wanna hear occasional rants about Tensorflow, Keras, DeepLearning4J, Python and Java?

Join me on twitter @ twitter.com/hudsonmendes!

Taking Machine Learning models to production is a battle. And there I share my learnings (and my sorrows) there, so we can learn together!

Data Pipeline for Data Science Series

This is a large tutorial that we tried to keep conveniently small for the occasional reader, and is divided into the following parts:

Part 1: Problem/Solution Fit
Part 2: TMDb Data "Crawler"
Part 3: Infrastructure As Code
Part 4: Airflow & Data Pipelines
(soon available) Part 5: DAG, Film Review Sentiment Classifier Model
(soon available) Part 6: DAG, Data Warehouse Building
(soon available) Part 7: Scheduling and Landings

The Problem: TMDb Review Sentiment Classification

This series will focus on a real world problem chosen to be solved as part of my Udacity Data Engineering Nanodegree Graduation,

Suppose that we have been tasked with the following:

Data Analysts must be able to produce reports on-demand, as well as run several roll-ups and drill-down queries into what the Review Sentiment is for both IMDb films and IMDb actors/actresses, based on their TMDb Film Reviews; And the Sentiment Classifier must be our own.

As Data Science Engineers, we write down a few notes about this task:

  1. "Produce reports on-demand"
  2. Run roll-up / drill-down queries
  3. Classify sentiment and "Sentiment Classifier must be our own"
  4. Do TMDb and IMDb data have any correlation?

Although there are several unanswered questions at this point, these 4 already help us a lot to get a first draft of what a solution could look like.

“Produce reports on-demand”

Here we know that reports will need to be produced "on-demand", therefore whenever needed.

That already implies that the ability to produce reports required by the Data Analysts will need to undergo some sort of automation, so that the team can always produce a more up-to-date report whenever newer/fresher data arrives.

With just that information, it's yet impossible to suggest what automation technique is most suitable to solve our problem, only that automation will be required.

Run roll-up / drill-down queries

That note makes clear that, even though there will be a summarised report, ad-hoc queries will need to be run against the data processed. Therefore, as a product for our Data Engineering solution, we will also have to ship a database as part of our solution.

The fact that roll-up and drill-down queries are being specifically mentioned, we can already detect a number of other small requirements:

  1. On-demand report generation creates a single write event (write should not be intensive)
  2. Several roll-ups and drill-down operations are expected, therefore multiple reads (read will be intensive)
  3. A columnar database storage system might be more suitable, given that these type of queries selectively bring columns into the query

The requirements above do give us a lot of insight into what our solution could look like, but let's first evaluate our other notes.

Classify sentiment and “Sentiment Classifier must be our own”

Sentiment Analysis models are functions trained using Machine Learning techniques that are capable of understanding Natural Language and performing tasks such as classification.

"Our own model" (presented as a requirement) means that the model is or will be trained using a specific technology, such as PyTorch, Tensorflow, or other.

In our case, we shall use a Tensorflow model that has the following architecture:

from tf.keras.models import Sequential
from tf.keras.layers import Embedding, LSTM, Dense
model = Sequential([
Embedding(
input_dim=vocab_size,
output_dim=emb_dims,
input_length=seq_len,
mask_zero=True),
LSTM(
lstm_units,
dropout=0.5,
recurrent_dropout=0.5),
Dense(
2,
activation='softmax')
])
model.compile(
optimizer='rmsprop',
loss='sparse_categorical_crossentropy',
metrics=['acc'])

Our solution will have to somehow be able to use that model to classify review sentiments and then provide that in the database shipped to Data Analysts.

Do TMDb and IMDb data have any correlation?

TMDb does allow you find films by imdb_id.

However, you have to request the films one by one, which undermines our ability to batch process the data. This will need to be addressed.

Once addressed, we should have TMDb and IMDB information linked, at the film level, so that we can connect these 2 different datasets into a single database.

The Solution: Film Review Sentiments Data Warehouse

Data Pipelines are, at a high level, composed by the following steps:

  1. Stage the Raw Data
  2. Wrangle Raw Data into your Dimensions and Facts
  3. Run Data Analysis Queries

The two main options to go through these 3 steps are ETLs (Extract, Transform, Load) and ELTs (Extract, Load and Transform), and both are possible.

Shipping a RDBMS-based Data Warehouse

Our requirements tell us to database that allows SQL queries to our Data Analysis team.

Therefore we can safely assume that producing the data into a RDBMS system-based Data Warehouse is a safe choice.

The RDBMS Data Warehouses choice makes the ETL (Extract, Transform, Load) more convenient.

Also, both RDBMS Data Warehouse and another ELT-based solution would easily allow us to automate the report that we are required to deliver on demand. Therefore the infra-structure we selected will be able to deliver these two requirements perfectly.

Raw File Storage of choice: Amazon S3

Amazon S3 provides us with elastic storage at very competitive prices based on this 2019 assessment.

The benefits of using S3 are not limited to being elastic and relatively inexpensive. S3 offers seamless integration to other Amazon platforms (such as Redshift), from which we can benefit immediately.

RDBMS of choice: Amazon Redshift

Once files are stored in Amazon S3, deploying our Data Warehouse to Amazon Redshift becomes the most logical choice.

Processing large datasets require us to make use of something similar to Postgres COPY commands.

Redshift, which is based in Postgres, is capable of running COPY commands directly from S3 as if the files were locally.

That means an enormous simplification of our "Extract" stage of our ETL, turning its tasks into nothing more than SQL Statements run against the database.

Classifying Sentiments with a Tensorflow Model

In our case, the data that will be written into our facts must have the text of their review dimension classified between positive (1) and negative (-1).

Tensorflow allows us to train and save a model that can be easily accessed easily via either:

(Soon, I will write about Keras models being served with DeepLearning4J).

Deploying an endpoint as part of the Data Pipeline is achievable, but definitely not as convenient as just loading a file from a folder. So if possible, using the Tensorflow Saved Model Python API would be our tool of choice.

But before we make that decision, let's think about how we streamline all those components.

Data Pipeline as DAGs running in Airflow

In order to have consistent results when executing our Data Pipeline, we must automate its steps.

There is a large number of options for automation across many different platforms and technologies, such as:

  1. Spring Cloud Data Flow (JVM)
  2. Apache Airflow (Python)
  3. Apache Beam (Java/Python/Others)

Given that we want to be able to load the Tensorflow's Saved Model in-process (without the need to deploy it to a separate infra-structure), being able to run our Pipelines in Python is a preferable choice.

The popularity of Apache Airflow for implementing Data Pipelines is large, it suffices for our requirements, and it also happens to be the platform for which the project team hold more knowledge and experience.

DAG: Film Review Sentiment Classifier Training

We define that, for our solution, the model training will be performed by a DAG.

Provided that the training data is properly placed in our Data Lake, the scope of our DAG shall be:

  1. Download the Training Data
  2. Train the Model
  3. Deploy the Model to a Directory accessible from Airflow
DAG as illustrated by Apache Airflow

For the purpose of this demonstration, we did not include Quality Checks on the model. However, this DAG should be responsible for checking the required quality for the model after it's trained.

Further improvements could also be made should the Tensorflow Extended (TFX) Guidelines be followed and their library implemented, but those are again out-of-scope for the present demonstration.

DAG: Film Review Sentiment Data Warehouse Building

Our solution will also provide a Data Warehouse Building DAG, responsible for running all our Data Pipeline steps, including sentiment classification using the model that was previously made available.

The steps should be:

  1. Copy/Download/Stage the Data into our DW Amazon S3 bucket
  2. Stage the Data into Amazon Redshift, using the COPY command
  3. Batch insert Staged Data into our Dimension and Facts tables
  4. Run the Sentiment Classification
  5. Produce the PDF summary report
  6. Upload the PDF report to our DW bucket, Output Reports folder
DAG as illustrated by Apache Airflow

Data Analysts should be able to run this DAG on demand, whenever a fresh version is needed.

Additionally, the solution provided must be able to allow us to effortlessly schedule the execution to a specific time.

No landing time constraints have been brought up by our analysis, but should they become a reality, Apache Airflow will also allow us to keep track of how well we are within our time-limits to deliver the reports.

In Summary

By looking at the requirements, our solution will be defined as follow:

  1. Our approach to the Data Pipeline will be an ETL
  2. Raw data will be Stored in Amazon S3
  3. We will ship a Data Warehouse into the RDBMS Amazon Redshift
  4. Our ETL "Extract" will run COPY commands against Redshift
  5. Our model will be trained with Tensorflow as a Airflow DAG
  6. Our ETL will run as an Airflow DAG

Next Steps

In the next article Part 2: TMDb API Data “Crawler” we will deep dive into how how we used Amazon Lambda to crawl the TMDb Films information (using their API) using parallelisation.

Source Code

Find the end-to-end solution source code at https://github.com/hudsonmendes/nanodataeng-capstone.

Wanna keep in Touch? LinkedIn!

My name is Hudson Mendes (@hudsonmendes), I’m a 38 years old coder, husband, father of 3, ex Startup Founder, ex-Peloton L7 Staff Tech Lead/Manager, nearly BSc in Computer Science by the University of London & Senior AI/ML Engineering Manager.

I’ve been on the Software Engineering road for 22+ years, and occasionally write about topics of interest to other Senior Engineering Managers and Staff Machine Learning Engineering with a bit of focus (but not exclusively) on Natural Language Processing.

Join me there, and I will keep you in the loop with my new learnings in AI/ML/LLMs and beyond!

--

--