Photo by EJ Strat on Unsplash

ETL is an automated process that takes raw data, extracts and transforms the information required for analysis, and loads it to a data warehouse. There are different ways to build your ETL pipeline, on this post we’ll be using three main tools:

  • Airflow: one of the most powerful platforms used by Data Engineers for orchestrating workflows.
  • AWS ECS/Fargate: a container management service that makes it easy to run, stop, and manage your containers.
  • AWS s3: AWS simple storage service.

The architecture that we will be building follows the schema bellow:

Photo by Ian Battaglia on Unsplash

In this post we will set up a private EC2 instance (in a private subnet), and use SSM session manager to access the instance that hosts a Jupyter Notebook server. We will then use PostForwarding with AWS Session Manager to access our server from our local machine.

We'll set up this infrastructure without opening inbound ports or setting up bastion hosts or managing SSH keys!.

Why use Session Manager?

It is well known that we can not directly connect to a private EC2 instance unless there is VPN Connectivity or Direct Connect or other network connectivity source with the VPC. …

Photo by Khadeeja Yasser on Unsplash

Alarms exist to notify us when our system behaves in an unexpected way, which warrants manual intervention to correct. When we have multiple systems in a production environment and an error passes unnoticed, the consequences can be catastrophic.

An alarm should be created when the system cannot automatically recover, and human intervention is required. If an alert happens to occur too frequently it might lead to longer response time or even get missed.

In this article, we will be building an alarm notification pipeline for an AWS Lambda function. For that will be using 3 AWS Services: AWS Lambda, Simple…

Photo by Sai Kiran Anagani on Unsplash

While working with terraform lambda modules, I had a hard time finding out the best repository architecture to automate my lambdas deployment. I couldn’t find any article that I could use as a guideline, that’s why I’m writing this article.

While working on big projects, project organization is a must-have. In this story, you will be presented to one way to organize your repositories so that it facilitates the deployment procedure, making it much more scalable.

In this article, we'll be building the following repositories architecture:

  • Lambda-Module: Repository containing the Terraform Lambda module.
  • Lambda-Infra: Repository containing Terraform code for deployment…

Photo by Max Duzij on Unsplash

Terraform is an amazing tool for building infrastructures. This tool is used for building, changing, and versioning infrastructure safely and efficiently. Terraform is the infrastructure as code offering from HashiCorp.

While using Terraform for building a project that I’m designing using Amazon Web Services (AWS), I came across the need to set up an API Gateway endpoint that takes records, put them into an SQS queue that triggers an Event Source for a Lambda function.

In this post, I would like to share with you each step required to build this infrastructure. …

Building a Multi-Label Classifier doesn't seem a difficult task using Keras, but when you are dealing with a highly imbalanced dataset with more than 30 different labels and with multiple losses it can become quite tricky.

In this post, we'll go through the definition of a multi-label classifier, multiple losses, text preprocessing and a step-by-step explanation on how to build a multi-output RNN-LSTM in Keras.

The dataset that we'll be working on consists of natural disaster messages that are classified into 36 different classes. The dataset was provided by Figure Eight. Example of input messages:

['Weather update - a cold front from Cuba that could pass over Haiti', 'Is the Hurricane over or is it not over', 'Looking for someone but no name', 'UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.'…

As a Python developer, you must be familiar with the python packages installation procedure: pip install <package_name>. But maybe, you have never asked yourself how this works behind the curtain. This post is to briefly explain how you can upload you own package to PyPi, so that other programmers can use it.

Image taken from:

Part I: Build a Python package

First, you will have to have your code ready for upload. In this tutorial, I’ll be using the package gradient_descent as an example. The code folder tree is organized as follows:


It’s important that your…

Working as a data scientist at a start-up, new projects and new demands are something that we should all be prepared for. My learning curve in the company keeps growing every day, and I thought that it would be nice to write down some of my personal works. What a better way to do that than publishing it on Medium ?


Exploratory Data Analysis is the process of performing investigations on data in order to retrieve its main characteristics, patterns, spot anomalies, test hypothesis and check assumptions; answering questions related to business or real-world applications of how the data could…

Daniel Da Costa

Data Scientist at Cyberlabs | MS École CentraleSupélec | BSc PUC-Rio

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store