December 1, 2022
Mock Interview
Interview Process
Mentorship
Interview Preparation Resources
HR & Behavioural Interviews

An Actionable Guide to AWS Glue for Data Engineers

In this article, learn about AWS Glue in depth and how it can benefit data engineers. Also, get a tutorial on how to set up and perform ETL Jobs on AWS Glue.

What is AWS Glue?

AWS Glue is a fully managed ETL service that helps you create jobs (based on Apache Spark, Python, or AWS Glue Studio) to perform extract, transform, and load (ETL) tasks on datasets of almost any size.

AWS Glue is serverless, therefore there’s no infrastructure to manage.

With flexible support for all workloads like ETL and streaming in one service, AWS Glue supports users across various workloads and types of users.

Being Serverless essentially means, if you wanted to run Spark Jobs you don't have to set up entire EC2 clusters for that.

It simplifies the discovery, preparation, movement, and integration of data from multiple sources for analytics users.

It also includes additional products and data ops tooling for authoring, running jobs, and implementing business workflows.

Besides analytics, it can be used for machine learning and application development.

Its major strength is, therefore, consolidating major data integration capabilities (data discovery, modern ETL, cleansing, transforming, and centralised cataloguing) into a single service, and reducing the barriers to entry for creating an ETL service.

For Data Engineers, it enables building complex data integration pipelines.

What makes this more interesting is the wide range of AWS services and others that can be integrated.

This means many use cases/scenarios are covered, and we can go a step ahead to create AWS Glue custom blueprints to even simplify further the automation of building data integration pipelines.

Below is a primer on the important bits you need to know as a Data Engineer to get up and running with AWS Glue.

What can AWS Glue help you achieve as a data engineer?

🟢 Discover and organise data

  • It helps to store, index, and search for data across multiple data sources and sinks, by cataloguing all your data in AWS.
  • AWS Glue has automatic data discovery. It uses crawlers to discover schema information and add it to your AWS Glue Data Catalogue.
  • Help you manage schemas and permissions by validating and controlling access to your databases and tables.
  • Helps you to connect to a wide variety of data sources.
  • You can tap into multiple data sources, both on-premises and on AWS, using AWS Glue connections to build your data lake.
  • Examples of these data sources include Amazon S3, Amazon Redshift and Amazon RDS Instances, MongoDB, Kafka, and others.

🟢 Transform, prepare, and clean data for analysis

  • Visually transform data with a drag-and-drop interface. Create your ETL process in the drag-and-drop job editor and it will automatically generate the code to extract, transform, and load your data.
  • You can build complex and highly automated ETL pipelines with simple job scheduling. Glue lets you invoke jobs on a schedule, on demand, or based on an event.
  • Also, it gives the ability to clean and transform streaming data in transit, thus, enabling continuous data consumption to your target data store. In-transit data can also be encrypted via SSL for security.
  • Automatic data cleaning and deduplication via machine learning.
  • Has built-in job notebooks via AWS Glue Studio, which are serverless and let you get started quickly with minimal setup effort.
  • With AWS Glue interactive sessions, you can interactively explore, prepare, edit debug and test your data. Using your favourite IDE or notebook, you can explore, experiment, and process data interactively.
  • Define, detect, and remediate sensitive data. AWS Glue's sensitive data detection allows you to identify sensitive data in your pipelines and lakes and process it accordingly.

🟢 Build and monitor data pipelines

  • Dynamically scale resources based on workload. It automatically scales resources up and down as needed. Workers are assigned to jobs only when they are needed.
  • Automate jobs with event-based triggers. Use event-based triggers to start crawlers or AWS Glue jobs, and design a chain of dependent jobs and crawlers.
  • Run your AWS Glue jobs, and  after that monitor them with automated monitoring tools such as the Apache Spark UI, AWS Glue job run insights, and AWS CloudTrail.
  • Set up workflows for ETL and integration activities. Create workflows for ETL and integration activities using multiple crawlers, jobs, and triggers.

Prepare for your Data Engineer interviews today with an Expert

What are the fundamental concepts of AWS Glue?

Key Concepts of SW Glue

(Image source: https://docs.aws.amazon.com/glue/latest/dg/components-key-concepts.html)

🔷 Data catalogue: A persistent metadata store

This is a managed service that lets you store, annotate and share metadata which can be used to query and transform data. 

Metadata can be data location, schema information, data types, and data classification.

It is usually limited to one data catalogue per AWS region.

🔷 Databases

A database in AWS Glue is a set of associated Data Catalogue table definitions organised into a logical group. 

Therefore, tables belong to Databases.

Interestingly, databases can contain tables from more than one data source. 

Deleting this database also deletes all the tables within it.

Databases can be created from the databases within the AWS Glue console.

🔷 Tables 

These are the metadata definitions that represent your data. 

Tables are just a representation of your schemas.  

But they actually reside in another location, such as S3.

Tables can only belong to one database at a time.

There are multiple ways of creating tables/adding table definitions to the data catalogue. 

Some of the most common ways are through: running a crawler and creating manually via the AWS Glue console and AWS Glue API. 

🔷 Partitions

Partitioning, as AWS puts it, is an important technique for organising datasets, so they can be queried efficiently.  

It involves arranging your data in a hierarchical directory structure according to the distinct values of one or more columns. 

In AWS Glue, a partition is a logical entity rather than an actual service in glue, which represents Columns in a Glue table.  

Folders where data is stored in S3 (physical entities) are mapped to partitions (logical entities.) 

For example, sales/log data can be added to your S3 with partitions for years, months, and days like s3://my_bucket/sales/year=2019/month=Jan/day=1/.  

This makes it easier for services such as Glue, Amazon Athena, and Amazon Redshift Spectrum to filter your data by partitions instead of reading the entire database. 

Below is an image of an AWS folder structure that is partitioned according to multiple column names.

AWS Folder Structure

(Image source: https://docs.aws.amazon.com/glue/latest/dg/crawler-s3-folder-table-partition.html)

🔷 Crawler

Crawlers are a way to automatically recognise the metadata from your data sources, in order to use these to specify your tables.

Previously, without crawlers, all schema definitions need to be added manually to your tables, crawlers help you to achieve this.

Crawlers use what are known as classifiers to infer the format and structure of your data. 

There are custom and built-in classifiers.  

Custom classifiers (which you provide the code for) first run to try to recognise the schema of your data. 

If none matches, the crawler defaults to built-in classifiers. 

The data inferred is then saved to your data catalogue. 

Crawlers are also involved in creating connections with data stores from which they get the data.

🔷 Connections

They are data catalogue objects that contain the properties required to connect to particular data stores. 

They contain data such as login credentials, URI strings, and virtual private cloud (VPC) information. 

Some of the data sources that we can connect to are: JDBC, AWS (RDS, Redshift, DocumentDB), Kafka, MongoDB, or a network that is an AWS VPC.

🔷 Jobs 

This is the business logic required to perform ETL work. 

It consists of a transformation script, data sources, and data targets. 

ETL jobs can be initiated by triggers that can be scheduled or triggered by events.

There are 3 types of jobs on AWS Glue:

  • Spark: Run in a Spark environment managed by AWS Glue
  • Streaming ETL: Similar to spark, but performs jobs on streaming data
  • Python Shell: Run Pythion script as a shell

In addition, jobs can write output files in any of the following data formats such as JSON,

CSV, Apache Parquet, Apache Avro, or ORC (Optimised Row Columnar).

🔷 Triggers 

The trigger initiates an ETL job.  

These can be scheduled to run daily, hourly custom, etc., and therefore, do not require us to run ETL jobs manually at the push of a button.  

This is a similar concept to running cron jobs. 

Apart from being scheduled, ETL jobs could be triggered after the successful completion of another ETL job. 

Other modes are On-demand and EventBridge event triggers.

🔷 Dev endpoints 

It is basically an environment that you can use to develop and test your AWS Glue scripts.  

It's essentially an abstracted cluster.  

These can be expensive, hence, it is important to shut them down after use.

How to create an ETL Job on AWS Glue?

We are now well-positioned to go ahead to create our first ETL Job on AWS Glue. 

We will also create a corresponding trigger for this job.  

Along this, we will touch upon concepts like jobs, triggers, databases, tables, and data sources that we have already discussed.  

Let’s dive right into it: 

✅ Prerequisites: Create an AWS account 

Before starting be careful to later tear down the service that you spin up for this tutorial in order not to incur unnecessary costs.

Steps to Load Data: 

1. Create an S3 bucket

  • We will use this S3 bucket to store the dataset we need to load into AWS Glue
  • Under the bucket create a /data folder where we will add the initial dataset
  • Also, create a /temp-dir folder that AWS glue will use as its temporary folder. We will supply this folder name to AWS glue to use to store the temporary folders
  • Under /data, we can add our database here. Essentially these will also be folders, for example: /students_database
  • Under our database, we can create another folder to represent our tables. Something like: /students_csv

2. Create an IAM role that we will use for AWS Glue

  • Under Roles, create a new AWS service role by selecting Glue
  • Under permissions give it Administrator Access. Note that this is not usually recommended in practice, but for purposes of this quick start, this will do
  • Give the role any name. For example, glue-admin-access and create

3. On the AWS Glue console, create a database

During creation, you will be prompted for the Location to which you can add the AWS folder that we created for the database. That is s3://…./students_database/.

4. Create a table within your database

To create a table, you will also be required to add the link to the actual data file. Also, you will need to specify your column names manually.

You can also utilise crawlers to load data into your tables.

✅ Creating Glue Jobs

1. Create the job properties

Some of these are its name, IAM role (that we had previously created), job type, and Glue version. 

A script file for the job whose S3 path you’ll define.

2. Choose a data source

For this example, we can choose the S3 table /students_csv that we had previously created.

3. Choose a transformation type

In our case, change the schema. 

4. Choose a data-target

Under target, you’ll specify the data store, connection (to the data store), and database name.  

A data format is also required for some data stores, such as S3. 

5. Create the target folder in S3 and run the job 

This usually takes some time depending on its size. 

The result of your job should be saved in the S3 bucket to the target folder specified.

✅ Set up a trigger 

Create a trigger and add its properties.

These include name, trigger type (schedule, job event, an eventbridge event, or on-demand), frequency, and start time.

To sum up

In this blog, I have highlighted the key components of the AWS Glue service. 

I also showed you how to set up and perform ETL Jobs on Glue, featuring other services like AWS S3. 

Apart from the numerous benefits of this service for data engineers, AWS Glue has an overall positive net effect on any organisation using it.

It enables data to be processed easily and in a more scalable way, while also reducing server costs related to self-hosting ETL tools such as Spark and Pythons.

Information on the pricing of the service is also worth mentioning. 

AWS charges users a monthly fee to store and access metadata in the Glue Data Catalogue.

For ETL jobs and crawlers, there is also a per-second charge with AWS Glue pricing, with a minimum duration of 10 minutes or 1 minute (depending on the Glue version).

A per second charge is also included in AWS to connect to a development endpoint for interactive development. 

You can find more details about AWS pricing here.

What Next?

Here are some references and further reading resources that you can check out to further strengthen your knowledge and confidence in using AWS Glue. 

Remember, practice makes human perfect - which is the best way to learn and stay up-to-date.

AWS Glue Tutorial for Beginners - https://youtu.be/dQnRP6X8QAU

AWS official documentation for Glue - https://docs.aws.amazon.com/glue/index.html

Case study - Hudl: https://aws.amazon.com/blogs/big-data/how-hudl-built-a-cost- optimized-aws-glue-pipeline-with-apache-hudi-datasets/

Streaming ETL Jobs: https://docs.aws.amazon.com/glue/latest/dg/add-job- streaming.html

PySpark: https://github.com/hyunjoonbok/PySpark

If you have any questions related to what I shared in this blog or would like to seek clarity on AWS Glue, do get in touch with me.