Learn about AWS glue and data engineering with AWS. Plus, a tutorial on how to set up and perform ETL Jobs on AWS Glue.
AWS Glue is a fully managed ETL service that helps you create jobs (based on Apache Spark, Python, or AWS Glue Studio) to perform, extract, transform, and load (ETL) tasks on datasets of almost any size.
AWS Glue is serverless, therefore there’s no infrastructure to manage.
With flexible support for all workloads like ETL and streaming in one service, AWS Glue supports users across various workloads and types of users.
Being Serverless essentially means, if you wanted to run Spark Jobs you don't have to set up entire EC2 clusters for that.
✔️It simplifies the discovery, preparation, movement, and integration of data from multiple sources for analytics users.
It also includes additional products and data ops tooling for authoring, running jobs, and implementing business workflows.
Besides analytics, it can be used for machine learning and application development.
⭐Its major strength is, therefore, consolidating major data integration capabilities (data discovery, modern ETL, cleansing, transforming, and centralised cataloguing) into a single service, and reducing the barriers to entry for creating an ETL service.
For Data Engineers, it enables building complex data integration pipelines.
What makes this more interesting is the wide range of AWS services and others that can be integrated.
This means many use cases/scenarios are covered, and we can go a step ahead to create AWS Glue custom blueprints to even simplify further the automation of building data integration pipelines.
Below is a primer on the important bits you need to know as a Data Engineer to get up and running with AWS Glue.
📍 Preparing for a tech job? Work with the best MAANG mentors for customised guidance.
1. Discover and organise data
2. Transform, prepare, and clean data for analysis
3. Build and monitor data pipelines
(Image source: https://docs.aws.amazon.com/glue/latest/dg/components-key-concepts.html)
1. Data catalogue: A persistent metadata store
This is a managed service that lets you store, annotate and share metadata which can be used to query and transform data.
👉 Metadata can be data location, schema information, data types, and data classification.
It is usually limited to one data catalogue per AWS region.
A database in AWS Glue is a set of associated Data Catalogue table definitions organised into a logical group.
Therefore, tables belong to Databases.
Interestingly, databases can contain tables from more than one data source.
👉 Deleting this database also deletes all the tables within it.
Databases can be created from the databases within the AWS Glue console.
These are the metadata definitions that represent your data.
Tables are just a representation of your schemas.
But they actually reside in another location, such as S3.
Tables can only belong to one database at a time.
There are multiple ways of creating tables/adding table definitions to the data catalogue.
👉 Some of the most common ways are through: running a crawler and creating manually via the AWS Glue console and AWS Glue API.
Partitioning, as AWS puts it, is an important technique for organising datasets, so they can be queried efficiently.
👉 It involves arranging your data in a hierarchical directory structure according to the distinct values of one or more columns.
In AWS Glue, a partition is a logical entity rather than an actual service in glue, which represents Columns in a Glue table.
Folders where data is stored in S3 (physical entities) are mapped to partitions (logical entities.)
👉 For example, sales/log data can be added to your S3 with partitions for years, months, and days like s3://my_bucket/sales/year=2019/month=Jan/day=1/.
This makes it easier for services such as Glue, Amazon Athena, and Amazon Redshift Spectrum to filter your data by partitions instead of reading the entire database.
Below is an image of an AWS folder structure that is partitioned according to multiple column names. 👇
Crawlers are a way to automatically recognise the metadata from your data sources, in order to use these to specify your tables.
Previously, without crawlers, all schema definitions need to be added manually to your tables, crawlers help you to achieve this.
🎯 Crawlers use what are known as classifiers to infer the format and structure of your data.
There are custom and built-in classifiers.
Custom classifiers (which you provide the code for) first run to try to recognise the schema of your data.
If none matches, the crawler defaults to built-in classifiers.
The data inferred is then saved to your data catalogue.
🎯 Crawlers are also involved in creating connections with data stores from which they get the data.
They are data catalogue objects that contain the properties required to connect to particular data stores.
🎯 They contain data such as login credentials, URI strings, and virtual private cloud (VPC) information.
Some of the data sources that we can connect to are: JDBC, AWS (RDS, Redshift, DocumentDB), Kafka, MongoDB, or a network that is an AWS VPC.
This is the business logic required to perform ETL work.
It consists of a transformation script, data sources, and data targets.
ETL jobs can be initiated by triggers that can be scheduled or triggered by events.
🎯 There are 3 types of jobs on AWS Glue:
In addition, jobs can write output files in any of the following data formats such as JSON,
CSV, Apache Parquet, Apache Avro, or ORC (Optimised Row Columnar).
The trigger initiates an ETL job.
These can be scheduled to run daily, hourly custom, etc., and therefore, do not require us to run ETL jobs manually at the push of a button.
This is a similar concept to running cron jobs.
🎯Apart from being scheduled, ETL jobs could be triggered after the successful completion of another ETL job.
Other modes are On-demand and EventBridge event triggers.
9. Dev endpoints
It is basically an environment that you can use to develop and test your AWS Glue scripts.
It's essentially an abstracted cluster.
These can be expensive, hence, it is important to shut them down after use.
We are now well-positioned to go ahead to create our first ETL Job on AWS Glue.
We will also create a corresponding trigger for this job.
Along this, we will touch upon concepts like jobs, triggers, databases, tables, and data sources that we have already discussed.
Let’s dive right into it:
Before starting be careful to later tear down the service that you spin up for this tutorial in order not to incur unnecessary costs.
Steps to Load Data:
1. Create an S3 bucket
2. Create an IAM role that we will use for AWS Glue
3. On the AWS Glue console, create a database
During creation, you will be prompted for the Location to which you can add the AWS folder that we created for the database. That is s3://…./students_database/.
4. Create a table within your database
To create a table, you will also be required to add the link to the actual data file.
Also, you will need to specify your column names manually.
You can also utilise crawlers to load data into your tables.
1. Create the job properties
Some of these are its name, IAM role (that we had previously created), job type, and Glue version.
A script file for the job whose S3 path you’ll define.
2. Choose a data source
For this example, we can choose the S3 table /students_csv that we had previously created.
3. Choose a transformation type
In our case, change the schema.
4. Choose a data-target
Under target, you’ll specify the data store, connection (to the data store), and database name.
A data format is also required for some data stores, such as S3.
5. Create the target folder in S3 and run the job
This usually takes some time depending on its size.
The result of your job should be saved in the S3 bucket to the target folder specified.
Create a trigger and add its properties.
These include name, trigger type (schedule, job event, an eventbridge event, or on-demand), frequency, and start time.
📌 Learn AWS Glue in detail, with practical application.
AWS Glue Blueprints are a helpful tool within the AWS Glue service that can make your data ETL (Extract, Transform, Load) processes more accessible and efficient.
Imagine having a bunch of data scattered across various sources, like databases, logs, or files, and you need to clean, transform, and prepare that data for analysis or storage.
AWS Glue Blueprints are pre-configured, customisable templates that help you do that without starting from scratch.
AWS Glue Blueprints serve several essential functions.
Instead of spending hours or days writing code from scratch to transform your data, you can use a blueprint as a starting point.
They provide a foundation for your ETL jobs.
AWS Glue Blueprints are designed based on best practices, which means they are structured to help you follow industry standards and optimise your data transformations.
While they offer predefined settings and configurations, you can tailor them to your specific data transformation needs.
It allows you to maintain flexibility and control over your ETL process.
Using these blueprints can significantly reduce the time it takes to create ETL jobs.
It is precious when dealing with large volumes of data.
When multiple team members work on ETL processes, blueprints help ensure a consistent approach across projects.
As they are based on established practices, using blueprints can help reduce the risk of errors in your data transformation jobs.
Since AWS Glue is a serverless service, it can automatically scale to handle varying workloads.
Another advantage of blueprints is that they fit seamlessly into this scalable architecture.
Using AWS Glue Blueprints is relatively straightforward.
Start by choosing the blueprint that best matches your data transformation needs.
AWS provides a variety of blueprints for different scenarios, such as data lake creation, data migration, and more.
Once you've selected a blueprint, you can customise it to align with your specific requirements.
You can configure data sources, transformation steps, and destinations as needed.
After customization, you can run the ETL job based on the blueprint.
AWS Glue takes care of the underlying infrastructure so that you can focus on the transformation logic.
AWS Glue also provides monitoring and debugging tools to help you monitor your ETL jobs and make improvements as necessary.
In essence, AWS Glue Blueprints are like pre-made ETL workflows that can save you time, effort, and potential headaches when dealing with data transformation tasks in the AWS cloud.
They're a great tool for efficient data processing in your AWS arsenal.
In this blog, I have highlighted the key components of the AWS Glue service.
I also showed you how to set up and perform ETL Jobs on Glue, featuring other services like AWS S3.
Apart from the numerous benefits of this service for data engineers, AWS Glue has an overall positive net effect on any organisation using it.
It enables data to be processed easily and in a more scalable way, while also reducing server costs related to self-hosting ETL tools such as Spark and Pythons.
Information on the pricing of the service is also worth mentioning.
AWS charges users a monthly fee to store and access metadata in the Glue Data Catalogue.
For ETL jobs and crawlers, there is also a per-second charge with AWS Glue pricing, with a minimum duration of 10 minutes or 1 minute (depending on the Glue version).
A per second charge is also included in AWS to connect to a development endpoint for interactive development.
You can find more details about AWS pricing here: https://aws.amazon.com/glue/pricing/
Here are some references and further reading resources that you can check out to further strengthen your knowledge and confidence in using AWS Glue.
Remember, practice makes human perfect; which is the best way to learn and stay up-to-date.
AWS Glue Tutorial for Beginners: https://youtu.be/dQnRP6X8QAU
AWS official documentation for Glue: https://docs.aws.amazon.com/glue/index.html
Case study - Hudl: https://aws.amazon.com/blogs/big-data/how-hudl-built-a-cost-optimized-aws-glue-pipeline-with-apache-hudi-datasets/
Streaming ETL Jobs: https://docs.aws.amazon.com/glue/latest/dg/add-job- streaming.html
If you have any questions related to what I shared in this blog or would like to seek clarity on AWS Glue, do get in touch with me.
Table of Contents
6 yrs of Exp. at
"Embark on a transformative journey with Kuldeep Pal, a seasoned data engineering professional with over 5 years of hands-on experience. Crafting scalable data pipel...read more