A Practical AWS Glue Guide for Data Engineers

Learn about AWS glue and data engineering with AWS. Plus, a tutorial on how to set up and perform ETL Jobs on AWS Glue.

Mentor

Blog

What is AWS Glue?

AWS Glue is a fully managed ETL service that helps you create jobs (based on Apache Spark, Python, or AWS Glue Studio) to perform, extract, transform, and load (ETL) tasks on datasets of almost any size.

AWS Glue is serverless, therefore there’s no infrastructure to manage.

With flexible support for all workloads like ETL and streaming in one service, AWS Glue supports users across various workloads and types of users.

Being Serverless essentially means, if you wanted to run Spark Jobs you don't have to set up entire EC2 clusters for that.

✔️It simplifies the discovery, preparation, movement, and integration of data from multiple sources for analytics users.

It also includes additional products and data ops tooling for authoring, running jobs, and implementing business workflows.

Besides analytics, it can be used for machine learning and application development.

⭐Its major strength is, therefore, consolidating major data integration capabilities (data discovery, modern ETL, cleansing, transforming, and centralised cataloguing) into a single service, and reducing the barriers to entry for creating an ETL service.

For Data Engineers, it enables building complex data integration pipelines.

What makes this more interesting is the wide range of AWS services and others that can be integrated.

This means many use cases/scenarios are covered, and we can go a step ahead to create AWS Glue custom blueprints to even simplify further the automation of building data integration pipelines.

Below is a primer on the important bits you need to know as a Data Engineer to get up and running with AWS Glue.

📍 Preparing for a tech job? Work with the best MAANG mentors for customised guidance.

Try a free 1:1 trial session.

What can AWS Glue help you achieve as a data engineer?

1. Discover and organise data

  • It helps to store, index, and search for data across multiple data sources and sinks, by cataloguing all your data in AWS.
    • AWS Glue has automatic data discovery. It uses crawlers to discover schema information and add it to your AWS Glue Data Catalogue.
      • Help you manage schemas and permissions by validating and controlling access to your databases and tables.
        • Helps you to connect to a wide variety of data sources.
          • You can tap into multiple data sources, both on-premises and on AWS, using AWS Glue connections to build your data lake.
            • Examples of these data sources include Amazon S3, Amazon Redshift and Amazon RDS Instances, MongoDB, Kafka, and others.

              .

              2. Transform, prepare, and clean data for analysis

              • Visually transform data with a drag-and-drop interface. Create your ETL process in the drag-and-drop job editor and it will automatically generate the code to extract, transform, and load your data.
                • You can build complex and highly automated ETL pipelines with simple job scheduling. Glue lets you invoke jobs on a schedule, on demand, or based on an event.
                  • Also, it gives the ability to clean and transform streaming data in transit, thus, enabling continuous data consumption to your target data store. In-transit data can also be encrypted via SSL for security.
                    • Automatic data cleaning and deduplication via machine learning.
                      • Has built-in job notebooks via AWS Glue Studio, which are serverless and let you get started quickly with minimal setup effort.
                        • With AWS Glue interactive sessions, you can interactively explore, prepare, edit debug and test your data. Using your favourite IDE or notebook, you can explore, experiment, and process data interactively.
                          • Define, detect, and remediate sensitive data. AWS Glue's sensitive data detection allows you to identify sensitive data in your pipelines and lakes and process it accordingly.

                            .

                            3. Build and monitor data pipelines

                            • Dynamically scale resources based on workload. It automatically scales resources up and down as needed. Workers are assigned to jobs only when they are needed.
                              • Automate jobs with event-based triggers. Use event-based triggers to start crawlers or AWS Glue jobs, and design a chain of dependent jobs and crawlers.
                                • Run your AWS Glue jobs, and after that monitor them with automated monitoring tools such as the Apache Spark UI, AWS Glue job run insights, and AWS CloudTrail.
                                  • Set up workflows for ETL and integration activities. Create workflows for ETL and integration activities using multiple crawlers, jobs, and triggers.

                                    .

                                    What Are the Fundamental Concepts of AWS Glue?

                                    Image

                                    (Image source: https://docs.aws.amazon.com/glue/latest/dg/components-key-concepts.html)

                                    📌 Connect one on one with top tech experts; upskill strategically to bag your dream career.  strategic career upskilling.

                                    1. Data catalogue: A persistent metadata store

                                    This is a managed service that lets you store, annotate and share metadata which can be used to query and transform data. 

                                    👉 Metadata can be data location, schema information, data types, and data classification.

                                    It is usually limited to one data catalogue per AWS region.

                                    2. Databases

                                    A database in AWS Glue is a set of associated Data Catalogue table definitions organised into a logical group. 

                                    Therefore, tables belong to Databases.

                                    Interestingly, databases can contain tables from more than one data source. 

                                    👉 Deleting this database also deletes all the tables within it.

                                    Databases can be created from the databases within the AWS Glue console.

                                    3. Tables 

                                    These are the metadata definitions that represent your data. 

                                    Tables are just a representation of your schemas.  

                                    But they actually reside in another location, such as S3.

                                    Tables can only belong to one database at a time.

                                    There are multiple ways of creating tables/adding table definitions to the data catalogue. 

                                    👉 Some of the most common ways are through: running a crawler and creating manually via the AWS Glue console and AWS Glue API. 

                                    4. Partitions

                                    Partitioning, as AWS puts it, is an important technique for organising datasets, so they can be queried efficiently.  

                                    👉 It involves arranging your data in a hierarchical directory structure according to the distinct values of one or more columns. 

                                    In AWS Glue, a partition is a logical entity rather than an actual service in glue, which represents Columns in a Glue table.  

                                    Folders where data is stored in S3 (physical entities) are mapped to partitions (logical entities.) 

                                    👉 For example, sales/log data can be added to your S3 with partitions for years, months, and days like s3://my_bucket/sales/year=2019/month=Jan/day=1/.  

                                    This makes it easier for services such as Glue, Amazon Athena, and Amazon Redshift Spectrum to filter your data by partitions instead of reading the entire database. 

                                    Below is an image of an AWS folder structure that is partitioned according to multiple column names. 👇

                                    Image

                                    (Source: https://docs.aws.amazon.com/glue/latest/dg/crawler-s3-folder-table-partition.html)

                                    5. Crawler

                                    Crawlers are a way to automatically recognise the metadata from your data sources, in order to use these to specify your tables.

                                    Previously, without crawlers, all schema definitions need to be added manually to your tables, crawlers help you to achieve this.

                                    🎯 Crawlers use what are known as classifiers to infer the format and structure of your data. 

                                    There are custom and built-in classifiers.  

                                    Custom classifiers (which you provide the code for) first run to try to recognise the schema of your data. 

                                    If none matches, the crawler defaults to built-in classifiers. 

                                    The data inferred is then saved to your data catalogue. 

                                    🎯 Crawlers are also involved in creating connections with data stores from which they get the data.

                                    6. Connections

                                    They are data catalogue objects that contain the properties required to connect to particular data stores. 

                                    🎯 They contain data such as login credentials, URI strings, and virtual private cloud (VPC) information. 

                                    Some of the data sources that we can connect to are: JDBC, AWS (RDS, Redshift, DocumentDB), Kafka, MongoDB, or a network that is an AWS VPC.

                                    7. Jobs 

                                    This is the business logic required to perform ETL work. 

                                    It consists of a transformation script, data sources, and data targets. 

                                    ETL jobs can be initiated by triggers that can be scheduled or triggered by events.

                                    🎯 There are 3 types of jobs on AWS Glue:

                                    • Spark: Run in a Spark environment managed by AWS Glue
                                      • Streaming ETL: Similar to spark, but performs jobs on streaming data
                                        • Python Shell: Run Pythion script as a shell

                                          In addition, jobs can write output files in any of the following data formats such as JSON,

                                          CSV, Apache Parquet, Apache Avro, or ORC (Optimised Row Columnar).

                                          8. Triggers 

                                          The trigger initiates an ETL job.  

                                          These can be scheduled to run daily, hourly custom, etc., and therefore, do not require us to run ETL jobs manually at the push of a button.  

                                          This is a similar concept to running cron jobs. 

                                          🎯Apart from being scheduled, ETL jobs could be triggered after the successful completion of another ETL job. 

                                          Other modes are On-demand and EventBridge event triggers.

                                          9. Dev endpoints 

                                          It is basically an environment that you can use to develop and test your AWS Glue scripts.  

                                          It's essentially an abstracted cluster.  

                                          These can be expensive, hence, it is important to shut them down after use.

                                          How to create an ETL Job on AWS Glue?

                                          We are now well-positioned to go ahead to create our first ETL Job on AWS Glue. 

                                          We will also create a corresponding trigger for this job.  

                                          Along this, we will touch upon concepts like jobs, triggers, databases, tables, and data sources that we have already discussed.  

                                          Let’s dive right into it: 

                                          ✅ Prerequisites: Create an AWS account 

                                          Before starting be careful to later tear down the service that you spin up for this tutorial in order not to incur unnecessary costs.

                                          Steps to Load Data: 

                                          1. Create an S3 bucket

                                          • We will use this S3 bucket to store the dataset we need to load into AWS Glue
                                            • Under the bucket create a /data folder where we will add the initial dataset
                                              • Also, create a /temp-dir folder that AWS glue will use as its temporary folder. We will supply this folder name to AWS glue to use to store the temporary folders
                                                • Under /data, we can add our database here. Essentially these will also be folders, for example: /students_database
                                                  • Under our database, we can create another folder to represent our tables. Something like: /students_csv

                                                    .

                                                    2. Create an IAM role that we will use for AWS Glue

                                                    • Under Roles, create a new AWS service role by selecting Glue
                                                      • Under permissions give it Administrator Access. Note that this is not usually recommended in practice, but for purposes of this quick start, this will do
                                                        • Give the role any name. For example, glue-admin-access and create.

                                                          .

                                                          3. On the AWS Glue console, create a database

                                                          During creation, you will be prompted for the Location to which you can add the AWS folder that we created for the database. That is s3://…./students_database/.

                                                          4. Create a table within your database

                                                          To create a table, you will also be required to add the link to the actual data file.

                                                          Also, you will need to specify your column names manually.

                                                          You can also utilise crawlers to load data into your tables.

                                                          ✅ Creating Glue Jobs

                                                          1. Create the job properties

                                                          Some of these are its name, IAM role (that we had previously created), job type, and Glue version. 

                                                          A script file for the job whose S3 path you’ll define.

                                                          2. Choose a data source

                                                          For this example, we can choose the S3 table /students_csv that we had previously created.

                                                          3. Choose a transformation type

                                                          In our case, change the schema. 

                                                          4. Choose a data-target

                                                          Under target, you’ll specify the data store, connection (to the data store), and database name.  

                                                          A data format is also required for some data stores, such as S3. 

                                                          5. Create the target folder in S3 and run the job 

                                                          This usually takes some time depending on its size. 

                                                          The result of your job should be saved in the S3 bucket to the target folder specified.

                                                          ✅ Set up a trigger 

                                                          Create a trigger and add its properties.

                                                          These include name, trigger type (schedule, job event, an eventbridge event, or on-demand), frequency, and start time.

                                                          📌 Learn AWS Glue in detail, with practical application.

                                                          Try 1:1 mentorship with me.

                                                          ⭐What Are AWS Glue Blueprints?

                                                          AWS Glue Blueprints are a helpful tool within the AWS Glue service that can make your data ETL (Extract, Transform, Load) processes more accessible and efficient. 

                                                          Imagine having a bunch of data scattered across various sources, like databases, logs, or files, and you need to clean, transform, and prepare that data for analysis or storage. 

                                                          AWS Glue Blueprints are pre-configured, customisable templates that help you do that without starting from scratch.

                                                          What's their function?

                                                          AWS Glue Blueprints serve several essential functions.

                                                          1. Accelerating development

                                                          Instead of spending hours or days writing code from scratch to transform your data, you can use a blueprint as a starting point. 

                                                          They provide a foundation for your ETL jobs.

                                                          2. Best practices

                                                          AWS Glue Blueprints are designed based on best practices, which means they are structured to help you follow industry standards and optimise your data transformations.

                                                          3. Customizability

                                                          While they offer predefined settings and configurations, you can tailor them to your specific data transformation needs. 

                                                          It allows you to maintain flexibility and control over your ETL process.

                                                          4. Time savings

                                                          Using these blueprints can significantly reduce the time it takes to create ETL jobs. 

                                                          It is precious when dealing with large volumes of data.

                                                          5. Consistency

                                                          When multiple team members work on ETL processes, blueprints help ensure a consistent approach across projects.

                                                          6. Error reduction

                                                          As they are based on established practices, using blueprints can help reduce the risk of errors in your data transformation jobs.

                                                          7. Scalability

                                                          Since AWS Glue is a serverless service, it can automatically scale to handle varying workloads. 

                                                          Another advantage of blueprints is that they fit seamlessly into this scalable architecture.

                                                          How Do You Use AWS Glue Blueprints?

                                                          Using AWS Glue Blueprints is relatively straightforward.

                                                          1. Select a blueprint

                                                          Start by choosing the blueprint that best matches your data transformation needs. 

                                                          AWS provides a variety of blueprints for different scenarios, such as data lake creation, data migration, and more.

                                                          2. Customise

                                                          Once you've selected a blueprint, you can customise it to align with your specific requirements. 

                                                          You can configure data sources, transformation steps, and destinations as needed.

                                                          3. Execute

                                                          After customization, you can run the ETL job based on the blueprint. 

                                                          AWS Glue takes care of the underlying infrastructure so that you can focus on the transformation logic.

                                                          4. Monitor and iterate

                                                          AWS Glue also provides monitoring and debugging tools to help you monitor your ETL jobs and make improvements as necessary.

                                                          In essence, AWS Glue Blueprints are like pre-made ETL workflows that can save you time, effort, and potential headaches when dealing with data transformation tasks in the AWS cloud. 

                                                          They're a great tool for efficient data processing in your AWS arsenal.

                                                          Summarising

                                                          In this blog, I have highlighted the key components of the AWS Glue service. 

                                                          I also showed you how to set up and perform ETL Jobs on Glue, featuring other services like AWS S3. 

                                                          Apart from the numerous benefits of this service for data engineers, AWS Glue has an overall positive net effect on any organisation using it.

                                                          It enables data to be processed easily and in a more scalable way, while also reducing server costs related to self-hosting ETL tools such as Spark and Pythons.

                                                          Information on the pricing of the service is also worth mentioning. 

                                                          AWS charges users a monthly fee to store and access metadata in the Glue Data Catalogue.

                                                          For ETL jobs and crawlers, there is also a per-second charge with AWS Glue pricing, with a minimum duration of 10 minutes or 1 minute (depending on the Glue version).

                                                          A per second charge is also included in AWS to connect to a development endpoint for interactive development. 

                                                          You can find more details about AWS pricing here: https://aws.amazon.com/glue/pricing/

                                                          What Next

                                                          Here are some references and further reading resources that you can check out to further strengthen your knowledge and confidence in using AWS Glue. 

                                                          Remember, practice makes human perfect; which is the best way to learn and stay up-to-date.

                                                          AWS Glue Tutorial for Beginners: https://youtu.be/dQnRP6X8QAU

                                                          AWS official documentation for Glue: https://docs.aws.amazon.com/glue/index.html

                                                          Case study - Hudl: https://aws.amazon.com/blogs/big-data/how-hudl-built-a-cost-optimized-aws-glue-pipeline-with-apache-hudi-datasets/

                                                          Streaming ETL Jobs: https://docs.aws.amazon.com/glue/latest/dg/add-job- streaming.html

                                                          PySpark: https://github.com/hyunjoonbok/PySpark

                                                          If you have any questions related to what I shared in this blog or would like to seek clarity on AWS Glue, do get in touch with me.