AWS GLUE: Crawler, Catalog, and ETL Tool

Glue is a sticky wet substance that binds things together when it dries. It is also the name for a new serverless offering from Amazon called AWS Glue. Cool Marketing for sure!

So what is AWS Glue?

Glue can go out and crawl for data assets contained in your AWS environment and store that information in a catalog. You may have often heard the word metadata, well that is exactly the kind of data that Glue discovers and stores. Metadata is simply ‘data about data.’ Or a description of your data assets. For instance if I have a file in S3 and it is parquet, I would simply have AWS scan that bucket for data about that file. It would explore the structure of that file, data types, and other schema details and store that data in the Glue catalog. This catalog can then be used as a directory of data within your AWS environment. Of course you would have to have the security permissions in order for Glue to scan it.

Once the data about your data is in the catalog it can be used as a service to do things. It is indexed, searchable, and you can query the catalog just like you can query a database. So think of Glue as a central repository for data about your data in your organization.

Another very powerful use for this catalog is the three most expensive letters in data: E-T-L. ETL stands for Extract, Transform, and Load. Which means I can change data or combine data together. It evens comes with a built in job scheduler. ETL requires expensive tools and people. AWS Glue can alleviate some of that burden. (See below for pricing)

What kind of data can I use AWS Glue with? Glue supports S3, Aurora, all other AWS RDS engines, Redshift, and common database engines running on your VPC (Virtual Private Cloud) in EC2.

How much does AWS Glue cost? Glue has an hourly rate and you are billed by the second for crawlers (data discovery) and ETL jobs. You also pay for the storage of data in the AWS Glue Catalog. The first million objects stored are free and the first million accesses are free.

Let’ take a look at an example of pricing:


You have one million tables per month, but have two million requests per month. Let’s say you also use crawlers to find new tables and they run for 30 minutes and consume 2 DPUs.

Your storage cost is $0, as the storage for your first million tables is free. Your first million requests are also free. You will be billed for one million requests above the free tier, which is $1.

Crawlers are billed at $0.44 per DPU-Hour, so you will pay for 2 DPUs * 1/2 hour at $0.44 per DPU-Hour or $0.44. This is a total monthly bill of $1.44.

Let’s wrap it up: Nothing to install!

AWS Glue is serverless. What does serverless mean? Serverless means I don’t have to stand anything up. AWS Glue is just a service that is always available. It is multi-tenant and highly secure; It is just there! So in order to use Glue all you have to do is use it. You can find it within your AWS Console. You only have to pay for the service while your jobs are running. AWS takes care of the elasticity and the infrastructure for you. This means you don’t have to have an EC2 image up and running to use it. Nice!

Data Nerd! Walking the Data wire for 30 years. If you are serious about data and analytics then I might be interesting to you!

Data Nerd! Walking the Data wire for 30 years. If you are serious about data and analytics then I might be interesting to you!