Automating S3 to Redshift with Glue

A straightfoward method to automate data ingestion from S3 buckets (data lake) to a Redshift (data warehouse) cluster; by using Glue.

Create a Redshift cluster inside a VPC, with at least one security group assigned to it.

Setup the Security Group within the VPC.

Go to VPC > Security > Security Groups look for the one that is attached to the Redshift cluster and click on Inbound rules > Edit inbound rules, you need to add a new rule type: ALL TCP with source: custom and look for the name of the self-security group that you are editing.

Create a new Role or edit an existing one that you already have to grant your glue Jobs with permissions to run and read from S3. Go to IAM > Roles > Create Role, select Glue and look for the next two policies:

AWSGlueServiceRoles
AmazonRedshiftFullAccess

Then you will need to create and add two more policies:

1. IAMPassRole

Service: IAM, Action: PassRole, Resource: Add ARN of your Glue Role

👉 2. S3ReadAccess

Service: S3, Actions: ListBucket, GetBucketLocation, GetObject

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::*/*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::*"
        }
    ]
}

S3 Wheel File

Download the following wheel file that will allow you to manage the connection and writing to Redshift link. If you specified a bucket in the S3 policy, make sure to upload the file to this same bucket.

Glue Connections

Create two connections, go to Glue > Connections > Add Connection:

1. This connection will allow your Glue Job to reach your services within your VPC.

Connection name: glue-connection-tovpc

Connection type: Network

VPC: Same as you Redshift Cluster

Subnet: Subnet within your VPC

Security groups: Choose the one you previously created

2. This connection is to securely save and retrieve your login data to Redshift.

Connection name: glue-connection-toredshift

Connection type: JDBC

JDBC URL: Go to Redshift > MyRedshiftCluster > General Information > JBDC URL

Username: Redshift user with write permissions to tables.

Password: Password of the user.

VPC: Same as you Redshift Cluster

Subnet: Subnet within your VPC

Security groups: Choose the one you previously created

Glue Job Configuration

Create a new Glue Job:

Name: automate-s3-to-redshift

IAM role: Role that you previously created.

Type: Spark / Python Shell

Glue version: Spark 2.4, Python 3 (Glue Version 2.0) / Python 3 (Glue Version 1.0)

This job runs: A new script to be authored by you

Connections: Add the two that were created in previous steps.

Go to Actions > Edit Job > Security configuration > Job parameters:

Key: — additional-python-modules

Value: s3://your_bucket/path_1/redshift_connector/redshift_connector-1.2-py3-none-any.whl (URI Object)

Glue Job Script

The wheel file will do the following:

Get your credentials from the 2nd connection you created.
Use these values to establish a connection with Redshift.
Copy your S3 files to Redshift.

The pipeline from S3 to Redshift is now automated.