
A straightfoward method to automate data ingestion from S3 buckets (data lake) to a Redshift (data warehouse) cluster; by using Glue.
Create a Redshift cluster inside a VPC, with at least one security group assigned to it.
Setup the Security Group within the VPC.
Go to VPC > Security > Security Groups look for the one that is attached to the Redshift cluster and click on Inbound rules > Edit inbound rules, you need to add a new rule type: ALL TCP with source: custom and look for the name of the self-security group that you are editing.
Create a new Role or edit an existing one that you already have to grant your glue Jobs with permissions to run and read from S3. Go to IAM > Roles > Create Role, select Glue and look for the next two policies:
Then you will need to create and add two more policies:
1. IAMPassRole
👉 2. S3ReadAccess
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::*/*"
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::*"
}
]
}
S3 Wheel File
Download the following wheel file that will allow you to manage the connection and writing to Redshift link. If you specified a bucket in the S3 policy, make sure to upload the file to this same bucket.
Glue Connections
Create two connections, go to Glue > Connections > Add Connection:
Connection name: glue-connection-tovpc
Connection type: Network
VPC: Same as you Redshift Cluster
Subnet: Subnet within your VPC
Security groups: Choose the one you previously created
Connection name: glue-connection-toredshift
Connection type: JDBC
JDBC URL: Go to Redshift > MyRedshiftCluster > General Information > JBDC URL
Username: Redshift user with write permissions to tables.
Password: Password of the user.
VPC: Same as you Redshift Cluster
Subnet: Subnet within your VPC
Security groups: Choose the one you previously created
Glue Job Configuration
Create a new Glue Job:
Name: automate-s3-to-redshift
IAM role: Role that you previously created.
Type: Spark / Python Shell
Glue version: Spark 2.4, Python 3 (Glue Version 2.0) / Python 3 (Glue Version 1.0)
This job runs: A new script to be authored by you
Connections: Add the two that were created in previous steps.
Go to Actions > Edit Job > Security configuration > Job parameters:
Key: — additional-python-modules
Value: s3://your_bucket/path_1/redshift_connector/redshift_connector-1.2-py3-none-any.whl (URI Object)
Glue Job Script
The wheel file will do the following:
The pipeline from S3 to Redshift is now automated.