Data Lake Design and Change Data Capture

Data flowing into the Data Lake obviously changes. Data table changes are captured by CDC or change data capture. Changes in the source database are delivered downstream to a receiving system or endpoint. Using CDC, systems stay synchronised. This can be for real time systems, or batch and microbatch systems. An objective would be to support end user analytics (real time or batch based); and to prevent any downtime during a database migration (from on premises for eg. to AWS).

Benefits of CDC:

Incremental Loading: CDC eliminates the need for bulk load updates and inconvenient batch windows by enabling incremental loading or real-time streaming of data changes into your target repository.
Data Consistency: It ensures that data in multiple systems stays in sync, which is crucial for making time-sensitive decisions in high-velocity data environments.

There are 2 common ways of doing this – Real time systems from on premises to AWS using Kafka; and migrating the data from on premises to AWS using the Database Migration service.

Real time sync from on-premise to AWS data lake

Pattern 1: Real Time using Kafka (MSK or managed Kafka in AWS) + Glue Schema Catalogue

Kafka Connect is an open-source component of Apache Kafka that acts as a centralized data hub for simple data integration between databases, key-value stores, search indexes, and file systems.
AWS Glue Schema Registry allows you to centrally discover, control, and evolve data stream schemas.

Solution Overview

Producer Db: Can choose RDS, IaaS SQL or Aurora etc.
- S3 Link or Debezium MySQL connector to perform CDC. The connector continuously monitors the databases and pushes row-level changes to a Kafka topic.
- The connector fetches the schema from the database to serialize the records into a binary form.
- Data moves into S3 raw, then is transformed by Glue into Base (and catalogued) and can be consumed or further partitioned and curated into an S3 prepared bucket.

AWS Glue Schema Registry:
- If the schema doesn’t already exist in the registry, it will be registered.
- This integration ensures schema consistency and version control.

Pattern 2: CDC from the on-premises SQL server

On premises SQL -> AWS EC2 hosted SQL -> use AWS Database Migration Service to move the data and schemas from on premises to AWS

AWS DMS Setup: First, set up an AWS DMS replication instance. This instance runs on an EC2 server and acts as the bridge between your source (on-premises SQL Server) and the target (e.g., Amazon RDS for SQL Server).
Configure the necessary security groups and IAM roles for DMS.
Source Endpoint Configuration: Create a source endpoint in DMS. Specify the connection details for your on-premises SQL Server instance (e.g., IP address, port, database credentials).
Ensure that the SQL Server account used for DMS has appropriate access privileges to the source database (view definition and view server state permissions).
CDC (Change Data Capture): Enable CDC on your SQL Server database. This involves setting up CDC at the database level and enabling it for specific tables. DMS will use CDC to capture changes (inserts, updates, deletes) from the SQL Server transaction log.
Target Endpoint Configuration: Create a target endpoint in DMS. Specify the details of your target database (e.g., Amazon RDS for SQL Server). Configure the necessary settings, including the replication instance, database name, and credentials.
Task Creation: Create a migration task in DMS. This task defines the source and target endpoints, the tables to migrate, and any transformation rules.
Choose the appropriate migration type (full load, ongoing replication, or both).
Start the Migration: Once everything is set up, start the migration task. DMS will continuously capture changes from the SQL Server transaction log and replicate them to the target database.
If you want to share this data, you could copy this data to S3 via an API connector and use Athena to query the data, or move it to Redshift and use Redshift data sharing. Redshift is very expensive.

If you are not migrating the database and application but leaving it on premises, and want the data to flow to the Datalake, you can still use DMS for large volumes or AWS DataSync.

Data moves from the source on premises database into S3 raw, then is transformed by Glue into Base (and catalogued) and can be consumed or further partitioned and curated into an S3 prepared bucket. You can also use Step Functions workflow within this process to process the data.
Or you can use Data Sync to move the data directly from the on premises SQL sever to S3 raw bucket and from there use Step Functions and Glue.

Considerations for data migrations:

SSL Encryption: You can use SSL to encrypt connections between your SQL Server endpoint and the replication instance.
Port Number: Ensure you know the actual port number of your SQL Server instance (dynamic ports or specific static port numbers) when configuring the source endpoint.
SQL Server Features: Keep in mind that Amazon RDS for SQL Server (the target) may not support all features available in SQL Server on EC2 (e.g., cross-database queries, linked servers, SQL Agent).

==END