
Databricks Lakeflow (replaces Airflow)
Databricks LakeFlow is built on top of Databricks Workflows and Delta Live Tables. It is an implementation of Apache Airflow built into the Databricks eco syst…
Read More »

Databricks LakeFlow is built on top of Databricks Workflows and Delta Live Tables. It is an implementation of Apache Airflow built into the Databricks eco syst…
Read More »
A straightfoward method to automate data ingestion from S3 buckets (data lake) to a Redshift (data warehouse) cluster; by using Glue. Create a Redshift cluster…
Read More »
[Data engineering lifecycle from “Fundamentals of Data Engineering” by Matt Housley] Data Ingestion Challenges Data ingestion can be complicated. There are usu…
Read More »
AWS Glue is a meta data catalogue service with Extract-Transform-Load logic. The Glue catalogue is based on Hive and is a MySQL DB and a Java front end. Glue &…
Read More »
Data flowing into the Data Lake obviously changes. Data table changes are captured by CDC or change data capture. Changes in the source database are delivered …
Read More »
Amazon Redshift is a petabyte scalable columnar data warehouse that is very efficient in storing raw data and collecting data from various sources. Redshift su…
Read More »
Data products are the end result of file or data movements to the cloud; ETL; processing; de-duplication; curation and storage in a consumable layer. There is …
Read More »
A typical Technology Stack for a Data Lake. S3 as the Golden Source. Snowflake as a corporate Data Share with SQL use cases. If AWS-S3 and Redshift are not pro…
Read More »
(ETL engine in the above could be AWS Glue) There are various ways to define performance and what that means. A simple way to be consistent with management is …
Read More »