
Parquet file format for Data Lakes
Parquet is a file format standard used in many enterprises. It allows the standardisation of files and provides a common framework for queries and storage. Par…
Read More »

Parquet is a file format standard used in many enterprises. It allows the standardisation of files and provides a common framework for queries and storage. Par…
Read More »
Databricks and Snowflake overlap in many areas. Firms deploying both need to clearly demarcate the epics and use case journeys to be supported by the technolog…
Read More »
A straightfoward method to automate data ingestion from S3 buckets (data lake) to a Redshift (data warehouse) cluster; by using Glue. Create a Redshift cluster…
Read More »
[Data engineering lifecycle from “Fundamentals of Data Engineering” by Matt Housley] Data Ingestion Challenges Data ingestion can be complicated. There are usu…
Read More »
AWS Glue is a meta data catalogue service with Extract-Transform-Load logic. The Glue catalogue is based on Hive and is a MySQL DB and a Java front end. Glue &…
Read More »
Data flowing into the Data Lake obviously changes. Data table changes are captured by CDC or change data capture. Changes in the source database are delivered …
Read More »
Amazon Redshift is a petabyte scalable columnar data warehouse that is very efficient in storing raw data and collecting data from various sources. Redshift su…
Read More »
Data products are the end result of file or data movements to the cloud; ETL; processing; de-duplication; curation and storage in a consumable layer. There is …
Read More »
In simple terms we can identify the differences between Data Lakes and Data Warehouses. Data Lake: A data lake is a centralized repository, usually a platform,…
Read More »
Digital Transformation Digital transformation not a magic solution nor a buffet of word salads. DT is roughly defined as the integration of digital technologie…
Read More »
A typical Technology Stack for a Data Lake. S3 as the Golden Source. Snowflake as a corporate Data Share with SQL use cases. If AWS-S3 and Redshift are not pro…
Read More »
(ETL engine in the above could be AWS Glue) There are various ways to define performance and what that means. A simple way to be consistent with management is …
Read More »