Data Lake Basics and AWS

A data lake is a centralized repository that allows a firm to store structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics — from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

It is estimated that the global datasphere will grow to 175 zettabytes by 2025. Around 90% of the data is mostly unstructured or semi-structured data (JSON, nested JSON, xml, HTML, PDF etc). There are many solutions to store and process structured data but when it comes to any form of the data be it structured/semi-structured or unstructured, then the data lake comes into the picture.

A data lake maintains data in its native formats and handles the three Vs of big data — volume, velocity, and variety — while providing tools for analyzing, querying, and processing. Data lakes eliminate all the restrictions of a typical data warehouse system by providing unlimited space, unrestricted file size, schema on reading, and various ways to access data (including programming, SQL-like queries, and REST calls).

Need of Data lake

Some key business and technology drivers which create the need for a Data Lake:

Silos, many data stores, and no SOR (statement of record), or single source of data
Many business domains use the same data but have trouble accessing the golden source of that data
Difficulties in fetching data from multiple sources to feed applications or analytical engines
Data storage costs on premises are expensive and include the server lifecycle costs and hassle
Data types, structures and formats vary
Regulatory reports and compliance requirements are not being met – the data is too scattered, poor quality, hard to report on
Document creation takes a lot of manual effort, overheads and needs to be automated
End user issues with access problems, latency, resiliency issues, or data quality concerns

Data Lake Layers

Data Lake Architecture

Many companies use cloud storage services such as Google Cloud Storage and Amazon S3 or a distributed file system such as Apache Hadoop. AWS Data Lake is built around S3.

Ingestion
Distillation
Processing
Insights

AWS Components eg

For smaller datasets, and less complex queries, a simple architecture in AWS would include:

-ODBC connection to a SQL server

-Table upload (S3 copy command, ingestion tool, DMS, Datasync) to move the data from the DB to S3

-Glue can be used to fetch the table schemas and create the target table in S3 via an ETL job

-Athena a lighter weight analytics engine can be used to query the S3 tables created by Glue

-Quicksight or other dashboards connect to Athena

=END

Need of Data lake

Data Lake Layers

Data Lake Architecture

AWS Components eg