
A data lake is a centralized repository that allows a firm to store structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics — from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.
It is estimated that the global datasphere will grow to 175 zettabytes by 2025. Around 90% of the data is mostly unstructured or semi-structured data (JSON, nested JSON, xml, HTML, PDF etc). There are many solutions to store and process structured data but when it comes to any form of the data be it structured/semi-structured or unstructured, then the data lake comes into the picture.
A data lake maintains data in its native formats and handles the three Vs of big data — volume, velocity, and variety — while providing tools for analyzing, querying, and processing. Data lakes eliminate all the restrictions of a typical data warehouse system by providing unlimited space, unrestricted file size, schema on reading, and various ways to access data (including programming, SQL-like queries, and REST calls).
Some key business and technology drivers which create the need for a Data Lake:
Many companies use cloud storage services such as Google Cloud Storage and Amazon S3 or a distributed file system such as Apache Hadoop. AWS Data Lake is built around S3.
For smaller datasets, and less complex queries, a simple architecture in AWS would include:
-ODBC connection to a SQL server
-Table upload (S3 copy command, ingestion tool, DMS, Datasync) to move the data from the DB to S3
-Glue can be used to fetch the table schemas and create the target table in S3 via an ETL job
-Athena a lighter weight analytics engine can be used to query the S3 tables created by Glue
-Quicksight or other dashboards connect to Athena
=END