AWS Links to Big Data

Big Data

How to use Amazon EMR to process data using the broad ecosystem of Hadoop tools like Hive and Hue and work with Amazon DynamoDB, Amazon Redshift, Amazon QuickSight, Amazon Athena and Amazon Kinesis.

AWS Marketplace for Big Data

Data Ingestion and Transfer

Amazon Kinesis Agent for Data Ingestion https://github.com/awslabs/amazon-kinesis-agent
Apache Flume https://flume.apache.org/ can be installed and run on Amazon EC2 instances.
You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3 across AWS accounts http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html
Apache Sqoop https://cwiki.apache.org/confluence/display/SQOOP/Home supports the transfer of data between Hadoop and structured data stores such as Amazon RDS.
AWS IoT can collect and handle large quantities of data coming from a variety of sources https://aws.amazon.com/iot/ and makes it easy to use AWS services like AWS Lambda, Amazon Kinesis, Amazon S3, Amazon Machine Learning, and Amazon DynamoDB.
AWS DataSync https://aws.amazon.com/datasync/ is a data transfer service that makes it easy for you to automate moving data between on-premises storage and Amazon S3 or Amazon Elastic File System (Amazon EFS).
Amazon FSx for Lustre https://aws.amazon.com/fsx/lustre/ provides a high-performance file system optimized for fast processing of workloads such as machine learning, high performance computing (HPC), video processing, financial modeling, and electronic design automation (EDA).
AWS Glue DataBrew https://aws.amazon.com/glue/features/databrew/ visual data preparation tool to clean and normalize data to prepare it for analytics and machine learning

Big Data Streaming and Amazon Kinesis

Overview of Amazon Kinesis Data Firehose https://aws.amazon.com/kinesis/data-firehose/
AWS Kinesis Data Analytics – SQL Functions https://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/sql-reference-functions.html
Using the Schema Discovery Feature on Streaming Data https://docs.aws.amazon.com/kinesisanalytics/latest/dev/sch-dis.html
Apache Spark Streaming enables high-throughput, fault-tolerant, and scalable processing of live data streams. It divides the incoming data streams into batches before sending them to the Spark engine for processing. http://spark.apache.org/streaming/
Amazon Managed Streaming for Kafka (MSK) https://aws.amazon.com/msk/ is a fully managed service that makes it easy for you to build and run applications that use Apache Kafka to process streaming data.

Data Lake Concepts and Building a Serverless Data Lake

What is a data lake? https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
Building Data Lakes on AWS https://d1.awsstatic.com/whitepapers/Storage/data-lake-on-aws.pdf AWS white paper.
AWS Lake Formation https://aws.amazon.com/lake-formation/ is a service that makes it easy to set up a secure data lake in days.
S3 Object Lifecycle Management http://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html
How to set up cross-origin resource sharing (CORS) http://docs.aws.amazon.com/apigateway/latest/developerguide/how-to-cors.html
EMR File System (EMRFS) consistent view https://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-consistent-view.html
Quick Start Data Lake with SnapLogic https://aws.amazon.com/quickstart/architecture/data-lake-with-snaplogic/ builds a data lake environment on AWS in about 15 minutes by deploying SnapLogic components and AWS services such as Amazon Simple Storage Service (Amazon S3) and Amazon Redshift.
AWS Lake Formation Workshop https://lakeformation.aworkshop.io/

AWS Links to Big Data

AWS Links to Big Data

Data Ingestion and Transfer

Big Data Streaming and Amazon Kinesis

Data Lake Concepts and Building a Serverless Data Lake

Hadoop Frameworks (Hive, Presto, Pig etc.)

Hadoop User Interfaces

Spark

Management and Monitoring