Big Data | Trilogix Cloud

How to use Amazon EMR to process data using the broad ecosystem of Hadoop tools like Hive and Hue and work with Amazon DynamoDB, Amazon Redshift, Amazon QuickSight, Amazon Athena and Amazon Kinesis.

AWS Marketplace for Big Data

1. Data Ingestion and Transfer

Amazon Kinesis Agent for Data Ingestion
Apache Flume can be installed and run on Amazon EC2 instances.
You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3 across AWS accounts
Apache Sqoop supports the transfer of data between Hadoop and structured data stores such as Amazon RDS.
AWS IoT can collect and handle large quantities of data coming from a variety of sources and makes it easy to use AWS services like AWS Lambda, Amazon Kinesis, Amazon S3, Amazon Machine Learning, and Amazon DynamoDB.
AWS DataSync is a data transfer service that makes it easy for you to automate moving data between on-premises storage and Amazon S3 or Amazon Elastic File System (Amazon EFS).
Amazon FSx for Lustre provides a high-performance file system optimized for fast processing of workloads such as machine learning, high performance computing (HPC), video processing, financial modeling, and electronic design automation (EDA).
AWS Glue DataBrew visual data preparation tool to clean and normalize data to prepare it for analytics and machine learning

2. Big Data Streaming and Amazon Kinesis

Overview of Amazon Kinesis Data Firehose
AWS Kinesis Data Analytics – SQL Functions
Using the Schema Discovery Feature on Streaming Data
Apache Spark Streaming enables high-throughput, fault-tolerant, and scalable processing of live data streams. It divides the incoming data streams into batches before sending them to the Spark engine for processing.
Amazon Managed Streaming for Kafka (MSK) is a fully managed service that makes it easy for you to build and run applications that use Apache Kafka to process streaming data.

3. Data Lake Concepts and Building a Serverless Data Lake

What is a data lake?
Building Data Lakes on AWS AWS white paper.
AWS Lake Formation is a service that makes it easy to set up a secure data lake in days.
S3 Object Lifecycle Management
How to set up cross-origin resource sharing (CORS)
EMR File System (EMRFS) consistent view
Quick Start Data Lake with SnapLogic builds a data lake environment on AWS in about 15 minutes by deploying SnapLogic components and AWS services such as Amazon Simple Storage Service (Amazon S3) and Amazon Redshift.
AWS Lake Formation Workshop

4. Hadoop Frameworks (Hive, Presto, Pig etc.)

About Amazon EMR Releases Each release comprises different big-data applications, components, and features that you select to have Amazon EMR install and configure when you create a cluster.
Apache Hive
Differences and Considerations for Hive on Amazon EMR
Presto on Amazon EMR
Apache Pig
PIGgy Bank is a place for Pig users to share their functions.
Apache Spark on Amazon EMR
Apache MXNet
How do I restart a service in Amazon EMR?
Amazon EMR now supports a public EMR artifact repository for Maven builds

5. Hadoop User Interfaces

View Web Interfaces Hosted on Amazon EMR Clusters
View On-Cluster Application User Interfaces
Launching the Hue Web Interface
Apache Zeppelin
JupyterHub allows you to host multiple instances of a single-user Jupyter notebook server.

6. Spark

Spark Or Hadoop: Which Is The Best Big Data Framework? Blog post from Data Science Central.
Apache Spark home page
Spark RDD Programming Guide
Spark Streaming Programming Guide
Use Apache Spark with Amazon Sagemaker

7. Management and Monitoring