How to use Amazon EMR to process data using the broad ecosystem of Hadoop tools like Hive and Hue and work with Amazon DynamoDB, Amazon Redshift, Amazon QuickSight, Amazon Athena and Amazon Kinesis.
- Amazon Kinesis Agent for Data Ingestion
- Apache Flume can be installed and run on Amazon EC2 instances.
- You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3 across AWS accounts
- Apache Sqoop supports the transfer of data between Hadoop and structured data stores such as Amazon RDS.
- AWS IoT can collect and handle large quantities of data coming from a variety of sources and makes it easy to use AWS services like AWS Lambda, Amazon Kinesis, Amazon S3, Amazon Machine Learning, and Amazon DynamoDB.
- AWS DataSync is a data transfer service that makes it easy for you to automate moving data between on-premises storage and Amazon S3 or Amazon Elastic File System (Amazon EFS).
- Amazon FSx for Lustre provides a high-performance file system optimized for fast processing of workloads such as machine learning, high performance computing (HPC), video processing, financial modeling, and electronic design automation (EDA).
- AWS Glue DataBrew visual data preparation tool to clean and normalize data to prepare it for analytics and machine learning
- Overview of Amazon Kinesis Data Firehose
- AWS Kinesis Data Analytics – SQL Functions
- Using the Schema Discovery Feature on Streaming Data
- Apache Spark Streaming enables high-throughput, fault-tolerant, and scalable processing of live data streams. It divides the incoming data streams into batches before sending them to the Spark engine for processing.
- Amazon Managed Streaming for Kafka (MSK) is a fully managed service that makes it easy for you to build and run applications that use Apache Kafka to process streaming data.
- What is a data lake?
- Building Data Lakes on AWS AWS white paper.
- AWS Lake Formation is a service that makes it easy to set up a secure data lake in days.
- S3 Object Lifecycle Management
- How to set up cross-origin resource sharing (CORS)
- EMR File System (EMRFS) consistent view
- Quick Start Data Lake with SnapLogic builds a data lake environment on AWS in about 15 minutes by deploying SnapLogic components and AWS services such as Amazon Simple Storage Service (Amazon S3) and Amazon Redshift.
- AWS Lake Formation Workshop
- About Amazon EMR Releases Each release comprises different big-data applications, components, and features that you select to have Amazon EMR install and configure when you create a cluster.
- Apache Hive
- Differences and Considerations for Hive on Amazon EMR
- Presto on Amazon EMR
- Apache Pig
- PIGgy Bank is a place for Pig users to share their functions.
- Apache Spark on Amazon EMR
- Apache MXNet
- How do I restart a service in Amazon EMR?
- Amazon EMR now supports a public EMR artifact repository for Maven builds
- View Web Interfaces Hosted on Amazon EMR Clusters
- View On-Cluster Application User Interfaces
- Launching the Hue Web Interface
- Apache Zeppelin
- JupyterHub allows you to host multiple instances of a single-user Jupyter notebook server.