- Pip install Boto3 - Thu, Mar 24 2022
- Install Boto3 (AWS SDK for Python) in Visual Studio Code (VS Code) on Windows - Wed, Feb 23 2022
- Automatically mount an NVMe EBS volume in an EC2 Linux instance using fstab - Mon, Feb 21 2022
Elastic MapReduce (EMR) ^
Elastic MapReduce (EMR) is a service that is based on the open source framework Apache Hadoop and allows you to process all kinds of large data in Amazon’s cloud. The Amazon EMR website lists a few examples: log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics.
The Getting Started guide demonstrates how to import data from Twitter to analyze how often the word “Kindle” was used in a positive or negative way in tweets. (Not surprisingly, the vast majority of tweets in the sample have been positive: 479 to 13).
Amazon EMR stores the data on S3 (in petabytes, if necessary), and the data analysis is done in EMR pools on EC2. You can manage the service through the AWS Management Console. Developers can write EMR programs in a variety of popular programming languages.
Redshift is Amazon’s data warehouse service that allows you to analyze vast amounts of structured data. The main difference from EMR is that it supports SQL, and you can therefore use existing business intelligence (BI) tools. It is possible to process unstructured data in EMR and bring it to Redshift for further analysis with your BI tools. Redshift is also the better choice for long-term data storage.
Data warehousing is an old discipline. What’s new is the amount of data that has to be processed. Conventional data warehouse solutions allow you to analyze data on multiple machines, but only the cloud has the resources for really big data. However, Redshift is also interesting for small organizations who can’t afford expensive, on-premises DW solutions.
The biggest advantage, probably, is that you don’t have to manage the cluster yourself, and you can just focus on data analysis. In addition, you have the typical pay-as-you-go pricing and can leverage the elasticity of the cloud for growing amounts of data.
Amazon just launched Kinesis, another data analytics service, in November of 2013. The main difference from EMR and Redshift is that Kinesis is for real-time processing.
You can pull large data streams into Kinesis, analyze the data, and store it in S3 or DynamoDB. It is also possible to emit the data for further analysis to EMR or Redshift. However, the main point of Kinesis is that you can react in real time to certain events that are hidden in large amounts of data.
A typical example is real-time log file analysis of hundreds or thousands of servers to filter and emit important data to a dashboard or to trigger alerts. Another example is clickstream analysis of large websites to dynamically change advertising strategies.
This post concludes my AWS series. Please let me know how you liked it. I hope you now have a basic understanding of Amazon’s cloud. Rest assured that this won’t be the last time I blog about this fascinating new world for IT pros.