Latest posts by Michael Pietroforte (see all)
- Evernote backup to Dropbox - Tue, Jan 9 2018
- Install PowerShell Core and the Azure module (AzureRM) on a Mac - Tue, Dec 26 2017
- New wiki doc about changing the PowerShell console colors - Thu, Dec 21 2017
Elastic MapReduce (EMR) ^
Elastic MapReduce (EMR) is a service that is based on the open source framework Apache Hadoop and allows you to process all kinds of large data in Amazon’s cloud. The Amazon EMR website lists a few examples: log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics.
The Getting Started guide demonstrates how to import data from Twitter to analyze how often the word “Kindle” was used in a positive or negative way in tweets. (Not surprisingly, the vast majority of tweets in the sample have been positive: 479 to 13).
Amazon EMR stores the data on S3 (in petabytes, if necessary), and the data analysis is done in EMR pools on EC2. You can manage the service through the AWS Management Console. Developers can write EMR programs in a variety of popular programming languages.
Redshift is Amazon’s data warehouse service that allows you to analyze vast amounts of structured data. The main difference from EMR is that it supports SQL, and you can therefore use existing business intelligence (BI) tools. It is possible to process unstructured data in EMR and bring it to Redshift for further analysis with your BI tools. Redshift is also the better choice for long-term data storage.
Data warehousing is an old discipline. What’s new is the amount of data that has to be processed. Conventional data warehouse solutions allow you to analyze data on multiple machines, but only the cloud has the resources for really big data. However, Redshift is also interesting for small organizations who can’t afford expensive, on-premises DW solutions.
The biggest advantage, probably, is that you don’t have to manage the cluster yourself, and you can just focus on data analysis. In addition, you have the typical pay-as-you-go pricing and can leverage the elasticity of the cloud for growing amounts of data.
Amazon just launched Kinesis, another data analytics service, in November of 2013. The main difference from EMR and Redshift is that Kinesis is for real-time processing.
You can pull large data streams into Kinesis, analyze the data, and store it in S3 or DynamoDB. It is also possible to emit the data for further analysis to EMR or Redshift. However, the main point of Kinesis is that you can react in real time to certain events that are hidden in large amounts of data.
A typical example is real-time log file analysis of hundreds or thousands of servers to filter and emit important data to a dashboard or to trigger alerts. Another example is clickstream analysis of large websites to dynamically change advertising strategies.
This post concludes my AWS series. Please let me know how you liked it. I hope you now have a basic understanding of Amazon’s cloud. Rest assured that this won’t be the last time I blog about this fascinating new world for IT pros.