Azure Data Lake is a set of services from Microsoft making the scale and management of large dataset analysis easier. In this article, we'll look at what your business could do with it and how to get started.

Paul Schnackenburg

Paul Schnackenburg works part time as an IT teacher as well as running his own business in Australia. He has MCSE, MCT, MCTS and MCITP certifications. Follow his blog TellITasITis.

Latest posts by Paul Schnackenburg (see all)

ADL Store vs. Azure blob storage ^

Azure Data Lake (ADL) is actually two different services: ADL Store and ADL Analytics. Let's start with the Store first. This is essentially the best place to store all your data. You may ask why vanilla Azure Blob storage doesn't do the trick. That's because each subscription has a limit of 200 storage accounts, and each account has a limit of 500 TB. More concerning is that the maximum size of a block blob is about 4.75 TB, and an append blob is 195 GB. Moreover, Azure storage isn't optimized for parallel processing in the same way that ADL Store is.

ADL Store on the other hand has no limits on account sizes, file sizes, or number of files. It is also built on a hierarchical file system in contrast with Azure Blob storage, which is an object store. Encryption is optional, and you can choose between Microsoft-managed keys or your own keys stored in Azure Key Vault.

You can ingest your data into ADL Store several ways. You can copy it from a local server or from Azure Storage Blobs, stream it in from Event Hubs, HDInsight Storm, or Azure Stream Analytics. You can also upload it from your relational databases using Apache Sqoop or Azure Data Factory, or from your web server log data. And if it's really huge amounts of data, you can ship disks offline to Azure. You can then access the data using the WebHDFS-compatible REST API, at adl://StoreName.azuredatalakestore.net.

A short aside is in order here, as one argument is that big data was born with Hadoop. Doug Cutting was the father of the Hadoop Distributed File System (HDFS). (In turn its genesis was a Google File System paper published in October 2003, followed by another Google paper: MapReduce: Simplified Data Processing on Large Clusters.) The problem that Google and Facebook were trying to solve was how to analyze gigantic amounts of information that didn't fit on a big traditional database server or data warehouse cluster. The solution was a typical "cloud" approach: spread the data across many commodity servers, build a file system that "expects" hardware and component failures, and make it appear as a single data store for analysis.

So ADL Store is a place to put your data for analysis. Microsoft offers Azure HDInsight, which is a service that spins up a Hadoop cluster for you to manage. It's fully compatible with many open-source Hadoop add-on packages such as Apache Spark and Storm, HBase, and Kafka. This flexibility does mean you'll need to know Hadoop and how to manage the cluster.

ADL Analytics on the other hand lets you do the analysis on a cluster that Microsoft manages for you. There's no need to remember to destroy your cluster when your job's finished because that's all part of the service.

Azure Data Lake Analytics ^

The point of ADL Analytics is to put big data analytics in the hands of developers who don't know Hadoop and all the associated technologies. Instead you use U-SQL, a language evolved from SCOPE, an internal big data language at Microsoft. U-SQL combines the declarative nature of SQL with the extensibility of C# and big data ideas such as "schema on reads" and reducers.

With ADL Analytics, you don't need to manage infrastructure. You simply pick how much power you need, and you only pay for the job as it's running. Both ADL services are integrated with Azure Active Directory, letting you control access easily.

To get started with U-SQL, there's a tutorial here, here, and a GitHub repository here.

If you have an Azure subscription, you can tag along with my steps here for creating an ADL Store and an Analytics account:

  1. Log in at portal.azure.com and click on New.
  2. Pick Data + Analytics and then Data Lake Store.
  3. Give it a name, put it in a new Resource group (so you can easily delete it all in one go later), and pick a location.
Creating a new ADL Store

Creating a new ADL Store

  1. You'll need to choose your encryption; I picked Not enabled. Click Create.
  2. After creating the Store, click the tile and then select Data Explorer.
  3. Here you can create subfolders for your files. I created a sample folder and uploaded some sample files from the U-SQL tutorials mentioned above into it.

Uploading data to the Store

  1. To create an Analytics account, select New > Data + Analytics again, and pick Data Lake Analytics this time.
  2. Name it (note it only allows lowercase letters and numbers).
  3. I chose to put the Analytics account into the same ARM resource group as the Store, but you could optionally create a new one.
  4. Click on Data Lake Store to link the Analytics account to the Store you created in step 3. Pin the account to your dashboard. Click Create.
Creating an ADL Analytics account

Creating an ADL Analytics account

  1. After provisioning the account, go into it.
  2. Click on the New Job icon to enter a script to run. I picked the sample from this page. Notice the slider lets you pick how much power to assign to the job from 1 to 250 and the corresponding cost scale. Click Submit job.
ADL Analytics test script run

ADL Analytics test script run

Note that as both ADL services are fully built on Azure Resource Manager, you can use PowerShell, the cross-platform CLI, or the API to create ADL resources.

Obviously, the portal isn't an efficient UI for production, and thus there are add-ins available for Visual Studio (including the free Community Edition) and Visual Studio Code. Planning your ADL Stores for performance is important as your data grows.

ADL tools in Visual Studio

ADL tools in Visual Studio

Like so many other Azure services, ADL is a "better together story." ADL also integrates with Azure Data Factory, Stream Analytics, Power BI, Data Catalog, SQL Server Data Warehouse and SQL Server Integration Services (SSIS), and Event Hubs.

Recent additions include built-in support for ADL Store in the latest version of Hortonworks Data Platform (HDP®) 2.6.1 and support for ADL Store in Azure Analysis Services (not to be confused with ADL Analytics) and SQL Server Data Tools.

There's no up-front cost for ADL Store, and you can choose pay-as-you-go pricing for the storage ($0.039/GB per month). Or if you know how much capacity you're going to use, you can pick a commitment package to save on cost overall.

Conclusion ^

Obviously, this short article only scratches the surface of what's possible with big data analysis using Azure. The incredible thing to me is that I can play around with this type of technology without a large investment in infrastructure. And if I had a project that needed this technology, I could scale out to petabytes and only pay for what I used. We live in interesting times indeed.

Win the monthly 4sysops member prize for IT pros

Share
0

Related Posts

0 Comments

Leave a reply

Your email address will not be published. Required fields are marked *

*

CONTACT US

Please ask IT administration questions in the forum. Any other messages are welcome.

Sending
© 4sysops 2006 - 2017

Log in with your credentials

or    

Forgot your details?

Create Account