Latest posts by Paul Schnackenburg (see all)
- Use Azure Managed Service Identity (MSI) to store passwords in your code securely - Thu, Nov 9 2017
- Azure Data Lake overview - Fri, Sep 22 2017
- Moving from Office 365 to on-premises Exchange - Tue, Sep 19 2017
ADL Store vs. Azure blob storage ^
Azure Data Lake (ADL) is actually two different services: ADL Store and ADL Analytics. Let's start with the Store first. This is essentially the best place to store all your data. You may ask why vanilla Azure Blob storage doesn't do the trick. That's because each subscription has a limit of 200 storage accounts, and each account has a limit of 500 TB. More concerning is that the maximum size of a block blob is about 4.75 TB, and an append blob is 195 GB. Moreover, Azure storage isn't optimized for parallel processing in the same way that ADL Store is.
ADL Store on the other hand has no limits on account sizes, file sizes, or number of files. It is also built on a hierarchical file system in contrast with Azure Blob storage, which is an object store. Encryption is optional, and you can choose between Microsoft-managed keys or your own keys stored in Azure Key Vault.
You can ingest your data into ADL Store several ways. You can copy it from a local server or from Azure Storage Blobs, stream it in from Event Hubs, HDInsight Storm, or Azure Stream Analytics. You can also upload it from your relational databases using Apache Sqoop or Azure Data Factory, or from your web server log data. And if it's really huge amounts of data, you can ship disks offline to Azure. You can then access the data using the WebHDFS-compatible REST API, at adl://StoreName.azuredatalakestore.net.
A short aside is in order here, as one argument is that big data was born with Hadoop. Doug Cutting was the father of the Hadoop Distributed File System (HDFS). (In turn its genesis was a Google File System paper published in October 2003, followed by another Google paper: MapReduce: Simplified Data Processing on Large Clusters.) The problem that Google and Facebook were trying to solve was how to analyze gigantic amounts of information that didn't fit on a big traditional database server or data warehouse cluster. The solution was a typical "cloud" approach: spread the data across many commodity servers, build a file system that "expects" hardware and component failures, and make it appear as a single data store for analysis.
So ADL Store is a place to put your data for analysis. Microsoft offers Azure HDInsight, which is a service that spins up a Hadoop cluster for you to manage. It's fully compatible with many open-source Hadoop add-on packages such as Apache Spark and Storm, HBase, and Kafka. This flexibility does mean you'll need to know Hadoop and how to manage the cluster.
ADL Analytics on the other hand lets you do the analysis on a cluster that Microsoft manages for you. There's no need to remember to destroy your cluster when your job's finished because that's all part of the service.
Azure Data Lake Analytics ^
The point of ADL Analytics is to put big data analytics in the hands of developers who don't know Hadoop and all the associated technologies. Instead you use U-SQL, a language evolved from SCOPE, an internal big data language at Microsoft. U-SQL combines the declarative nature of SQL with the extensibility of C# and big data ideas such as "schema on reads" and reducers.
With ADL Analytics, you don't need to manage infrastructure. You simply pick how much power you need, and you only pay for the job as it's running. Both ADL services are integrated with Azure Active Directory, letting you control access easily.
If you have an Azure subscription, you can tag along with my steps here for creating an ADL Store and an Analytics account:
- Log in at portal.azure.com and click on New.
- Pick Data + Analytics and then Data Lake Store.
- Give it a name, put it in a new Resource group (so you can easily delete it all in one go later), and pick a location.
- You'll need to choose your encryption; I picked Not enabled. Click Create.
- After creating the Store, click the tile and then select Data Explorer.
- Here you can create subfolders for your files. I created a sample folder and uploaded some sample files from the U-SQL tutorials mentioned above into it.
Uploading data to the Store
- To create an Analytics account, select New > Data + Analytics again, and pick Data Lake Analytics this time.
- Name it (note it only allows lowercase letters and numbers).
- I chose to put the Analytics account into the same ARM resource group as the Store, but you could optionally create a new one.
- Click on Data Lake Store to link the Analytics account to the Store you created in step 3. Pin the account to your dashboard. Click Create.
- After provisioning the account, go into it.
- Click on the New Job icon to enter a script to run. I picked the sample from this page. Notice the slider lets you pick how much power to assign to the job from 1 to 250 and the corresponding cost scale. Click Submit job.
Note that as both ADL services are fully built on Azure Resource Manager, you can use PowerShell, the cross-platform CLI, or the API to create ADL resources.
Obviously, the portal isn't an efficient UI for production, and thus there are add-ins available for Visual Studio (including the free Community Edition) and Visual Studio Code. Planning your ADL Stores for performance is important as your data grows.
Like so many other Azure services, ADL is a "better together story." ADL also integrates with Azure Data Factory, Stream Analytics, Power BI, Data Catalog, SQL Server Data Warehouse and SQL Server Integration Services (SSIS), and Event Hubs.
Recent additions include built-in support for ADL Store in the latest version of Hortonworks Data Platform (HDP®) 2.6.1 and support for ADL Store in Azure Analysis Services (not to be confused with ADL Analytics) and SQL Server Data Tools.
There's no up-front cost for ADL Store, and you can choose pay-as-you-go pricing for the storage ($0.039/GB per month). Or if you know how much capacity you're going to use, you can pick a commitment package to save on cost overall.
Obviously, this short article only scratches the surface of what's possible with big data analysis using Azure. The incredible thing to me is that I can play around with this type of technology without a large investment in infrastructure. And if I had a project that needed this technology, I could scale out to petabytes and only pay for what I used. We live in interesting times indeed.