- Azure Sentinel—A real-world example - Tue, Oct 12 2021
- Deploying Windows Hello for Business - Wed, Aug 4 2021
- Azure Purview: Data governance for on-premises, multicloud, and SaaS data - Wed, Feb 17 2021
Dude—where's my data?
Unless you're a very small business, you need to know what data you have, where it's stored, who's got access to it, and how it's being moved around as it's processed. The saying that "data is the new oil" certainly brings home the fact that data for most businesses is a strategic asset that needs to be managed accordingly.
Microsoft 365 offers Microsoft Information Protection (MIP) and Data Loss Prevention (DLP) for documents and email stored in SharePoint Online, OneDrive for Business, and Exchange Online, plus file shares and SharePoint server on-premises. MIP helps you set tags on documents, either manually or by scanning the content and automatically identifying bank account details or social security numbers, for example. Based on the tags, a policy is applied that could include encrypting it or adding visual indicators in the document. The protection follows the file so that if a user tries to share sensitive documents via personal email, USB stick, or cloud storage with an unauthorized party, they won't be able to open them.
MIP (and DLP) covers email and documents in Microsoft 365 and is an area of active development by Microsoft. Over the last year, they've added the ability to match based on your own business-specific data patterns and train a machine learning model to recognize more complex documents such as contracts, for instance.
Azure Purview uses the same engine to identify PII and other sensitive data stored in databases or data lakes.
Purview overview
Azure Purview is a data governance solution that helps you understand your business data across the entire data estate and maintain control over its usage. The underlying platform in Azure Purview is Apache Atlas, an open-source project for the governance of data assets and metadata management.
Azure Purview has three main areas. The Purview Data Map provides fast scanning across your data estate and shows lineage, i.e., the origin and transformation of data over its lifecycle. Both asset- and column-level lineages are shown for supported data sources.
The second function is the Data Catalog, which presents discovered data sources so that the users with the right permissions can understand what data there is and where it's stored. Data Insights provides reports that show what assets you have, glossary terms (see below) across them, and your classification and labeling results.
A glossary is a naming convention used by business users of the data; "SKU name" and "shipment address" are examples. In contrast, the tags applied at the table, column, or file level are classifications identifying the data sensitivity level. And a scan rule set combines several scan rules for easy management of the scanning process. Scan rules let you define how file types are handled and which classification rules to use.
The crux of a service such as Azure Purview is making sure it covers all the data sources your business is using. In this early public preview, it's quite heavily weighted toward Microsoft sources, but AWS S3 is coming (revealed in an Azure Friday video), and I'm sure many other third-party sources will follow. Today, these data sources are available:
- SQL Server on-premises
- Azure Data Lake Storage Gen1
- Azure Data Lake Storage Gen2
- Azure Blob Storage
- Azure Data Explorer
- Azure SQL DB
- Azure SQL DB Managed Instance
- Azure Synapse Analytics (formerly SQL DW)
- Azure Cosmos DB
- Power BI
These data sources use different file formats; here are the supported formats. In addition to the list above, Azure Data Share is supported in Azure Purview for lineage. This is a way to share large datasets with external partners without having to provide FTP access or create copies of the data.
Azure Data Factory is a service used to extract, transform, and load (ETL) data from one system to another and is also supported for Azure Purview lineage. Stored procedures are used in databases for many tasks, including joining table A to table B to produce table C. This is automatically inventoried by Azure Purview to understand data lineage. These aren't just analyzed statically but also at runtime, so "this procedure ran at this time / date and produced this result." Lineage is helpful in two main scenarios: impact analysis when a change is planned and root cause analysis if there's an issue with data quality.
The sensitive information types (SITs) in Azure Purview are exactly the same 200+ that are in M365; however, since Azure Purview doesn't rely on M365, you manage your SITs and classification rules directly in Purview studio.
In a larger deployment, Azure Purview will touch on many areas of your business. Microsoft provides good deployment guidance on how to think about Purview in your business, questions to ask to establish a good architecture, which stakeholders in your organization to involve, and what business scenarios to focus on. The guidance also includes a deployment plan in four phases, from pilot to production.
Features in Azure Purview
To make sure Azure Purview can access all the various data sources, it relies on the Azure key vault to manage credentials. There's a role-based access control (RBAC) model with specific Azure Purview permissions, such as Purview Data Reader Role, which provides read-only access to the portal and can read all content, except for scan bindings. The Purview Data Curator Role adds the ability to edit information about assets, classification definitions, and glossary terms to the reader role, and can also apply classifications and glossary terms to assets. The Purview Data Source Administrator Role does not have access to the portal but can manage all aspects of scanning the data.
There is actually a current service in Azure, Data Catalog (ADC) Gen 1, which provides a subset of what Purview is bringing. It was to be replaced by ADC Gen 2, which is now Azure Purview (codename Babylon). If your business is using ADC Gen 1, be aware that (unless something changes in the preview period) a "migration" involves exporting glossaries and other data using the API and then importing the data into Purview via CSV import.
Note that today, an Azure Purview account can only scan data sources in the same tenant that it exists in. I suspect that many Microsoft partners will want to unlock this power as a managed service for their clients through Azure Lighthouse.
Getting started with Azure Purview
Setting up an account can be done in the portal or in PowerShell. In this preview phase, there are a few prerequisites to take care of. Each account comes with either four or sixteen capacity units (CUs), each of which supports one API call per second. At the moment, there's no cost, but as Microsoft gains more experience in expected loads, I think they'll be more prescriptive in the sizes to pick. The preview comes with some limitations, including which browser to use.
Picking a platform size
When your account has been provisioned, you assign RBAC permissions to the appropriate accounts, connect to the Azure Key Vault, and start adding data sources. You probably don't want to run this against your production data, so to try it out, Microsoft provides a starter kit with sample data.
Subscribe to 4sysops newsletter!
Conclusion
On my wish list would be integration with another data-related Azure service, Synapse, which may be coming. While this service is just out in preview, I look forward to how it's going to evolve. It definitely completes Microsoft's data governance portfolio.