239 95 5MB
English Pages 116 Year 2018
Mark Beckner Quick Start Guide to Azure Data Factory, Azure Data Lake Server, and Azure Data Warehouse
Mark Beckner
Quick Start Guide to Azure Data Factory, Azure Data Lake Server, and Azure Data Warehouse
ISBN 978-1-5474-1735-3 e-ISBN (PDF) 978-1-5474-0127-7 e-ISBN (EPUB) 978-1-5474-0129-1 Library of Congress Control Number: 2018962033 Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.dnb.de. © 2019 Mark Beckner Published by Walter de Gruyter Inc., Boston/Berlin Printing and binding: CPI books GmbH, Leck Typesetting: MacPS, LLC, Carmel www.degruyter.com
About De|G PRESS Five Stars as a Rule De|G PRESS, the startup born out of one of the world’s most venerable publishers, De Gruyter, promises to bring you an unbiased, valuable, and meticulously edited work on important topics in the fields of business, information technology, computing, engineering, and mathematics. By selecting the finest authors to present, without bias, information necessary for their chosen topic for professionals, in the depth you would hope for, we wish to satisfy your needs and earn our five-star ranking. In keeping with these principles, the books you read from De|G PRESS will be practical, efficient and, if we have done our job right, yield many returns on their price. We invite businesses to order our books in bulk in print or electronic form as a best solution to meeting the learning needs of your organization, or parts of your organization, in a most cost-effective manner. There is no better way to learn about a subject in depth than from a book that is efficient, clear, well organized, and information rich. A great book can provide life-changing knowledge. We hope that with De|G PRESS books you will find that to be the case.
DOI 10.1515/9781547401277-202
Acknowledgments Thanks to my editor, Jeff Pepper, who worked with me to come up with this quick start approach, and to Triston Arisawa for jumping in to verify the accuracy of the numerous exercises that are presented throughout this book.
DOI 10.1515/9781547401277-203
About the Author
Mark Beckner is an enterprise solutions expert. With over 20 years of experience, he leads his firm Inotek Group, specializing in business strategy and enterprise application integration with a focus in health care, CRM, supply chain and business technologies. He has authored numerous technical books, including Administering, Configuring, and Maintaining Microsoft Dynamics 365 in the Cloud, Using Scribe Insight, BizTalk 2013 Recipes, BizTalk 2013 EDI for Health Care, BizTalk 2013 EDI for Supply Chain Management, Microsoft Dynamics CRM API Development, and more. Beckner also helps up-and-coming coders, programmers, and aspiring tech entrepreneurs reach their personal and professional goals. Mark has a wide range of experience, including specialties in BizTalk Server, SharePoint, Microsoft Dynamics 365, Silverlight, Windows Phone, SQL Server, SQL Server Reporting Services (SSRS), .NET Framework, .NET Compact Framework, C#, VB.NET, ASP.NET, and Scribe. Beckner’s expertise has been featured in Computerworld, Entrepreneur, IT Business Edge, SD Times, UpStart Business Journal, and more. He graduated from Fort Lewis College with a bachelor’s degree in computer science and information systems. Mark and his wife, Sara, live in Colorado with their two children, Ciro and Iyer.
DOI 10.1515/9781547401277-204
Contents Chapter 1: Copying Data to Azure SQL Using Azure Data Factory 1 Creating a Local SQL Instance 1 Creating an Azure SQL Database 3 Building a Basic Azure Data Factory Pipeline 10 Monitoring and Alerts 28 Summary 31 Chapter 2: Azure Data Lake Server and ADF Integration 33 Creating an Azure Data Lake Storage Resource 34 Using Git for Code Repository 40 Building a Simple ADF Pipeline to Load ADLS 47 Combining into a Single ADF Pipeline 56 Data Lake Analytics 57 Summary 64 Chapter 3: Azure Data Warehouse and Data Integration using ADF or External Tables 65 Creating an ADW Instance 66 ADW Performance and Pricing 67 Connecting to Your Data Warehouse 69 Modeling a Very Simple Data Warehouse 72 Load Data Using ADF 77 Using External Tables 89 Summary 97 Index 99
Introduction A challenge has been presented to me, that is to distill the essence of Azure Data Factory (ADF), Azure Data Lake Server (ADLS), and Azure Data Warehouse (ADW) into a book that is a short and fast quick start guide. There’s a tremendous amount of territory to cover when it comes to diving into these technologies! What I hoped to accomplish in this book is the following: 1. Lay out the steps that will set up each environment and perform basic development within it. 2. Show how to move data between the various environments and components, including local SQL and Azure instances. 3. Eliminate some of the elusive aspects of the various features (e.g., check out the overview of External Tables at the end of Chapter 3!) 4. Save you time! I guarantee that this book will help you fully understand how to set up an ADF pipeline integration with multiple sources and destinations that will require very little of your time. You’ll know how to create an ADLS instance and move data in a variety of formats into it. You’ll be able to build a data warehouse that can be populated with an ADF process or by using external tables. And you’ll have a fair understanding of permissions, monitoring the various environments, and doing development across components. Dive in. Have fun. There is a ton of value packed into this little book!
DOI 10.1515/9781547401277-206
Chapter 1 Copying Data to Azure SQL Using Azure Data Factory In this chapter we’ll build out several components to illustrate how data can be copied between data sources using Azure Data Factory (ADF). The easiest way to illustrate this is by using a simple on-premise local SQL instance with data that will be copied to a cloud-based Azure SQL instance. We’ll create an ADF pipeline that uses an integration runtime component to acquire the connection with the local SQL database. A simple map will be created within the pipeline to show how the columns in the local table map to the Azure table. The full flow of this model is shown in Figure 1.1.
Figure 1.1: Components and flow of data being built in this chapter
Creating a Local SQL Instance To build this simple architecture, a local SQL instance will need to be in place. We’ll create a single table called Customers. By putting a few records into the table, we can then use it as the base to load data into the new Azure SQL instance you create in the next section of this chapter. The table script and the script to load records into this table are shown in Listing 1.1. A screenshot of the local SQL Server instance is shown in Figure 1.2.
DOI 10.1515/9781547401277-001
2
Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
Figure 1.2: Creating a table on a local SQL Server instance Listing 1.1: Listing 1.1. The Local SQL Customers Table with Data
CREATE TABLE [dbo].[Customers]( [CustomerID] [nchar](10) NOT NULL, [LastName] [varchar](50) NULL, [FirstName] [varchar](50) NULL, [Birthday] [date] NULL, [CreatedOn] [datetime] NULL, [ModifiedOn] [datetime] NULL ) ON [PRIMARY] GO INSERT [dbo].[Customers] ([CustomerID] ,[LastName] ,[FirstName] ,[Birthday] ,[CreatedOn] ,[ModifiedOn]) VALUES (N'CUST001' ,N'Jones'
Creating an Azure SQL Database
3
,N'Jim' ,CAST(N'1980-10-01' AS Date) ,CAST(N'2017-09-01T12:01:04.000' AS DateTime) ,CAST(N'2018-04-01T11:31:45.000' AS DateTime)) GO INSERT [dbo].[Customers] ([CustomerID] ,[LastName] ,[FirstName] ,[Birthday] ,[CreatedOn] ,[ModifiedOn]) VALUES (N'CUST002' ,N'Smith' ,N'Jen' ,CAST(N'1978-03-04' AS Date) ,CAST(N'2018-01-12T01:34:12.000' AS DateTime) ,CAST(N'2018-01-12T01:45:12.000' AS DateTime)) GO
Creating an Azure SQL Database Now we’ll create an Azure SQL database. It will contain a single table that will be called Contacts. To begin with, this Contacts table will contain its own record, separate from data in any other location, but will eventually be populated with the copied data from the local SQL database. To create the Azure database, you’ll need to log into portal.azure.com. Once you’ve successfully logged in, on the left you’ll see a list of actions that can be taken in the left-hand navigation toolbar. To create the database, click on the SQL databases menu item and then click the Add button, as shown in Figure 1.3.
4
Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
Figure 1.3: Adding a new Azure SQL database
You can now enter in the information that pertains to your new database. You can get more information on these properties by clicking the icon next to each label. Some additional details on several of these properties are noted as follows: 1. Database name—the database name will be referenced in a variety of locations, so name it just like you would a local database (in this case, we’ll refer to it as InotekDemo). 2. Subscription—you’ll have several potential options here, based on what you have purchased. Figure 1.3 shows Visual Studio Enterprise, as that is the MSDN subscription that is available. Your options will look different depending on licensing. 3. Select source—go with a new blank database for this exercise, but you could base it on an existing template or backup if there was one available that matched your needs. 4. Server—this is the name of the database server you will connect to and where your new database will live. You can use the default or you can create your own (see Figure 1.4). A database server will allow you to separate your databases and business functions from one another. This server will be called “Demoserverinotek” with serveradmin as the login name.
Creating an Azure SQL Database
5
Figure 1.4: Configuring the new Azure SQL Server where the database will reside
5. Pricing Tier—pricing in Azure is a little overwhelming, and you’ll want to think about costs across all components before you decide. For now, we’ll select the basic model, which allows for up to 2 gigs of data. When you’re ready, click the Create a new server button and the deployment process in Azure will begin. You’ll see a notification on your toolbar (see Figure 1.5) that shows the status of this deployment. After a minute or two your database deployment will be completed and you’ll be able to click on the new database and see information about it.
6
Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
Figure 1.5: Notification of the deployment in process
There are several ways to connect to this new server. You can use the Azure tools or you can connect from a local SQL tool like SQL Server Enterprise Manager. Using Enterprise Manager requires that you enter information about the SQL Server you are trying to connect to. To connect to your Azure server, click on the Connection strings property of your database in Azure. You’ll want to grab the server name from here and enter it into your server connection window (shown in Figure 1.6).
Creating an Azure SQL Database
7
Figure 1.6: Connecting to the new Azure SQL Server from a local Enterprise Manager connection window
Next, you’ll type in the login and password and click Connect. If you haven’t connected to an Azure SQL instance before, you will be asked to log into Azure. If this occurs, click the Sign in button and enter the credentials that you used to connect to the Azure portal (see Figure 1.7). Once authenticated, you’ll be able to select whether to add the specific IP you’re on or add the full subnet. You’ll be required to do this each time you connect to your SQL instance from a new IP.
Figure 1.7: The first time you connect from Enterprise Manager on a new computer, you will be required to enter this information
8
Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
With the credentials entered accurately, Enterprise Manager will connect to your new SQL Azure instance and you’ll be able to create artifacts just like you would with a local SQL instance. For this exercise, we’ll create a table called Contact in the InotekDemo database where we’ll eventually upload data from a local SQL instance. The table looks like that shown in Figure 1.8, with SQL script shown in Listing 1.2.
Figure 1.8: Creating a table in the Azure database from Enterprise Manager
Listing 1.2: Listing 1.2 A new table created on the Azure SQL Server database
CREATE TABLE [dbo].[Contact]( [ID] [int] IDENTITY(1,1) NOT NULL, [First] [nchar](20) NULL, [Last] [nchar](20) NULL, [DOB] [date] NULL, [LastModified] [datetime] NULL ) ON [PRIMARY] In addition to SQL Enterprise Manager, you can also use the Query Editor tool that’s available in the Azure web interface. To illustrate how this tool works, we’ll use it to insert a record into the table. To do this, click on the Query editor link in the Azure navigation bar (see Figure 1.9). A window will open where you’ll be able to see your objects and type in standard SQL commands.
Creating an Azure SQL Database
9
Figure 1.9: Using the Query Editor tool available in the Azure portal
You’ll have basic access to write queries and view SQL objects. To insert a record, use a standard insert script like that shown in Figure 1.10 and click the Run button. You’ll see information about your SQL transaction on the two available tabs, Results and Messages. You can save your query or open a new one. Both of these actions use your local file path and are not actions that take place within Azure itself. You can also use the INSERT script shown in Listing 1.3. Listing 1.3: Listing 1.3 Alternative Insert Script
INSERT INTO dbo.Contact (First,Last,DOB) VALUES ('John','Doe','2000-01-01')
10
Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
Figure 1.10: Inserting a record using the Azure portal Query tool
You can also edit your data in your tables through an editing interface in Azure by clicking on the Edit Data button. This will open an editable grid version of your table where you can modify your existing data or create new records. You can do this by using the Create New Row button, shown in Figure 1.11.
Figure 1.11: Editing data directly in the Azure Query tool
Building a Basic Azure Data Factory Pipeline At this point, you have a local SQL database and a hosted Azure SQL database, both populated with a small amount of data. We’ll now look at how to pull the data from your local SQL instance into your Azure SQL instance using Azure Data Factory (ADF). To create a new ADF instance, click on the Create a resource link
Building a Basic Azure Data Factory Pipeline
11
on your main Azure navigation toolbar and then select Analytics. Click on the Data Factory icon in the right-hand list, as shown in Figure 1.12.
Figure 1.12: Creating a new Data Factory resource
In the configuration screen that opens, you’ll be required to enter a name for your new ADF process. You’ll also need to select several other properties, one of which is the Resource Group. For ease of reference and organization, we’ll put this ADF
12
Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
component in the same Resource Group that was used for the SQL Azure server instance created earlier in this chapter. Figure 1.13 shows the base configuration for the data factory. Click the Create button once the configuration has been completed.
Figure 1.13: A new data factory configuration
The creation of the ADF will take a few moments, but eventually you’ll see a notification that your deployment has been completed. You can see this on the notification bar, where you can also click the Go to resource button (see Figure 1.14). You can also access this via the All resources option on the main Azure portal navigation toolbar.
Building a Basic Azure Data Factory Pipeline
13
Figure 1.14: Notification indicating that the data factory has been deployed
Clicking this button will take you to an overview screen of your new ADF. You can also access this overview screen by clicking All resources from the main Azure navigation toolbar and clicking on the name of the ADF that was just created. You’ll see a button on this overview screen for Author & Monitor (see Figure 1.15). Click it to open a new window where the actual ADF development will take place. One item to note—some functions in Azure will not work in Internet Explorer. Only Chrome and Edge are officially supported. If you run into functionality issues, try another browser!
14
Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
Figure 1.15: To achieve development of an ADF, click on the Author & Monitor button
For development within the ADF web framework, there are several options that will be available for you to choose from (see Figure 1.16). We’ll look at using the Create pipeline functionality and SSIS Integration Runtime later in this book (both of which allow for development of custom multistep processes), but for now we’ll look at Copy Data, which is a templated pipeline that lets you quickly configure a process to copy data from one location to another. We’ll use this to copy data from the local SQL instance to the Azure instance. Note that in cases where it is necessary to move data from a local SQL instance into an Azure instance, we can use all of these approaches (a pipeline, the copy data tab, and the SSIS integration runtime), as well as others, which we won’t detail here (like replication, external integration platforms, local database jobs, etc.!).
Figure 1.16: Use the Copy Data button for development within the ADF
Building a Basic Azure Data Factory Pipeline
15
To use the Copy Data component, click on the Copy Data icon on the ADF screen. You will see a new area open where you can configure the copy process. Underlying this web interface is a standard ADF pipeline, but instead of using the pipeline development interface, you can use a more workflow-oriented view to configure things. You’ll begin by naming the task and stating whether it will run once or whether it will run on a recurring schedule. If you select the recurring schedule option, you’ll see several items that will allow you to define your preferences. Note that one of the options is a Tumbling Window, which basically means that the process will complete before the timer starts again so that two processes of the same type don’t overlap one another. A standard scheduler will run at a given time, regardless of whether the previous instance has completed or not. As you can see in Figure 1.17, the schedule options are typical.
Figure 1.17: Setting the name and schedule attributes of the process
16
Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
The next step in the process, after the base properties have been defined, is to specify the source of the data that will be copied. There are many options and many existing connectors that allow for rapid connection to a variety of data sources. For this exercise we’ll point to the local SQL database with the customer data. Click on the Database tab, click the Create new connection button, and select SQL Server from the list of connectors (see Figure 1.18).
Figure 1.18: Creating a SQL source
When the SQL Server icon is selected, you’ll have an extensive set of options to determine how to connect to your source database. First, you’ll need to create an integration runtime to connect to a local SQL instance. The integration runtime will allow you to install an executable on the local machine that will enable ADF to communicate with it. Once the integration runtime has been created, you’ll be able to see your local SQL Server and configure your ADF pipeline to point at the source table you are after. Here are the steps to set up this source: 1. On the New Linked Service window (Figure 1.19), set the name of your service and then click the +New link in the Connect via integration runtime dropdown.
Building a Basic Azure Data Factory Pipeline
17
Figure 1.19: Creating a new integration runtime
2. On the next screen that opens, select the Self-Hosted button and click Next. 3. Next, give the integration runtime a descriptive name and click Next. 4. There will now be a screen that has several options—you can use the Express setup or the Manual setup. As you can see in Figure 1.20, the installation of the integration runtime on your local machine (where the local instance of SQL Server exists) is secured through two encrypted authentication keys. If you use the Manual setup, you’ll need to reference these keys. The Express setup will reference them for you. For this exercise, click the Express setup link.
18
Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
Figure 1.20: Options for installing the integration runtime on a local machine
5. When you click on the Express setup link, an executable will download, and you’ll need to run it. This will time to install. A progress indicator will appear on the screen while installation is taking place (see Figure 1.21).
Building a Basic Azure Data Factory Pipeline
19
Figure 1.21: The installation progress of the local executable for the integration runtime
6. When the installation has completed, you’ll be able to open the Integration Runtime Configuration Manager on your local computer. To validate that this configuration tool is working, click on the Diagnostics tab and test a connection to your local database that you plan to connect to from Azure. Figure 1.22 shows a validated test (note the checkmark next to the Test button). You can connect to virtually any type of local database, not just SQL Server (just select the type of connection you need to test from the dropdown).
20
Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
Figure 1.22: Testing the connection from the newly installed local integration runtime
7. Finally, back in the Azure portal and the Integration Runtime Setup screen, click the Finish button. When you have finished creating and installing the Integration Runtime component, you’ll find yourself back on the New Linked Service window. You’ll need to specify the server name, database name, and connection information. If you tested your connection in the locally installed Integration Runtime Configuration Manager as shown in the steps above, you can repeat the same information here. Figure 1.23 shows the connection to a local database instance configured in the ADF screen that has been tested and validated.
Building a Basic Azure Data Factory Pipeline
21
Figure 1.23: Testing connectivity to the local database from the Azure portal
Click the Finish button. The source data store will now be created. You’ll now be able to proceed to the next step of the process of setting up the Copy Data pipeline. Do this by clicking the Next button on the main screen, which will pop up a
22
Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
new window where you can indicate what table set or query you will be using to extract data from the local database. For this exercise, there is only a single table available to select, which is called Customers. Click this table and a preview of the current data will be shown (see Figure 1.24).
Figure 1.24: Selecting the source table(s) where data will be copied from
Click Next and you will see a screen where filtering can be applied. We’ll leave the default setting as no filtering. Clicking Next on the filter page will take you to the Destination stage of configuring the Copy Data pipeline. For the current solution, the destination is going to be the SQL Azure database that was created earlier in this chapter. The setup of the destination is like the setup of the source, except in this case no additional integration runtime will need to be set up or configured, since the destination is an Azure database. Click the Azure tab and press the Create new connection button. Select the Azure SQL Database option, as shown in Figure 1.25.
Building a Basic Azure Data Factory Pipeline
23
Figure 1.25: Selecting an Azure SQL Database as the destination
Click the Continue button and a new screen will appear within which you can configure the connection to the Azure database. Here, you will set the name of the connection and then select the default AutoResolveIntegrationRuntime option for the integration runtime component. This integration runtime will allow the pipeline to connect to any Azure SQL servers within the current framework. Select the Azure subscription that you used to create your Azure SQL Server earlier in this chapter and then select the server and database from the dropdowns that will auto-populate based on what is available. Enter the appropriate credentials and then test the connection. Figure 1.26 shows the configuration for the Azure SQL connection.
24
Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
Figure 1.26: The configuration screen for the destination database
Click Finish to complete the creation of the destination connection. You’ll be returned to the main screen of the Copy Database pipeline configuration flow. If you click on the All tab, you’ll see both the source and destination connections. You can roll your mouse over them to see a quick view into what the configuration for each is (see Figure 1.27).
Building a Basic Azure Data Factory Pipeline
25
Figure 1.27: Source and destination connections are now available
Continue with the configuration process by clicking Next, which will allow you to define the targeted table in the Azure SQL database where the source data will be copied to. There is only a single table that we’ve created in the Azure database. This table will show in the dropdown as Contact. Select this table and click Next. A screen will open where the table mapping between the source and the destination takes place. We’re working with two simple tables, so the mapping is one to one. Figure 1.28 shows the completed mapping for the tables that are being used in this exercise. In cases where your source tables don’t match cleanly with the target, you’ll need to write a process on your local SQL instance that will load the data into a staging table that will allow for ease of mapping. You can always write custom mapping logic in a custom Azure pipeline, but leaving the logic at the database level, when possible, will generally ease your development efforts.
Figure 1.28: Perform the mapping of the source columns to the target columns
26
Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
Click Next to continue. You’ll see options to set fault tolerance, which we’ll leave defaulted for now. For large datasets, you may want to simply skip rows that have errors. You can define your preference here and then click Next. You’ll see a summary of all the configurations that have been done. Review this summary (shown in Figure 1.29), as the next step you’ll take is to deploy the Copy Data pipeline. Click Next when you have reviewed the summary. The deployment of the completed pipeline will now take place.
Figure 1.29: Summary of the work that has been done
With the pipeline deployed, you’ll be able to test it. There are several ways to navigate around Azure, but for this exercise click the Pencil icon on the left side of the screen. This will show the pipeline you just developed (which has a single step of Copy Data) and the two datasets configured. You’ll want to test your process. This can be done most easily by clicking the Debug button in the upper toolbar of your pipeline. To see this button, you’ll need to click on the pipeline name in the left-hand toolbar. Figure 1.30 shows the various tabs that need to be clicked to be able to press the Debug button.
Building a Basic Azure Data Factory Pipeline
27
Figure 1.30: Clicking on the Debug button to test the process
When the Debug button is clicked, the process will begin to run. It takes a second for it to spin up, and you can monitor the progress of the run in the lower output window. The process will instantiate and begin to execute. When it completes, it will either show that it has succeeded or that it failed. In either case, you can click the eye glasses icon to see details about what happened during the run (see Figure 1.31).
Figure 1.31: Process has completed successfully, click the eyeglasses icon to see details
By clicking the Details button, you’ll see the number of records that were read, the number that were successfully written, and details about rows that failed. As shown in Figure 1.32, the two records that were in the source database were successfully copied to the target. You can verify that the data is truly in the targeted table by running a select query against the Azure SQL table and reviewing the records that were loaded.
28
Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
Figure 1.32: Summary showing that the data was successfully copied from the source to the target
Monitoring and Alerts The pipeline that was just created was set up to run on a scheduled basis, which means the data will continue to copy to the destination table every hour. To see the status, click on the gauge (Monitor) icon on the left-hand side of the screen. This will show you a history of the pipelines that have run, along with several options for sorting and viewing metrics. Figure 1.33 shows that current pipeline’s audit history.
Monitoring and Alerts
29
Figure 1.33: Seeing the history of the scheduled pipelines
From this same monitoring area, you can see the health of your integration runtimes by clicking on the Integration Runtimes tab at the top of the screen. You can also click on Alerts and Metrics, both of which will open new tabs within your browser. When you click Monitoring, you will land on the main Azure portal monitoring page where there are endless options for seeing what is going on within your various Azure components. Some of the key areas to note here are the Activity Log (click here to see all activity across your Azure solution) and Metrics (build your own custom metrics to be able to see details across your deployed components). But for monitoring your ADF solution itself, you’ll most likely want to remain within the Monitor branch of the ADF screens. When you click Alerts, you’ll find that you can set up a variety of rules that will allow you to monitor your databases. These databases are primarily intended for administrative monitoring, but you’ll see that there are dozens of alert types that can be configured, and you’ll want to become familiar with what is in here to decide if you’re interested in setting up notifications. Just like the main Monitoring page, these alerts apply more to the generic Azure portal than they do specifically to ADF, but you can certainly monitor events related to your ADF environment. You can also build notification directly into your ADF pipelines as you build them out. To set up an alert from the main Alert page, click on +New alert rule on the toolbar, as shown in Figure 1.34.
30
Chapter 1: Copying Data to Azure SQL Using Azure Data Factory
Figure 1.34: Creating a new alert
A new screen will open where a rule can be created. Three steps need to be taken to create a rule: define the alert condition, name the alert (which ultimately gets included in the notification that is sent), and define who gets notified. An example of a configured alert rule is shown in Figure 1.35. You can set up notifications like emails, SMS texts, voice alerts, along with a variety of others. The number of configurations here is extensive.
Figure 1.35: Creating an alert rule
Summary
31
Summary A lot of territory has been covered in this chapter. We’ve looked at creating an ADF pipeline that uses an integration runtime to connect to a local SQL Server instance and copying data from a table on that local instance to a table in an Azure SQL database. We’ve looked at the options around configuring and working with each of these components, as well as how to create basic monitoring and alerting. In the next chapter, we’ll expand into deeper customization and development of an ADF pipeline and look at how additional data sources can be brought into the mix by working with an Azure Data Lake Server.
Chapter 2 Azure Data Lake Server and ADF Integration Having explored the basic approach to copying data to an Azure SQL instance using an ADF pipeline in the previous chapter, we’ll now expand our data storage options by working with Azure Data Lake Server (ADLS) storage. With ADLS, just about any type of data can be uploaded, such as spreadsheets, CSV files, database dumps, and other formats. In this chapter, we’ll look at uploading CSV and database extracts containing simple sales information to ADLS. This sales information will be related to the customer data that was used in the previous chapter. For example, the CSV file will contain the customer ID that can be related to the record in the Azure SQL database. This relationship will be made concrete in the next chapter when we pull the data from Azure SQL and ADLS into an Azure Data Warehouse. To move data into ADLS, we’ll build out a pipeline in ADF. In the previous chapter we used the “Copy Data” pipeline template to move the data into SQL Azure; we’ll make a copy of this and alter it to push data to ADLS. To create the copy, we’ll build a code repository in Git and connect it to the ADF development instance. Once the data has been loaded into the ADLS, we’ll look at using an Azure Data Lake Analytics instance for basic querying against the data lake. The flow of the data and components of this chapter are shown in Figure 2.1.
Git Repository
On-Premise Machine
Azure
Integration Runtime
ADF Pipeline
TXT
ADLS
Local SQL Server
CSV
Figure 2.1: Components and flow of data being built in this chapter
DOI 10.1515/9781547401277-002
ADLA Query
34
Chapter 2: Azure Data Lake Server and ADF Integration
Creating an Azure Data Lake Storage Resource It’s possible to create an Azure Data Lake Storage resource as part of the ADF custom pipeline setup, but we’ll create the ADLS resource first and then incorporate it into a copy of the pipeline that was created in the previous chapter. To create this ADLS instance, click on +Create a resource in the main Azure portal navigation and then click on Storage. Next, click on the Data Lake Storage Gen 1 option under the featured components (see Figure 2.2).
Figure 2.2: Creating a new ADLS resource
When you have selected the Data Lake Storage Gen1 resource option, a new screen layover will open where you can configure the ADLS top level information. Figure 2.3 shows the configuration used for this exercise. When done setting the properties click Create, which will deploy the ADLS resource. The properties set on this page include: 1. Name—note that you can only name your ADLS resource with lowercase letters and numbers. The name must be unique across all existing ADLS instances, so you’ll have to experiment a little! 2. Subscription, Resource Group, and Location—for this example use the same information that you used to create the Azure components in Chapter 1. 3. Pricing package—pricing on ADLS is inexpensive, but care should be taken when determining what plan you want to utilize. If you click the information
Creating an Azure Data Lake Storage Resource
35
circle next to payments a new window will open where you can calculate your potential usage costs. 4. Encryption settings—by default, encryption is enabled. The security keys are managed within the ADLS resource. These keys will be used later when the ADLS resource is referenced from the ADF pipeline that will be created.
Figure 2.3: Configuration of top-level settings for the ADLS instance
When the deployment of the ADLS resource has completed, you’ll get a notification (similar to that shown in Figure 2.4). You’ll be able to navigate to it by clicking
36
Chapter 2: Azure Data Lake Server and ADF Integration
either on the Go to resource button in the notification window or by clicking on All resources in the main Azure toolbar.
Figure 2.4: A notification will appear when the resource has been created
Opening the newly created ADLS resource shows several actions that can be taken, as shown in Figure 2.5. The default screen shows usages, costs incurred, and other reporting metrics. The navigation bar to the left shows options for setting up alerts and monitoring usage, like what was described at the end of Chapter 1. Most importantly, there’s the option Data explorer, which is the heart of configuration and access to data within the data lake itself.
Creating an Azure Data Lake Storage Resource
37
Figure 2.5: Accessing Data explorer in ADLS
Clicking the Data explorer button will open your data lake server in a new window. Your first step will be to create a new folder. This folder will contain the data that you’ll be uploading to the data lake. For this exercise we’ll call it SalesData. In Figure 2.6 you can see the navigation frame on the left and the newly created SalesData folder.
Figure 2.6: Creating a new folder
We’ll manually upload a CSV file now. Later in this chapter we’ll use an ADF pipeline to upload data automatically. Click the SalesData folder to open it. Once inside the folder, click the Upload button. Figure 2.7 shows the data contained in
38
Chapter 2: Azure Data Lake Server and ADF Integration
the CSV that will be uploaded for this discussion (there’s an image of it in Excel and Notepad so that you can see it is just a comma separated list of values).
Figure 2.7: The data being uploaded, shown in Excel and Notepad
You’ll see three columns in this CSV file. The first column is Customer ID, which matches the ID column in the Azure SQL Contact table from Chapter 1. The second column is Sale Date and the third is Sale Amount. There are several sales for each customer. Of course, you aren’t limited to CSVs in a data lake—you can upload just about anything. A real-world scenario there would be a large export of sales data from an ERP system, which is in flat file format. This data would be uploaded hourly or daily to the ADLS instance. We’ll look at this real-world application in more detail in Chapter 3. Figure 2.8 shows the CSV uploaded into the data lake.
Creating an Azure Data Lake Storage Resource
39
Figure 2.8: Uploading a file manually to ADLS
Your uploaded file can now be seen in the main data explorer menu. You can now click on the context menu of the file. One option is to preview the file, which lets you see all of the data within the file. In most data lake scenarios, you’ll be dealing with extremely large files that have a variety of formats, so the preview functionality is critical as you navigate through your files to find the information and structure you are after. Figure 2.9 shows the context menu with a preview of the CSV that was uploaded.
Figure 2.9: Previewing the data that was uploaded
40
Chapter 2: Azure Data Lake Server and ADF Integration
Using Git for Code Repository You’ve just created an ADLS instance and uploaded a file manually. This has limited application, and you really need to figure out how to automate this through code, However, before moving on to creating an ADF pipeline and related components needed for automating the transfer of data into your new ADLS, let’s pause to look at code repositories. With pipelines in your ADF, you’ll likely want to back things up or import pipelines into other instances that you create in the future. To export or import pipelines and related components you must have a code repository to integrate with. Azure gives you two options: Git and Azure DevOps. We’ll break down getting Git set up and a repository created so that ADF components can be exported and imported through it. The first step of this process is to get your Git repository associated with Azure. Navigating to the root ADF page, you’ll see an option for Set up Code Repository. Clicking on this will allow you to enter your repository information (see Figure 2.10). If you don’t see this icon, then you will need to click on the repository button in the toolbar at the top of the screen within the Author and Monitor window where pipeline development takes place.
Using Git for Code Repository
41
Figure 2.10: Setting up a code repository
If you don’t already have access to Git, follow these four steps to set up a repository and connect ADF to it: 1. Go to github.com and set up an account (or log into an existing account). 2. Create a new code repository. Once the repository has been set up, you’ll need to initialize it. The repository home page will have command line information to initialize your new repository with a readme document. To run these commands, you’ll need to download Git and install it on your local machine. Once it has been installed locally, open a command prompt window and type
42
Chapter 2: Azure Data Lake Server and ADF Integration
the commands listed. In the end, you should have a repository set up that has a single readme document. Note that you’ll most likely have to play around with these steps to get them to work—keep at it until you succeed, but it may take you a bit of work! If you don’t want to mess around with command lines (who would?), make sure you click on the Initialize this repository with a README option when you create your repository (see Figure 2.11).
Figure 2.11: Make sure and initialize your Git repository
3. Once the repository has been set up, you will be able to reference it from within ADF. Click on the Set up Code Repository button on the main ADF page. 4. A configuration screen will open within Azure that requests your Git user account. Entering this will allow you to connect to Git and select the repository you’ve just created. Once you’ve connected successfully to the repository, you’ll see that Git is now integrated into your Authoring area (see Figure 2.12).
Using Git for Code Repository
43
Figure 2.12: The repository will show in the ADF screen once added successfully
As soon as you have your repository connected, the components you’ve created in Azure will automatically be synchronized with your Git repository. After completing the connection and logging into the Git web interface, you should see something like what is shown in Figure 2.13. Here, the dataset, pipeline, linked server, trigger, and integration runtime folders have all been created, based on components that have been created to data in the ADF solution. Within each of the folders are the component files. These files could be imported into other ADF instances by copying them into the repository for that ADF solution. Azure and Git perform continual synchronizations once connected so that your data is always backed up and accessible.
Figure 2.13: Once connected, the code in your Azure ADF will automatically sync with Git
44
Chapter 2: Azure Data Lake Server and ADF Integration
To understand how to use the repository, let’s take a quick look at importing a pipeline from another solution that may have been developed. We’ll assume that this solution already exists somewhere and we are trying to take a copy of that pipeline and upload it to the current ADF instance. To do this, follow these steps: 1. Determine which files you want to copy over. In the case of the pipeline created in Chapter 1, for example, there is a single pipeline file, two dataset files, an integration runtime, linked services, and other components. You can see in Figure 2.14 the resources on the left in ADF, along with the pipeline files in the Git folder.
Figure 2.14: The JSON pipeline file in Git
2. You can transfer just the core pipeline file or any of the other associated files (e.g., data sources, integration runtimes, etc.) individually. Each will import with their original configurations. If, for example, you import a pipeline that references several data sources but you don’t import the data sources, the pipeline will still open. You’ll just have to set up new data sources to link into it. 3. Every file in ADF exports as a JSON file. You can edit these files before importing if you want. For example, Listing 2.1 shows the CopyLocalDataToAzure pipeline as implemented in Chapter 1. If you want to alter the name on this, just edit the JSON file’s name property (there are two of them you’ll need to set). If you want to alter the mappings, change the columnMappings node information. For advanced developers, editing the JSON of existing pipelines
Using Git for Code Repository
45
and related components can save time and allow you to build out multiple processes with common logic with relative ease. Listing 2.1: JSON of pipeline
{ "name": "CopyLocalDataToAzure",