ADF Spark Activity

Ashish Kumar Azure September 21, 2018

Introduction

Spark Activity is one of the data transformation activities supported by Azure Data Factory. This activity runs the specified Spark program on your Apache Spark cluster in Azure HDInsight.

Prerequisite for ADF Spark Activity

1. Create a Azure Storage Account and select account type as Blob storage.

2. Create an Apache Spark cluster in Azure HDInsight and Associate the Azure storage account (Blob storage).

While creating HDInsight Spark Cluster select Primary Storage Type as Azure Storage & select storage account that you have created.

Note- for ADF Spark Activity Blob Storage should be Primary Storage only. You can add ADL as secondary storage if you want to access input Data from ADL.

3. Create Folder Structure for Spark Job in Blob Storage.

Once you create a spark cluster with Storage as Blob Storage, it will create a container in Blob storage with Cluster name. Create Spark Job folder structure in this container.

Folder Structure should be like below.

SparkWordCount --[ Folder ]

WordCount-0.0.1-SNAPSHOT.jar --[ File ]

files --[ Folder ]

input1.txt --[ File ]

input2.txt --[ File ]

SparkWordCount.properties --[ File ]

--put all files that want to pass to spark job

jars --[ Folder ]

package1.jar --[ File ]

package2.jar --[ File ]

--put dependency jars

logs --[ Folder ]

Output --[ Folder ]

How to create folder in Blob storage.

There is two way to create folder.

1. Using Azure Storage Explorer. It will be install on your VM from where you want to access.

2. Using putty . If you are using putty and creating dir

Hadoop fs -mkdir wasbs://<containername>@<accountname>.blob.core.windows.net/<path>

Hadoop fs -mkdir wasbs://mycontainer@myaccount.blob.core.windows.net/SparkWordCount

It will not show as folder in Azure portal unless you are not putting any file in this dir.

How to create a pipeline with Spark activity.

1. Create a data factory if not exist.

2. Create an Azure Storage linked service to link your Azure storage that is associated with your HDInsight Spark cluster to the data factory.

3. Create an Azure HDInsight linked service to link your Apache Spark cluster in Azure HDInsight to the data factory.

4. Create a dataset that refers to the Azure Storage linked service. Currently, you must specify an output dataset for an activity even if there is no output being produced.

5. Create a pipeline with Spark activity that refers to the HDInsight linked service created in #2. The activity is configured with the dataset you created in the previous step as an output dataset.

The output dataset is what drives the schedule (hourly, daily, etc.). Therefore, you must specify the output dataset even though the activity does not really produce an output.

Create Data Factory

1. Log in to the Azure portal.

2. Click NEW on the left menu, click Data + Analytics, and click Data Factory.

3. In the New data factory blade, enter SparkDF for the Name.

4. Select the Azure subscription where you want the data factory to be created.

5. Select an existing resource group or create an Azure resource group.

6. Select Pin to dashboard option.

7. Click Create on the New data factory blade.

8. You see the data factory being created in the dashboard of the Azure portal as follows:

9. After the data factory has been created successfully, you see the data factory page, which shows you the contents of the data factory. If you do not see the data factory page, click the tile for your data factory on the dashboard.

How to create folder in Blob storage ?

There is two way to create folder.

1. Using Azure Storage Explorer. It will be install on your VM from where you want to access.

2. Using putty . If you are using putty and creating dir

Hadoop fs -mkdir wasbs://<containername>@<accountname>.blob.core.windows.net/<path>

Hadoop fs -mkdir wasbs://mycontainer@myaccount.blob.core.windows.net/SparkWordCount

It will not show as folder in Azure portal unless you are not putting any file in this dir.

How to create a pipeline with Spark activity.

1. Create a data factory if not exist.

2. Create an Azure Storage linked service to link your Azure storage that is associated with your HDInsight Spark cluster to the data factory.

3. Create an Azure HDInsight linked service to link your Apache Spark cluster in Azure HDInsight to the data factory.

4. Create a dataset that refers to the Azure Storage linked service. Currently, you must specify an output dataset for an activity even if there is no output being produced.

5. Create a pipeline with Spark activity that refers to the HDInsight linked service created in #2. The activity is configured with the dataset you created in the previous step as an output dataset.

The output dataset is what drives the schedule (hourly, daily, etc.). Therefore, you must specify the output dataset even though the activity does not really produce an output.

Create Data Factory

1. Log in to the Azure portal.

2. Click NEW on the left menu, click Data + Analytics, and click Data Factory.

3. In the New data factory blade, enter SparkDF for the Name.

4. Select the Azure subscription where you want the data factory to be created.

5. Select an existing resource group or create an Azure resource group.

6. Select Pin to dashboard option.

7. Click Create on the New data factory blade.

8. You see the data factory being created in the dashboard of the Azure portal as follows:

Create linked services

In this step, you create two linked services, one to link your Spark cluster to your data factory, and the other to link your Azure storage to your data factory.

Create Azure Storage linked service

In this step, you link your Azure Storage account to your data factory. A dataset you create in a step later in this walkthrough refers to this linked service. The HDInsight linked service that you define in the next step refers to this linked service too.

1. Click Author and deploy on the Data Factory blade for your data factory. You should see the Data Factory Editor.

2. Click New data store and choose Azure storage.

3. You should see the JSON script for creating an Azure Storage linked service in the editor.

4. Replace account name and account key with the name and access key of your Azure storage account.

To get account name and account key

Azure Storage Account (Blob Storage) > Access Key

5. To deploy the linked service, click Deploy on the command bar. After the linked service is deployed successfully, the Draft-1 window should disappear and you see AzureStorageLinkedService in the tree view on the left.

JSON Script ( AzureStorageLinkedService )

{

"name": "AzureStorageLinkedService",

"properties": {

"description": "",

"type": "AzureStorage",

"typeProperties": {

"connectionString": "DefaultEndpointsProtocol=https; AccountName=abc ;AccountKey=****

}

Create HDInsight linked service

In this step, you create Azure HDInsight linked service to link your HDInsight Spark cluster to the data factory. The HDInsight cluster is used to run the Spark program specified in the Spark activity of the pipeline in this sample.

1. Click ... More on the toolbar, click New compute, and then click HDInsight cluster.

2. Copy and paste the following snippet to the Draft-1 window. In the JSON editor, do the following steps:

1. Specify the URI for the HDInsight Spark cluster. For example: https://<sparkclustername>.azurehdinsight.net/.

2. Specify the name of the user who has access to the Spark cluster.

3. Specify the password for user.

4. Specify the Azure Storage linked service that is associated with the HDInsight Spark cluster. In this example, it is: AzureStorageLinkedService.

JSON Script ( HDInsightLinkedService)

{

"name": "HDInsightLinkedService",

"properties": {

"description": "",

"type": "HDInsight",

"typeProperties": {

"clusterUri": "https://<sparkclustername>.azurehdinsight.net",

"userName": "abc",

"password": "*****"

"linkedServiceName": "AzureStorageLinkedService"

}

Create output dataset

The output dataset is what drives the schedule (hourly, daily, etc.). Therefore, you must specify an output dataset for the spark activity in the pipeline even though the activity does not really produce any output. Specifying an input dataset for the activity is optional.

1. In the Data Factory Editor, click ... More on the command bar, click New dataset, and select Azure Blob storage.

2. Copy and paste the following snippet to the Draft-1 window. The JSON snippet defines a dataset called OutputDataset.

Blob container created by spark cluster - adfspark

Spark job root folder - SparkWordCount

Job output folder - Output

JSON Script ( WordCountOutputDataset)

{

"name": "WordCountOutputDataset",

"properties": {

"published": false,

"type": "AzureBlob",

"linkedServiceName": "AzureStorageLinkedService",

"typeProperties": {

"fileName": "WordCountOutput.txt",

"folderPath": "adfspark/SparkWordCount/Output",

"format": {

"type": "TextFormat",

"columnDelimiter": " "

}

"availability": {

"frequency": "Day",

"interval": 1,

"offset": "15:25:00"

}

Create pipeline

1. In the Data Factory Editor, click � More on the command bar, and then click New pipeline.

2. Replace the script in the Draft-1 window with the following script:

3. The type property is set to HDInsightSpark.

4. The rootPath is set to adfspark/SparkWordCount where adfspark is the Azure Blob container and SparkWordCount is folder in that container. In this example, the Azure Blob Storage is the one that is associated with the Spark cluster. You can upload the file to a different Azure Storage. If you do so, create an Azure Storage linked service to link that storage account to the data factory. Then, specify the name of the linked service as a value for the sparkJobLinkedService property.

5. The entryFilePath is set to the WordCount-0.0.1-SNAPSHOT.jar, which is the spark program jar file.

6. The getDebugInfo property is set to Always, which means the log files are always generated (success or failure).

The outputs section has one output dataset. You must specify an output dataset even if the spark program does not produce any output. The output dataset drives the schedule for the pipeline (hourly, daily, etc.).

JSON Script ( WordCountPipeline)

{

"name": "WordCountPipeline",

"properties": {

"activities": [

{

"type": "HDInsightSpark",

"typeProperties": {

"rootPath": "adfspark/SparkWordCount",

"entryFilePath": "WordCount-0.0.1-SNAPSHOT.jar",

"arguments": [ "arg1", "arg2" ],

"sparkConfig": {

"spark.executor.memory": "512m"

"className": "wordcount.WordCount",

"getDebugInfo": "Always"

"outputs": [

{

"name": "WordCountOutputDataset"

}

"scheduler": {

"frequency": "Day",

"interval": 1,

"offset": "15:25:00"

"name": "MySparkActivity",

"description": "This activity invokes the Spark program",

"linkedServiceName": "HDInsightLinkedService"

}

"start": "2017-05-23T15:24:00Z",

"end": "2017-05-27T00:00:00Z",

"isPaused": false,

"pipelineMode": "Scheduled"

}

Monitor pipeline

Click X to close Data Factory Editor blades and to navigate back to the Data Factory home page. Click Monitor and Manage to launch the monitoring application in another tab.

0 Comments

Be first to comment on this post.

You must be logged in to post a comment..

River IQ

ADF Spark Activity

0 Comments