Azure Spark Session On Synapse Spark Pool

Spark Session On Synapse Spark Pool

how-to-use-azuremlazure-mldata-sciencenotebookmachine-learningazure-machine-learningdeep-learningazuremlazure-ml-notebooksazureazure-synapse

alph-notebooks/azure-ml-notebooks / spark_session_on_synapse_spark_pool.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Licensed under the MIT License.

Impressions

Interactive Spark Session on Synapse Spark Pool

Install package

[ ]

For JupyterLab, please additionally run:

[ ]

PLEASE restart kernel and then refresh web page before starting spark session.

0. How to leverage Spark Magic for interactive Spark experience

[ ]

1. Start Synapse Session

[ ]

2. Data prepration

Three types of datastore are supported in synapse spark, and you have two ways to load the data.

Datastore Type	Data Acess
Blob	Credential
Adlsgen1	Credential & Credential-less
Adlsgen2	Credential & Credential-less

Example 1: Data loading by HDFS path

Read data from Blob

	# setup access key or sas token

sc._jsc.hadoopConfiguration().set("fs.azure.account.key.<storage account name>.blob.core.windows.net", "<acess key>")
sc._jsc.hadoopConfiguration().set("fs.azure.sas.<container name>.<storage account name>.blob.core.windows.net", "sas token")

df = spark.read.parquet("wasbs://<container name>@<storage account name>.blob.core.windows.net/<path>")

Read data from Adlsgen1

	# setup service pricinpal which has access of the data
# If no data Credential is setup, the user identity will be used to do access control

sc._jsc.hadoopConfiguration().set("fs.adl.account.<storage account name>.oauth2.access.token.provider.type","ClientCredential")
sc._jsc.hadoopConfiguration().set("fs.adl.account.<storage account name>.oauth2.client.id", "<client id>")
sc._jsc.hadoopConfiguration().set("fs.adl.account.<storage account name>.oauth2.credential", "<client secret>")
sc._jsc.hadoopConfiguration().set("fs.adl.account.<storage account name>.oauth2.refresh.url", "https://login.microsoftonline.com/<tenant id>/oauth2/token")

df = spark.read.csv("adl://<storage account name>.azuredatalakestore.net/<path>")

Read data from Adlsgen2

	# setup service pricinpal which has access of the data
# If no data Credential is setup, the user identity will be used to do access control

sc._jsc.hadoopConfiguration().set("fs.azure.account.auth.type.<storage account name>.dfs.core.windows.net","OAuth")
sc._jsc.hadoopConfiguration().set("fs.azure.account.oauth.provider.type.<storage account name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
sc._jsc.hadoopConfiguration().set("fs.azure.account.oauth2.client.id.<storage account name>.dfs.core.windows.net", "<client id>")
sc._jsc.hadoopConfiguration().set("fs.azure.account.oauth2.client.secret.<storage account name>.dfs.core.windows.net", "<client secret>")
sc._jsc.hadoopConfiguration().set("fs.azure.account.oauth2.client.endpoint.<storage account name>.dfs.core.windows.net", "https://login.microsoftonline.com/<tenant id>/oauth2/token")

df = spark.read.csv("abfss://<container name>@<storage account>.dfs.core.windows.net/<path>")

[ ]

Example 2: Data loading by AML Dataset

You can create tabular data by following the guidance and use to_spark_dataframe() to load the data.

	%%synapse

import azureml.core
print(azureml.core.VERSION)

from azureml.core import Workspace, Dataset
ws = Workspace.get(name='<workspace name>', subscription_id='<subscription id>', resource_group='<resource group>')
ds = Dataset.get_by_name(ws, "<tabular dataset name>")
df = ds.to_spark_dataframe()

# You can do more data transformation on spark dataframe

3. Session Metadata

After session started, you can check the session's metadata, find the links to Synapse portal.

[ ]

4. Stop Session

When current session reach the status timeout, dead or any failure, you must explicitly stop it before start new one.

[ ]