Spark Session On Synapse Spark Pool
how-to-use-azuremlazure-mldata-sciencenotebookmachine-learningazure-machine-learningdeep-learningazuremlazure-ml-notebooksazureazure-synapse
Export
Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT License.
![]()
Interactive Spark Session on Synapse Spark Pool
Install package
[ ]
For JupyterLab, please additionally run:
[ ]
PLEASE restart kernel and then refresh web page before starting spark session.
0. How to leverage Spark Magic for interactive Spark experience
[ ]
1. Start Synapse Session
[ ]
[ ]
[ ]
[ ]
[ ]
2. Data prepration
Three types of datastore are supported in synapse spark, and you have two ways to load the data.
| Datastore Type | Data Acess |
|---|---|
| Blob | Credential |
| Adlsgen1 | Credential & Credential-less |
| Adlsgen2 | Credential & Credential-less |
Example 1: Data loading by HDFS path
Read data from Blob
# setup access key or sas token
sc._jsc.hadoopConfiguration().set("fs.azure.account.key.<storage account name>.blob.core.windows.net", "<acess key>")
sc._jsc.hadoopConfiguration().set("fs.azure.sas.<container name>.<storage account name>.blob.core.windows.net", "sas token")
df = spark.read.parquet("wasbs://<container name>@<storage account name>.blob.core.windows.net/<path>")
Read data from Adlsgen1
# setup service pricinpal which has access of the data
# If no data Credential is setup, the user identity will be used to do access control
sc._jsc.hadoopConfiguration().set("fs.adl.account.<storage account name>.oauth2.access.token.provider.type","ClientCredential")
sc._jsc.hadoopConfiguration().set("fs.adl.account.<storage account name>.oauth2.client.id", "<client id>")
sc._jsc.hadoopConfiguration().set("fs.adl.account.<storage account name>.oauth2.credential", "<client secret>")
sc._jsc.hadoopConfiguration().set("fs.adl.account.<storage account name>.oauth2.refresh.url", "https://login.microsoftonline.com/<tenant id>/oauth2/token")
df = spark.read.csv("adl://<storage account name>.azuredatalakestore.net/<path>")
Read data from Adlsgen2
# setup service pricinpal which has access of the data
# If no data Credential is setup, the user identity will be used to do access control
sc._jsc.hadoopConfiguration().set("fs.azure.account.auth.type.<storage account name>.dfs.core.windows.net","OAuth")
sc._jsc.hadoopConfiguration().set("fs.azure.account.oauth.provider.type.<storage account name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
sc._jsc.hadoopConfiguration().set("fs.azure.account.oauth2.client.id.<storage account name>.dfs.core.windows.net", "<client id>")
sc._jsc.hadoopConfiguration().set("fs.azure.account.oauth2.client.secret.<storage account name>.dfs.core.windows.net", "<client secret>")
sc._jsc.hadoopConfiguration().set("fs.azure.account.oauth2.client.endpoint.<storage account name>.dfs.core.windows.net", "https://login.microsoftonline.com/<tenant id>/oauth2/token")
df = spark.read.csv("abfss://<container name>@<storage account>.dfs.core.windows.net/<path>")
[ ]
Example 2: Data loading by AML Dataset
You can create tabular data by following the guidance and use to_spark_dataframe() to load the data.
%%synapse
import azureml.core
print(azureml.core.VERSION)
from azureml.core import Workspace, Dataset
ws = Workspace.get(name='<workspace name>', subscription_id='<subscription id>', resource_group='<resource group>')
ds = Dataset.get_by_name(ws, "<tabular dataset name>")
df = ds.to_spark_dataframe()
# You can do more data transformation on spark dataframe
3. Session Metadata
After session started, you can check the session's metadata, find the links to Synapse portal.
[ ]
4. Stop Session
When current session reach the status timeout, dead or any failure, you must explicitly stop it before start new one.
[ ]