This sample makes use of Databricks Delta time-travel capabilities which can efficiently version large-scale dataset. You can effectively query dataset versioned by either specifying a version number or a timestamp.
In this sample, we are using the Azure Databricks Credential Passthrough feature to securely access datalake. Credential passthrough is recommended over using the storage access key as it ensures that only users who have access to the underlying ADLS Gen2 storage can access it from Azure Databricks.
When using a High Concurrency Cluster with credential passthrough, the cluster can be shared among multiple users and will take the identity of whoever is using the cluster and access datalake. For Standard Clusters, only a single user's identity can be tied to each cluster, so every user will need to create their own cluster when using credential passthrough.
Currently, MLFlow is not supported on High Concurrency clusters with Credential Passthrough. Because this sample makes use of MLFlow, it requires the usage of Standard Clusters instead.
ACLs give you the ability to apply "finer grain" level of access to directories and files. Permission on RBAC is checked first and moved on to ACL. For more details please refer to the official doc.
The official doc mentioned:
- Will you be accessing your data in a more interactive, ad-hoc way, perhaps developing an ML model or building an operational dashboard? In that case, we recommend that you use Azure Active Directory (Azure AD) credential passthrough.
- Will you be running automated, scheduled workloads that require one-off access to the containers in your data lake? Then using service principals to access Azure Data Lake Storage is preferred.
-
Ensure you have the right permissions to ADLS Gen2 through ACLs.
- Navigating to your Storage account in the Azure Portal then clicking on
containers
->container(datalake)
->Manage ACL
- Add your READ and EXECUTE permission and click save.
- [Optional] In case you have any existing files in the Data Lake container, you may need to propogate ACL permissions.
- Open up Microsoft Azure Storage Explorer
- Navigate to the storage account and right click on container to select propagate access control list.
Propagate access control list cannot be found: Try updating azure storage explorer to the latest version.
- Navigating to your Storage account in the Azure Portal then clicking on
-
In your Databricks workspace, create a cluster ensuring you set the Cluster Mode to Standard and selecting Enable credential passthrough for user-level access.
-
Import the
data_versioning.py
notebook into your Databricks workspace. Set theyour_storage_account_name
variable with the name of your storage account that was created as part of the provisioned Azure Resources by IaC (Terraform) step. -
Run the notebook.
See here for more information on Credential Passthrough.